As a programmer, I don’t like bugs. My least favorite kind of bugs are the intermittent ones, which are hard to reproduce and thus hard to fix.
Symptom
A few days ago I had such a bug in my home lab configuration, which caused some Kubernetes API calls to intermittently fail like this:
|
|
It would go away in a few seconds, and then the master nodes would look fine.
|
|
Also worth noting that there were no events for those nodes.
Environment
- Proxmox VE: 8.4.1
- Kubernetes cluster: v1.33.3+k3s1
Troubleshooting
Thinking that it was some kind of syncing issue (and being lazy to check the logs), I first tried to stop the k3s services one by one, on each master node.
As usual, first I draiend each node by
|
|
followed by sudo systemctl stop k3s on the node.
Normally in a 3 master node k3s cluster, restarting one master node at a time should not cause any downtime. If it was a syncing issue and the restarted node was the problematic one, the remaining two nodes should be able to resolve the issue.
However, no matter which node I restarted, the issue persisted.
To make things worse, when there are only two master nodes left, the Kubernetes API server went down even more frequently, making the cluster almost unusable. Everything that use the Kubernetes API started to fail, having my lab assistant yelling at me non-stop.

Going down the rabbit hole
Realizing this was a more serious issue that even restarting could not fix, I proceeded to check the logs:
|
|
This was from kmaster03, and 10.0.69.101 is the IP address of kmaster01. Logs on kmaster02 showed similar errors, and on kmaster01,
|
|
there were myriads of etcdserver: no leader errors.
The erros from two sides of the communication led me to believe that there was an error in electing a leader among the etcd instances, due to some networking issue as seen in the no route to host error.
Since both kmaster02 and kmaster03 were complaining about not being able to reach kmaster01, and kmaster01 was continuously complaining about not being able to find a leader, I suspected that an intermittent connectivity between kmaster01 and the other two nodes was the root cause of the issue.
The other thing that led me to this conclusion was my network topology:

kmaster01 is on its own connected to the router directly, while the other master nodes are connected to a switch.
In hindsight, what I should have realized here is that, since shutting down any master node did not resolve the issue, the problem was not simply on a single node.
So I transfered over kmaster01 from opx01 to min01, which is connected to the same switch as kmaster02 and kmaster03, hoping this would resolve the issue.
And it did not.
Encounter with the mastermind
Figuring out the root cause was not by some intelligent reasoning, but rather somewhat by luck. I say “somewhat” because I didn’t have to be extremely lucky, but it was still pretty luck-based.
When sshing into kmaster03, I was greeted by this familiar message:
|
|
I see this message quite often since I tend to re-use IPs when setting up new VMs and tearing down old ones, so I simply ran the suggested command ssh-keygen -f "/home/jy/.ssh/known_hosts" -R "10.0.69.103" and moved on.
However, the message kept on appearing, about once in every 3 ssh attempts.
I could only think of one possibility: there is another host having the IP 10.0.69.103. Of course, that is assuming there is no man-in-the-middle attack actually going on, and I’m very relieved the assumption was right.
Aside from the network infrastructure, there is another lore in my home lab: previously I had a mini-PC in my Proxmox cluster which I decommissioned a few days ago. It was then added back to the cluster and running a new VM. The ssh fingerprint warning made me remember that I used to have a VM on that host having the exact same IP address: 10.0.69.103.
That means the mastermind behind all this was…
me.

Confirming the root cause
So I hopped into apu01 which is the host for VM kmaster03, and ran qm list in hope to find the rogue VM.
However, there was only one VM there, which was the new VM I deployed after re-adding the node to the cluster, not the old kmaster03.
|
|
qm list --full also did not show any other VMs.
ps, on the other hand, showed the culprit:
|
|
Restoring peace
Now it’s time to do the cleanup.
Since the VM was not managed by Proxmox or qm anymore, I resorted to killing the process and removing its disk image manually.
After finding the VM disk from the ps command output: file=/dev/pve/vm-101-disk-0, all I had to do was:
|
|
where 1785 is the PID of the VM process, and 101 is the VM ID.
With that, the rogue VM was gone, the ssh fingerprint warning disappeared, and finally, the intermittent failure of Kubernetes API calls was resolved.