Solving the intermittent sabotage of my k3s master nodes

As a programmer, I don’t like bugs. My least favorite kind of bugs are the intermittent ones, which are hard to reproduce and thus hard to fix.

Symptom

A few days ago I had such a bug in my home lab configuration, which caused some Kubernetes API calls to intermittently fail like this:

1
2


$ kubectl get nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request

It would go away in a few seconds, and then the master nodes would look fine.

1
2
3
4
5
6


$ kubectl get nodes
NAME        STATUS   ROLES                       AGE   VERSION
kmaster01   Ready    control-plane,etcd,master   20d   v1.33.4+k3s1
kmaster02   Ready    control-plane,etcd,master   18d   v1.33.4+k3s1
kmaster03   Ready    control-plane,etcd,master   21d   v1.33.4+k3s1
...

Also worth noting that there were no events for those nodes.

Environment

Proxmox VE: 8.4.1
Kubernetes cluster: v1.33.3+k3s1

Troubleshooting

Thinking that it was some kind of syncing issue (and being lazy to check the logs), I first tried to stop the k3s services one by one, on each master node.

As usual, first I draiend each node by

1

kubectl drain kmaster01 --ignore-daemonsets --delete-local-data

followed by sudo systemctl stop k3s on the node.

Normally in a 3 master node k3s cluster, restarting one master node at a time should not cause any downtime. If it was a syncing issue and the restarted node was the problematic one, the remaining two nodes should be able to resolve the issue.

However, no matter which node I restarted, the issue persisted.

To make things worse, when there are only two master nodes left, the Kubernetes API server went down even more frequently, making the cluster almost unusable. Everything that use the Kubernetes API started to fail, having my lab assistant yelling at me non-stop. Being yelled

Going down the rabbit hole

Realizing this was a more serious issue that even restarting could not fix, I proceeded to check the logs:

1
2


$ sudo journalctl -u k3s
Sep 09 12:25:57 kmaster02 k3s[927056]: {"level":"warn","ts":"2025-09-09T12:25:57.995495-0600","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"1a55b10cc3294e4b","rtt":"0s","error":"dial tcp 10.0.69.101:2380: connect: no route to host"}

This was from kmaster03, and 10.0.69.101 is the IP address of kmaster01. Logs on kmaster02 showed similar errors, and on kmaster01,

1
2
3
4


Sep 09 12:57:05 kmaster01 k3s[2380]: E0909 12:57:05.978412 2380 watcher.go:342] watch chan error: etcdserver: no leader 
Sep 09 12:57:05 kmaster01 k3s[2380]: E0909 12:57:05.978440 2380 watcher.go:342] watch chan error: etcdserver: no leader 
Sep 09 12:57:05 kmaster01 k3s[2380]: E0909 12:57:05.978457 2380 watcher.go:342] watch chan error: etcdserver: no leader
...

there were myriads of etcdserver: no leader errors.

The erros from two sides of the communication led me to believe that there was an error in electing a leader among the etcd instances, due to some networking issue as seen in the no route to host error.

Since both kmaster02 and kmaster03 were complaining about not being able to reach kmaster01, and kmaster01 was continuously complaining about not being able to find a leader, I suspected that an intermittent connectivity between kmaster01 and the other two nodes was the root cause of the issue.

The other thing that led me to this conclusion was my network topology:

Network topology

kmaster01 is on its own connected to the router directly, while the other master nodes are connected to a switch.

tip

In hindsight, what I should have realized here is that, since shutting down any master node did not resolve the issue, the problem was not simply on a single node.

So I transfered over kmaster01 from opx01 to min01, which is connected to the same switch as kmaster02 and kmaster03, hoping this would resolve the issue.

And it did not.

Encounter with the mastermind

Figuring out the root cause was not by some intelligent reasoning, but rather somewhat by luck. I say “somewhat” because I didn’t have to be extremely lucky, but it was still pretty luck-based.

When sshing into kmaster03, I was greeted by this familiar message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
<fingerprint>.
Please contact your system administrator.
Add correct host key in /home/jy/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/jy/.ssh/known_hosts:112
  remove with:
  ssh-keygen -f "/home/jy/.ssh/known_hosts" -R "10.0.69.103"
Host key for 10.0.69.103 has changed and you have requested strict checking.
Host key verification failed.

I see this message quite often since I tend to re-use IPs when setting up new VMs and tearing down old ones, so I simply ran the suggested command ssh-keygen -f "/home/jy/.ssh/known_hosts" -R "10.0.69.103" and moved on.

However, the message kept on appearing, about once in every 3 ssh attempts.

I could only think of one possibility: there is another host having the IP 10.0.69.103. Of course, that is assuming there is no man-in-the-middle attack actually going on, and I’m very relieved the assumption was right.

Aside from the network infrastructure, there is another lore in my home lab: previously I had a mini-PC in my Proxmox cluster which I decommissioned a few days ago. It was then added back to the cluster and running a new VM. The ssh fingerprint warning made me remember that I used to have a VM on that host having the exact same IP address: 10.0.69.103.

That means the mastermind behind all this was…

me.
Footgun

Confirming the root cause

So I hopped into apu01 which is the host for VM kmaster03, and ran qm list in hope to find the rogue VM.

However, there was only one VM there, which was the new VM I deployed after re-adding the node to the cluster, not the old kmaster03.

1
2


      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       109 gui01                running    24888            256.00 171001

qm list --full also did not show any other VMs.

ps, on the other hand, showed the culprit:

1
2
3
4


# ps aux | grep qemu
root        1785  8.6  3.6 24602276 1197808 ?    Sl   Sep06 335:01 /usr/bin/kvm -id 101 -name kmaster03, ...
root      171001  3.8 77.6 26635624 25413804 ?   Sl   Sep07 116:20 /usr/bin/kvm -id 109 -name gui01, ...
root      715324  0.0  0.0   6332  2048 pts/0    S+   11:44   0:00 grep qemu

Restoring peace

Now it’s time to do the cleanup.

Since the VM was not managed by Proxmox or qm anymore, I resorted to killing the process and removing its disk image manually.

After finding the VM disk from the ps command output: file=/dev/pve/vm-101-disk-0, all I had to do was:

1
2
3


kill 1785
lvremove /dev/pve/vm-101-disk-0
rm -f /var/run/qemu-server/101.*

where 1785 is the PID of the VM process, and 101 is the VM ID.

With that, the rogue VM was gone, the ssh fingerprint warning disappeared, and finally, the intermittent failure of Kubernetes API calls was resolved.