Featured image of post Troubleshooting k3s weirdness - cannot add a new node

Troubleshooting k3s weirdness - cannot add a new node

Recently, I migrated all of my VMs that run my k3s cluster from libvirt to proxmox. I challenged myself to do it without any downtime, and I had been successful, until this happened…

After removing the third master node from the cluster and attempting to add a new one, I the installation process hung at this line:

1
kmaster03: [INFO]  systemd: Starting k3s-agent

Troubleshooting

I left it there for about 10 minutes and nothing happened, so I started digging into journalctl,

1
journalctl -xeu k3s-agent

where I found a bunch of errors like this:

1
Feb 06 22:43:06 kmaster02 k3s[512]: {"level":"warn","ts":"2025-02-06T22:43:06.202830-0700","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ef227cb413536d9b","rtt":"3.085533ms","error":"dial tcp 10.0.69.101:2380: connect: no route to host"}

The IP 10.0.69.101 was the IP of the master node that I had removed from the cluster. Weird. I have removed nodes from kubernetes clusters before, and I’ve never seen the remaining nodes trying to connect to the removed node.

Since I could see no trace of the old node from the control plane, and the port 2380 was not a familiar one to me, I suspected that something else than k3s itself was trying to connect to the removed node.
I looked it up and quickly found out it was used for “etcd server client API” (See: Kubernetes Ports and Protocols.

Solution

From the above information, it was pretty clear that, somehow, the old node was still part of the etcd cluster.
Here is how I removed it from the cluster:

First SSH into one of the remaining master nodes, and install etcdctl (k3s does not come with it by default):

1
2
3
ETCD_VERSION="v3.5.5"
ETCD_URL="https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz"
curl -sL ${ETCD_URL} | tar -zxv --strip-components=1 -C /usr/local/bin

As for the version, I got lazy and just used the latest version listed here. Really, you should check the “Embedded Component Versions” section of the k3s release listed here: k3s releases.

After setting the environment variables for etcdctl:

1
2
3
4
export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379'
export ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt'
export ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt'
export ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key'

Locate the lingering node’s ID with:

1
etcdctl member list

Example output:

1
2
3
96b1c2e02ac1, started, kmaster01-f6c3d087, https://10.0.69.101:2380, https://10.0.69.101:2379, false
6cdd8d95dc51014e, started, kmaster03-30aef4cd, https://10.0.69.112:2380, https://10.0.69.112:2379, false
f75a61c7b6416081, started, kmaster02-8cbeac08, https://10.0.69.111:2380, https://10.0.69.111:2379, false

Remove the node:

1
etcdctl member remove 96b1c2e02ac1

Now go back to the new node that was hanging, and restart the installation process. It should now complete successfully.

Conclusion

Even after resolving the issue, it is still not clear to me why this happened. I had removed nodes before with the same procedure:

  1. Cordon
  2. Drain
  3. Delete node and every time, the node was automatically removed from the etcd cluster as well.

I thought this issue might occur again in the future, so I decided to write about it here.

Built with Hugo
Theme Stack designed by Jimmy