I had completed my Vault setup in the previous post, done some testing, and was pretty happy with how everything worked together.
However, before trusting all my important secrets with it, I wanted to solidify one thing: backup and recovery.
Environment
- Kubernetes cluster: v1.33.3+k3s1
- ArgoCD: 9.1.4
- HashiCorp Vault: 0.31.0
- External Secrets Operator: 1.2.0
Automated backups
Vault conveniently has a simple command to backup its data stored in its integrated storage (Raft) backend, which is what I’m using.
1
|
vault operator raft snapshot save [FILE_NAME]
|
This is trivial to automate using a simple Cronjob in my Kubernetes cluster.
The tricky part was to authenticate the Cronjob with Vault. Since I didn’t want to use root tokens or be bothered with renewing tokens, I decided to use the Kubernetes authentication method in Vault.
Vault configuration
On the Vault side, I created a policy that allows reading the snapshot path, and an associated role that binds to a specific service account in Kubernetes.
Pretty much the same with what I did for the admin user in the previous post.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
ROOT_TOKEN="[ROOT_TOKEN]"
vault login $ROOT_TOKEN
vault policy write backup-policy - <<EOF
path "sys/storage/raft/snapshot" {
capabilities = ["read"]
}
EOF
vault write auth/kubernetes/role/backup \
bound_service_account_names=vault-backup \
bound_service_account_namespaces=backup \
policies=backup-policy \
audience="vault-bck" \
ttl=1h
|
Cronjob
To use the Kubernetes auth method, I set up a service account for the cronjob.
The cronjob would use the service account token to authenticate with Vault, get a Vault token, and then run the snapshot command.
1
2
3
4
5
|
apiVersion: v1
kind: ServiceAccount
metadata:
name: vault-backup
namespace: backup
|
Aside from the authentication rituals, the backup script is pretty simple: it takes a snapshot, saves it to a mounted volume, and cleans up old backups.
configmap.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
apiVersion: v1
kind: ConfigMap
metadata:
name: vault-backup-config
namespace: backup
data:
VAULT_ADDR: "http://hashi-vault-active.hashi-vault.svc.cluster.local:8200"
VAULT_ROLE: "backup"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: vault-backup-script
namespace: backup
data:
backup.sh: |
#!/bin/sh
set -e
echo "Starting Vault Raft snapshot backup"
BACKUP_DIR="/backup"
BACKUP_FILE="${BACKUP_DIR}/vault-snapshot-$(date +%Y%m%d-%H%M%S).snap"
# Authenticate using Kubernetes service account
echo "Authenticating to Vault using Kubernetes auth..."
K8S_TOKEN=$(cat /var/run/secrets/vault/token)
VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login \
role=${VAULT_ROLE} \
jwt=${K8S_TOKEN})
export VAULT_TOKEN
# Take the snapshot
echo "Taking snapshot..."
vault operator raft snapshot save "${BACKUP_FILE}"
echo "Snapshot saved to ${BACKUP_FILE}"
ls -lh "${BACKUP_FILE}"
# Clean up backups older than 7 days
echo "Cleaning up old backups..."
find ${BACKUP_DIR} -name "vault-snapshot-*.snap" -mtime +7 -delete
echo "Current backups:"
ls -lh ${BACKUP_DIR}/vault-snapshot-*.snap 2>/dev/null || echo "No backups found"
echo "Backup completed successfully"
|
The cronjob itself:
cronjob.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
apiVersion: batch/v1
kind: CronJob
metadata:
name: vault-backup
namespace: backup
spec:
schedule: "0 3 * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: vault-backup
restartPolicy: OnFailure
containers:
- name: vault-backup
image: hashicorp/vault:1.20.4
command:
- /bin/sh
- /scripts/backup.sh
envFrom:
- configMapRef:
name: vault-backup-config
volumeMounts:
- name: backup-script
mountPath: /scripts
- name: backup-storage
mountPath: /backup
subPath: hashi-vault
- name: vault-token
mountPath: /var/run/secrets/vault
readOnly: true
volumes:
- name: backup-script
configMap:
name: vault-backup-script
defaultMode: 0755
- name: backup-storage
persistentVolumeClaim:
claimName: backup
- name: vault-token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 600
audience: vault-bck
|
storage.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
apiVersion: v1
kind: PersistentVolume
metadata:
name: backup
namespace: backup
spec:
storageClassName: csi-cephfs-sc
accessModes:
- ReadWriteMany
capacity:
storage: 1Gi
csi:
driver: cephfs.csi.ceph.com
nodeStageSecretRef:
name: csi-cephfs-secret
namespace: ceph-csi-cephfs
volumeAttributes:
"fsName": "bdvault"
"clusterID": "[CLUSTER_ID]"
"staticVolume": "true"
"rootPath": /backup/
volumeHandle: backup
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup
namespace: backup
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: "csi-cephfs-sc"
volumeMode: Filesystem
volumeName: backup
|
Recovery test
Burndown for testing
To test the recovery process, I first demolished the data in the existing Vault setup.
After stopping auto-sync in ArgoCD,
1
2
3
|
kcs hashi-vault # change to vault namespace
k delete statefulsets.apps hashi-vault
k delete pvc data-hashi-vault-0 data-hashi-vault-1 data-hashi-vault-2
|
Then I turned auto-sync back on in ArgoCD to recreate the Vault statefulset, which created a fresh Vault instance (calling it “Temp Vault” for convenience) with no data.
Restoration
Before restoring, it was necessary to initialize and unseal Temp Vault.
1
2
3
4
5
6
7
8
|
cd ~/tmp # my temp dir
kcs hashi-vault
k exec hashi-vault-0 -- vault operator init \
-key-shares=1 \
-key-threshold=1 \
-format=json > vault-keys.json
k exec hashi-vault-0 -- vault operator unseal $(jq -r '.unseal_keys_hex[0]' vault-keys.json)
|
Restoring from the backup was pretty straightforward. Just logging in as the root user and running the restore command with the snapshot file.
1
2
3
4
|
k cp /mnt/bd/backup/hashi-vault/vault-snapshot-20251224-215949.snap hashi-vault-0:/tmp/bck.snap
k exec -ti hashi-vault-0 -- sh
vault login [ROOT_TOKEN]
vault operator raft snapshot restore -force /home/vault/bck.snap
|
[ROOT_TOKEN] is the root token generated for the fresh Vault instance. This is the last time anything from the Temp Vault is used.
At this point, hashi-vault-0 pod was fully restored from the backup, and running healthily. I could confirm it from kubectl get pod and the Vault UI.
The last step was to rejoin the other two pods to the restored pod.
1
2
|
k exec hashi-vault-1 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200
k exec hashi-vault-2 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200
|
ClusterSecretStore
The true last step was to verify that the External Secrets Operator had recovered access to the secrets stored in Vault.
When I first checked it showed:
1
2
3
|
$ k get clustersecretstores.external-secrets.io
NAME AGE STATUS CAPABILITIES READY
vault 26h Invalid ReadWrite false
|
So I restarted the ESO deployment to force a reload,
1
|
k -n external-secrets rollout restart deployment external-secrets-operator
|
And it became healthy again.
1
2
3
|
$ k get clustersecretstores.external-secrets.io
NAME AGE STATUS CAPABILITIES READY
vault 26h Valid ReadWrite True
|
Not sure if this is necessary, it would probably have resolved itself eventually.
Then I deleted the secret created by ESO (see the previous post) to see if it would be recreated.
1
|
k -n test delete secret external
|
It was recreated immediately, and the secret was still accessible after restarting the pod.