Automated backup and recovery of HashiCorp Vault

I had completed my Vault setup in the previous post, done some testing, and was pretty happy with how everything worked together.

HashiCorp Vault and External Secrets Operator on Kubernetes

December 23, 2025• 9 min read

kubernetesvaultsecurity

HashiCorp Vault and External Secrets Operator on Kubernetes

However, before trusting all my important secrets with it, I wanted to solidify one thing: backup and recovery.

Environment

Kubernetes cluster: v1.33.3+k3s1
ArgoCD: 9.1.4
HashiCorp Vault: 0.31.0
External Secrets Operator: 1.2.0

Automated backups

Vault conveniently has a simple command to backup its data stored in its integrated storage (Raft) backend, which is what I’m using.

1

vault operator raft snapshot save [FILE_NAME]

This is trivial to automate using a simple Cronjob in my Kubernetes cluster.

The tricky part was to authenticate the Cronjob with Vault. Since I didn’t want to use root tokens or be bothered with renewing tokens, I decided to use the Kubernetes authentication method in Vault.

Vault configuration

On the Vault side, I created a policy that allows reading the snapshot path, and an associated role that binds to a specific service account in Kubernetes.
Pretty much the same with what I did for the admin user in the previous post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


ROOT_TOKEN="[ROOT_TOKEN]"
vault login $ROOT_TOKEN

vault policy write backup-policy - <<EOF
path "sys/storage/raft/snapshot" {
    capabilities = ["read"]
}
EOF

vault write auth/kubernetes/role/backup \
  bound_service_account_names=vault-backup \
  bound_service_account_namespaces=backup \
  policies=backup-policy \
  audience="vault-bck" \
  ttl=1h

Cronjob

To use the Kubernetes auth method, I set up a service account for the cronjob.

The cronjob would use the service account token to authenticate with Vault, get a Vault token, and then run the snapshot command.

1
2
3
4
5


apiVersion: v1
kind: ServiceAccount
metadata:
  name: vault-backup
  namespace: backup

Aside from the authentication rituals, the backup script is pretty simple: it takes a snapshot, saves it to a mounted volume, and cleans up old backups.

configmap.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-backup-config
  namespace: backup
data:
  VAULT_ADDR: "http://hashi-vault-active.hashi-vault.svc.cluster.local:8200"
  VAULT_ROLE: "backup"

---

apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-backup-script
  namespace: backup
data:
  backup.sh: |
    #!/bin/sh
    set -e

    echo "Starting Vault Raft snapshot backup"
    BACKUP_DIR="/backup"
    BACKUP_FILE="${BACKUP_DIR}/vault-snapshot-$(date +%Y%m%d-%H%M%S).snap"

    # Authenticate using Kubernetes service account
    echo "Authenticating to Vault using Kubernetes auth..."
    K8S_TOKEN=$(cat /var/run/secrets/vault/token)
    VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login \
      role=${VAULT_ROLE} \
      jwt=${K8S_TOKEN})

    export VAULT_TOKEN

    # Take the snapshot
    echo "Taking snapshot..."
    vault operator raft snapshot save "${BACKUP_FILE}"

    echo "Snapshot saved to ${BACKUP_FILE}"
    ls -lh "${BACKUP_FILE}"

    # Clean up backups older than 7 days
    echo "Cleaning up old backups..."
    find ${BACKUP_DIR} -name "vault-snapshot-*.snap" -mtime +7 -delete

    echo "Current backups:"
    ls -lh ${BACKUP_DIR}/vault-snapshot-*.snap 2>/dev/null || echo "No backups found"

    echo "Backup completed successfully"

The cronjob itself:

cronjob.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-backup
  namespace: backup
spec:
  schedule: "0 3 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: vault-backup
          restartPolicy: OnFailure
          containers:
          - name: vault-backup
            image: hashicorp/vault:1.20.4
            command:
            - /bin/sh
            - /scripts/backup.sh
            envFrom:
            - configMapRef:
                name: vault-backup-config
            volumeMounts:
            - name: backup-script
              mountPath: /scripts
            - name: backup-storage
              mountPath: /backup
              subPath: hashi-vault
            - name: vault-token
              mountPath: /var/run/secrets/vault
              readOnly: true
          volumes:
          - name: backup-script
            configMap:
              name: vault-backup-script
              defaultMode: 0755
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup
          - name: vault-token
            projected:
              sources:
              - serviceAccountToken:
                  path: token
                  expirationSeconds: 600
                  audience: vault-bck

storage.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


apiVersion: v1
kind: PersistentVolume
metadata:
  name: backup
  namespace: backup
spec:
  storageClassName: csi-cephfs-sc
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 1Gi
  csi:
    driver: cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: csi-cephfs-secret
      namespace: ceph-csi-cephfs
    volumeAttributes:
      "fsName": "bdvault"
      "clusterID": "[CLUSTER_ID]"
      "staticVolume": "true"
      "rootPath": /backup/
    volumeHandle: backup
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: backup
  namespace: backup
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: "csi-cephfs-sc"
  volumeMode: Filesystem
  volumeName: backup

Recovery test

Burndown for testing

To test the recovery process, I first demolished the data in the existing Vault setup.

After stopping auto-sync in ArgoCD,

1
2
3


kcs hashi-vault # change to vault namespace
k delete statefulsets.apps hashi-vault
k delete pvc data-hashi-vault-0 data-hashi-vault-1 data-hashi-vault-2

Then I turned auto-sync back on in ArgoCD to recreate the Vault statefulset, which created a fresh Vault instance (calling it “Temp Vault” for convenience) with no data.

Restoration

Before restoring, it was necessary to initialize and unseal Temp Vault.

1
2
3
4
5
6
7
8


cd ~/tmp # my temp dir
kcs hashi-vault
k exec hashi-vault-0 -- vault operator init \
  -key-shares=1 \
  -key-threshold=1 \
  -format=json > vault-keys.json

k exec hashi-vault-0 -- vault operator unseal $(jq -r '.unseal_keys_hex[0]' vault-keys.json)

Restoring from the backup was pretty straightforward. Just logging in as the root user and running the restore command with the snapshot file.

1
2
3
4


k cp /mnt/bd/backup/hashi-vault/vault-snapshot-20251224-215949.snap hashi-vault-0:/tmp/bck.snap
k exec -ti hashi-vault-0 -- sh
vault login [ROOT_TOKEN]
vault operator raft snapshot restore -force /home/vault/bck.snap

note

[ROOT_TOKEN] is the root token generated for the fresh Vault instance. This is the last time anything from the Temp Vault is used.

At this point, hashi-vault-0 pod was fully restored from the backup, and running healthily. I could confirm it from kubectl get pod and the Vault UI.

The last step was to rejoin the other two pods to the restored pod.

1
2


k exec hashi-vault-1 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200
k exec hashi-vault-2 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200

ClusterSecretStore

The true last step was to verify that the External Secrets Operator had recovered access to the secrets stored in Vault.

When I first checked it showed:

1
2
3


$ k get clustersecretstores.external-secrets.io
NAME    AGE   STATUS     CAPABILITIES   READY
vault   26h   Invalid    ReadWrite      false

So I restarted the ESO deployment to force a reload,

1

k -n external-secrets rollout restart deployment external-secrets-operator

And it became healthy again.

1
2
3


$ k get clustersecretstores.external-secrets.io
NAME    AGE   STATUS   CAPABILITIES   READY
vault   26h   Valid    ReadWrite      True

Not sure if this is necessary, it would probably have resolved itself eventually.

Then I deleted the secret created by ESO (see the previous post) to see if it would be recreated.

1

k -n test delete secret external

It was recreated immediately, and the secret was still accessible after restarting the pod.