Featured image of post Automated backup and recovery of HashiCorp Vault

Automated backup and recovery of HashiCorp Vault

I had completed my Vault setup in the previous post, done some testing, and was pretty happy with how everything worked together.

However, before trusting all my important secrets with it, I wanted to solidify one thing: backup and recovery.

Environment

  1. Kubernetes cluster: v1.33.3+k3s1
  2. ArgoCD: 9.1.4
  3. HashiCorp Vault: 0.31.0
  4. External Secrets Operator: 1.2.0

Automated backups

Vault conveniently has a simple command to backup its data stored in its integrated storage (Raft) backend, which is what I’m using.

1
vault operator raft snapshot save [FILE_NAME]

This is trivial to automate using a simple Cronjob in my Kubernetes cluster.

The tricky part was to authenticate the Cronjob with Vault. Since I didn’t want to use root tokens or be bothered with renewing tokens, I decided to use the Kubernetes authentication method in Vault.

Vault configuration

On the Vault side, I created a policy that allows reading the snapshot path, and an associated role that binds to a specific service account in Kubernetes.
Pretty much the same with what I did for the admin user in the previous post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
ROOT_TOKEN="[ROOT_TOKEN]"
vault login $ROOT_TOKEN

vault policy write backup-policy - <<EOF
path "sys/storage/raft/snapshot" {
    capabilities = ["read"]
}
EOF

vault write auth/kubernetes/role/backup \
  bound_service_account_names=vault-backup \
  bound_service_account_namespaces=backup \
  policies=backup-policy \
  audience="vault-bck" \
  ttl=1h

Cronjob

To use the Kubernetes auth method, I set up a service account for the cronjob.

The cronjob would use the service account token to authenticate with Vault, get a Vault token, and then run the snapshot command.

1
2
3
4
5
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vault-backup
  namespace: backup

Aside from the authentication rituals, the backup script is pretty simple: it takes a snapshot, saves it to a mounted volume, and cleans up old backups.

configmap.yml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-backup-config
  namespace: backup
data:
  VAULT_ADDR: "http://hashi-vault-active.hashi-vault.svc.cluster.local:8200"
  VAULT_ROLE: "backup"

---

apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-backup-script
  namespace: backup
data:
  backup.sh: |
    #!/bin/sh
    set -e

    echo "Starting Vault Raft snapshot backup"
    BACKUP_DIR="/backup"
    BACKUP_FILE="${BACKUP_DIR}/vault-snapshot-$(date +%Y%m%d-%H%M%S).snap"

    # Authenticate using Kubernetes service account
    echo "Authenticating to Vault using Kubernetes auth..."
    K8S_TOKEN=$(cat /var/run/secrets/vault/token)
    VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login \
      role=${VAULT_ROLE} \
      jwt=${K8S_TOKEN})

    export VAULT_TOKEN

    # Take the snapshot
    echo "Taking snapshot..."
    vault operator raft snapshot save "${BACKUP_FILE}"

    echo "Snapshot saved to ${BACKUP_FILE}"
    ls -lh "${BACKUP_FILE}"

    # Clean up backups older than 7 days
    echo "Cleaning up old backups..."
    find ${BACKUP_DIR} -name "vault-snapshot-*.snap" -mtime +7 -delete

    echo "Current backups:"
    ls -lh ${BACKUP_DIR}/vault-snapshot-*.snap 2>/dev/null || echo "No backups found"

    echo "Backup completed successfully"

The cronjob itself:

cronjob.yml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-backup
  namespace: backup
spec:
  schedule: "0 3 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: vault-backup
          restartPolicy: OnFailure
          containers:
          - name: vault-backup
            image: hashicorp/vault:1.20.4
            command:
            - /bin/sh
            - /scripts/backup.sh
            envFrom:
            - configMapRef:
                name: vault-backup-config
            volumeMounts:
            - name: backup-script
              mountPath: /scripts
            - name: backup-storage
              mountPath: /backup
              subPath: hashi-vault
            - name: vault-token
              mountPath: /var/run/secrets/vault
              readOnly: true
          volumes:
          - name: backup-script
            configMap:
              name: vault-backup-script
              defaultMode: 0755
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup
          - name: vault-token
            projected:
              sources:
              - serviceAccountToken:
                  path: token
                  expirationSeconds: 600
                  audience: vault-bck
storage.yml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: v1
kind: PersistentVolume
metadata:
  name: backup
  namespace: backup
spec:
  storageClassName: csi-cephfs-sc
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 1Gi
  csi:
    driver: cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: csi-cephfs-secret
      namespace: ceph-csi-cephfs
    volumeAttributes:
      "fsName": "bdvault"
      "clusterID": "[CLUSTER_ID]"
      "staticVolume": "true"
      "rootPath": /backup/
    volumeHandle: backup
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: backup
  namespace: backup
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: "csi-cephfs-sc"
  volumeMode: Filesystem
  volumeName: backup

Recovery test

Burndown for testing

To test the recovery process, I first demolished the data in the existing Vault setup.

After stopping auto-sync in ArgoCD,

1
2
3
kcs hashi-vault # change to vault namespace
k delete statefulsets.apps hashi-vault
k delete pvc data-hashi-vault-0 data-hashi-vault-1 data-hashi-vault-2

Then I turned auto-sync back on in ArgoCD to recreate the Vault statefulset, which created a fresh Vault instance (calling it “Temp Vault” for convenience) with no data.

Restoration

Before restoring, it was necessary to initialize and unseal Temp Vault.

1
2
3
4
5
6
7
8
cd ~/tmp # my temp dir
kcs hashi-vault
k exec hashi-vault-0 -- vault operator init \
  -key-shares=1 \
  -key-threshold=1 \
  -format=json > vault-keys.json

k exec hashi-vault-0 -- vault operator unseal $(jq -r '.unseal_keys_hex[0]' vault-keys.json)

Restoring from the backup was pretty straightforward. Just logging in as the root user and running the restore command with the snapshot file.

1
2
3
4
k cp /mnt/bd/backup/hashi-vault/vault-snapshot-20251224-215949.snap hashi-vault-0:/tmp/bck.snap
k exec -ti hashi-vault-0 -- sh
vault login [ROOT_TOKEN]
vault operator raft snapshot restore -force /home/vault/bck.snap
note

[ROOT_TOKEN] is the root token generated for the fresh Vault instance. This is the last time anything from the Temp Vault is used.

At this point, hashi-vault-0 pod was fully restored from the backup, and running healthily. I could confirm it from kubectl get pod and the Vault UI.

The last step was to rejoin the other two pods to the restored pod.

1
2
k exec hashi-vault-1 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200
k exec hashi-vault-2 -- vault operator raft join http://hashi-vault-0.hashi-vault-internal:8200

ClusterSecretStore

The true last step was to verify that the External Secrets Operator had recovered access to the secrets stored in Vault.

When I first checked it showed:

1
2
3
$ k get clustersecretstores.external-secrets.io
NAME    AGE   STATUS     CAPABILITIES   READY
vault   26h   Invalid    ReadWrite      false

So I restarted the ESO deployment to force a reload,

1
k -n external-secrets rollout restart deployment external-secrets-operator

And it became healthy again.

1
2
3
$ k get clustersecretstores.external-secrets.io
NAME    AGE   STATUS   CAPABILITIES   READY
vault   26h   Valid    ReadWrite      True

Not sure if this is necessary, it would probably have resolved itself eventually.

Then I deleted the secret created by ESO (see the previous post) to see if it would be recreated.

1
k -n test delete secret external

It was recreated immediately, and the secret was still accessible after restarting the pod.

Built with Hugo
Theme Stack designed by Jimmy