I have set up Ceph RGW to provide object storage for my self-hosted applications. Naturally, I want a reliable backup strategy to ensure all that data is safe.
Although Ceph is designed for redundancy and fault tolerance, its primary purpose is HA, not backup. For instance, if I lose access to Ceph configurations and only have access to the raw data, recovering the RGW data can be very challenging.
So the end goal is to periodically back up all RGW data to a non-Ceph storage, such as an external ZFS pool. For now, since I haven’t set up such a pool, I’m just simulating the backup process by copying data to a CephFS directory.
If the bucket contains absurdly large amount of objects, backing up to a file system might introduce performance issues or even hit inode limits. In my case I just plan to use RGW for handful of services, so this approach should be good (at least for a while).
Environment
- Ceph: 19.2.3 Squid, installed via Proxmox APT repository
- Kubernetes: v1.33.4+k3s1
Backup cronjob
My favorite way to run scheduled tasks is still k8s cronJob.
I decided to use the AWS CLI tool for interacting with RGW, as RGW is compatible with the S3 API and AWS CLI provides a readily available official container image.
The goal is to have a fairly recent version of each RGW buckets in the backup storage. Since I’m not aiming for something like point-in-time recovery, I decided to just use aws s3 sync to perform incremental backups.
script-cm.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
apiVersion: v1
kind: ConfigMap
metadata:
name: rgw-backup-script
namespace: backup
data:
backup.sh: |
#!/bin/sh
set -e
echo "Starting Ceph RGW backup"
BACKUP_DIR="/backup/"
# sync all buckets incrementally
aws s3 ls --endpoint-url ${RGW_ENDPOINT} | awk '{print $3}' | while read bucket; do
if [ -n "$bucket" ]; then
echo "Syncing bucket: ${bucket}"
aws s3 sync --endpoint-url ${RGW_ENDPOINT} s3://${bucket} ${BACKUP_DIR}/${bucket} --delete
fi
done
echo "Backup completed"
du -sh ${BACKUP_DIR}
|
cronjob.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
apiVersion: batch/v1
kind: CronJob
metadata:
name: rgw-backup
namespace: backup
spec:
schedule: "0 4 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: rgw-backup
image: amazon/aws-cli:latest
command:
- /bin/sh
- /scripts/backup.sh
envFrom:
- secretRef:
name: rgw-admin
volumeMounts:
- name: backup-script
mountPath: /scripts
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-script
configMap:
name: rgw-backup-script
defaultMode: 0755
- name: backup-storage
persistentVolumeClaim:
claimName: rgw-bck
|
secret.yml
1
2
3
4
5
6
7
8
9
10
|
apiVersion: v1
kind: Secret
metadata:
name: rgw-admin
namespace: backup
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: <access-key-id>
AWS_SECRET_ACCESS_KEY: <secret-access-key>
RGW_ENDPOINT: https://rgw.i.junyi.me
|
volume.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
apiVersion: v1
kind: PersistentVolume
metadata:
name: rgw-bck
namespace: backup
spec:
storageClassName: csi-cephfs-sc
accessModes:
- ReadWriteMany
capacity:
storage: 1Gi
csi:
driver: cephfs.csi.ceph.com
nodeStageSecretRef:
name: csi-cephfs-secret
namespace: ceph-csi-cephfs
volumeAttributes:
"fsName": "<fs-name>"
"clusterID": "<cluster-id>"
"staticVolume": "true"
"rootPath": /backup/rgw
volumeHandle: rgw-bck
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rgw-bck
namespace: backup
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: "csi-cephfs-sc"
volumeMode: Filesystem
volumeName: rgw-bck
|
Conclusion
I now have an automated backup job that dumps all my buckets in the following fashion.
1
2
3
|
jy@opx02:/mnt/bd/backup/rgw$ ls
git-lfs gitlab-backups gitlab-dependency-proxy gitlab-packages gitlab-terraform-state registry tmp
gitlab-artifacts gitlab-ci-secure-files gitlab-mr-diffs gitlab-pages gitlab-uploads runner-cache
|
Since I have an object storage in place and it will be backed up properly in the future, I can finally properly configure CNPG to back up using object storage, instead of a custom pg-dump job.