There are various ways to backup and restore Percona Server for MongoDB clusters when you run them on Kubernetes. Percona Operator for MongoDB utilizes Percona Backup for MongoDB (PBM) to take physical and logical backups, continuously upload oplogs to object storage, and maintain the backup lifecycle.
Cloud providers and various storage solutions provide the capability to create volume snapshots. Using snapshots is useful for owners of large data sets with terabytes of data, as this way, you can rely on efficient storage to recover the data faster. In this blog post, we are going to look into how somebody can backup and restore MongoDB clusters managed by the Operator with snapshots. This is a proof of concept that will be fully automated in the Operator’s future releases.
The goal
- Take the snapshot
- Prepare the cluster for backup with Percona Backup for MongoDB (PBM).
- Leverage Kubernetes Volume Snapshots that, on the infrastructure level, trigger cloud volume snapshots (for example, AWS EBS Snapshot).
- Recover to the new cluster using these snapshots.
Consistency considerations
Snapshots don’t guarantee data consistency. With clusters that receive a lot of writes, you might find that not all data is written to the disk.
This is where Percona Backup for MongoDB, which we use for backups and restores in the Operator, steps in. It provides the interface for making snapshot-based physical backups and restores and ensures data consistency. As a result, database owners benefit from increased performance and reduced downtime and are sure that their data remains consistent.
Set it up
All manifests and other configuration files used in this blog post are stored in the blog-data/mongo-k8s-volume-snapshots git repository.
Prepare Percona Server for MongoDB
Deploy the Percona Operator for MongoDB using your favorite way. I will use regular kubectl and version 1.17.0 (the latest at the time of writing):
1 |
kubectl apply -f https://raw.githubusercontent.com/percona/percona-server-mongodb-operator/refs/tags/v1.17.0/deploy/bundle.yaml |
RPO considerations
Restoring from snapshot provides quite a poor Recovery Point Objective (RPO), as it depends on the schedule you take the snapshots. To improve RPO, we are going to upload oplogs to the object storage.
We have a special flag spec.backup.pitr.oplogOnly to enable only oplogs to upload to the object storage. The backup section in Custom Resource manifest would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
backup: enabled: true image: percona/percona-backup-mongodb:2.5.0 pitr: enabled: true oplogOnly: true compressionType: gzip compressionLevel: 6 storages: sp-test: type: s3 s3: bucket: BUCKET credentialsSecret: SECRET_WITH_KEYS endpointUrl: OBJECT_STORAGE_URL |
Read more about backup configuration in our documentation.
Apply the custom resource:
1 |
kubectl apply -f 00-cr.yaml |
Take the snapshots
Volume Snapshot Class
To create snapshots, you need to have a Volume Snapshot Class. I’m running my experiments on GKE, and my snapshot class looks like this:
1 2 3 4 5 6 |
apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: gke-snapshot-class driver: pd.csi.storage.gke.io deletionPolicy: Delete |
The driver depends on your cloud or storage provider. Create the snapshot class:
1 |
kubectl apply -f 02-snapshot-class.yaml |
Prepare the cluster for backup
To prepare the cluster for backup, we need to run a short Percona Backup for MongoDB command – pbm – as described in the documentation. To do that, we will exec into one of the PBM containers:
1 2 3 4 5 6 7 8 9 |
% kubectl get pods NAME READY STATUS RESTARTS AGE demo-cluster1-rs0-0 2/2 Running 0 117m demo-cluster1-rs0-1 2/2 Running 0 116m demo-cluster1-rs0-2 2/2 Running 0 115m percona-server-mongodb-operator-8664c5b8fc-4mdkv 1/1 Running 0 155m % kubectl exec -ti demo-cluster1-rs0-0 -c backup-agent bash |
Now let’s run a pbm command that will prepare the cluster for backup – it opens the backup cursor and stores the metadata on the disk:
1 2 3 4 |
$ pbm backup -t external Starting backup '2024-09-25T10:54:24Z'......Ready to copy data from: - demo-cluster1-rs0-1.demo-cluster1-rs0.default.svc.cluster.local:27017 After the copy is done, run: pbm backup-finish 2024-09-25T10:54:24Z |
The output of this command tells you which node to use for a snapshot. In my case, it is demo-cluster1-rs0-1.demo-cluster1-rs0.default.svc.cluster.local:27017.
Take the snapshots
In Kubernetes you can create a Persistent Volume snapshot through a VolumeSnapshot resource. It should reference both Persistent Volume Claim (PVC) and a Volume Snapshot Class. For example (see 02-snapshot.yaml):
1 2 3 4 5 6 7 8 |
apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: snapshot-demo-cluster1-rs0-1 spec: volumeSnapshotClassName: gke-snapshot-class source: persistentVolumeClaimName: mongod-data-demo-cluster1-rs0-1 |
You will create a single snapshot per replica set. Apply the manifest to create the snapshot:
1 |
kubectl apply -f 02-snapshot.yaml |
Check if snapshots were created:
1 2 3 4 |
% kubectl get volumesnapshots NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE snapshot-demo-cluster1-rs0-0 true mongod-data-demo-cluster1-rs0-1 3Gi gke-snapshot-class snapcontent-34a398ed-4408-454f-bcad-8b2ba8f22a18 42s 43s |
Close the cursor
Now we need to go back to the PBM container and finish the backup, which closes any open backup cursors:
1 2 |
$ pbm backup-finish 2024-09-25T10:54:24Z Command sent. Check `pbm describe-backup 2024-09-25T10:54:24Z` for the result. |
Restore
There are a few caveats you need to know about the restoration:
- It is not possible to do an in-place restore with snapshots. It can be:
- A completely new cluster
- The existing cluster, but you will need to pause it and delete the existing volumes
- You must also backup the Secrets (TLS keys and users). This is not different from any other way of backing up in the Operator. We recommend using some Kubernetes Secret storage, for example, Vault.
Let’s look at the use case where you want to recover the existing cluster. The way it can be done is the following:
1. Delete the cluster
1 |
kubectl delete -f 00-cr.yaml |
2. Delete Persistent Volume Claims (PVC) that belong to the cluster
1 |
kubectl delete pvc -l app.kubernetes.io/instance=demo-cluster1 |
3. Create PVCs from the snapshot (see 03-volumes-from-snapshot.yaml). You need to create PVCs with the same name as the Operator would. In our case, we use the same names as we had before:
1 2 3 4 5 |
% kubectl apply -f 03-volumes-from-snapshots.yaml persistentvolumeclaim/mongod-data-demo-cluster1-rs0-0 created persistentvolumeclaim/mongod-data-demo-cluster1-rs0-1 created persistentvolumeclaim/mongod-data-demo-cluster1-rs0-2 created |
Now we can start the cluster:
1 |
kubectl apply -f 00-cr.yaml |
The cluster is now restored and has the data that was captured when you took the snapshots.
Point-in-time recovery
As we explained above, the Recovery Point Objective (RPO) can be improved by storing oplogs in the object storage separately.
To recover the data from oplogs, you will need to exec into the backup-agent container again:
1 |
% kubectl exec -ti demo-cluster1-rs0-0 -c backup-agent bash |
Check if oplog chunks are stored:
1 2 3 4 5 6 |
$ pbm status ... PITR chunks [131.70KB]: 2024-09-27T08:04:05Z - 2024-09-27T08:07:05Z (no base snapshot) |
You can recover by using the following command (get the timestamp from pbm status output, but adjust to your timezone):
1 |
pbm oplog-replay --start 2024-09-27T11:04:05 --end 2024-09-27T11:07:05 |
Now you will have the latest data.
Conclusion
Even though this snapshot-based backup and restore solution is currently a proof of concept, it still demonstrates the Percona Operator for MongoDB‘s flexibility and adaptability in managing large datasets, especially in scenarios where traditional logical or physical backups might not be ideal.
While the process may involve a few manual steps, it underscores the Operator’s commitment to providing comprehensive data protection options. Future releases will focus on streamlining and automating this process further, making snapshot-based backup and recovery even more seamless and user-friendly.
In the meantime, for those dealing with massive datasets where efficient storage and rapid recovery are paramount, this PoC offers a valuable tool for safeguarding critical data.