We have successfully used ZFS for MySQL® backups and MongoDB® is no different. Normally, backups will be taken from a hidden secondary, either with
mongodump , WT hot backup or filesystem snapshots. In the case of the latter, instead of LVM2, we will use ZFS and discuss potential other benefits.
Preparation for initial snapshot
Before taking a ZFS snapshot, it is important to use db.fsyncLock() . This allows a consistent on disk copy of the data by blocking writes. It gives the server the time it needs to commit the journal to disk before the snapshot is taken.
My MongoDB instance below is running a ZFS volume and we will take an initial snapshot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
revin@mongodb:~$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT zfs-mongo 596M 9.04G 24K /zfs-mongo zfs-mongo/data 592M 9.04G 592M /zfs-mongo/data revin@mongodb:~$ mongo --port 28020 --eval 'db.serverCmdLineOpts().parsed.storage' --quiet { "dbPath" : "/zfs-mongo/data/m40", "journal" : { "enabled" : true }, "wiredTiger" : { "engineConfig" : { "cacheSizeGB" : 0.25 } } } revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncLock()' --quiet { "info" : "now locked against writes, use db.fsyncUnlock() to unlock", "lockCount" : NumberLong(1), ... } revin@mongodb:~$ sleep 0.6 revin@mongodb:~$ sudo zfs snapshot zfs-mongo/data@full revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncUnlock()' --quiet { "info" : "fsyncUnlock completed", "lockCount" : NumberLong(0), ... } |
Notice the addition of sleep on line 23 of my command above. This is to ensure that even with the maximum storage.journal.commitIntervalMs of 500ms we allow enough time to commit the data to disk. This is simply an extra layer of guarantee and may not be necessary if you have very low journal commit interval.
1 2 3 4 5 |
revin@mongodb:~$ sudo zfs list -t all NAME USED AVAIL REFER MOUNTPOINT zfs-mongo 596M 9.04G 24K /zfs-mongo zfs-mongo/data 592M 9.04G 592M /zfs-mongo/data zfs-mongo/data@full 192K - 592M - |
Now I have a snapshot…
At this point, I have a snapshot I can use for a number of purposes.
- Replicate a full and delta snapshot to a remote storage or region with tools like zrepl. This allows for an extra layer of redundancy and disaster recovery.
- Use the snapshots to rebuild, replace or create new secondary nodes or refresh test/development servers regularly.
- Use the snapshots to do point in time recovery. ZFS snapshots are relatively cost free so it is possible to take snapshots even at five minutes interval! This is actually my favorite use case and feature.
Let’s say we take snapshots every five minutes. If a collection was accidentally dropped or even just a few rows were deleted, we can mount the last snapshot before this event. If the event was discovered in less than five minutes (perhaps that’s unrealistic) we only need to replay less than five minutes of oplog!
Point-in-Time-Recovery
To start a PITR, first clone the snapshot. Cloning the snapshot like below will automatically mount it. We can then start a temporary mongod instance with this mounted directory.
1 2 3 4 5 6 7 8 9 10 11 |
revin@mongodb:~$ sudo zfs clone zfs-mongo/data@full zfs-mongo/data-clone revin@mongodb:~$ sudo zfs list -t all NAME USED AVAIL REFER MOUNTPOINT zfs-mongo 606M 9.04G 24K /zfs-mongo zfs-mongo/data 600M 9.04G 592M /zfs-mongo/data zfs-mongo/data@full 8.46M - 592M - zfs-mongo/data-clone 1K 9.04G 592M /zfs-mongo/data-clone revin@mongodb:~$ ./mongodb-linux-x86_64-4.0.8/bin/mongod \ --dbpath /zfs-mongo/data-clone/m40 \ --port 28021 --oplogSize 200 --wiredTigerCacheSizeGB 0.25 |
Once mongod has started, I would like to find out the last oplog event it has completed.
1 2 3 |
revin@mongodb:~$ mongo --port 28021 local --quiet \ > --eval 'db.oplog.rs.find({},{ts: 1}).sort({ts: -1}).limit(1)' { "ts" : Timestamp(1555356271, 1) } |
We can use this timestamp to dump the oplog from the current production and use it to replay on our temporary instance.
1 2 3 4 5 6 |
revin@mongodb:~$ mkdir ~/mongodump28020 revin@mongodb:~$ cd ~/mongodump28020 revin@mongodb:~/mongodump28020$ mongodump --port 28020 -d local -c oplog.rs \ > --query '{ts: {$gt: Timestamp(1555356271, 1)}}' 2019-04-16T23:57:50.708+0000 writing local.oplog.rs to 2019-04-16T23:57:52.723+0000 done dumping local.oplog.rs (186444 documents) |
Assuming our bad incident occurred 30 seconds from the time this snapshot was taken, we can apply the oplog dump with mongorestore. Be aware, you’d have to identify this from your own oplog.
1 2 3 4 5 6 |
revin@mongodb:~/mongodump28020$ mv dump/local/oplog.rs.bson dump/oplog.bson revin@mongodb:~/mongodump28020$ rm -rf dump/local revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()' 79767 revin@mongodb:~/mongodump28020$ mongorestore --port 28021 --dir=dump/ --oplogReplay \ > --oplogLimit 1555356302 -vvv |
Note the oplogLimit above shows a 31 seconds difference from the snapshot’s. Since we want to apply the next 30 seconds from the time the snapshot was taken, oplogLimit takes a value before the specified value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
2019-04-17T00:06:46.410+0000 using --dir flag instead of arguments 2019-04-17T00:06:46.412+0000 checking options 2019-04-17T00:06:46.413+0000 dumping with object check disabled 2019-04-17T00:06:46.414+0000 will listen for SIGTERM, SIGINT, and SIGKILL 2019-04-17T00:06:46.418+0000 connected to node type: standalone 2019-04-17T00:06:46.418+0000 standalone server: setting write concern w to 1 2019-04-17T00:06:46.419+0000 using write concern: w='1', j=false, fsync=false, wtimeout=0 2019-04-17T00:06:46.420+0000 mongorestore target is a directory, not a file 2019-04-17T00:06:46.421+0000 preparing collections to restore from 2019-04-17T00:06:46.421+0000 using dump as dump root directory 2019-04-17T00:06:46.421+0000 found oplog.bson file to replay 2019-04-17T00:06:46.421+0000 enqueued collection '.oplog' 2019-04-17T00:06:46.421+0000 finalizing intent manager with multi-database longest task first prioritizer 2019-04-17T00:06:46.421+0000 restoring up to 4 collections in parallel ... 2019-04-17T00:06:46.421+0000 replaying oplog 2019-04-17T00:06:46.446+0000 timestamp 6680204450717499393 is not below limit of 6680204450717499392; ending oplog restoration 2019-04-17T00:06:46.446+0000 applied 45 ops 2019-04-17T00:06:46.446+0000 done |
After applying 45 oplog events, we can see additional documents has been added to the percona.session collection.
1 2 |
revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()' 79792 |
Conclusion
Because snapshots are immediately available and because of its support for deltas, ZFS is quite ideal for large datasets that would otherwise take hours for other backup tools to complete.
—
Photo by Designecologist from Pexels