Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

ZFS For MongoDB Backups

April 29, 2019

Author

Jervin Real

Insight for DBAs

MongoDB

Share this Post:

mongodb backup using zfs We have successfully used ZFS for MySQL® backups and MongoDB® is no different. Normally, backups will be taken from a hidden secondary, either with mongodump , WT hot backup or filesystem snapshots. In the case of the latter, instead of LVM2, we will use ZFS and discuss potential other benefits.

Preparation for initial snapshot

Before taking a ZFS snapshot, it is important to use db.fsyncLock() . This allows a consistent on disk copy of the data by blocking writes. It gives the server the time it needs to commit the journal to disk before the snapshot is taken.

My MongoDB instance below is running a ZFS volume and we will take an initial snapshot.

revin@mongodb:~$ sudo zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo        596M  9.04G    24K  /zfs-mongo
zfs-mongo/data   592M  9.04G   592M  /zfs-mongo/data
revin@mongodb:~$ mongo --port 28020 --eval 'db.serverCmdLineOpts().parsed.storage' --quiet
{
    "dbPath" : "/zfs-mongo/data/m40",
    "journal" : {
        "enabled" : true
    },
    "wiredTiger" : {
        "engineConfig" : {
            "cacheSizeGB" : 0.25
        }
    }
}
revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncLock()' --quiet
{
    "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
    "lockCount" : NumberLong(1),
...
}
revin@mongodb:~$ sleep 0.6
revin@mongodb:~$ sudo zfs snapshot zfs-mongo/data@full
revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncUnlock()' --quiet
{
    "info" : "fsyncUnlock completed",
    "lockCount" : NumberLong(0),
...
}

revin@mongodb:~$ sudo zfs list

NAME USED AVAIL REFER MOUNTPOINT

zfs-mongo 596M 9.04G 24K /zfs-mongo

zfs-mongo/data 592M 9.04G 592M /zfs-mongo/data

revin@mongodb:~$ mongo --port 28020 --eval 'db.serverCmdLineOpts().parsed.storage' --quiet

{

"dbPath" : "/zfs-mongo/data/m40",

"journal" : {

"enabled" : true

"wiredTiger" : {

"engineConfig" : {

"cacheSizeGB" : 0.25

}

revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncLock()' --quiet

{

"info" : "now locked against writes, use db.fsyncUnlock() to unlock",

"lockCount" : NumberLong(1),

...

}

revin@mongodb:~$ sleep 0.6

revin@mongodb:~$ sudo zfs snapshot zfs-mongo/data@full

revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncUnlock()' --quiet

{

"info" : "fsyncUnlock completed",

"lockCount" : NumberLong(0),

...

}

Notice the addition of sleep on line 23 of my command above. This is to ensure that even with the maximum storage.journal.commitIntervalMs of 500ms we allow enough time to commit the data to disk. This is simply an extra layer of guarantee and may not be necessary if you have very low journal commit interval.

revin@mongodb:~$ sudo zfs list -t all
NAME                  USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo             596M  9.04G    24K  /zfs-mongo
zfs-mongo/data        592M  9.04G   592M  /zfs-mongo/data
zfs-mongo/data@full   192K      -   592M  -

revin@mongodb:~$ sudo zfs list -t all

NAME USED AVAIL REFER MOUNTPOINT

zfs-mongo 596M 9.04G 24K /zfs-mongo

zfs-mongo/data 592M 9.04G 592M /zfs-mongo/data

zfs-mongo/data@full 192K - 592M -

Now I have a snapshot…

At this point, I have a snapshot I can use for a number of purposes.

- Replicate a full and delta snapshot to a remote storage or region with tools like zrepl. This allows for an extra layer of redundancy and disaster recovery.

- Use the snapshots to rebuild, replace or create new secondary nodes or refresh test/development servers regularly.

- Use the snapshots to do point in time recovery. ZFS snapshots are relatively cost free so it is possible to take snapshots even at five minutes interval! This is actually my favorite use case and feature.

Let’s say we take snapshots every five minutes. If a collection was accidentally dropped or even just a few rows were deleted, we can mount the last snapshot before this event. If the event was discovered in less than five minutes (perhaps that’s unrealistic) we only need to replay less than five minutes of oplog!

Point-in-Time-Recovery

To start a PITR, first clone the snapshot. Cloning the snapshot like below will automatically mount it. We can then start a temporary mongod instance with this mounted directory.

revin@mongodb:~$ sudo zfs clone zfs-mongo/data@full zfs-mongo/data-clone
revin@mongodb:~$ sudo zfs list -t all
NAME                   USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo              606M  9.04G    24K  /zfs-mongo
zfs-mongo/data         600M  9.04G   592M  /zfs-mongo/data
zfs-mongo/data@full   8.46M      -   592M  -
zfs-mongo/data-clone     1K  9.04G   592M  /zfs-mongo/data-clone

revin@mongodb:~$ ./mongodb-linux-x86_64-4.0.8/bin/mongod 
	--dbpath /zfs-mongo/data-clone/m40 
	--port 28021 --oplogSize 200 --wiredTigerCacheSizeGB 0.25

revin@mongodb:~$ sudo zfs clone zfs-mongo/data@full zfs-mongo/data-clone

revin@mongodb:~$ sudo zfs list -t all

NAME USED AVAIL REFER MOUNTPOINT

zfs-mongo 606M 9.04G 24K /zfs-mongo

zfs-mongo/data 600M 9.04G 592M /zfs-mongo/data

zfs-mongo/data@full 8.46M - 592M -

zfs-mongo/data-clone 1K 9.04G 592M /zfs-mongo/data-clone

revin@mongodb:~$ ./mongodb-linux-x86_64-4.0.8/bin/mongod

--dbpath /zfs-mongo/data-clone/m40

--port 28021 --oplogSize 200 --wiredTigerCacheSizeGB 0.25

Once mongod has started, I would like to find out the last oplog event it has completed.

revin@mongodb:~$ mongo --port 28021 local --quiet 
>     --eval 'db.oplog.rs.find({},{ts: 1}).sort({ts: -1}).limit(1)'
{ "ts" : Timestamp(1555356271, 1) }

revin@mongodb:~$ mongo --port 28021 local --quiet

> --eval 'db.oplog.rs.find({},{ts: 1}).sort({ts: -1}).limit(1)'

{ "ts" : Timestamp(1555356271, 1) }

We can use this timestamp to dump the oplog from the current production and use it to replay on our temporary instance.

revin@mongodb:~$ mkdir ~/mongodump28020
revin@mongodb:~$ cd ~/mongodump28020
revin@mongodb:~/mongodump28020$ mongodump --port 28020 -d local -c oplog.rs 
>     --query '{ts: {$gt: Timestamp(1555356271, 1)}}'
2019-04-16T23:57:50.708+0000	writing local.oplog.rs to
2019-04-16T23:57:52.723+0000	done dumping local.oplog.rs (186444 documents)

revin@mongodb:~$ mkdir ~/mongodump28020

revin@mongodb:~$ cd ~/mongodump28020

revin@mongodb:~/mongodump28020$ mongodump --port 28020 -d local -c oplog.rs

> --query '{ts: {$gt: Timestamp(1555356271, 1)}}'

2019-04-16T23:57:50.708+0000 writing local.oplog.rs to

2019-04-16T23:57:52.723+0000 done dumping local.oplog.rs (186444 documents)

Assuming our bad incident occurred 30 seconds from the time this snapshot was taken, we can apply the oplog dump with mongorestore. Be aware, you’d have to identify this from your own oplog.

revin@mongodb:~/mongodump28020$ mv dump/local/oplog.rs.bson dump/oplog.bson
revin@mongodb:~/mongodump28020$ rm -rf dump/local
revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()'
79767
revin@mongodb:~/mongodump28020$ mongorestore --port 28021 --dir=dump/ --oplogReplay 
>     --oplogLimit 1555356302 -vvv

revin@mongodb:~/mongodump28020$ mv dump/local/oplog.rs.bson dump/oplog.bson

revin@mongodb:~/mongodump28020$ rm -rf dump/local

revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()'

79767

revin@mongodb:~/mongodump28020$ mongorestore --port 28021 --dir=dump/ --oplogReplay

> --oplogLimit 1555356302 -vvv

Note the oplogLimit above shows a 31 seconds difference from the snapshot’s. Since we want to apply the next 30 seconds from the time the snapshot was taken, oplogLimit takes a value before the specified value.

2019-04-17T00:06:46.410+0000	using --dir flag instead of arguments
2019-04-17T00:06:46.412+0000	checking options
2019-04-17T00:06:46.413+0000		dumping with object check disabled
2019-04-17T00:06:46.414+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2019-04-17T00:06:46.418+0000	connected to node type: standalone
2019-04-17T00:06:46.418+0000	standalone server: setting write concern w to 1
2019-04-17T00:06:46.419+0000	using write concern: w='1', j=false, fsync=false, wtimeout=0
2019-04-17T00:06:46.420+0000	mongorestore target is a directory, not a file
2019-04-17T00:06:46.421+0000	preparing collections to restore from
2019-04-17T00:06:46.421+0000	using dump as dump root directory
2019-04-17T00:06:46.421+0000	found oplog.bson file to replay
2019-04-17T00:06:46.421+0000	enqueued collection '.oplog'
2019-04-17T00:06:46.421+0000	finalizing intent manager with multi-database longest task first prioritizer
2019-04-17T00:06:46.421+0000	restoring up to 4 collections in parallel
...
2019-04-17T00:06:46.421+0000	replaying oplog
2019-04-17T00:06:46.446+0000	timestamp 6680204450717499393 is not below limit of 6680204450717499392; ending oplog restoration
2019-04-17T00:06:46.446+0000	applied 45 ops
2019-04-17T00:06:46.446+0000	done

2019-04-17T00:06:46.410+0000 using --dir flag instead of arguments

2019-04-17T00:06:46.412+0000 checking options

2019-04-17T00:06:46.413+0000 dumping with object check disabled

2019-04-17T00:06:46.414+0000 will listen for SIGTERM, SIGINT, and SIGKILL

2019-04-17T00:06:46.418+0000 connected to node type: standalone

2019-04-17T00:06:46.418+0000 standalone server: setting write concern w to 1

2019-04-17T00:06:46.419+0000 using write concern: w='1', j=false, fsync=false, wtimeout=0

2019-04-17T00:06:46.420+0000 mongorestore target is a directory, not a file

2019-04-17T00:06:46.421+0000 preparing collections to restore from

2019-04-17T00:06:46.421+0000 using dump as dump root directory

2019-04-17T00:06:46.421+0000 found oplog.bson file to replay

2019-04-17T00:06:46.421+0000 enqueued collection '.oplog'

2019-04-17T00:06:46.421+0000 finalizing intent manager with multi-database longest task first prioritizer

2019-04-17T00:06:46.421+0000 restoring up to 4 collections in parallel

...

2019-04-17T00:06:46.421+0000 replaying oplog

2019-04-17T00:06:46.446+0000 timestamp 6680204450717499393 is not below limit of 6680204450717499392; ending oplog restoration

2019-04-17T00:06:46.446+0000 applied 45 ops

2019-04-17T00:06:46.446+0000 done

After applying 45 oplog events, we can see additional documents has been added to the percona.session collection.

revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()'
79792

1 2	revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()' 79792

Conclusion

Because snapshots are immediately available and because of its support for deltas, ZFS is quite ideal for large datasets that would otherwise take hours for other backup tools to complete.

—
Photo by Designecologist from Pexels