MongoDB Disaster, Snapshot Restore and Point-in-time Replay

Mistakes can happen. If only we could go back in time to the very second before that mistake was made.

Act 1: The Disaster

Plain text version for those who cannot run the asciicast above:

Act 2: Time travel with a Snapshot restore + Oplog replay

Plain text version for those who cannot run the asciicast above:


The ‘TLDR’

The oplog of the damaged replicaset is your valuable, idempotent history if you have a backup from a recent enough time to apply it on.

  • Identify your disaster operation’s timestamp value in the oplog.
  • Before shutting the damaged replicaset down: mongodump connection-args --db local --collection
    • (Necessary workaround #1) use a --query '{"ns": {"$nin": ["config.system.sessions", "config.transactions", "config.transaction_coordinators"]}}' argument to avoid transaction-related system collections from v3.6 and v4.0 (and maybe 4.2+ too) that can’t be restored.
  • (Necessary workaround #2) Get rid of the subdirectory structure mongodump makes and keep just the file.
  • (Necessary workaround #3) Make a fake, empty directory somewhere else too, to trick mongorestore later.
  • Use bsondump | head -n 1 to check that this oplog starts before the time of your last backup
  • Shut the damaged DB down.
  • Restore to the latest backup before the disaster.
  • (Possibly-required workaround #4) If the oplog updates other system collections, create a user-defined role that grants anyAction on anyResource and grants it to your user as well. (See special section on system collections below.)
  • Replay up to but not including the disaster second: mongorestore connection-args –oplogReplay –oplogFile –oplogLimit disaster_epoch_sec:0 /tmp/fake_empty_directory

See the ‘Act 2’ video for the details.

So how did that work?

If you’re having the kind of disaster presented in this article I assume you are already familiar with the mongodump and mongorestore tools and MongoDB Oplog idempotency. Taking that for granted let’s go to the next level of detail.

The applyOps command – Kinda secret; Actually public

In theory you could iterate oplog documents and write an application that runs an insert command for an “i” op, an update for the “u” ops, various different commands for the “c” op, etc, but the simpler way is to submit them as they are (well almost exactly as they are) using the applyOps command, and this is what the mongorestore tool does.

The permission to run applyOps is granted to the “restore” role for all non-system collections, and there is no ‘reject if a primary’ rule. So you can make a primary apply oplog docs like a secondary does.

N.b. for some system collections, the “restore” role is not enough. See the bottom section for more details.

It might seem a bit strange users can have this privilege but without it, there would be no convenient way for dump-restore tools to guarantee consistency. The “consistency” here means all that the restored data will be exactly as it was at some point in time – the end of the dump – and not contain earlier versions of documents from some midpoint time during the dumping process.

Achieving that data consistency is why the --oplog option for mongodump was created, and why mongorestore has the matching --oplogReplay option. (Those two options should be on by default i.m.o. but they are not). The short oplog span made during a normal dump will be at  <dump_directory>/, but the --oplogFile argument lets you choose any arbitrary path.


We could have limited the oplog docs during mongodump to only include those before the disaster time with –query parameter such as the following:

mongodump ... --query '{"ts": {"$lt": new Timestamp(1560915610, 0)}}' ...

But --oplogLimit makes it easier. You can dump everything, but then use --oplogLimit <epoch_sec_value>[:<counter>] when you run mongorestore with the –oplogReplay argument.

If you’re getting confused about whether it’s UTC or your server timezone – it’s UTC. All timestamps inside MongoDB are UTC if they represent ‘wall clock’ times, and for ‘logical clocks’ timezone is a non-applicable concept.

When the oplog includes system collection updates

In the built-in roles documentation, inserted after the usual and mostly fair warnings on why you should not grant users the most powerful internal role, comes this extra note that tells you what you actually need to do to allow oplog-replay updates on all system collections too:

If you need access to all actions on all resources, for example to run applyOps commands … create a user-defined role that grants anyAction on anyResource and ensure that only the users who need access to these operations have this access.

Translation: if your oplog replay fails because it hit a system collection update the “restore” role doesn’t cover, upgrade your user to be able to run with all the privileges that a secondary runs oplog replication with.

Alternatively, to granting the role shown above, you could restart the mongod with security disabled; in this mode, all operations work without access control restrictions.

It’s not quite as simple as that though because transaction stuff is currently (v3.6, v4.0) throwing a spanner in the works. So I’ve found explicitly excluding config.system.sessions and config.transactions during mongodump is the best way to avoid those updates. They are logically unnecessary in a restore because the sessions/transactions finished when the replica set was completely shut down.

Learn more about Percona Server for MongoDB

Share this post

Leave a Reply