This blog post is the second in a series covering MongoDB PIT backups. You can find the first part here.
Sharding Makes Everything Fun(ner)
The first blog post in this series looked at MongoDB backups in a simple single-replica set environment. In this post, we’ll look at the scale-out use case. When sharding, we have exactly the same problem as we do on a single replica set. However, now the problem is multiplied by the number of replica sets in the cluster. Additionally, we have a bonus problem: each replica set has unique data. That means to get a truly consistent snapshot of the cluster, we need to orchestrate our backups to capture a single consistent point in time. Just so we’re on the same page, that means that every replica set needs to stop their backups at, or near, the same time that the slowest replica set stops. Are you sufficiently confused now? Let me get to a basic concept that I forgot to cover in the first post, and then I’ll give you a simple description of the problem.
Are you Write Concerned?
So far, I’ve neglected to talk about the very important role of “write concern” when taking consistent backups. In MongoDB, the database is not durable by default. By “durable,” I mean “on disk” when the database acknowledges receipt of an operation from your application. There are most likely several reasons for this. Most likely the biggest one originally was probably throughput given a lack of concurrency.
However, the side effect is possible data loss due to loss of operations applied only in memory. Changing the write concern to “journaled” ( j : true) will change this behavior so that MongoDB journals changes before acknowledging them (you also need to be running with journal enabled).
TIP: For true durability in a replica set, you should use a write concern of “majority” for operations and the writeConcernMajorityJournalDefault : true on all replica set members (new to v3.4). This has the added benefit of greatly decreasing the chance of rollback after an election.
Wow, you’re inconsistent
At the risk of being repetitive, the crux of this issue is that we need to run a backup on every shard (replica set). This is necessary because every shard has a different piece of the data set. Each piece of that data set is necessary to get an entire picture of the data set for the cluster (and thus, your application). Since we’re using mongodump, we’ll only have a consistent snapshot at the point in time when the backup completes. This means we must end each shard’s backup at a consistent point in time. We cannot expect that the backup will complete in exactly the same amount of time on every shard, which is what we’ll need for a consistent point in time across the cluster. This means that Shard1 might have a backup that is consistent to 12:05 PM, and another shard that is consistent to 12:06 PM. In a high traffic environment (the kind that’s likely to need horizontal scale), this could mean thousands of lost documents. Here’s a diagram:
Here’s the math to illustrate the problem:
- Shard1’s backup will contain 30,000 documents ((100 docs * 60 secs) * 5 mins)
- Shard2’s backup will contain 36,000 documents ((100 docs * 60 secs) * 6 mins)
In this example, to get a consistent point in time you’d need to remove all insert, update and delete operations that happened on Shard 2 from the time that Shard 1’s backup completed (6,000 documents). This means examining the timestamp of every operation in the oplog and reversing it’s operation. That’s a very intensive process, and will be unique for every mongodump that’s executed. Furthermore, this is a pretty tricky thing to do. The repeatable and much more efficient method is to have backups that finish in a consistent state, ready to restore when needed.
Luckily, Percona has you covered!
You’re getting more consistent
Having data is important, but knowing what data you have is even more important. Here’s how you can be sure you know what you have in your MongoDB backups:
David Murphy has released his MongoDB Consistent Backup Tool in the Percona Labs github account, and has written a very informative blog post about it. My goal with these blog posts is to make it even easier to understand the problem and how to solve it. We’ve already had an exhaustive discussion about the problem on both small and large scales. How about the solution?
It’s actually pretty simple. The solution, at a basic level, is to use a simple algorithm to decide when a cluster-wide consistent point-in-time can be reached. In the MongoDB Consistent Backup tool, this is done by the backup host kicking off backups on a “known good member” of each shard (that’s a pretty cool feature by itself) and then tracking the progress of each dump. At the same time the backup is kicked off, the backup host kicks off a separate thread that tails the oplog on each “known good member” until the mongodump on the slowest shard completes. By using this method, we have a very simple way of deciding when we can get a cluster-wide consistent snapshot. In other words, when the slowest member completes their piece of the workload. Here’s the same workload from Figure 4, but with the MongoDB Consistent Backup Tool methodology:
TIP: The amount of time that it takes to perform these backups is often decided by two factors:
- How evenly distributed the data is across the shards (balanced)
- How much data each shard contains (whether or not it’s balanced).
The takeaway here is that you may need to shard so that each shard has a manageable volume of data. This allows you to hit your backup/restore windows more easily.
…The Proverbial “Monkey Wrench”
There’s always a “gotcha” just when you think you’ve got your mind around any difficult concept. Of course, this is no different.
There is one very critical concept in sharding that we didn’t cover: tracking what data lies on which shard. This is important for routing the workload to the right place, and balancing the data across the shards. In MongoDB, this is completed by the config servers. If you cannot reach (or recover) your config servers, your entire cluster is lost! For obvious reasons, you need to back them up as well. With the Percona Labs MongoDB Consistent Backup Tool, there are actually two modes used to backup config servers: v3.2 and greater, and legacy. The reason is that in v3.2, config servers went from mirrors to a standard replica set. In v3.2 mode, we just treat the config servers like another replica set. They have their own mongodump and oplog tail thread. They get a backup that is consistent to the same point in time as all other shards in the cluster. If you’re on a version of MongoDB prior to v3.2, and you’re interested in an explanation of legacy mode, please refer back to David’s blog post.
The Wrap Up
We’ve examined the problems with getting consistent backups in a running MongoDB environment in this and the previous blog posts. Whether you have a single replica set or a sharded cluster, you should have a much better understanding of what the problems are and how Percona has you covered. If you’re still confused, or you’d just like to ask some additional questions, drop a comment in the section below. Or shoot me a tweet @jontobs, and I’ll make sure to get back to you.