What’s new in TokuMX 1.4, Part 5: Faster chunk migrations

We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.

Sharding in MongoDB and TokuMX does a great job of scaling an application beyond what a single machine can do, but it also brings new challenges to the table.  One of those challenges is how to deal with the impact of migrations on the running system.  Many users try to avoid migrations completely by using a hashed shard key, but if that isn’t suitable for the application, some will schedule the balancing window to periods of lower traffic.

With the improvements to sharding in TokuMX 1.4, users generally don’t need to worry about migrations anymore.  I introduced two big improvements yesterday, this one is a longer story:

In MongoDB and TokuMX 1.3 and earlier, migrations are a complicated process, involving many locks, background threads, coordination, and a plethora of internal commands back and forth between the routers, shard servers, and config servers.  This is necessary in MongoDB to ensure that no data is lost along the way, and that all modifications to the migrating chunk aren’t missed or duplicated because of the migration.  Earlier TokuMX releases implemented a similar algorithm for safety, but now we’ve leveraged transactions internally to make migrations simpler, more reliable, and much faster.  I’m going to go into some detail on this to show how transactions helped us.

To migrate a chunk, there are basically three steps: cloning the existing data, catching up with modifications that happened along the way, and committing the migration with a form of two-phase commit among the servers.  In MongoDB, cloning happens with some special commands that sort existing documents based on their location on disk and then transfer them in batches.  Modifications during this period are tracked in memory and eventually all modified documents are re-copied (or deleted) in order to catch up (and if there are too many changes to keep in memory, the migration will fail).  Each cloned document needs to be added as an upsert to avoid duplicating documents (the operation must be idempotent).

Two-phase commit is needed for safety, so we didn’t change that.  In TokuMX 1.3 we kept the cloning commands but had them just do normal queries internally (because the shard key is a clustering key, the documents are already sorted properly for fast access), and we changed the modification tracking to essentially log modifications to a temporary collection that looks like the oplog.

In TokuMX, each upsert needs to query first to see if the document already exists, and this query may cost a random I/O, which will be slow.  It’s important to note that TokuMX’s Fractal Tree indexes can insert much faster than this, without causing any random I/O.

In TokuMX 1.4, we’ve done away with the cloning commands entirely (except for compatibility with 1.3 servers) and instead, we just start an MVCC snapshot transaction (and start logging modifications), and then do a normal cursor query over the chunk on the donor shard, which works just like any other cursor, instead of sending commands back and forth.  On the recipient shard, rather than do upserts with their hidden queries, we now send insert messages, which we can do because we’re reading them off the donor with a snapshot transaction.  The catchup phase will still just replay the operations that happened on the donor.

This simplifies the process and makes testing easier, but what about performance?  On the donor, the only work is a range query on the shard key, which is clustering by default, so that’s all fast sequential I/O.  On the recipient, things are a little more complicated.  While we no longer automatically do queries for upserts, we still need to check for uniqueness of the “_id” field.  If the shard key is just {_id: 1}, then these unique checks will be on sequential keys, so I/O isn’t a problem and they will be very fast.  But if the shard key is anything else, then these will be random queries, and can be slow.

Why bother with the unique checks if we already did them when we inserted the data the first time?  A sharded collection can’t ensure uniqueness on any index but the shard key.  This means that a misbehaving application could insert two documents with the same _id, but different shard keys so they initially go to separate shards.  If a migration moves those documents to the same shard, we want to detect that the documents have the same _id.  If both of the following apply, then skipping the unique checks during migration could lead to data loss:

  • The collection must be sharded on a key other than the primary key.  This is not the default behavior of shardCollection, but is possible, especially when sharding an existing collection.
  • The application must assign _ids to documents manually, and assign the same _id to two different documents, which is already a misuse of MongoDB.

To protect our users in all cases, TokuMX 1.4 always does the unique checks during a migration, and data loss is not possible.

(By default.)

There is a new setting in TokuMX 1.4 called migrateUniqueChecks, which controls this behavior.  Its default is “true”, but if you know that any of the following are true, you can turn this off:

  • You are sharding with a primary key.
  • You are allowing MongoDB to assign _ids for your application.
  • You know your _ids are all globally unique and are willing to accept the consequences if they aren’t.

If any of those apply, you can start your shard servers with --setParameter migrateUniqueChecks=false.  If you do this, then there will be no random I/O on the recipient shard either, for the bulk clone phase, and migrations will be significantly faster.  With a workload that doesn’t have updates, you can actually do a full migration on a loaded cluster without inducing any random I/O on any shard.

With TokuMX 1.4.0, chunk migrations are now faster than ever.  Check out Tim’s chunk migration benchmark to see just how fast they are.

Want to check out the newest version of TokuMX?  Download TokuMX 1.4.0 here:

MongoDB Download MongoDB Download


Share this post

Leave a Reply