Why TokuMX Changed MongoDB’s Oplog Format for Operations

Over several posts, I’ve explained the differences between TokuMX replication and MongoDB replication, and why they are completely incompatible. In this (belated) post, I explain one last difference: the oplog format for operations. Specifically, TokuMX and MongoDB log updates and deletes differently.

Suppose we have a collection foo, with the following element:

We perform the following update:

In TokuMX, the operation’s entry looks like this:

In MongoDB, the operation looks looks like:

While there are several differences, I want to draw attention to the fields that define the update. Within the TokuMX oplog entry, they are “o”, and “m”. “pk” is used as well, but let’s disregard that as it needlessly complicates the story. Within MongoDB, they are “o2” and “o”. In both cases, the first field defines the pre-image of the document, and the second field defines the modification to be made. For both fields, TokuMX and MongoDB handle these fields differently. Below, I explain why.

Let’s address the pre-image first. MongoDB only includes the _id field in the oplog entry. On the secondary, to perform the update, MongoDB does the following:

  • Use the _id field to perform a lookup to find the entire pre-image document.
  • Perform the update of the document, in place if possible, otherwise by moving the document to a new location.
  • Update secondary indexes if necessary.

Note that for MongoDB, retrieving the document is a requirement for updating the document, because the update may be done in place. Therefore, if the entire pre-image was written to the oplog instead of just the _id field, MongoDB gains nothing. That’s just a property of their B-tree based and heap based storage.

TokuMX, on the other hand, logs the full pre-image of the document. We do this because if we know the full pre-image and necessary modifications, we can perform the update without retrieving the document. The full pre-image, along with the modifications to be made, define what changes will occur in the primary key (which stores the main copy of the document) and all associated secondary keys. With this information, we can leverage Fractal Tree indexing by sending messages down trees, which will perform the index maintenance without doing lookups first. So, by including the full pre-image in the oplog, TokuMX secondaries avoid an unnecessary lookup, which means avoiding an unnecessary I/O.

Thanks to this, TokuMX secondaries perform significantly less I/O than primaries. All the hidden I/Os required to lookup documents for updates and deletes are not performed on secondaries. For this reason, we think TokuMX secondaries do a wonderful job at scaling reads across a replica set. To show this, we ran the following experiment. On both MongoDB and TokuMX, we ran sysbench on a replica set such that the primary was utilizing 100% of available I/O. Note that some of the I/O was used for performing queries, which are not replicated. We proceeded to measure the I/O utilization of secondaries as they were replicating off I/O bound primaries. Below is graph showing our results.


You’ll see that the TokuMX secondary is using significantly less I/O to keep up with the primary.

A downside to this current design choice is that large documents induce more network bandwidth over replica sets. But that is because, at the moment, we are encoding the full pre-image of the entire document. What we ought to be doing, and what we will do in the future, is encode just the pre-image of fields affected by the update.

Now let’s address the post-image. You’ll notice that the user issued a $inc update, but MongoDB changed it to a $set within its oplog. The reason (I believe) is that MongoDB needs the oplog to be idempotent, as mentioned here. $set is an idempotent operation, $inc is not. As I’ve mentioned in the past, we don’t want idempotency to be a requirement for oplog entries, because it hinders our ability to innovate. Now, I want to give an example.

Some day, we’d like to change the internal implementation of more updates to not require a read before performing the write. Mark Callaghan describes these writes here. He mentions increment and decrement as examples, but really any type of update can be fast, provided the query uses the primary key and secondary indexes are not modified. In fact, if any user is interested in this feature, drop us a line at support@tokutek.com and tell us about it. However, if the oplog must be idempotent, as MongoDB’s is, I see no way to optimize non-idempotent update operations such as increment and decrement. As far as I can tell, MongoDB’s requirement of changing a $inc operation to a $set in the oplog forces the storage engine to perform a read before applying the modification, in order to find what value we will be setting the field “a” to.

So, when I said forcing TokuMX’s oplog entries to be idempotent would limit our ability to innovate, a prime example is making more updates really fast (in the future).

Share this post

Leave a Reply