We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.
In TokuMX 1.4.0, we improved performance by making two big changes to how updates are implemented:
In MongoDB, updates are performed by first searching for the document to update, then applying the update in-place if it can be done without growing the document, or by making a new copy of the document and then updating all the indexes regardless of the effect of the update. This process is the same on all secondaries, which means that for working sets larger than RAM, update workloads can quickly consume all of the I/O resources in the cluster.
In TokuMX, updates still do a search on the primary, to find the original document, but after this point, we don’t do any more random I/O for the update. Instead, we send messages into the affected indexes (how all modifications to Fractal Tree indexes happen) that will delete invalidated entries and replace them with new versions. Since TokuMX secondary indexes use logical rather than physical primary key identifiers, there is nothing special about growing a document, only the keys actually affected by the update modification will be updated, and this is all done without any further random I/O, the same as normal inserts.
For the secondaries in a replica set, TokuMX puts more information in the oplog than MongoDB does, so that the secondaries have enough information just from the oplog entry to replicate the update (deletes and inserts into affected indexes) without doing any I/O at all, which makes them much more effective for scaling read workloads. However, for coding simplicity, until 1.4.0, the entire old and new versions of the document were put into the oplog.
Since MongoDB, like many NoSQL databases, encourages denormalization, many existing MongoDB applications frequently update small portions of large documents. For a small update, TokuMX might log two copies of a very large document. These would tend to compress well in the oplog, but it is still a significant waste of space, and the MongoDB wire protocol doesn’t support compression (yet), so this inflates network traffic in a replica set.
There is still work remaining, beyond the two changes above, to reduce the oplog size further in future releases, but this is a big step forward for some of our users.
Most users should simply upgrade and immediately see the benefits. One thing to keep in mind about this is that, since this change adds a new type of message that a primary can put in the oplog, it is unsafe to have a replica set with a primary at version 1.4.0 or greater and any secondaries at an earlier version. To upgrade a replica set, one must upgrade all secondaries first (without letting upgraded nodes step up to become primary—their priority can be set to zero to ensure this), then step one of them up to be the primary and upgrade the original primary. This process is covered fully in the Users Guide.
Want to check out the newest version of TokuMX? Download TokuMX 1.4.0 here: