How TokuMX was Born

With TokuMX 1.4 coming out soon, with (teaser) wonderful improvements made to sharding and updates (and plenty of other goodies), I’ve recently reminisced about how we got TokuMX to this point. We (actually, really John) started dabbling with integrating Fractal Tree® indexes into MongoDB in the summer of 2012, where we (really, he) prototyped using Fractal Tree indexes only for secondary indexes. As cool as that prototype was, it wasn’t production ready, and we knew we had a challenge creating a usable product.

Back in early 2013, when we were analyzing how we wanted to (actually) integrate Fractal Tree indexes into MongoDB, internally, we saw three choices:

  1. Use Fractal Trees only for secondary indexes, hence giving the user the option to add a fractal tree secondary index.
  2. Use Fractal Tree indexes only for secondary indexes and the main data heap. This would allow users to make entire collections with our technology, and other collections using MongoDB’s storage technology. “Metadata” collections such as the oplog, system.namespaces, profiling collections, and other collections would still use MongoDB’s storage.
  3. Use Fractal Trees for everything. And we mean EVERYTHING: user collections, oplog, system.namespaces, system.indexes, config servers, etc… Any and all data would be stored with Fractal Tree indexes.

Needless to say, we chose option #3, and TokuMX was born soon after (with a lot of hard work). What I would like to do now is give some insight into our reasoning. But really, it comes down to these reasons:

  • Simplicity
  • Opportunity to innovate

Here are the benefits and drawbacks we saw to each approach.

Using Fractal Tree indexes only for secondary indexes

The big advantage we saw was the ability to non-intrusively inject our technology into MongoDB. Users could opt-in only with secondary indexes that were likely to benefit. Theoretically, we could slide ourselves in and not worry about the rest of the MongoDB stack such as the query planner, replication, and sharding. All we would care about is parts where our index is modified, and where our index is queried. Theoretically, this would be small in scope.

Unfortunately, the disadvantages we saw were quite big:

  • Recovery upon a crash could be challenging. How could we ensure that the main data heap, which is stored with MongoDB, would be in sync with our Fractal Tree secondary index? MongoDB uses journaling, whereas we have our own logging and recovery mechanism. Could we get this working in a performant way? Our analysis of the code, and feedback we received, told us this would be very challenging.
  • If the main data heap used vanilla MongoDB storage, users would not get the benefits of our compression.
  • The database level reader/writer lock, and MongoDB’s lock yielding behavior, made it difficult to find a performant way to integrate ourselves as just a secondary index. In MongoDB’s code, if an I/O is going to be performed down low in the stack with the database level reader/writer lock held, an exception is thrown, and high up in the stack, the lock is yielded so the I/O can be performed lock-free. Our Fractal Tree library is not designed to throw an exception on I/O and to have an API to prefetch some data. Integrating this identical behavior into our Fractal Tree library seemed challenging enough that we did not consider it. So, if we were to integrate ourselves as just a secondary index, we likely would have had our I/O’s happen with the reader/writer lock held, whereas MongoDB’s I/O’s would not. We did not want this performance characteristic. So, one could argue that with this approach, our locking characteristics would be worse than MongoDB’s existing locking characteristics.

Using Fractal Tree indexes for entire collections

The big benefits to this approach over the “just a secondary index” approach were the following:

  • Users would get our full compression benefits.
  • A collection’s main data was guaranteed to be in sync with secondary indexes, because all the data would be stored with our transactional Fractal Tree library

Unfortunately, we reasoned that the disadvantages of the “just a secondary index” approach also applied here:

  • Recovery would still remain challenging. We would need the collections to be in sync with information in metadata collections upon recovering from a crash. In TokuDB and MySQL, this has never been an easy problem, even though MySQL has done so much great work to make this problem simpler for storage engines. Attacking this problem fresh with MongoDB’s storage would be challenging.
  • The database level reader/writer lock’s yielding behavior presents the same problem here as it did with the “just a secondary index” approach.

Using Fractal Tree Indexes for EVERYTHING

The third option was to “take control over more of the stack”, use Fractal Tree indexes for everything and completely replace the MongoDB storage code. The challenge (but not downside) was that this approach was a lot of work. However, it was arguably less work than getting either of the above options working. We had to become experts in how replication and sharding, and in some cases, rewrite existing algorithms with new ones to better utilize Fractal Tree indexes. The benefits we saw were:

  • Crash recovery was no longer an issue, because all work is done with our transactional storage engine
  • Locking was not an issue, because we could refine the locking ourselves, which we did. This is why TokuMX has document level locking.

But really, the BIGGEST benefit to this approach was the following: we could innovate on more of the MongoDB core server stack in ways the other approaches would not allow. Prior to TokuMX 1.4, such innovations include (but are not limited to):

  • Document level locking
  • Multi-statement transactions (on non-sharded clusters)
  • MVCC snapshot query semantics
  • Clustering indexes (although, to be fair, this was possible in other approaches)
  • Dramatically reduced I/O utilization on secondaries (which we will elaborate on in a future post)
  • Fast bulk loading
  • Enterprise hot backup

For these reasons, we chose this option, and after some hard work, TokuMX was born.

What really has me excited about TokuMX 1.4 and beyond is that we have taken these innovations further. We have improved sharding and updates in ways that would be impossible had we taken another approach. Also, we have plans to improve other areas of the system in similar ways. So stay tuned.

Share this post

Comment (1)

  • Cyrus Reply

    Excellent job. TokuMX solved MongoDB ‘s the most serious problem ,fragmentation,memory management,compression(BSON is too large),and perfermance when datasets is very large.
    By the way, howto pronounce TokuMX ?

    February 18, 2014 at 10:37 am

Leave a Reply