In Part 1, we showed performance results of some of the work that’s gone in to TokuDB v6.6. In this post, we’ll take a closer look at how this happened, on the engineering side, and how to think about the performance characteristics in the new version.
It’s easiest to think about our concurrency changes in terms of a Fractal Tree® index that has nodes like a B-tree index, and buffers on each node that batch changes for the subtree rooted at that node. We have materials that describe this available here, but we can proceed just knowing that:
It’s these operations that modify and examine the buffers in the root that were the main reason we used to serialize operations inside a single index.
Read concurrency isn’t too tricky. In a Fractal Tree index, when a reader applies messages to a leaf node to bring it up to date, we don’t actually remove those messages from the buffer where we found them. This means that if you have two readers that are each looking at different leaf nodes, they can each get exclusive access to their leaf nodes if they need to apply any messages, but they can share access to their common ancestor nodes since queries don’t modify them.
It turns out that we actually have some metadata in the ancestors that we do modify, and it took a bit of work to stop doing that (some of which I describe in a previous post). But now, it’s safe for multiple readers to do their work within a single buffer concurrently, as long as there are no writers.
After this work, in the steady state of a read-only workload, there are no new messages to apply, so readers just get read locks everywhere and concurrency is fantastic. As you add new data, you need to start taking some exclusive locks on leaves sometimes and this can cause serialization, but it’s often quite hard to notice if the table is large.
You can do some back-of-the-envelope calculations that say that even if every reader needed an exclusive lock on a leaf, if you have enough active leaves, all your processors will find something useful to do almost all of the time. What’s more, readers only need an exclusive lock if there was some write targeted at their leaf node since the last reader brought it up to date.
To inject data into a Fractal Tree index, we put a message in a buffer in the root node, and that message gets moved down the tree somehow. Usually, when a node reaches a certain size, the contents of one of its buffers get flushed down the tree.
The concept of making writes concurrent is easy: instead of putting a message in a buffer right at the root, we could start the message in a buffer a little bit farther down the tree where it would end up after a few flushes. In this way, writers operating on different sections of the tree would go down the tree, sharing access to the nodes close to the root, and only need exclusive access to the node where they put their messages. Such writers could all operate on different subtrees concurrently, and queries targeted at leaves in other subtrees would be free to operate concurrently with those writers as well. We’ve used this strategy of doing flushing work up front before (but not for concurrency), and we call it “promotion” internally (as in: a message is “promoted” to a deeper node).
TokuDB takes frequent checkpoints, currently about once every 60 seconds. We want our checkpoints to be frequent so that recovery is fast, and we want to minimize the amount of data we need to checkpoint each time so that we can avoid latency problems like Vadim Tkachenko reported for InnoDB in MySQL 5.5.
The advantage of putting messages in the root right away is that modifications get localized in the root node—which only costs one disk I/O to save during a checkpoint—until there are enough modifications for it to be worth our time to do more than one I/O. This is how Fractal Tree indexes achieve their insertion speed. The trade-off is that the root node forms a sort of in-memory bottleneck at which insertion threads serialize. Making writes concurrent is just a question of balancing these concerns.
If we promote a message to a deeper node, we’re taking a bet. If we were going to insert enough data that we would have flushed that message down the tree before the next checkpoint anyway, then we haven’t cost ourselves any extra I/O. But if we weren’t going to insert very much data, then in the next checkpoint, where we would have just had to write the root node to disk, now we may need to write a bunch of its children to disk and that’s more I/O. What we gain from this bet is concurrency as described above, and that we do somewhat less total CPU work (the flushing work).
To strike this balance, we track some statistics and try to predict how much data is going into a given region of the tree. If one section of the tree is hot, then we’d have to do more I/O for those nodes anyway, so we promote farther down the tree and gain concurrency.
The result is that, in our testing, promotion adapts well to a wide range of workloads.