In my last post, we talked about the read/write tradeoff of indexing data structures, and some ways that people augment B-trees in order to get better write performance. We also talked about the significant drawbacks of each method, and I promised to show some more fundamental approaches.
We had two “workload-based” techniques: inserting in sequential order, and using fewer indexes, and two “data structure-based” techniques: a write buffer, and OLAP. Remember, the most common thing people do when faced with an insertion bottleneck is to use fewer indexes, and this kills query performance. So keep in mind that all our work on write-optimization is really work for read-optimization, in that write-optimized indexes are cheap enough that you can keep all the ones you need to get good read performance.
Today, I’ll draw some parallels with the write buffer and OLAP. Recall that the write buffer gets you a small insertion speedup but doesn’t really hurt query time, and OLAP gets you a big insertion speedup but doesn’t let you query your data for a long time. We’ll also use the fact that sequential insertions are orders of magnitude faster than random insertions.
With that in mind, let’s move on to the new generation of write optimization.
We couldn’t manage to get both insertion and query performance out of either a write buffer or OLAP. But they’re similar techniques, just at two extremes.
LSMs are two great tastes that taste great together. To start, make the buffer big, but make it a B-tree, so you can query data as it arrives. In fact, suppose you have log(N) B-trees, B1, B2, …, Blog(N), each twice the size of the one before. If B-trees are slow, using more of them sounds crazy, but I promise we’re getting somewhere.
Each B-tree has twice the capacity of the one before it. So Bk can hold 2k elements. When a new row arrives, put it in B-tree B1. When B1 reaches capacity (which in this case is 2 rows), dump those rows into B2. B2‘s capacity is 4 rows. When B2 overflows, you dump the items into B3, which has a capacity of 8 rows, and so on. The trick here is that each time we dump things down to the next B-tree, they’re already sorted, so we get the insertion boost out of doing sequential insertions.
The first log(M) B-trees are in memory (where M is the size of memory). A simple optimization is to just have one B-tree for all these levels, because in-memory B-trees are fast. Once you start flowing out of memory, you are always merging one tree with another which has at most twice as many rows. This way, the smaller B-tree can be treated like the large, OLAP-style buffer, and you get a similarly large speedup, in fact, this merge happens at disk bandwidth speeds.
Not so fast, you say: You don’t get to use all the bandwidth, because each row gets moved from B-tree to B-tree, and it uses up bandwidth each time. This is true, but it turns out that you’re operating at a 1/log(N/M) fraction of bandwidth, which is a lot better than a B-tree, by orders of magnitude.
Alas, the queries are not so great. Even though we made the buffer into B-trees, which are good for queries, you now need to do a query in each one. There are log(N/M) of them on disk, so this ends up being slower than a B-tree by a log(N/M) factor. There’s that pesky tradeoff, which is much better than the B-tree tradeoff, but still not the mathematically optimal tradeoff.
One last point: if instead of growing the B-trees by a factor of 2, you grow them by a larger factor, you slow down your insertions but speed up your queries. Once again, the tradeoff emerges.
A COLA (that’s Cache-Oblivious Lookahead Array) is a lot like an LSM, with the queries done in a better way. To begin with, you use a technique called fractional cascading to maintain some information about the relationship from one level to the next. This information can take many forms, but what’s important is that you don’t restart your query at each level and end up doing a full B-tree query log(N) times. Instead, you get to do a small local search. If you do things just right, you can match the query time of a single B-tree. This is true even if you are doubling your structures at each level, so in addition, COLAs are as fast at insertions as LSMs.
Let me repeat that: they match B-trees for queries while simultaneously matching LSMs for insertions. It’s nice to note that COLAs are on the mathematically optimal write/read tradeoff curve, and they’re a proof, by example, that B-trees are not optimal.
This flavor of data structure, which combines the insertion speed of the size-doubling hierarchy of sorted structures (the LSM) with the query speed boost of fractional cascading, goes by many names and can be found dressed up in a bunch of surprising ways, but the underlying math, as well as the performance, is exactly the same.
For bonus points, if you read my colleagues’ paper on COLAs, you’ll see that they are described as being log(B) slower than B-trees on queries. This log(B) is easily recouped in practice—giving you the same query speed as B-trees—if you give up so-called cache obliviousness (a property which is nice mathematically, but not as nice as having faster queries).
I’ve been focusing on write optimization, and Fractal Trees do go a couple of orders of magnitude faster than B-trees for indexing that pesky non-sequential data. What that means for the user is typically read optimization: you start adding all the indexes you needed all along, since indexes are so wonderfully cheap to update. My motto is: write optimization is the best read optimization!
You can get COLA-style read-optimal, write-optimized goodness here at Tokutek, where it is marketed as Fractal Trees and available in TokuDB for MySQL and MariaDB.