In my last post, we talked about the read/write tradeoff of indexing data structures, and some ways that people augment B-trees in order to get better write performance. We also talked about the significant drawbacks of each method, and I promised to show some more fundamental approaches.

We had two “workload-based” techniques: inserting in sequential order, and using fewer indexes, and two “data structure-based” techniques: a write buffer, and OLAP. Remember, the most common thing people do when faced with an insertion bottleneck is to use fewer indexes, and this kills query performance. So keep in mind that all our work on write-optimization is really work for read-optimization, in that write-optimized indexes are cheap enough that you can keep all the ones you need to get good read performance.

Today, I’ll draw some parallels with the write buffer and OLAP. Recall that the write buffer gets you a small insertion speedup but doesn’t really hurt query time, and OLAP gets you a big insertion speedup but doesn’t let you query your data for a long time. We’ll also use the fact that sequential insertions are orders of magnitude faster than random insertions.

With that in mind, let’s move on to the new generation of write optimization.

Two Great Tastes: Log-Structured Merge trees (LSMs)

We couldn’t manage to get both insertion and query performance out of either a write buffer or OLAP. But they’re similar techniques, just at two extremes.

LSMs are two great tastes that taste great together. To start, make the buffer big, but make it a B-tree, so you can query data as it arrives. In fact, suppose you have log(N) B-trees, B1, B2, …, Blog(N), each twice the size of the one before. If B-trees are slow, using more of them sounds crazy, but I promise we’re getting somewhere.

An LSM Tree with 4 levels

Each B-tree has twice the capacity of the one before it. So Bk can hold 2k elements. When a new row arrives, put it in B-tree B1. When B1 reaches capacity (which in this case is 2 rows), dump those rows into B2. B2‘s capacity is 4 rows. When B2 overflows, you dump the items into B3, which has a capacity of 8 rows, and so on. The trick here is that each time we dump things down to the next B-tree, they’re already sorted, so we get the insertion boost out of doing sequential insertions.

The first log(M) B-trees are in memory (where M is the size of memory). A simple optimization is to just have one B-tree for all these levels, because in-memory B-trees are fast. Once you start flowing out of memory, you are always merging one tree with another which has at most twice as many rows. This way, the smaller B-tree can be treated like the large, OLAP-style buffer, and you get a similarly large speedup, in fact, this merge happens at disk bandwidth speeds.

Not so fast, you say: You don’t get to use all the bandwidth, because each row gets moved from B-tree to B-tree, and it uses up bandwidth each time. This is true, but it turns out that you’re operating at a 1/log(N/M) fraction of bandwidth, which is a lot better than a B-tree, by orders of magnitude.

Alas, the queries are not so great. Even though we made the buffer into B-trees, which are good for queries, you now need to do a query in each one. There are log(N/M) of them on disk, so this ends up being slower than a B-tree by a log(N/M) factor. There’s that pesky tradeoff, which is much better than the B-tree tradeoff, but still not the mathematically optimal tradeoff.

One last point: if instead of growing the B-trees by a factor of 2, you grow them by a larger factor, you slow down your insertions but speed up your queries. Once again, the tradeoff emerges.

Have Your Cake and Eat It Too: COLAs

A COLA (that’s Cache-Oblivious Lookahead Array) is a lot like an LSM, with the queries done in a better way. To begin with, you use a technique called fractional cascading to maintain some information about the relationship from one level to the next. This information can take many forms, but what’s important is that you don’t restart your query at each level and end up doing a full B-tree query log(N) times. Instead, you get to do a small local search. If you do things just right, you can match the query time of a single B-tree. This is true even if you are doubling your structures at each level, so in addition, COLAs are as fast at insertions as LSMs.

Let me repeat that: they match B-trees for queries while simultaneously matching LSMs for insertions. It’s nice to note that COLAs are on the mathematically optimal write/read tradeoff curve, and they’re a proof, by example, that B-trees are not optimal.

COLAs are on the optimal read/write tradeoff curve

This flavor of data structure, which combines the insertion speed of the size-doubling hierarchy of sorted structures (the LSM) with the query speed boost of fractional cascading, goes by many names and can be found dressed up in a bunch of surprising ways, but the underlying math, as well as the performance, is exactly the same.

For bonus points, if you read my colleagues’ paper on COLAs, you’ll see that they are described as being log(B) slower than B-trees on queries. This log(B) is easily recouped in practice—giving you the same query speed as B-trees—if you give up so-called cache obliviousness (a property which is nice mathematically, but not as nice as having faster queries).

Write Optimization is the Best Read Optimization

I’ve been focusing on write optimization, and Fractal Trees do go a couple of orders of magnitude faster than B-trees for indexing that pesky non-sequential data. What that means for the user is typically read optimization: you start adding all the indexes you needed all along, since indexes are so wonderfully cheap to update. My motto is: write optimization is the best read optimization!

You can get COLA-style read-optimal, write-optimized goodness here at Tokutek, where it is marketed as Fractal Trees and available in TokuDB for MySQL and MariaDB.

14 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Deepak N

Very Nicely written. Puts the larger ideas into perspective.

One question – in an LSM tree, how are the updates carried out? The paper is difficult to read, and I got the idea that in implementation there has to be a background threading merging the trees.

Thanks!

Deepak N

I can imagine that the implementation is indeed quite tricky. Thanks a lot! That was a super answer 😀

Dhruv

So w.r.t. LSM trees, I’m seeing that at least 1 new implementation (LevelDB) uses a fanout of 10 instead of 2 that I am used to seeing COLA for example. i.e. Each lower (or higher) level is 10x the previous level. Assume that L0 can hold at most 1 element, L1 at most 10, L2 at most 100, etc…

This doesn’t affect the asymptotics since 10 is still a constant and inserts are still O((1/B) * log N) amortized, but there’s a huge honking constant (~55) hidden in the big-oh. What do you make of this, and why do you think this tradeoff was made?

p.s. In spam protection, you should start asking big-Oh related questions rather than algebra 😉

Dhruv

Oops forgot to divide the constant. Should be 5, not 55 – not so bad now huh? But I’m still interested in knowing why the magic number 10 might have been chosen.

Dhruv

Thanks a lot for your comments Leif! 🙂

> In a Fractal Tree, when you increase the fanout, you
> decrease the size of a flush, which is the number that
> represents the amount of amortizing going on.

I am assuming 2 things:
1. I am assuming you are speaking of a tree that roughly looks like a buffer-tree (with sqrt(B) keys, and hence fanout and the rest of the space being used for the insert buffer).
2. You keep the size of the node the same, which means that there’s lesser buffer space.

> … which is a nice thing to say on marketing materials.

That’s actually pretty neat as calculations go! 🙂

> Changing the fanout is something that we talk about from
> time to time, it’s somewhat open for discussion.

Is there any method to this “madness” of selecting the fanout/block-size? I was thinking that running a simple program that does a bunch of seeks and reads off 1-byte from each seeked location and then does reads of 1k, 2k, 4k, etc… and then compares where the seek time equals the read bandwidth would dictate the block size. i.e. If it takes 2ms to seek and 2ms on the average to read off 143kiB, then the block size should be 143kiB, since that’s where the 2 meet. Is this intuitively okay or do you see some holes in it?

Dhruv

> Yep, it looks like a buffer-tree, but with 4 <= F F = sqrt(B).

Makes sense. Do you have any numbers on how insert/query performance is affected by changing these?

My guess would be that if F exceeds the amount that causes one page to exceed the “optimal block size”, then both suffer since you need to read more data than you would in the same time as it takes for a seek.

> And we keep the max node size the same (but it’s allowed to be
> smaller, and it can be bigger for a moment before it flushes or splits).

Am guessing this is an implementation issue to account for rows that cause the buffer to spill out of the allocated page size. And smaller might be to reduce the amount of data read in case most of the buffer space is unused. Please correct me if I am wrong, since these seem like useful optimizations.

Will let you know when I run tests. I am guessing the results will vary across HDD and SSD.

Scott Carey

This talks about disk efficiency quite a bit, but what about CPU efficiency? How do these data structures compare for CPU usage for writes, scans, and queries? It has been my experience that many data structures are built to optimize disk but fail to optimize well for CPU, leading to bottlenecks if the I/O subsystem is fast enough.

Lower CPU use can translate into less I/O bandwidth — just like how write optimization can allow for more indexes and thus faster reads, lower CPU can be used to add delta compression of keys or compression of disk blocks.

Does this data structure support any delta-compression of keys (or similar)? I’m working with large, long keys where shared prefixes are common and long (20 to 150 bytes).

Michael

Information about cache-friendly trees is hard to come by, and mostly concerned with optimizing the on-disk portion, where I am concerned mostly with in-memory data-structures.

I read the COLA paper a few years ago and again this last week. You mention briefly that you can in practice speed up the queries. Can you point me in the direction how to do this?

In the paper, they implemented cache aware COLAs; called g-COLAs. It was against these that they ran the performance tests, so I’m guessing it is something above and beyond this that you are talking about.