Write Optimization: Myths, Comparison, Clarifications

Some indexing structures are write optimized in that they are better than B-trees at ingesting data. Other indexing structures are read optimized in that they are better than B-trees at query time. Even within B-trees, there is a tradeoff between write performance and read performance. For example, non-clustering B-trees (such as MyISAM) are typically faster at indexing than clustering B-trees (such as InnoDB), but are then slower at queries.

This post is the first of two about how to understand write optimization, what it means for overall performance, and what the difference is between different write-optimized indexing schemes. We’ll be talking about how to deal with workloads that don’t fit in memory—in particular, if we had our data in B-trees, only the internal nodes (perhaps not even all of them) would fit in memory.

As I’ve already said, there is a tradeoff between write and read speed in B-trees. There is also a mathematically optimal tradeoff that’s been proven between data ingestion and query speed, no matter what data structure you have.

But these are not the same tradeoff! B-trees are not even remotely optimal in terms of read/write tradeoff. I’d say this is the most common confusion I run into on this topic. The way it comes up is in the assumption people seem to make that if you are faster than a B-tree at indexing, you must pay a read penalty. This is simply not true, and I intend to convince you, but today I’ll just describe how I think people arrive at this confusion.

B-trees are not on the optimal insert/query tradeoff curve

Extreme solutions

What’s the best we can do if we only worry about writes? Or reads? By considering these extreme cases; we can get a feel for the space of all possible solutions.

What’s the most write-optimized structure? Simply log all the data. You ingest data at the bandwidth of the storage system, and you can’t do better than that. The problem is that each query now requires all the data to be examined, and that’s a pretty lousy way to get query performance (unless, of course, you intend to look at it all anyway, like MapReduce does).

At the other extreme, you get maximum read performance by re-sorting the data and laying it out optimally for queries (for each index!) every time new data is inserted. The layout is technical, but can be done, and query performance doesn’t get any better than this. But then insertion speeds are disastrously slow.


For most real workloads, we can’t sacrifice one aspect of performance to benefit another, we need something that works well for both.

A B-tree is one compromise between reads and writes. They’re easy to understand and they’ve been popular for decades, but as the data structure grows, their performance falls to pieces. Heavy use over the years has led to all kinds of refinements, and there are plenty of write optimizations for B-trees we could discuss at length. Here’s a little observation that will help us out: write optimization in B-trees involves maximizing the useful work we accomplish each time we touch a leaf.1

Let’s examine some write optimizations and see how this observation applies. We’ll also see some drawbacks of each technique.

  • Insert in sequential order: B-trees are a couple of orders of magnitude faster at inserting data in sequential order compared to random order. This is because once we insert a row into a leaf, that leaf is in memory, and since the next run of rows are destined for the same leaf, they can all be inserted cheaply. According to the disk, it’s almost like you’re just logging the data.

    This property of B-trees leads many people to bend over backwards trying to keep their insertions sequential. In many applications, this is not a natural or easy thing to do, and causes all sorts of problems elsewhere in the system.

  • Use a write buffer: We store up a bunch of insertions in memory. When we have too many, we pick a leaf and write all the rows we’ve stored that are going to that leaf.

    This optimization works when we get to pick a bunch of stored rows that are going to the same leaf. When this happens, we see a speedup: you get to accomplish a bunch of work just for touching a leaf once.

    You can expect a win of a factor of 2 or so for this optimization, which is nice, but leaves a factor of more than 100 on the table. The query cost doesn’t take much of a hit in this case. Even though you do have to query the write buffer, it’s in memory so it’s way faster than querying the tree on disk.

  • OLAP: OLAP is a bunch of things, but with respect to insertions, the idea is to save up a big batch of rows, pre-sort them, and then insert them into an existing B-tree in sorted order. You can think of it like an on-disk version of the insertion buffer. The big win happens when the amount you batch up is of size comparable to the B-tree you already have, and this gets you the insertion boost that a write buffer can’t achieve.

    The downside is that, unlike the in-memory write buffer, the rows you are buffering don’t get indexed—they just get logged, and we already said how bad query performance is for a log. In practice, OLAP users don’t even look at their data until it gets indexed.

    So by batching more, you get better insertion speed, but wait longer before your data is available for queries. In this case, you get a write/freshness tradeoff, rather than a write/read tradeoff, but the fundamental reason is the same.

  • Use fewer indexes: In general, each row inserted must go to the leaf of every index. This technique is just: maintain fewer indexes, so we have less to do. We wanted to maximize the work accomplished per leaf we touch, but now we’re minimizing the number of leaves we need to touch per amount of work (row insertion), but we get the same effect.

    This is less like a technique to be employed, and more like an artifact of bad B-tree performance, but it’s probably the most common “optimization.” Its downside makes it the clearest case for true write optimization: if you can’t afford to keep the right set of indexes, you kill your query performance, and this is without a doubt a read/write tradeoff.

These are pretty common tricks for speeding up B-tree insertions. You may have even tried some of them yourself. Each one has a downside though, and often they’re not obvious in production. Maybe if you didn’t need to insert sequentially for speed, you could simplify your application in a useful way. Or if you thought you could afford more indexes, you’d spend the time to think up cool new ways to analyze your data.

I think the reason people are convinced that B-trees are on the optimal tradeoff curve is pretty simple. Since all of the popular write optimizations are modifications to a B-tree, most people end up stuck on the B-tree tradeoff curve. But contrary to conventional wisdom, it is possible to do a lot better. Next post, I’ll explain why.

1: That’s because we can amortize the cost of the I/O—which is needed by a B-tree for most leaves, when the database is bigger than memory—against the work we are able to complete.

Share this post

Comments (5)

  • Mark Callaghan

    I don’t dispute the value of making writes fast without a read-penalty as done by TokuDB, although I want to see that in practice to confirm that TokuDB doesn’t do too many extra reads for an IO-bound workload.

    However I think you understate the value of write-optimizations for traditional b-trees. I published benchmark results showing that the row-insert rate was 80X better with the insert buffer enabled.

    October 7, 2011 at 4:26 pm
  • leif

    The relative sizes of the table and the insert buffer affect how well it
    performs, and that’s on a sliding scale. If a lot of the table fits in
    memory, then there will be a smaller fraction of the leaves that are on
    disk. So the probability that two inserts in the buffer target the same
    leaf (and get amortized) goes up as the table gets smaller, even with a
    fixed insert buffer size. If instead you increase the insert buffer size
    while keeping the table size constant, then you just get more chances to
    amortize away inserts (because there are more of them buffered), and this
    also speeds up inserts.

    In your benchmark, with 1 billion rows and a 16G buffer pool, a lot of the
    table is going to be cached. So whether the buffer pool is being used for
    the insert buffer or to cache leaves, the ratio between the size of the
    insert buffer and the number of leaf nodes that are stuck on disk is going
    to be pretty high, compared to the workload I described in the post. I’m
    not surprised you saw such good insert performance with the buffer, since
    you’re actually closer to the OLAP configuration. The theory predicts the
    performance you saw, it’s just at a different point on the scale from what
    I described in the post.

    More importantly, the client in your benchmark is single-threaded. A
    multi-threaded client should be able to get up to around the number of
    IOPS on your RAID, even without the insert buffer. The fact that you only
    get around 100 inserts per second on a system with 1600 IOPS indicates to
    me that this is the bigger issue: I think the insert buffer is “faking”
    multi-threading for you, so you’re getting more utilization out of all
    your disk heads. If that’s true, comparing multi-threaded insertions with
    and without the insert buffer (on a large table or with a small buffer)
    should only show a small factor improvement, because you’ll get better
    performance without the insert buffer, and that’s a more fair comparison.

    October 13, 2011 at 8:18 pm

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.