The Depth of a B-tree

Schlomi Noach recently wrote a useful primer on the depth of B-trees and how that plays out for point queries — in both clustered indexes, like InnoDB, and in unclustered indexes, like MyISAM. Here, I’d like to talk about the effect of B-tree depth on insertions and range queries. And, of course, I’ll talk about alternatives like Fractal Trees, since that’s the basis of Tokutek’s storage engine for MySQL.

Please see Schlomi’s post for details, but I can summarize a few points, partly because I need some vocabulary for the points I’d like to make below. Scholmi notes that there are two main features determining the depth of a B-tree (or B+-tree):

  1. The number of rows in the database. We’ll call that N.
  2. The size of the indexed key. Let’s call B the number of key that fit in a B-tree node. (Sometimes B is used to refer to the node size itself, rather than the number of keys it holds, but I hope my choice will make sense directly.)

Given these quantities, the depth of a B-tree is logB N, give or take a little. That’s just (log N)/log B. Now we can rephrase Scholmi’s point as noting that small keys means a bigger B, which reduces (log N)/log B. If we cut the key size in half, then the depth of the B-tree goes from (log N)/log B to (log N)/log 2B (twice as many keys fit in the tree nodes), and that’s just (log N)/(1+log B).

Let’s put some numbers in there. Say you have a billion rows, and you can currently fit 64 keys in a node. Then the depth of the tree is (log 109)/ log 64 ≈ 30/6 = 5. Now you rebuild the tree with keys half the size and you get log 109 / log 128 ≈ 30/7 = 4.3. Assuming the top 3 levels of the tree are in memory, then you go from 2 disk seeks on average to 1.3 disk seeks on average, for a 35% speedup.

That’s a nice savings, assuming, of course, that the new, smaller key you used is as useful for queries. And the time for an insertion into a B-tree enjoys the same savings. An insertion is O((log N)/log B) — about the same as point queries, up to a constant, but in any case, you’d still get a similar speedup.

What about range queries? Here things aren’t so sensitive to the depth of the tree. In a range query, you seek to a leaf that has your first row, and then you iterate throw all rows, jumping to sibling leaves as necessary, until you reach your ending row. The initial time to seak to the first leaf — the part that depends on the depth of the tree — is typically in the noise.

There is, however, a more subtle effect going on here. If you have fast insertions (say, by using smaller keys), you can afford to keep more indexes. And having the the right index around can make a huge difference to a range query: not having an index can mean that a range query over a small number of rows can become a full table scan. I’ve seen a 5-order-of-magnitude speedup in such queries once the right index is added.

Here’s a simple example of of what I’m talking about, using iiBench. [Details available]

We issue a query on a table with no secondary indexes, and then again after building an appropriate secondary index — with query caching off, of course. The speedup is from more than 11 minutes, to 0.31 seconds, in this case a factor 2185x faster.

So for range queries, it all boils down to having the right set of indexes. Faster insertions means you can keep more indexes, so the savings on insertion times mentioned above can help: if you can afford to keep 4 indexes on your live data using big keys, you can afford around 6 indexes if you use keys half the size. [Well, it’s more complicated than that, because this assumes you are keeping 3 levels from each for the index in memory. Caveat indexor.]

Once again, this assumes that you can find small keys to use for all your indexes. And the fastest indexes for range quires are covering indexes, which tend to have large keys. [Stay tuned for a tokuview posting on covering indexes.] For non-covering indexes, we’d have to do lots of point queries into the main table. Yikes!

So we are caught in a bind. We’d like build secondary indexes to make range queries fast, and we’d like the secondary indexes to be covering for our important queries, but covering indexes are slow to build using B-trees, partly because the keys are large and the trees are deep, but for a bunch of other reasons as well, reasons I’ll cover in another blog post.

What to do? Of course, my solution is to use TokuDB for MySQL, which has much faster insertions. TokuDB isn’t built on B-trees, but rather on Fractal Trees, so the whole large-key question goes away. See the Technology Brief on Tokutek’s Technology Page for a discussion of the insertion performance of Fractal Trees and TokuDB.

Share this post

Leave a Reply