Sorry for the delay, now on to range queries and lenient updates. Let’s call them queries and updates, for short. So far, I’ve shown that B-trees (and any of a number of other data structures) are very far from the “tight bound.” I’ll say a bound is a tight if it’s a lower bound and you can come up with data structure that matches it.
So how do we match the bandwidth bound for queries and updates? I already mentioned in passing how to do this, but let’s look more closely.
The way to get fast updates is to log them. You can easily saturate disk bandwidth by writing out the insertion, deletion and update requests with no index.
A query now will typically start by sorting the data. Even a point query requires looking at all the data, but a range query requires looking at all the data log times (in order to sort it), or using a large amount of extra storage. Let’s focus on sorting for range queries.
For queries that range over all the data, sorting may be tolerable (after all, you’d be doing a lot better in terms of bandwidth than a B-tree gets over the same query!). This solution does, however, have some drawbacks. First, for smaller queries, where you only look at x% of the data, the fraction of bandwidth you are getting is x/log N, for appropriate base of the log. You can beat this somewhat if you put a lot of work into it. But you’re still going to get very little of your bandwidth.
[2 points if you’ve figured out that some queries don’t actually require sorting the data or using lots of space. 2 more points if you figured out that end users don’t want this kind of noise in their database, where some queries turn out to be unpredictably slow. A final 2 points for figuring out that almost no one wants to learn enough about a data base to be able to predict which queries will be slow.]
The way to get fast queries is to lay out the data on disk carefully. That is, the best you could do would be to take all your data, sort it, and insert it in order into a sensible database (MySQL + InnoDB will do nicely) that will lay out the leaves more or less in order. Then range queries become linear scans on the disk, and you can get close to bandwidth rates. This is one of the main tricks for getting performance out of a data warehouse. Dumping and reloading data for this purpose is also one of the time-consuming things that DBAs do.
Inserting data now requires batching data, sorting it and inserting it. The time per insertion might not be so bad, but the latency is killer. It is not uncommon for data to become available for query in the data base a day or two after it comes into the system.
As we try to tease apart what’s fast and what’s inherently slow in a database, we get the idea that there’s some basic tradeoff between updates (even lenient ones) and range queries. Is this tradeoff inherent, or is there some way to get bandwidth speeds on both lenient updates and range queries?
Next time we’ll look at some tradeoff lower bounds and try to get closer to some answer.