Last time I talked about point queries. The conclusion was that big databases and point queries don’t mix. It’s ok to do them from time to time, but it’s not how you’re going to use your database, unless you have a lot of time. Today, I’d like to talk about range queries, which seem much more useful for the analysis of big databases, say in a business intelligence setting.
Recall that the focus is on the storage engine (a la MySQL) level, and a database on a single disk — the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000. It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com] Now imagine filling the disk with random
Suppose the above data is stored in a B-tree, and that you’d like to iterate over all the data in order by key. Further suppose that the B-tree has 4KB blocks. Thus each leaf has 256 key-value pairs. The tree has around 244 million leaves. Before we figure out how long it’s going to take to look at those leaves, we have to understand how aged the B-tree is.
A B-tree that you get from loading up the data in sorted order in one fell swoop has all the leaves laid out in a relatively nice order, so that sibling leaves tend to be on the same track, and you get lots of leaves for each disk seek. As the B-tree ages (either from doing insertions & deletions, or from inserting things in a more random order to begin with), the leaves tend to be scattered all over the place. Let’s assume that the tree is well aged to see how bad things can get.
A range query over the whole data set in our scenario will give a disk seek per leaf, which would take 3.4 x 10^6 seconds (= 244 x 10^6 x 14 x 10^-3 seconds) or 39 days. For one query.
Now the question is, can you do better? The first thing to note is that I have made my argument as to how long a B-tree would take, and in particular one that’s aged and has a specific block size. In the point query case, I made an argument about *any* data structure. In this case, I can’t. I can’t say that there isn’t some other data structure that ages well, for example, and thus does better. For example, in the extreme case where the data has been sorted on disk, it can be read at full bandwidth, in which case it would take 10 hours to do a range query. That’s still a longish time, but it’s not 39 days, and it means that the B-tree example is using about 1% of available disk bandwidth.
So there are certainly ways to fix this. Use a bigger block size for the B-tree (and kill insertion performance). Sort the data before insertion into the B-tree (and kill insertion performance). The point is that unlike point queries, the natural lower bound is band width, not disk seeks, and the textbook data structure for storing data on disk has poor performance for this operation.
As I’ve already argued, range queries are the natural query type for large data sets, not least because of the argument above, that point queries can’t even be implemented on disks, but also more positively because large data is useful for large-scale market analysis, where aggregate statistics about the data set are mined. And that’s implemented by range queries.
Unlike point queries, where performance will only be gained by (expensive!) hardware changes, data structural changes are key for range queries. In a way, we already know this. What is a data warehouse if not a database that’s been optimized for range queries at the expense of insertion times.
In future postings, we’ll see if there are any options to data warehouse for achieving fast range queries (hint: there are). In the next posting, we’ll look at insertion rates.