# Tradeoff: Insertions versus Point Queries

| March 11, 2008 |  Posted In: Tokutek, TokuView

I’ve been waving my hands about lower bounds. Well, sometimes I haven’t been waving my hands, because the lower bounds are tight. But in other cases (lenient insertions, range queries), the lower bounds are very far from what we’re used to.

So now, for a bit of math:

Brodal and Fagerberg showed in 2003 that there’s a tradeoff between insertions and queries. The insertions they consider are lenient. Well, any lower bound for lenient is a lower bound for strict, but they also gave upper bounds, so it matters. Also, they don’t know from lenient, but if you look at their upper bound, they are implementing lenient insertions. The queries they consider are, unfortunately, point queries. That’s too bad for us, because we’ve already seen that point queries are just too slow to be of interest on hard disks.

Still, they have matching upper and lower bounds, so let’s see what they say:

They show that, for any p between 0 and 1, if Query time < 1/p * log_B N then Insertion time > (1/pB^{1-p} log_B N). And visa versa, that is, we get a lower envelope of Query, Insertion times that are possible. Let try p=1. This says that if Query time is at most log_B N (which a B-tree gets), then insertion time is at least log_B N (also what a B-tree gets). So a B-tree is optimal for one point of this tradeoff.

More interestingly, when p=1/2, the query time doubles to 2log_B N. The insertion time is then no less than 2/sqrt{B} * log_B N. Wow. Taking B=4K, sqrt{B} = 64, and we get something that’s 32 times faster than a B-tree. But we’re taking about a lower bound. Maybe it’s not a very good lower bound. Maybe there’s no data structure that does this well.

It turns out there is (please see their paper for more details). Consider what that means: you can speed up insertions by a factor of 32 by slowing down queries by a factor of 2. That’s a pretty favorable tradeoff! Unfortunately, their solution doesn’t do range queries any better than B-trees — in fact, it’s somewhat more cumbersome, though no worse asymptotically than B-trees.

Conclusions

The Brodal & Fagerberg is quite nice. They’ve really captured something interesting about storing data on disk. However, it’s not the result we need, because the tradeoff that I think is interesting “in the field” is one between insertions and range queries, the kind of thing that translates to a MySQL database. But range queries depend on locality on secondary storage, and the typical mathematical models in use for thinking about external memory (the Disk Access Model and the Cache-Oblivious Model, about which more later) don’t capture this sort of locality. They focus on intra-block locality, not inter-block locality. I don’t know of any theoretical results that address the problem we have been considering.

So if you can’t do theory, you do hacks. Next time, we’ll start looking at some of the hacks that people use (or claim to use) to speed up their databases. Which of them play nicely with disks?