Recall that I’ve claimed that it takes 28 years to fill a disk with random insertions, given a set of reasonable assumptions. Recall what they are:
We are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk — the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000. It has a disk seek time 14ms and transfer rate of around 69MB/s. [See tomshardware.com] We insert random
Now, my analysis requires each insertion to induce a disk seek. Suppose we do something clever with main memory. After all, we have this main memory hanging around. It should be possible to buffer up some insertions, and once we fill up main memory, insert key/value pairs that belong on the same leaf. Thus, fetching a leaf will be amortized over the number of rows that we expect to insert into a leaf.
So what’s the best we could do? If we could somehow figure it out, the best we could do would be to take the leaf that’s going to receive the most insertions, fetch that one in, and do all the insertions from memory into that one. This assume we won’t overflow that leaf and have to rebalance, etc. We’re just going to count the cost of fetching in a leaf.
At first, you’d get a lot of advantage, since you start with an empty database with no leaves, so the first set of insertions (256 of them, in fact) will all go to the first leaf. So let’s consider the situation where we’ve filled up the database half way. We’ll discount this time and just look at the time to fill up the second half of the disk. At that point, we have 2^27 leaves.
Now, let’s suppose we have 1GB = 2^30 bytes of memory. It really won’t make an appreciable difference, it turns out, if we have as much or half as much memory. With this much memory, we can get 2^27 rows into memory.
Now we have what is know as a ball and bins problem. Think of the leaves as a set of 2^27 bins, and the in-memory rows as 2^27 balls. Since the insertion keys are random, we can think of each row/ball as picking a random leaf/bin where it goes. Now, what is the expected maximum number of balls in any single bin? The answer is well known — ln (2^27) / lnln (2^27) = 18.7/2.9 = 6.4. So we expect the very best leaf to typically get 6.4 rows inserted.
So now the maximum speedup we could expect is 14 years/6.4 = 2.1875 years. Now, that’s a lot better, but in fact, it’d be hard (and probably impossible) to get anywhere near that performance. Why? First, there’s the question of how you know which rows go on the same leaf without actually having the leaves. You could keep a B-tree in memory of the minimum key in each leaf, but this would fill half of memory. Second, we didn’t count the time to fill the first half of memory. We could do a finer analysis of the time needed to load the last half of the database, plus the preceding 1/4 rows, plus the preceding 1/8, etc., until you have a small enough database to fit in memory, and that ends up at around 4 years.
Finally, and most importantly, the factor of 6.4 is far too generous. The average number of balls in each bin is 1, and we only expect a single bin to get above 6 balls. Then we immediately empty that bin (by inserting the rows into the database). Now we throw some more rows in memory (balls into bins), and they are very unlikely to produce yet another 6-ball bin. That are almost certainly going to land in bins that are either empty or have a single ball. Not to get too technical, the number of bins with a large number of balls drops exponentially as the number of balls increases, until we get down to single bin with more than 6 balls. As you empty these very full bins, it takes a large number of new balls to get more bins that have a lot of balls.
Now, if I were still an academic, I’d take the time to study this recycling balls-and-bins problem, but instead I have a company now, and theorems of this time are not really on any critical path! Still, one of my partners, Michael Bender, suspects that the answer is going to be lnln 2^27 = 2.9. I’d be very surprised if the answer is any more than this.
So a realistic number for going through all this work is that it’ll take 28 years/2.9 = 9.7 years. Your disk will still die before you fill it.
My recommendation: don’t use this heuristic!
Sorting before inserting?
What about another heuristic? How about, when memory fills, sort the in-memory rows, then merge the sorted order of the database with the sorted order of the in-memory inserted rows. To compute the time, we need to figure out how good a job we do of keeping the leaves of the tree in order on the disk. If we keep them all nicely in order and packed — which is to say, if every time we merge, we rebuild the index from scratch, we get the following: The bandwidth time for one batch of memory is 1GB/69MB/s = 14.5s. Discounting sort time, the first memory-full of rows takes 14.5s, the second batch takes 14.5s * 2, then 14.5s * 3, up until 14.5s * 1000, for the last batch before filling memory. That gives a total time of 14.5s * 1000 * 1001/2 = 7.3 million seconds = 84 days. What an improvement! Still, it’s almost 3 months to index the data, compared to the 10 hours it takes to log the data. So your yield is 0.2% of bandwidth.
Let me point out that we’ve made some pretty big assumptions here. For one thing, we’ve assumed that you are rebuilding the index from scratch each time. It’s much faster to build an index on data given in sequential order, but surely it’s not free! So the 3 months to build the index this way is a gross underestimate. If you don’t rebuild the index each time, your database ages. It’s pretty tricky to estimate the time it takes to build the index this way, but a very nasty bit of estimating later, I get that it takes at least 41 years (my main assumption is that once the database is half full, it is fully aged). Oy!
By now you’re saying to yourself that you can get a 1 TB database built in much less than 3 months. If you have all the data ahead of time, you can sort the data in roughly 40 hours, plus another at least 10 hours to build the index (assuming our magical no-overhead indexer for sorted data). So you get a tradeoff between
- Waiting for all the data to come in (and get stale), then 2 days to build the index.
- Insert as you go at a limited rate, so that the entire project takes 3 months. Throughout the process, all inserted data is available for querying.
- Anything in between.
And we’re back to the old tradeoff: you can get faster insertion if you are willing to let enough of your data get stale. I want a storage engine that’s fast for insertions, fast for range queries and doesn’t age. So does everyone else. This would end 90% of the kludges of how DBAs deal with database (ok, so I made up the 90%, but it’s probably not far off, is it?).
Next, we’re going to look at Cache Obliviousness, and see if we can get any closer to this impossible dream.