Last time, I introduced the notion of strict and lenient updates. Now it’s time to see what the performance characteristics are of each.
Just to rehash, we are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk — the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000. It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com] We will insert and delete random
These are the easier update types to analyze. Please review the definition of strict updates from the last blog entry. Now notice that each insertion or deletion requires a point query. For example, during an insertion, in order to determine if there’s already a row with a particular key value in the database, one must look up that key. In order to tell how many rows are removed during a delete operation (if any), one must look up the key being deleted. There’s no getting around the fact that the disk head needs to move to the location on disk of the information being returned.
Since these point queries are not the main point of what’s going on, and since many people might not realize that they are even happening, I’m going to call them cryptogets (the get because a point query is often referred to as a get at the storage engine level).
Since no strict update can be faster than a point query, filling up the disk with data takes at least 28 years. And once again, this is independent of the data structure used.
What about lenient updates? The only lower bound is the 10 hour bandwidth lower bound we’ve already talked about. Ten hours is the time it takes to write the data out sequentially on disk. And certainly, if you don’t need to return any information at the time of an update, you could just log the updates, which is to say, you could write them out sequentially to disk.
That makes queries slow, but the point is that a log is a (very simple) data structure that matches the disk-bandwidth lower bound.
What about B-trees? A B-tree index will do (at least) one disk seek per update. That means that it will take at least 28 years to finish filling the disk. This ratio is even worse than B-trees achieved for range queries.
[Yes, yes, I know, what about presorting the data before inserting? Won’t that make things much faster? Yes, it will, but that’s the subject of another blog entry.]
B-trees do a disk seek per insertion during which they swap in all the information needed to implement a strict update. They don’t get any performance benefit from implementing lenient updates. I believe that this is the reason that people don’t make the distinction between different classes of B-trees.
So far I’ve focused on lower bounds. In future blog entries, I’ll talk about algorithms for actually implementing lenient updates. Ok, here comes the punch line: lenient updates stomp all over strict updates, performancewise.
But first, we’ll look at the tradeoff between updates and range queries. Stay tuned.