Making “Replace Into” Fast, by Avoiding Disk Seeks

Posted on:



Share Button

In this post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In this post, I explain how the command “replace into” can be fast with fractal trees.

The semantics of “replace into” are as follows:

  • if the primary (or unique) key does not exist, insert the new row
  • if the primary (or unique) key does exist, overwrite the existing row with the new row

The slow, expensive way B-trees use to implement these semantics are:

  • look up the primary (or unique key), to verify its existence
  • if it does not exist, insert the new row, otherwise overwrite the existing row

The first step incurs a disk seek. That slows down performance considerably.

Instead, with TokuDB’s fractal tree data structure, we can follow a similar strategy to what we use for deletes. Recall that for deletes, we do not look up the key we wish to delete, and physically remove its data from the tree. Instead, we use tombstone messaging, or tombstone deletions, to defer the physical removal of the data to a better time. The same idea applies here. Instead of searching for the primary key, as we are forced to do with B-trees, we insert an insertion message into the fractal tree, and defer the existence check for later.

Let us look at an example. Take the following table:

Suppose the fractal tree for this table looks as follows:

The ‘i’ stands for insertion message. Now suppose we do:

With fractal trees, we simply insert (i (1000,1001)) into the top node. The tree then looks as such:

Similar to deletes, with this scheme, “replace into” can be two orders of magnitude faster than insertions into a B-tree.

On queries, a message in a higher node overrides messages in lower nodes. So upon querying the key ‘1000’, a cursor notices that (1000,1001) is located higher than (1000,1000), and therefore returns (1000,1001) to the user. On merges, the message in the higher node overwrites the message in the lower node.

So, by using messages, fractal trees can achieve the same performance boost for “replace into” as it does for insertions. In fact, using “replace into” in the manner above is how a customer has achieved an 80x speedup under actual field conditions. The details can be found here.

Next week, I explain how “insert ignore” can similarly be fast. Also, while this shows how “replace into” can be fast (and IS fast for a lot of scenarios we see with TokuDB), there are some caveats (with indexes, triggers, and replication). I will get into those in a couple of weeks.

Share Button


, , , , , , , ,

Tokutek, TokuView


Leave a Reply

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.

Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below.

No, thank you. Please do not ask me again.