Making “Replace Into” Fast, by Avoiding Disk Seeks

In this post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In this post, I explain how the command “replace into” can be fast with fractal trees.

The semantics of “replace into” are as follows:

  • if the primary (or unique) key does not exist, insert the new row
  • if the primary (or unique) key does exist, overwrite the existing row with the new row

The slow, expensive way B-trees use to implement these semantics are:

  • look up the primary (or unique key), to verify its existence
  • if it does not exist, insert the new row, otherwise overwrite the existing row

The first step incurs a disk seek. That slows down performance considerably.

Instead, with TokuDB’s fractal tree data structure, we can follow a similar strategy to what we use for deletes. Recall that for deletes, we do not look up the key we wish to delete, and physically remove its data from the tree. Instead, we use tombstone messaging, or tombstone deletions, to defer the physical removal of the data to a better time. The same idea applies here. Instead of searching for the primary key, as we are forced to do with B-trees, we insert an insertion message into the fractal tree, and defer the existence check for later.

Let us look at an example. Take the following table:

Suppose the fractal tree for this table looks as follows:

The ‘i’ stands for insertion message. Now suppose we do:

With fractal trees, we simply insert (i (1000,1001)) into the top node. The tree then looks as such:

Similar to deletes, with this scheme, “replace into” can be two orders of magnitude faster than insertions into a B-tree.

On queries, a message in a higher node overrides messages in lower nodes. So upon querying the key ‘1000’, a cursor notices that (1000,1001) is located higher than (1000,1000), and therefore returns (1000,1001) to the user. On merges, the message in the higher node overwrites the message in the lower node.

So, by using messages, fractal trees can achieve the same performance boost for “replace into” as it does for insertions. In fact, using “replace into” in the manner above is how a customer has achieved an 80x speedup under actual field conditions. The details can be found here.

Next week, I explain how “insert ignore” can similarly be fast. Also, while this shows how “replace into” can be fast (and IS fast for a lot of scenarios we see with TokuDB), there are some caveats (with indexes, triggers, and replication). I will get into those in a couple of weeks.

Share this post

Comments (7)

  • Kristian Nielsen

    > * if the primary (or unique) key does exist, overwrite the existing row with the
    > new row

    Actually, the semantics is that any existing row is deleted, and then the row is inserted:

    This has subtle implications (as the article hints in the last paragraph). For example DELETE triggers are fired when a previous row exists, and “rows affected” is supposed to return two rows affected rather than one. And columns not mentioned explicitly in REPLACE are given DEFAULT values, not left alone as with UPDATE.

    (And some of these implications are annoying, as they prevent some optimisations that engines like NDB wants to do. Though an engine can choose to implement different semantics if it wants, of course, at the cost of risking user confusion).

    June 30, 2010 at 3:39 am

Comments are closed.