How clustering indexes sometimes help UPDATE and DELETE performance

I recently posted a blog entry on clustering indexes, which are good for speeding up queries. Eric Day brought up the concern that clustering indexes might degrade update performance. This is often true, since any update will require updating the clustering index as well.

However, there are some cases in TokuDB for MySQL, where the opposite is true: clustering indexes can drastically improve the performance of updates and deletions. Consider the following analysis and example.

Updates and deletions generally have two steps. First, a query is done to find the necessary rows (the where clause in the statement), and then the rows are modified: they are deleted or updated, as the case may be. So the total time to do a deletion or an update is

Eric noted that T_change depends on the number of indexes. So adding an index will slow down T_change for deletion, because for each row removed, there will need to be a separate deletion in each index.

For updates, it’s a little more complicated. Only those indexes that have data being changed need to be updated. Still, a clustering index contains all data, so every new clustering index will increase T_change.

It should be clear where I’m heading. If we want to reduce T_total, and T_change just went up by adding a clustering index, then the clustering index had better reduce T_query to compensate.

Of course, that’s the whole point of indexes, to reduce query times. But in the case of a clustering index, we have the entire row. Normally, you have to go to the primary index to look up everything you need to do a deletion or update. But any clustering index will suffice. Some clustering index may be a lot faster for some updates or deletes than the primary table.

Here is an example where a clustering key speeds up T_query enough to compensate for an increase in T_change. Suppose we have 50M rows in the iiBench table, but with only the primary key defined:

Now suppose we want to update about 50k rows (about 0.1% of the database) with the following statement:

MyISAM executes this update in 37.27 seconds. In this case, there’s no index on customerid, so T_query is the time of a full table scan. Perhaps we can speed up this update by adding an index to reduce T_query.

Suppose we add a key on on the field (customerid). Now T_total = 293s. This has to be because T_query is much worse, since T_change is not changed by having this secondary index (because price is not present in the index).

The slowdown results from a problem in the query plan. MyISAM tries to use the index (customerid) to retrieve the necessary rows for modification instead of using a full table scan. Because the index is not clustering, MyISAM has to make random point queries into its main table to retrieve all of the rows. This turned out to be much slower than doing a full table scan. As a result, using the index slows down MyISAM by a lot.

Now suppose we have a similar tokudb table:

TokuDB executes this update in 91s. Once again, we want to reduce T_query, but in this case, we can use a clustering index on (customerid) . This query should be fast, because the query is a pure range query, with no subsequent point queries. (Getting rid of secondary point queries is the biggest benefit of clustering indexes.)

After adding the clustering index:

TokuDB executes this query in 1.97s. This is nearly a 20X improvement over MyISAM with no index and a 45X improvement over TokuDB with no index.

Just a note here: The whole technical innovation of Fractal Trees is to make T_change very small. Normally, you see this on very fast insertions on lots of indexes, but here you see in an update.

A similar example shows the same benefits for deletions. Suppose we wanted to delete 50K rows (still 0.1% of the table):

MyISAM executes this query in 38.75s. TokuDB, with a clustering index on (customerid), executes this query in 1.77s.

This is not to say that clustering indexes always speed up updates and deletions. When applying an update or deletion, both the main table and clustering index must be modified. This is the T_query vs T_change tradeoff mentioned above.

Clustering indexes do not always speed up updates and deletions, and certainly not when updating or deleting large fractions of the table. To take an extreme example, if you are updating half of a table, the time savings on T_query is not that big (versus doing a full table scan of the primary index), whereas now you do get hurt by an increase in T_change. The exact tipping point between helping and hurting depends on too many factors to summarize easily.

In conclusion, for updates and deletions that modify a small percentage of the table, by adding a clustering key to reduce the time for the query in the statement, a user can significantly reduce the runtime of updates and deletions.

Share this post