Yesterday, I (Zardosht) posted an entry introducing clustering indexes. Here, I elaborate on three differences between a clustering index and a covering index:
MySQL allows at most 16 columns in an index and at most 3072 bytes per index. For tables that have more than 16 columns or a row size of greater than 3072 bytes, one cannot create a covering index that includes all of the fields. Suppose one had the following table with 26 columns:
|
1 |
<br>create table foo (a int, b int, c int, ..., y int, z int);<br> |
In standard MySQL, one cannot create a covering index for the query:
|
1 |
<br>select * from foo where a=a_num and b=b_num;<br> |
However, one can create the following clustering index to speed up the query as:
|
1 |
<br>alter table foo add clustering index (a,b);<br> |
The intent of a clustering index definition is easier to understand because it only includes columns necessary to achieve the desired sort order. Columns needed to satisfy the select clause can be left out. As users add more indexes to speed up more queries, simplifying schema definitions becomes increasingly important.
Consider the table:
|
1 |
<br>create table foo (<br> a int,<br> aa int,<br> aaa int,<br> aaaa int,<br> b int,<br> bb int,<br> bbb int,<br> bbbb int,<br> c int,<br> cc int,<br> ccc int,<br> cccc int,<br> d int,<br> dd int,<br> ddd int,<br> dddd int<br> );<br> |
Now suppose you want to speed up this query:
|
1 |
<br>select * from foo where a=a_num and b=b_num;<br> |
To create a covering index for this query you do:
|
1 |
<br>alter table foo add index cvr_idx (a,b,aa,aaa,aaaa,bb,bbb,bbbb,c,cc,ccc,cccc,d,dd,ddd,dddd);<br> |
To create a clustering index for this you do:
|
1 |
<br>alter table foo add clustering index cvr_idx (a,b);<br> |
Another example of the simplification shows up when want to add a column. With clustering indexes you can write
|
1 |
<br>alter table foo add column z int;<br> |
and you don’t need to change the index definition. With a covering index, you need to:
|
1 |
<br>alter table foo drop index cvr_idx;<br>alter table foo add index cvr_idx (a, b, aa, aaa, aaaa, bb, bbb, bbbb, c, cc, ccc, cccc, d, dd, ddd, dddd, z);<br> |
If you don’t drop and re-add the covering index, it won’t cover the new column. Covering indexes have similar problems if you drop a column.
The key length of a clustering index is much smaller than that of a covering index. Smaller key lengths have performance benefits. Inserting into a clustering index is often faster than inserting into an equivalent covering index. Some other (non-MySQL) databases provide syntax to include nonkey columns in an index. One can think of clustering indexes as an extreme case, in which all the columns are added as nonkey columns. As Kristian Nielsen phrased it the comments of the earlier posting, the extra fields are stored only in the leaves of a B-Tree, not in the internal nodes. That can be useful if the row size is much larger than the key size. (Although Fractal Tree® indexes aren’t exactly organized like B-trees, this insight is essentially correct.)
Maintaining clustering indexes has a similar cost to maintaining covering indexes. Specifically, data gets inserted into multiple places. This is the tradeoff for many storage engines: to use clustering indexes to get faster queries, one must use more I/O operations and extra disk storage.
For TokuDB these tradeoffs are a lot easier to make. TokuDB can insert data an order of magnitude faster than other storage engines, while maintaining indexes. Percona measured our insertion rate at 10.6x faster than InnoDB’s, and measured our database size at 5x to 6x smaller than InnoDB and MyISAM. Thus, the tradeoff is more manageable for TokuDB than it is for other storage engines. This was the motivation for implementing the clustering index feature.
Clustering indexes provide the benefits of covering indexes in an easy-to-use manner. Clustering indexes offer immediate performance gains that are difficult or impossible to achieve with covering indexes. TokuDB can maintain indexes while inserting data an order of magnitude faster than other storage engines, and compresses the data, making clustering keys really valuable.
Resources
RELATED POSTS