Introducing Multiple Clustering Indexes

In this posting I’ll describe TokuDB’s multiple clustering index feature.

In general (not just for TokuDB) a clustered index or a clustering index is an index that stores the all of the data for the rows. Quoting the MySQL 5.1 reference manual:

Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.

Most storage engines allow at most one clustered index for each table. For example, MyISAM does not support clustered indexes at all, whereas InnoDB allows only the primary key be a clustered index.

In TokuDB, we have added functionality and grammar syntax to define multiple clustered indexes. As a result, users can get the performance advantages of clustered indexes for multiple indexes.

How to define multiple clustered indexes

To define a clustered index in TokuDB, simply add the key word “clustering” in front of any index that is to be created. Here are some examples:

How clustering keys work

Like InnoDB, TokuDB clusters data with its primary key (if no primary key is defined, then TokuDB auto-generates a hidden one). The primary index maintains a copy of the entire row. Each secondary index stores, for each row, the row’s secondary key and primary key.

A clustering index maintains a copy of the entire row, not just the primary key. As a result, when querying on a clustering key lookups in the primary index are always avoided. A clustering index is a covering index for any query. Another way to think of a clustering index is that it is a materialization of the table sorted in another order.

An example using clustering keys

Suppose we had 40M rows in an iiBench table with this schema:

Now suppose we wanted to perform the following queries in sequence:

  • Query 1:

    which scans more than 1200 rows.
  • Query 2:

    which returns 406 rows.
  • Query 3:

    which returns 820 rows.

Ideally, to perform the first query, we would want a covering index of (customerid, price) on the table. To perform the third query, we would want a covering index of (customerid, productid, dateandtime). For the second query, a covering index would need to include all the fields.

With other storage engines, the common solution is to define a key on the field (customerid). Here are the results with TokuDB when using a key on customerid:

Here are the results with MyISAM when using a key on customerid:

Here are the results with TokuDB when using a clustering key on customerid:

As we can see, clustering indexes provide the same order of magnitude of performance as covering indexes, but for a broader range of queries.

A limitation of clustering keys

A clustering key cannot be defined to be unique.

Share this post

Comments (26)

  • Eric Day

    Hi! I can see this being very useful for some workloads, but I think it is very important to mention the main reason why most engines don’t support multiple clustered indexes: INSERT and UPDATE performance will degrade and disk usage will go up. If you maintain multiple copies of the data, it’s that many more I/O ops that need to happen when the data changes.

    May 27, 2009 at 8:05 pm
  • Zardosht Kasheff

    @Eric, thanks for the comment – excellent points! Because TokuDB provides very high indexed insertion rates and compression, TokuDB can really make clustering indexes work. In many cases, TokuDB can meet or exceed insertion rate and disk use requirements even with multiple clustering and/or non-clustering indexes, even on large tables.

    May 27, 2009 at 9:51 pm
  • Kristian Nielsen

    How is this any different from a covering index in MyISAM (or InnoDB) obtained by just appending all remaining columns after the indexed keys?

    Especially with the restriction on no unique indexes, that seems to be exactly the same?

    From the description it sounds like this is just a syntax shortcut for creating covering indexes, which are supported in MyISAM and InnoDB (but not Falcon). But a covering index is not a clustered index, since as Eric writes it does not improve insert/update speed in any scenarios.

    May 27, 2009 at 10:38 pm
  • Zardosht Kasheff

    @Kristian, good questions. I will address covering indexes vs. clustering indexes in a blog post later today.

    May 28, 2009 at 2:53 pm
  • Venu Kalyan

    Can you describe how much extra storage is needed for each add-on clustered index/table ?

    May 28, 2009 at 3:34 pm
  • Zardosht Kasheff

    @Venu, with TokuDB, we typically measure compression ratios of 3x to 12x relative to the amount of raw data in a table. Since a clustering index is another copy of the table in a different sort order, it will probably consume somewhere between 1/3 and 1/12 of the table’s raw data size.

    May 28, 2009 at 5:59 pm
  • Kristian Nielsen

    Ah, now I think I see the difference between this and normal covering indexes. The difference here is that the extra fields are only stored in the leaf nodes (assuming Btree index). The keys in interior nodes need not store the extra fields, unlike in traditional covering indexes where the extra fields needlessly increase the key size proper.

    That might be useful, especially when the row size is much bigger than the key size (not at all uncommon).

    May 28, 2009 at 9:07 pm
  • Mark Callaghan

    This is very nice. I missed this post when you first wrote it. It can be used to provide much better bounds on worst-case query performance (a few disk reads versus thousands) and do a much better job at making queries ‘online’.

    March 15, 2010 at 5:19 pm
  • moses wejuli

    hi zardosht,

    got a few queries for u here..

    1) i know clustering indexes include all columns for the table, however, am keen to know if we can define multi-column clustering indexes for purposes of sorting e.g.
    — create clustering index clstr_key on foo(a,b,c)
    where column c is sorted relative to b, and b sorted relative to a, etc….. is this feasible?!

    2) where does the index sorting happen with TokuDB? On disk or in memory? at what point do we get back to memory during this process?

    3) with the advent of multiple clustering indexes per table with TokuDB, how useful are single column indexes for multi-row, multi-field result sets? seems to me that single column indexes will only be useful for a single row fetches only, if at all!

    3) what are the advantages of defining a clustering index with a column that has been defined as a single row index elsewhere? or will that only be a waste of index?

    4) what is the maximum number of columns that can be defined as part of a single composite or covering index (please specify for both TokuDB and InnoDB).



    December 19, 2011 at 4:26 pm
    • Zardosht Kasheff

      1) Yes, this is feasible. A clustering index can be defined just as a normal index is defined.

      2) Both! For more information, see

      3a) I assume you are asking about the advantages of single column non clustering indexes, such as key(a). Such an index takes less space on disk (so that is an advantage), but covers fewer queries. I don’t think there is an advantage to single row fetches, because if the index is not covering, then this may result in 2 disk seeks, where as a covering index may answer the query in one disk seek.

      3b) I assume you are asking if there is an advantage to having both a clustering key (a) and a key(a). I see no advantage, other than for a small set of queries for which both key(a) and clustering key(a) are covering, key(a) is a little more efficient.

      4) In our releases, the maximum number of columns is 32.

      December 22, 2011 at 6:26 pm
  • just-working

    Hi, I’m not sure where to ask this question. But I’m very interested in this database engine for a high score database. The main index will be user-id,game-id. The secondary index will be game-id,score,timestamp. Would it help if the secondary index is also clustered because then it’s already ordered and it will allow fast retrieval of scores as they are stored in ranking order?

    But that’s what brings me to my next question (my main question). It’s about rank. If I have only the user-id and the game-id of a record, can I find the rank of that user’s score compared to others with the same game-id in O(log n) time? It would use the first index to find the user, then the second index to count the number of records that are higher ranking so that’s two look ups. It seems feasible but it depends on the engine and I don’t know if the engine keeps track of how many nodes there are greater or less than at every node or not. I’m talking about a potentially very large database with many millions of records and hundreds of thousands of records with each game-id, and many of such look ups (maybe 200 of them in a row), so that’s why I am interested in this. 200 O(log n) wouldn’t be so bad if it doesn’t have to perform a count every time.

    April 6, 2014 at 10:21 pm
    • Zardosht Kasheff

      Please email tokudb-user google group. That is the appropriate place for this discussion

      April 7, 2014 at 2:37 am
  • Raghu ram reddy

    How multiple clustered indexes sorts the data?
    If we have one clustered index then the data will be in sorted order.what if have multiple clustered index then how the data is sorted?can anyone explain physical data when we create multiple clustered index??

    November 12, 2014 at 12:25 am
    • Zardosht Kasheff

      Each clustering index has a copy of the data sorted in that index’s order

      November 12, 2014 at 9:33 am
  • Doron Grinstein

    If my primary key is a GUID, isn’t utilizing a clustered primary key index inefficient? I want to use GUIDs so that I can have active-active replication where each active database can insert its own records.

    I see this as a major issue with TokuDB and InnoDB. I would appreciate some words of wisdom from MySql experts.

    August 19, 2015 at 5:24 pm

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.