# Covering Indexes: How many indexes do you need?

I’ve recently been blogging about how partitioning is a poor man’s answer to covering indexes. I got the following comment from Jaimie Sirovich:

“There are many environments where you could end up creating N! indices to cover queries for queries against lots of dimensions.”

[Just a note: this is only one of several points he made. I just wanted to dig into this one in some detail. Here goes…]

Although it is, in theory, possible to generate a workload that would take N! indexes, this is not a realistic (or useful) bound (leaving aside that this workload would kill partitioning!). For one thing, it would take N! queries to exercise all those indexes. And the queries would have to include every field in the where clause — as we’ll get into below.

So what is a reasonable bound on the number of covering indexes that would be useful? Not surprisingly, there’s no one answer, so let’s look at a couple of cases.

Before we get started, let’s simplify the discussion by making all indexes CLUSTERING, which means they include all fields, and therefore, we only need to look at the WHERE clause of our queries to figure out what needs covering. More on that later.

### How many indexes does it take to beat partitioning?

When partitions are used to avoid table scans, it’s because the secondary indexes do not cover the queries. All it takes to make them cover whatever query they are used on is to declare them CLUSTERING. Once you do that, they will be faster than using the same indexes with partitions.

So not only do you not need N! indexes, what you need is the same number of indexes! (Here the first ‘!’ is factorial, and the second is emphasis!)

A little detail here: suppose you have partitioned on field T and you have defined an index on (A,B,C). Now get rid of partitioning and define CLUSTERING index (T,A,B,C). Lather, rinse, repeat.

### How many indexes does it take in general to cover your queries?

As noted above, you only need to look at the WHERE clause because making your indexes CLUSTERING takes care of the rest of the fields you might need in your query.

Next we need to consider what rows get examined in a query. Suppose you have a query like:

SELECT * from foo WHERE A = 5 and B = 10 and C < 20 and D = 30;

What do you need to cover this query? It turns out that once you hit an inequality, MySQL stops searching in the index and starts a scan of the index. So that means that a CLUSTERING index on (A,B,C) is all you need. There’s no need for D in the index definition. If you also had:

SELECT * from foo WHERE A = 5 and B > 10 and D = 30;

the CLUSTERING index on (A,B,C) would also work for this query.

So how many indexes do you need? Certainly not N!. Take a look at your most important queries. Consider their WHERE clauses and throw away all the fields after the first inequality test. Throw away any set of fields that’s a prefix of another set. So if you end up with:

(A,B,C)

(A,B)

(C,D,A)

you don’t need the (A,B). The remaining sets of field are your candidates. Each CLUSTERING index is as big as the primary table — that’s what makes them so good for covering lots of different queries! While this may sound like it would take up a lot of space, TokuDB uses very aggressive compression, typically getting 10x-15x compression, so it’s possible to add quite a few such indexes.

### So how many indexes do I need?

How many mission-critical queries do you have? How many of them can be clumped together using CLUSTERING indexes? When talking to customers, the most number of useful indexes I’ve seen is 20. That sounds like a lot for two reasons: size and update time. But the total size should be less than twice the original data size, assuming at least 10x compression. And TokuDB is all about updating indexes fast, even when they don’t fit in memory, so this hasn’t been a complaint either. I’m going to go out on a limb and say that a typical upper bound is 20.

### What if I don’t know what queries I’ll be performing?

In this case, partitioning isn’t the answer. We already beat the performance of partitioning by switching over the indexes to CLUSTERING. What you need in this case is flexibility. When you get a new query type, what you need is TokuDB’s hot indexing, which allows you to add an index with basically no downtime (3 seconds downtime is typical).

But what happens if I always get queries that are unrelated to previous queries and they are mission critical and they are never repeated? Such a hypothetical leads us down the N!-index path. It doesn’t seem that realistic.

### Good but not perfect covering is OK too

Finally, it’s not necessary to tailor your indexes perfectly to your queries. The CLUSTERING index (A,B) covers the query

SELECT * from foo where A = 5 and B = 10 and C = 20 and D = 30;

even though C and D are not mentioned. CLUSTERING indexes cover all queries. It’s all a matter of getting the table into the best possible order for your queries. You can still get benefit from an index that isn’t the perfect order, and that index will cover more queries. It’s a tradeoff.

If you are interested in evaluating TokuDB for deployment and want to know how to proceed, try the following