Quickly finding unused indexes (and estimating their size)

I had a customer recently who needed to reduce their database size on disk quickly without a lot of messy schema redesign and application recoding.  They didn’t want to drop any actual data, and their index usage was fairly high, so we decided to look for unused indexes that could be removed.

Collecting data

It’s quite easy to collect statistics about index usage in Percona Server (and others) using the User Statistics patch.  By enabling ‘userstat_running’, we start to get information in the INFORMATION_SCHEMA.INDEX_STATISTICS table.  This data collection does add some overhead to your running server, but it’s important to leave this running for a good long while to get a good dataset that is representative of as much of your workload as possible.

If you miss collecting index stats while some occasional queries run, you run the risk of dropping indexes that are being (seldomly) used, but are still important for the health of your system to have.  This may or may not impact you, but I’d highly recommend you manually review the list of unused indexes being generated above before you simply drop them.

Depending on your sensitivity to production load, you may therefore want to run this several days, or just sample different short windows during your normal production peak.  In either case, you may want to compare or repeat this index analysis, so let’s setup a separate schema to do this.  Its important that this index analysis is on a server with your full production dataset loaded, but it could be a master, or just a slave somewhere (just be careful not to break replication!).

If our index_statistics are collecting on the same server, then we can simply get a snapshot of it into our schema with one command:

If the stats come from some other server, then you may need to dump and load a copy of that table into your working index_analysis schema.

Merging stats from several servers

In the case of this client, they had a master and several slaves taking read traffic.  The index workload on these two sets of servers was different and I wanted to make sure I considered the index statistics from both of these sources.  Be sure you include all relevant index stats from all aspects of your application, otherwise garbage-in, garbage-out and you risk dropping necessary indexes.

To accomplish merging multiple result sets, I gathered statistics from both their master and slave and loaded them into my schema as separate tables.  Then I simply created a view of a UNION DISTINCT of those two tables:

Now I can query the ‘all_known_index_usage’ and see the union of both of those datasets.  This, of course, can be extended to all the datasets you want.

Interpreting the data

So, this is all well and good, but how do we then easily determine a list of indexes that are not being used?  Well, for this we need to back to the INFORMATION_SCHEMA to get a list of ALL the indexes on my system (or at least the schemas I want to consider dropping indexes in).  Let’s keep using views so this dynamically updates as our schema changes over time:

Now I can query this view to see my indexes:

Now I need a way to find the set of indexes in all_indexes, but not in used_indexes.  These indexes (if our original index statistics are good) are candidates to be dropped:

Note that we also want to avoid dropping PRIMARY and UNIQUE indexes since those tend to enforce important application data constraints, so we added some additional criteria to the end of our SELECT.

I can now select my droppable (unused) indexes from this view:

From here I can use some clever SQL to generate the precise ALTER TABLE statements to drop these indexes, an exercise left to the reader. 🙂

Estimating the size of these indexes

But, what if we want to see if it’s worth doing first?  Do these indexes actually represent a significant enough amount of disk space for it to be worth our while?

We need some more information to answer this question, but fortunately in Percona Server, we have it in the INFORMATION_SCHEMA.INNODB_INDEX_STATS table and the ‘index_total_pages’ column.  A page in Innodb is (usually) 16k, so some simple math here should help us know how much disk space an index utilizes.

Let’s go update our all_indexes view to include this information: