September 1, 2014

Tools and Techniques for Index Design Webinar Questions Followup

I presented a webinar this week to give an overview of Tools and Techniques for Index Design. Even if you missed the webinar, you can register for it, and you’ll be emailed a link to the recording.

I’d like to invite folks who are interested in tools for query optimization to attend the new Percona online virtual training Analyzing SQL Queries with Percona Toolkit, October 29 – November 2, 2012. See http://store.percona.com/catalog.php?category=17 for details on this training class and how to register to attend.

During my webinar, I got a number of good questions. Here are the questions (paraphrased), with my answers.

Q: Can you link to the book on index optimization you mentioned in the webinar?

Sure!  The book is Relational Database Index Design and the Optimizers by Tapio Lahdenmäki and Michael Leach.

Also Baron Schwartz has written a good review of this book.

Q: Given that the book was written in 2005, before MySQL 5.1 and 5.5 were released, is it still relevant?

That book is designed to be RDBMS-neutral, and the concepts it covers apply to other products besides MySQL. The logical principles still hold true: using an index really means leveraging a pre-sorted data structure to reduce the work a query needs to do.  The basic data structure of an index hasn’t changed in recent releases.  So yes, the principles of this book are still relevant to current versions of MySQL.

Q: How do the upcoming changes in MySQL 5.6 change the best practices for index design?

Fundamentally, index design best practices will not change.  MySQL 5.6 is still in beta as of this writing, but the new version promises some exciting new ways that MySQL can use indexes, such as:

Q: Is it always the case that no second-star index exists for a query with both a GROUP BY and an ORDER BY clause?

Right; you’re stuck, unless your query groups by the same column as its sort column. You can optimize by adding the grouping column to your index, so all rows in a given group are read together. But doing so fixed the order in which rows are accessed, so if you need to sort by another column, it must collect the interim result set and sort it as a separate step. This is when we see “Using temporary; Using filesort” in the EXPLAIN report.

Hopefully, the size of this interim result set is relatively small, after the GROUP BY reduces it to one row per distinct value in the grouping column. Then sorting it may be more likely to happen in memory.

Q: Can you comment on the QUBE method for index estimation?

QUBE stands for Quick Upper-Bound Estimate. This is a method for estimating the theoretical cost of a query, by counting how many lookups to index entries or table rows will be necessary for a given query in the worst case. This method tries to quantify the cost of a query, but ultimately it arrives at similar conclusions as the star system: define an index with all columns in the predicate, add columns to avoid sorting per query, and add further columns to cover the select-list. The difference is that QUBE helps you to predict quantitatively how much a new index is likely to improve the query.

Q: Could you provide some best practices for partitioning and its effect on index design?

The most important consideration is that “every unique key on the table must use every column in the table’s partitioning expression.”  It might even be impossible to partition a table while using UNIQUE KEY constraints the way that makes most sense for your table.

Logically, your index design will be the same with a partitioned table as with a non-partitioned table. But since each partition is physically stored as a separate table, each index contains only values of its respective subset of the data. The index is more shallow and quicker to search. If your query has predicates that allow partition pruning to reduce the scope of the search, the index search will be even more efficient than if you were searching for the same rows in an non-partitioned table.

Thanks to folks who attended my webinar, and thanks for the questions!

About Bill Karwin

Bill Karwin has been a software professional for over 20 years. He's helped thousands of developers with SQL technology. Bill authored the book "SQL Antipatterns," collecting frequent blunders and showing better solutions.

Comments

  1. Elaine says:

    This was a very interesting webinar; thank you very much for making it available. I feel just a little bit smarter.

    What is the best way to handle a situation where we have dozens of different kinds of queries against a table? How do I work out the tradeoff between the more versatile single column indexes versus having multiple multi-column indexes that are tailored for each query type? At what point do I need to say, there are too many indexes that are too large and they’re eating into my memory pool?

  2. Hi Elaine,

    When I do an audit of queries for a site, I use pt-query-digest to generate a ranked list of queries. I start at the top of the list of queries, and analyze what index can I create to be optimal for the query that accounts for the most total database response time. Then do the same for the second query, and so on.

    I find that the new indexes I create are good for multiple query types. For example, I get to query #6 and I see that it’s got a different fingerprint, but the best index for query #2 is also the best index for query #6. This repeats itself as I go down the list, so I might only create three new indexes for the top 10 queries.

    It’s also common to find cases where an index seems like it has more columns than it needs for query #2, but it does help that query, and the additional columns are important for query #6. So one index serves both queries effectively, even if it’s not the minimal index needed for one of them.

    Once I get past the top queries, the remaining queries usually each account for less than 1% of the load on the database. I probably won’t add new indexes specifically for those queries, because it would have less benefit to the total server load.

    Suppose I can speed up query #27 by 80% if I create a new index. But if that query is run so infrequently that it even without that index the query only accounts for 0.3% of the total database response time, I’m not getting much benefit out of that new index relative to the total server load. I’d rather create the indexes that help queries ranked higher.

    So start at the top of the ranked list, try to find indexes that benefit more than one case, and stop adding indexes when it seems like they don’t offer much benefit anymore.

  3. Elaine says:

    Thank you so much for that answer. It was very helpful.

    As I am playing with some of these strategies, it is obvious to me that the order of the columns in the index matters. When I am not certain, testing with my real queries helps to make my decisions. In general, I am starting with the strategy to index WHERE columns, then ORDER BY columns, and then SELECT columns in left-to-right order when trying to get an index that will hit all three sweet spots.

  4. Elaine says:

    My application has quite a few join tables where we use a compound primary key. At this point, is there a performance difference between a table with a (mostly unused) single id field as a primary key and then a unique compound index on the columns we use for queries? IE, does PRIMARY confer any speed or performance advantages over any other index that is unique and NOT NULL?

  5. Hi Elaine,

    Yes, in the case of InnoDB (which should be your default choice of storage engine), a primary key index is called a “clustered index” which means the table’s data is literally stored as the primary key. Whereas any secondary key (unique or otherwise) is stored separately. Any lookup by secondary key potentially does two searches: first a lookup in the secondary index, and then if the query needs other columns than those in the index, it needs to do a second lookup in the clustered index to fetch those other columns.

    For this reason, there’s an advantage to designing the clustered index (i.e. the primary key) as the columns that you search on most frequently. In the case when you have a candidate key (your unique/not null index), that may be better to use as the primary key than having a pseudokey.

    But this choice, like any optimization, depends on the load you want to optimize for in your application.

Speak Your Mind

*