Where the open source database community meets: Use code PERCONA75 and secure your spot for Percona Live. Register

Downloads

Blog

Choosing Column Order in Indexes

June 5, 2009

Author

Baron Schwartz

Insight for DBAs

Share this Post:

I wanted to share a little rule of thumb I sometimes use to choosing column order in indexes. This is not specific to MySQL, it’s generally applicable to any database server with b-tree indexes. And there are a bunch of subtleties, but I will also ignore those for the sake of simplicity.

Choosing Column Order

Let’s start with this query, which returns zero rows but does a full table scan. EXPLAIN says there are no possible_keys.

SELECT * FROM tbl WHERE status='waiting' AND source='twitter'
 AND no_send_before <= '2009-05-28 03:17:50' AND tries <= 20
 ORDER BY date ASC LIMIT 1;

SELECT * FROM tbl WHERE status='waiting' AND source='twitter'

AND no_send_before <= '2009-05-28 03:17:50' AND tries <= 20

ORDER BY date ASC LIMIT 1;

Don’t try to figure out the meaning of the query, because that’ll add complexity to the example 😉 In the simplest case, we want to put the most selective column first in the index, so that the number of possible matching rows is the smallest, i.e. we find the rows as quickly as possible. Assuming that all the columns have an even distribution of values, we can just count the number of matching rows for each criterion.

select sum(status='waiting'), sum(source='twitter'),
 sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)
 from tblG
 *************************** 1. row ***************************
                       sum(status ='waiting'): 550
                        sum(source='twitter'): 37271
 sum(no_send_before <= '2009-05-28 03:17:50'): 36975
                             sum(tries <= 20): 36569
                                     count(*): 37271

select sum(status='waiting'), sum(source='twitter'),

sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)

from tblG

*************************** 1. row ***************************

sum(status ='waiting'): 550

sum(source='twitter'): 37271

sum(no_send_before <= '2009-05-28 03:17:50'): 36975

sum(tries <= 20): 36569

count(*): 37271

This is pretty simple — all I did was wrap each clause in a SUM() function, which in MySQL is equivalent to COUNT(number_of_times_this_is_true). It looks like the most selective criterion is “status=waiting”. Let’s put that column first in the index. Now, pull it out of the SELECT list and put it into the WHERE clause, and run the query again to get numbers within the subset of rows that match:

select sum(source='twitter'),
 sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)
 from tbl where status='waiting'G
 *************************** 1. row ***************************
                        sum(source='twitter'): 549
 sum(no_send_before <= '2009-05-28 03:17:50'): 255
                             sum(tries <= 20): 294
                                     count(*): 549

select sum(source='twitter'),

sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)

from tbl where status='waiting'G

*************************** 1. row ***************************

sum(source='twitter'): 549

sum(no_send_before <= '2009-05-28 03:17:50'): 255

sum(tries <= 20): 294

count(*): 549

So we’re down to a reasonable number of rows (the count() is changing because I’m running this on live data, by the way). It looks like the ‘source’ is no more selective, that is, it won’t filter out any more rows within this set. So adding it to the index would not be useful. We can filter this set further by either the ‘no_send_before’ or the ‘tries’ column. Doing so on either will reduce the count of matches for the other to zero:

select sum(source='twitter'),
 sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)
 from tbl where status='waiting' and no_send_before
 <= '2009-05-28 03:17:50'G
 *************************** 1. row ***************************
                        sum(source='twitter'): 255
 sum(no_send_before <= '2009-05-28 03:17:50'): 255
                             sum(tries <= 20): 0
                                     count(*): 255

select sum(source='twitter'),
 sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)
 from tbl where status='waiting' and tries <= 20G
 *************************** 1. row ***************************
                        sum(source='twitter'): 294
 sum(no_send_before <= '2009-05-28 03:17:50'): 0
                             sum(tries <= 20): 294
                                     count(*): 294

select sum(source='twitter'),

sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)

from tbl where status='waiting' and no_send_before

<= '2009-05-28 03:17:50'G

*************************** 1. row ***************************

sum(source='twitter'): 255

sum(no_send_before <= '2009-05-28 03:17:50'): 255

sum(tries <= 20): 0

count(*): 255

select sum(source='twitter'),

sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*)

from tbl where status='waiting' and tries <= 20G

*************************** 1. row ***************************

sum(source='twitter'): 294

sum(no_send_before <= '2009-05-28 03:17:50'): 0

sum(tries <= 20): 294

count(*): 294

That means we can add an index on either of (status, tries) or (status,no_send_before) and we will find the zero rows pretty efficiently. Which is better depends on what this table is really used for, which is a question I’m avoiding.

0 0 votes

Article Rating

18 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Cary Millsap

16 years ago

Baron, if you haven’t seen “Relational Database Index Design and the Optimizers” by Tapio LahdenmÃ¤ki and Michael Leach, you should find a copy.

â€”Cary

Author

Baron Schwartz

16 years ago

Cary: thanks for the recommendation I’m adding it to my to-buy list for my next book order. For the benefit of the readers, if there’s some deadly sin I’m committing in my simplified example above (aside from all the assumptions I’ve sidestepped), would you please let me know?

Nicolae Namolovan

16 years ago

That’s a cool technique, thanks for sharing 😉

Roman Petrichev

16 years ago

Live data can be changed dramatically within few minutes and previously looked efficient compound index can easily turn quite inefficient for the same query but for changed data set. So such technique is more suitable for immutable tables IMHO.

Rob Wultsch

16 years ago

I have done similar things in the past though I have mostly looked at cardinality as I assumed this would be a better general use case. (note: I think this is a foolish assumption when most queries are looking for the same values…)

I have wondered if there should be some account of the size of the columns in place. For example lets say a int is slightly less selective than a varchar(50). I would think that the smaller datatype should be somewhat preferred as a smaller index would be created and I imagine traversed more quickly.

Also I think worth note would be the effect of range type restrictions for user that are unfamiliar with their effects on queries that might make use of composite indexes.

Any thoughts?

Anthony Linton

16 years ago

Thanks the post, a great help. A lot of the ones recently have seemed to be at a more complex level, which isn’t so useful to me =)

Author

Baron Schwartz

16 years ago

Rob, it’s a good point about size of the column. This was meant to be kind of an all-things-being-equal example. I didn’t want to get into all the subtleties of range criteria. That’s a big mess to explain clearly 🙂

Admin

Peter Zaitsev

16 years ago

Baron,

I think it also makes sense to look into things in the cost/benefit way. Extending the index makes it larger. In your case adding extra column which halves amount of rows to be traversed for the query probably makes sense. In some other cases when it is only 10% gain it does not make sense.

Another thing which often make sense to consider is covering index potential in particular when trying to target several query patterns with single indexes.

Covering index slows down queries which can’t use it as covering index because it is longer but speeds up queries which can use covering index dramatically in particular in IO bound scenarios.

ibrahim oguz

16 years ago

is it possible that using 2 server one for inserting data and the other one for reporting(making select queries) first server replicate to second server but in first one there is no index but in second server there are indexes.
because indexes make slow when you inserting,updating or deleting data but it makes fast when you select data.
how can i reorganize index when replicate data.

thanks

Stefan Stubbe

16 years ago

Wouldn’t it be better to check the cardinality of the columns instead of the selectivity of specific values ?
When you want to select the rows with another STATUS, another column order in the index might be better.

Author

Baron Schwartz

16 years ago

Right. That’s another half-written blog post from long ago, and as people are asking about all the nuances around this I’m regretting trying to simplify things too much. There are a dozen things to think about, though — cardinality has skew, queries use certain values more often than others, etc etc. Ultimately you really need statistical information about the values in the table and the values that are used in WHERE clauses. In the example, the query came from a real application that queries for twitter/waiting FAR more often than anything else, by the way.

Rob Wultsch

16 years ago

For whatever it is worth I regret my comment. It is easy to be a critic (which is a hair away from what I did of asking questions I knew were hard to answer).

This technique is very good for beginners and a very good option for dealing with an unfamiliar dbms. One way or another seat time and understanding whatever optimizer you are dealing with can not be easily replaced with a single blog post. Understanding the optimizer is really key and that can not be easily summarized.

It is a pipe dream for many (most?) databases to see a process list that evenly distributes queries to hit all portions of tables evenly. Much more likely is getting hit with many queries of the same form (possibly exactly the same) on a table that changes just enough for query cache to be a bad idea. In this case the above technique with some knowledge of nuances is I think very useful…

Thank you for your time.

John Larsen

16 years ago

I ran into an interesting multi key issue last week, wondering if anyone had any suggestions.
Table sent_items looked like
id int(10) autoincrement unsigned
sender_id int(11) unsigned
receiver_id int(11) unsigned
item_id tinyint(3)
sent_time int(11) (stored unix timestamp, stamped by application not server)
in InnoDB
This issue started popping up somewhere around 20 million rows in the table, possibly earlier though we didn’t notice.
There was a single key on receiver_id and a combo key on (sender_id, sent_time) due to a frequent query
to get the 50 most recently received items.
Average use had about a dozen items sent.
select * from sent_items where sender_id = ‘12345678’ limit 50;
would give 12 rows in 0.5 seconds under heavy server load.
select * from sent_items where sender_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
would give the same 12 rows in 5 full seconds under heavy load.
Both queries would be nearly instant under light server load.
A query on
select * from sent_items where receiver_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
With or without ORDER would be 0.5 seconds under load.
Now the sender_id has a very different cardinality than the receiver_id and records are generally grouped by a sender as a person can send many items to a number of different receivers in an instant.
Going to try changing the multi_key to a single key later, but very strange results. Wondering if you have any ideas?

Nikolay

15 years ago

I knew this but I love the way you describe it.

Dawn Green

16 years ago

What a fantastic article worthy of my retweet. Thank you!

praji venu

11 years ago

i thought even the compound indexes are single btree. so the column order in the index should not matter if all columns of the index are present in the where clause…

Yzmir Ramirez

8 years ago

In Cary Millsap comment above he recommended “Relational Database Index Design and the Optimizers” by Tapio Lahdenm and Michael Leach, but that was back in 2009. Is that book still good to reference?