I wanted to share a little rule of thumb I sometimes use to choosing column order in indexes. This is not specific to MySQL, it’s generally applicable to any database server with b-tree indexes. And there are a bunch of subtleties, but I will also ignore those for the sake of simplicity.
Let’s start with this query, which returns zero rows but does a full table scan. EXPLAIN says there are no possible_keys.
|
1 2 3 |
SELECT * FROM tbl WHERE status='waiting' AND source='twitter' AND no_send_before <= '2009-05-28 03:17:50' AND tries <= 20 ORDER BY date ASC LIMIT 1; |
Don’t try to figure out the meaning of the query, because that’ll add complexity to the example 😉 In the simplest case, we want to put the most selective column first in the index, so that the number of possible matching rows is the smallest, i.e. we find the rows as quickly as possible. Assuming that all the columns have an even distribution of values, we can just count the number of matching rows for each criterion.
|
1 2 3 4 5 6 7 8 9 |
select sum(status='waiting'), sum(source='twitter'), sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*) from tblG *************************** 1. row *************************** sum(status ='waiting'): 550 sum(source='twitter'): 37271 sum(no_send_before <= '2009-05-28 03:17:50'): 36975 sum(tries <= 20): 36569 count(*): 37271 |
This is pretty simple — all I did was wrap each clause in a SUM() function, which in MySQL is equivalent to COUNT(number_of_times_this_is_true). It looks like the most selective criterion is “status=waiting”. Let’s put that column first in the index. Now, pull it out of the SELECT list and put it into the WHERE clause, and run the query again to get numbers within the subset of rows that match:
|
1 2 3 4 5 6 7 8 |
select sum(source='twitter'), sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*) from tbl where status='waiting'G *************************** 1. row *************************** sum(source='twitter'): 549 sum(no_send_before <= '2009-05-28 03:17:50'): 255 sum(tries <= 20): 294 count(*): 549 |
So we’re down to a reasonable number of rows (the count() is changing because I’m running this on live data, by the way). It looks like the ‘source’ is no more selective, that is, it won’t filter out any more rows within this set. So adding it to the index would not be useful. We can filter this set further by either the ‘no_send_before’ or the ‘tries’ column. Doing so on either will reduce the count of matches for the other to zero:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
select sum(source='twitter'), sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*) from tbl where status='waiting' and no_send_before <= '2009-05-28 03:17:50'G *************************** 1. row *************************** sum(source='twitter'): 255 sum(no_send_before <= '2009-05-28 03:17:50'): 255 sum(tries <= 20): 0 count(*): 255 select sum(source='twitter'), sum(no_send_before <= '2009-05-28 03:17:50'), sum(tries <= 20), count(*) from tbl where status='waiting' and tries <= 20G *************************** 1. row *************************** sum(source='twitter'): 294 sum(no_send_before <= '2009-05-28 03:17:50'): 0 sum(tries <= 20): 294 count(*): 294 |
That means we can add an index on either of (status, tries) or (status,no_send_before) and we will find the zero rows pretty efficiently. Which is better depends on what this table is really used for, which is a question I’m avoiding.