November 26, 2014

Using UNION to implement loose index scan in MySQL

One little known fact about MySQL Indexing, however very important for successfull MySQL Performance Optimization is understanding when exactly MySQL is going to use index and how it is going to do them.

So if you have table people with KEY(age,zip) and you will run query something like
SELECT name FROM people WHERE age BETWEEN 18 AND 20 AND zip IN (12345,12346, 12347) do you think it will use index effectively ? In theory it could – it could look at each of the ages from the range and look at all zip codes supplied. In practice – it will not:

As you see instead only first index keypart is used (key_len is 4) and zip part where clause is applied after rows are retrived. Notice Using Where. There are even more bad news. Full rows will need to be read to check if zip is in the list, while it could be done only by reading data from the index. MySQL can ether read index only for all rows, in this case you will see “Using Index” in EXPLAIN output or it will read row data for all rows – it can’t read Index and perform row read only if it needs to be done at this point.

So MySQL Will not use indexes in all cases when it is technically possible. For multiple key part indexes MySQL will only be able to use multiple keyparts if first keyparts matched with “=”. Here is example:

Note number of rows has decreased from 90556 to 3, whle “key_len” remains the same. This however looks like a bug in the MySQL 5.0.18 I’m using for this demo. It should have had increased to 8.

Lets see how query times differ in these cases:

As you see difference is tremendous. And it is not what you would intuitively expect – why range which covers 5 rows is hundreds of times slower than single row ? If MySQL Optimizer would handle this case right it would not be but in this case we only can give a hand to MySQL Optimizer and change the query so it can handle it well…. use UNION:

Ethen though this query looks much more complicated MySQL is able to execute it much faster, delivering us expected performance.

You can also use this approach when first key column is not in where clause at all if it has just few values. For example if we would have gender instead of age with just two possible values it would be faster to run such query with union. I bet it would even be so with age even if it would take some 100 queries in the union to do so.

This strategy is best applied if no others work well. Ie if there are range on both keyparts and none of them is selective enough by itself. For example if we would like to only lookup people within single zip I would advice to use index in (zip,age) instead of using this workaround.

And… yes this example is a bit artificial. You would probably use date (or at least year) or birth instead of age, and put zip as first column in the index as it is more selective but it is good enough for illustrative purposes :)

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Michael Otto says:

    How does SELECT name FROM people WHERE age in (18,19,20,21,22) AND zip IN (12345,12346, 12347); perform here?

  2. Dmitry says:

    Talking about MySQL 5 – what if you had two indices on age and zip separately, how would it look like in explain?

  3. peter says:

    Michael,

    You’re right. This problem seems to be partially fixed and IN works for this case as well as union

    mysql> explain SELECT name FROM people WHERE age in (18,19,20,21,22) AND zip IN (12345,12346, 12347);
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | 1 | SIMPLE | people | range | age | age | 4 | NULL | 16 | Using where |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    1 row in set (0.16 sec)

    mysql> SELECT name FROM people WHERE age in (18,19,20,21,22) AND zip IN (12345,12346, 12347);
    +———————————-+
    | name |
    +———————————-+
    | ed4481336eb9adca222fd404fa15658e |
    | 888ba838661aff00bbbce114a2a22423 |
    +———————————-+
    2 rows in set (0.05 sec)

    I actually ment a bit different case but I thought I should simplify it which turned to be not the best way around.
    I’ll post one more post in couple of days to show where union will be the only this which can help you :)

  4. peter says:

    Dmitry,

    This is not the case where index merge can apply:

    mysql> explain SELECT SQL_NO_CACHE name FROM people WHERE age BETWEEN 18 AND 22 AND zip IN (12345,12346, 12347);
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | 1 | SIMPLE | people | range | age,zip | zip | 3 | NULL | 19 | Using where |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    1 row in set (0.00 sec)

    As you see it can use only one of the indexes it selects to use zip as it is more selective;
    As corelation between indexes is not known it is hard to tell what will be faster to use one of indexes only or
    to intersect row pointers from both indexes.

    If you would need OR in this case Index merge would apply.

    mysql> explain SELECT SQL_NO_CACHE name FROM people WHERE age BETWEEN 18 AND 22 OR zip IN (12345,12346, 12347);
    +—-+————-+——–+————-+—————+———+———+——+——-+—————————————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——–+————-+—————+———+———+——+——-+—————————————-+
    | 1 | SIMPLE | people | index_merge | age,zip | age,zip | 1,3 | NULL | 38821 | Using sort_union(age,zip); Using where |
    +—-+————-+——–+————-+—————+———+———+——+——-+—————————————-+
    1 row in set (0.00 sec)

  5. Yuan WANG says:

    Hi peter:

    Sorry to disturb you but I don’t quite understand the big performance difference between the “age=19″ query and the “age BETWEEN 18 AND 22 AND” query. According to the output of the EXPLAINs, I think the process of executing the “age=19″ query is:
    1. Perform an index scan of KEY(age, zip), with scankey “age=19″. For key_len is 4, the condition on zip is useless in this step;
    2. For each rows returned by the index scan above, retreive that row from the table and test where it’s zip is in (12345, 12346, 12347)
    And I think the process of executing the “age BETWEEN 18 AND 22 AND” query is:
    1. Perform an index range scan of KEY(age, zip), start with “age=18″ and end with “age=22″. For key_len is 4, the condition on zip is useless in this step;
    2. For each rows returned by the index scan above, retreive that row from the table and test where it’s zip is in (12345, 12346, 12347)
    So the main difference should be the first step. If the data has a uniform distribution, number of rows checked and returned by the first step of the “age=19″ query should be about 1/5 of the “age BETWEEN 18 AND 22 AND” qeury, so would be the total cost.

    However, the difference of the real cost is much more tremendous. So chould you be so kind to give some more words about the reason of the performance difference?

  6. peter says:

    Yuan WANG,

    If query with where clause “age=19 and zip in (…)” is executed what really is performed is sequence of lookups
    age=19 and zip=const. Key length=4 is missleading in this case. It seems to be EXPLAIN bug in this case.

    For query with “age between 19 and 22 and zip in (…)” what you have described is happening – index range lookup is performend on age condition and zip condition only checked after row is read.

    You can use FLUSH STATUS; SHOW STATUS; and check Handler_XXX stats to see how query execution was performed.

  7. Apachez says:

    Is this a MySQL Optimizer bug (which we could expect to be fixed in near future) or is this “by design” for the next couple of years ?

    I mean the UNION is a nice workaround but needs more code and logic where just writing a “WHERE col BETWEEN 19 AND 22″ is so much easier :-)

  8. James Day says:

    Apachez, one of many areas where it’s known that the optimizer can be improved. They are gradually being done, no particular schedule for when each one is done. Other than more slowly than we’d all like. :)

    James Day
    Support Engineer, MySQL AB

  9. Mark says:

    Its 3 years later, has this been fixed? If so, in what version of MySQL?

    Thanks!

  10. Jindra says:

    Thanks, helped me to optimize query with agregate function on table with multiple key:

    table:
    Create table tabData (
    channel_id Int NOT NULL,
    date_time Datetime NOT NULL,
    value Double NOT NULL,
    Primary Key (channel_id, date_time)
    );

    too slow (more then a few seconds when thousands of records):

    select min(date_time) from tabData where id_virtual_channel in (1, 2, 3)

    optimized – query returns immediately:

    select min(m)
    from (
    select min(date_time) as m from tabData where channel_id = 1
    union
    select min(date_time) from tabData where channel_id = 2
    union
    select min(date_time) from tabData where channel_id = 3
    ) as min_datetimes

  11. Spyd says:

    Hey, i found this page while googling for more information after i implemented something exactly like this! I thought i had done something “dirty” but now i’m happy that this “tricky” method is suggested by you! My queries went from 0,6 to 0,01 :) :) :)

  12. You know, for kids! says:

    In
    mysql> SELECT name FROM people WHERE age=18 AND zip IN (12345,12346, 12347)
    -> UNION ALL
    -> SELECT name FROM people WHERE age=19 AND zip IN (12345,12346, 12347)
    -> UNION ALL
    -> SELECT name FROM people WHERE age=20 AND zip IN (12345,12346, 12347)
    -> UNION ALL
    -> SELECT name FROM people WHERE age=21 AND zip IN (12345,12346, 12347)
    -> UNION ALL
    -> SELECT name FROM people WHERE age=22 AND zip IN (12345,12346, 12347);
    we have 5 different ages. What if we have 20? Would we have to manually enter 15 additional unions?

Speak Your Mind

*