October 31, 2014

Multi-Column IN clause – Unexpected MySQL Issue

We have an application which stores massive amount of urls. To save on indexes instead of using URL we index CRC32 of the URL which allows to find matching urls quickly. There is a bit of chance there would be some false positives but these are filtered out after reading the data so it works all pretty well.

If we just process urls one by one it works great:

Handling URLs one by one is however not efficient if you’re processing millions of them so we tried to do bulk fetches:

As you can see just using multiple column IN makes MySQL to pick doing full table scan in this case, even though the cardinality on the first column is almost perfect. I did some more testing and it looks like a bug or missing optimizer feature.

I should not be surprised though as multi-column in is not the most used MySQL feature out there.

For given application case we could simply rewrite query using more standard single column IN clause:

Theoretically speaking this query is not equivalent to the first one – because row having url_crc=2752937066 and url=’http://www.coxandforkum.com/’ would match it, while it should not. It however does not happen in our case as url_crc is functionally dependent on url so both queries are equivalent.

So we’ve got our work around and can forget about the issue and MySQL team gets yet another bug to deal with.
What worries me again is – this is very simple case which seems to to be generally broken which raises a question how good coverage MySQL tests have.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Petya says:

    How much faster this works than more common idea
    of index on url column? Did you make any tests?

  2. Mark Robson says:

    Perhaps the optimiser could do a better job if you used the more conventional

    WHERE (url_crc=12345 AND url=’http://something’) OR (url_crc=56789 AND url=’http://else’)

    OR

    Use a derived table, fetch the rows with the right url_crc first:

    SELECT whatever FROM (SELECT cols FROM tbl WHERE url_crc IN (12345,56789, … )) WHERE (url_crc,url) IN ( … )

    Just an idea.

    Mark

  3. tom says:

    just avoid to put AND url=’http://something’ in your query; it’s quite useless as there are very few false positive and it’s efficient to handle them in the application logic

  4. Xaprb says:

    Peter, it’s not quite the same thing, but see also

    http://dev.mysql.com/doc/refman/5.0/en/row-subqueries.html

    I have a draft post about 2 years old on this topic, warning people about this :-) I thought such row constructors were never optimized, though the manual says that 5.0.26 and newer does optimize this (I have not tested). Your bug report says you see this on 5.0.54, so I guess the optimizer is filling in this functionality slowly.

  5. peter says:

    Petya,

    It is hard to tell how much faster – it depends on a lot of questions, data size kind of lookups. If you have few duplicates so lookups mostly return nothing it can be many times faster on large data sets – because small index is much easier to fit in memory.

    For lookups which match the data you get space savings up to 1.5-1.9 times depending on other columns you have in the table, which improves you cache hit ratio, plus if you will be able to fun full index in memory you will need to do 1 IO instead of 2 IO for random lookups on large data sets.

  6. peter says:

    Mark,
    Indeed you can use AND OR variant of this query and it uses index properly:

    mysql> EXPLAIN SELECT url FROM 106pages.106pages WHERE (url_crc=2752937066 and url=’http://members.aye.net/~gharris/blog/’) OR (url_crc=3799762538 and url=’http://www.coxandforkum.com/’);
    +—-+————-+———-+——-+—————+———+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+———-+——-+—————+———+———+——+——+————-+
    | 1 | SIMPLE | 106pages | range | url_crc | url_crc | 4 | NULL | 2 | Using where |
    +—-+————-+———-+——-+—————+———+———+——+——+————-+
    1 row in set (0.00 sec)

    This is especially fun as it is different form of same stuff. Though I guess it triggers other execution path and is not optimized well.

    The thing I do not really like about this query form is more cluttered and is harder on the parser. Plus INs are optimized better by sorting IDs (that is for single column case).

  7. peter says:

    Tom,

    Sure you can make post filtering in your application but especially if you’re using scripting language this will take more CPU time plus makes things less clear. If there is no significant performance gains it is easier to write the code so you can just replace the queries on their more optimal variants.

  8. peter says:

    Thanks Baron,

    Indeed this is not the most commonly used feature so I guess this is why it is not getting too many optimization team focus. I see the bug Mark reported scheduled for 6.0 so I would not expect quick fixes either.

  9. tom says:

    Hello Peter, I agree in part with you; this is the same as to use or not stored procedures :)
    is it better to have the logic in mysql or in the application? I think the the answer is “depends”, as perhaps it’s in this case.
    cpu time spent in scripting languages is usually worse then cpu time spent in mysql; but if using odd queries avoid correct using of indexes or incur in more disk seekes perhaps it’s better application logic postprocessing.

    As in many other cases the best things is to make tests against your own data

  10. Ruslan Zakirov says:

    I think that at some number prepared statement will be faster than bulk.

    May be UNION ALL is more native representation.

    Also, wonder why mysql even doesn’t consider index merge.

  11. peter says:

    Tom,

    In this case it is easy enough to get the query which works. There are many cases when you can’t get MySQL to do what you want – consider for example bunch of subqueries, in these cases I would handle thing on the application – ie do SELECT create IN list and generate second query.

  12. peter says:

    Ruslan,

    I do not think prepared statements will be faster if you’re doing one by one checks – the savings on query parsing are much smaller than loss by having many roundtrips.

    If you would need to check 1000000 urls such a way you can use prepared statements together with batches and check them by 1000 or something like that. This likely would be the most efficient approach.

  13. Hi Peter,

    I know it shouldn’t matter, but have you tried:

    EXPLAIN SELECT url FROM 124pages.124pages WHERE (url_crc,url) = (484036220,’http://www.dell.com/’);

    to see if it is due to IN or rather due to multiple columns?

    If there is a difference, it seems likely that it is a bug, which may turn out to be trivial to fix.

  14. peter says:

    Roland,

    It has to do with IN, your query works fine.

    mysql> EXPLAIN SELECT url FROM 124pages.124pages WHERE (url_crc,url) = (484036220,’http://www.dell.com/’);
    +—-+————-+———-+——+—————+———+———+——-+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+———-+——+—————+———+———+——-+——+————-+
    | 1 | SIMPLE | 124pages | ref | url_crc | url_crc | 4 | const | 1 | Using where |
    +—-+————-+———-+——+—————+———+———+——-+——+————-+
    1 row in set (0.03 sec)

  15. ries says:

    I am sorry but what is the use on having an CRC in your table?? You could only really figure out IF a URL exists, and only with some certainty so you need to check your result set . USE a SP for that, don’t even think to do this on the application level.

    What if your boss comes to you and asks you : hey dude, how many URL’s do you have in your database that points to the website http://members.aye.net/

    For sure you HAVE to do something like this SELECT * FROM table WHERE url LIKE (‘http://members.aye.net/%’);

    What I would do is may be create a domain table to store domains only based on CRC32,
    then create 26 tables for your URL’s and partition your URI’s based on the URI

    so table domain stores : http://www.dell.com, http://www.vantwisk.nl, http://www.mysqlperformanceblog,com
    then table URI stores : 2008/04/04/multi-column-in-clause-unexpected-mysql-issue/#more-361, blabla.html, myblog/entry/bla.html

    On the doamin table you can create an idnex by CRC, but I would just use the.
    On table URI you simple create an index on the first XX characters to keep index size low.

    Data retrieval is done using a couple of stored procedures to get the right data and filter any duplicates.

    Ries
    PS: you could partition your tables if you need more. Since you didn’t mention how many rows you ant to store it’s hard to guess….
    You did mention massive, how much is massive for you??? in the order of 1000mil records???

    Ries

  16. peter says:

    Ries,

    I do not understand you. You can perfectly check EXACTLY if URL exists or not using plain SQL statement. Simply only CRC part will be checked by index and post-checking will be done after full row is read.

    Now planning your database structure you plan it for certain queries. If you need to serve arbitrary queries you design schema appropriately. If you only need some queries but you need them really fast you design schema differently.

    Think about it in car terms. If you only want to entertain your girlfriend and get yourself to work 2 seater sportscar may do good job for you. If you may need to load 7 people or load a sofa Minivan works better. So what ? You can’t say one or another is better without knowing the purpose.

    With domain table you propose certain way of normalization which is good for some applications again.

    For some workloads prefix indexes may work or other index structures.

    Speaking about 1-10b rows is a prototype size so we’re rather picky on optimizing queries.

  17. Ken says:

    Peter,
    Why use CRC32 at all… have your app use a common compression algorithm for the url, store the result… it will be unique, save space, etc….

  18. peter says:

    Ken,

    Because CRC32 is just 32bits 4bytes compressed url would take much longer.

    Of course if you can sue some simple compression like you need to store image URLs and so would store 123 instead of img.site.com/123.jpg but for general urls you will not compress them even close to this number

  19. Abhishek Soni says:

    Hi all

    I’ve a solution for above mentioned queries. After hard digging I found out that when we use IN clause, MySQL first executes the outer query i.e. the PRIMARY query as shown in the EXPLAIN for the query and then it executes the inner subquery.

    To bypass this problem I used JOIN and the execution time reduced drastically from few seconds to few msec’s

    I don’t know what is schema of your table. But here is what I would like to propose as a possible solution

    EXPLAIN SELECT url FROM 106pages.106pages as t1 left outer join 106pages.106pages as t2 on t1.url = t2.url WHERE t2.url_crc
    IN ((2752937066,3799762538);

    I have used self join in this case. I haven’t executed the query because of lack of knowledge about table schema. hope it works.

  20. Rick says:

    I have a similar issue with the in clause (5.0.77). It seems that if there are more than 2 items in the list, the optimizer abandons use of an existing index.

    Any ideas (besides moving to 5.1.xx)?

    mysql> explain select * from projects where project_phase_id in (2,3);
    +—-+————-+———-+——-+—————+—————+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+———-+——-+—————+—————+———+——+——+————-+
    | 1 | SIMPLE | projects | range | proj_phase_id | proj_phase_id | 4 | NULL | 137 | Using where |
    +—-+————-+———-+——-+—————+—————+———+——+——+————-+
    1 row in set (0.00 sec)

    mysql> explain select * from projects where project_phase_id in (2,3,4);
    +—-+————-+———-+——+—————+——+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+———-+——+—————+——+———+——+——+————-+
    | 1 | SIMPLE | projects | ALL | proj_phase_id | NULL | NULL | NULL | 757 | Using where |
    +—-+————-+———-+——+—————+——+———+——+——+————-+
    1 row in set (0.00 sec)

  21. Saru says:

    Index Performance :

    CREATE TABLE Persons (
    id int(11) NOT NULL DEFAULT 20,
    name varchar(20) NOT NULL DEFAULT ” “,
    age int(100) NOT NULL,
    PRIMARY KEY (id,name)
    ) ENGINE=InnoDB;

    insert into Persons (id,name) values (1,”name1″),(2,”name2″);

    1. select id from Persons where id = 1; -> this query use PRIMARY KEY index
    2. select id from Persons name = “name1″; -> internal index
    3. select name from Persons where id=1 and name=”name1″ and age=20; -> ??
    4. select name from Persons where id=1 and name=”name1″ and age = 30; -> PRIMARY KEY index
    5. select name from Persons where (id,name) =(1 , “name1″) and age = 30 ; -> PRIMARY KEY index

    Please explain index lookups for above queries.
    Query 4 and 5 are same.But which performance is best query 4 and query 5?

Speak Your Mind

*