MySQL Limitations Part 3: Subqueries

PREVIOUS POST
NEXT POST

This is the third in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1, 2). This post is about subqueries, which in some cases execute outside-in instead of inside-out as users expect.

It’s easy to pick on subqueries in MySQL, so I’ll try to be gentle. The following query will surprise users unpleasantly:

Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT …) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It’s only when there is a SELECT inside it that it works poorly.)

The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it’s not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn’t been used widely.

To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn’t that much of a limitation. But it would be a big improvement if it were fixed.

See if you can guess what limitation number 4 will be!

PREVIOUS POST
NEXT POST

Comments

  1. Foobarista says

    One thing that helps is to make sure you use a range variable on the subquery if there’s any columns in tables in the outer query that have the same name. I think what’s going on is mysql’s parser doesn’t do scoping very well and thinks the subquery may be correlated if there’s any ambiguity in column names. So, rewrite the above to something like

    select * from a where a.id in (select x.id from b x);

  2. Patrick Casey says

    My guess for the “normal case” #4 is the threading model and coarse grained memory locks inside the system. Its most prevelent in thinks like the myisam key buffer cache, but there are still a lot of nasty big locks inside innodb that limit practical concurrency. Given that a modern server class machine has a minimum of 8/16 cores (physical/ht), a database which can only use, say, 4-6 of them effectively is leaving a lot of horsepower on the table.

    My two personal vote for #4 because it impacts me but probably not most workload are:

    #4 Inability to span innodb data files across disks or luns. Even if I use file per table, its not practical to put, say, tables apple.idb, pear.idb and banana.idb on one disk and taco.idb, burrito,idb, and burger.idb on another. On my oracle servers I can (and do) provision a lot of luns and put redo on two luns, data files on multiple luns, etc. On my INNODB servers its really not practical so we just end up with single big luns and pray the san can handle it. I’m reasonably sure I could shut down the database, move the file manually, and leave sym-links in their place, but that’s an extreme non starter for me where the production databases do no come down.

  3. Patrick Casey says

    One interesting point on the subqueries:

    Yes, I would dearly like to see this fixed as well. I’d even take a query hint that told the optimizer “eval the inner loop first” although I’d prefer the optimizer to be brighter.

    As the OP said, a lot of times you are better off with a join, but not every case converts into join semantics easily or precisely and, frankly, a lot more people can write a logically correct inner select than will get a logically correct join.

  4. tobi says

    It seems to me that many common subqueries can be transformed to joins very easily. The difficult cases can be left aside for the moment. Why not transform the simple cases?

  5. tobi says

    Limitation 4 is that hash and merge joins are not supported which seriously destroys large query performance for reporting. It can also lead to a runtime that is exponential in the number of joins.

  6. tobi says

    It also does not have a hash aggregate so that big data sets have to be sorted to be grouped. Sorting big data sets is not a good idea when you can just stuff the data in a hash table instead. Hash joins and aggregates are what drive large scale queries in postgres, sql server and oracle.

  7. Patrick Casey says

    Tobi:
    I believe the challenge with transforming inner selects into joins is that to do it accurately (without changing the results) requires information that, while readily available to a human, isn’t available to an optimizer.

    Consider:

    select * from food where id in (select food_id from meals where user = me)

    The trivial rewrite is:

    select food.* from food inner join meals on food.id = meals.food_id where meals.user = me

    The problem is the result above *might* change the answer.

    Consider:

    Foods
    1 taco
    2 burger
    3 apple

    meals
    breakfast me 3
    lunch me 1
    dinner me 1

    It seems I had a taco for lunch and for dinner (not unreasonable).

    Our original query gives back:
    1 taco
    3 apple

    our rewritten query gives back
    3 apple
    1 taco
    1 taco

    In order to know if we can safely rewrite we have to know if the inner select will contain any duplicates. An application designer knows this. The optimizer cannot, although in certain special cases he can figure it out.

    Likewise, in a query like this in the absence of an order by clause, the mere fact that we did a rewrite will likely change the row retrieval order. Technically this is legit since, in the absence of order by, the user has no right to expect a particular order. In practise though there’s lots of bad code out there that makes assumptions like this and global changes like that would break it.

  8. Dmitri Mikhailov says

    The correct rewrite is:

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id = foods.id
    where meals.user = ‘me’
    ;

    – or –

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id = foods.id
    and meals.user = ‘me’
    ;

    – or –

    select foods.*
    from
    ( select distinct food_id
    from meals
    where meals.user = ‘me’
    ) meals
    inner join foods
    on foods.id = meals.food_id
    ;

  9. Michael Peters says

    Does this limitation on subqueries also affect subqueries that are used in JOIN clauses? For instance

    SELECT a.* FORM a JOIN (SELECT b.a_id FROM b WHERE b.blah = ‘asdf’)

  10. Daniël van Eeden says

    My guesses for limitation number 4:
    1. single-threaded dump/restore (mysqldump). Which is partly solved by mk-parallel-dump/mk-parallel-restore
    2. single-threaded restore from a crashed database
    3. performance within a stored procedure (Which Roland Bouman explained during fosdem)
    4. the optimizer choosing the wrong index (esp. with prepared statements, but that’s not really the optimizers fault)
    5. TCP scalability issues (SCTP could solve that)

  11. Matt says

    It is my understanding that some of they subquery fixes were put into 5.4. Materialization to be specific. Doesn’t that solve this case?

  12. Fedya says

    There is a problem where i dont see any solution without using sub-queries:

    For example:

    Suppliers
    id, name

    Documents
    id (document version id)
    document_id (document id)

    Suppliers Documents
    supp_id
    doc_id (version id)
    verified

    One document can have many versions (ids). The last id for there document_id is current document version;

    How to count supplier total last documents and documents verified ?

    just have to use join clause like “documents.id IN (SELECT MAX(id) FROM documents)”
    Of course we can run “SELECT MAX(id) FROM documents” and the just construct and insert string like (2,6,7..) in the main query, but that’s not the case

    if any one have ideas, how to re wright this query…
    thanks

  13. Rand says

    Why not use an EXISTS query?

    select *
    from food
    where exists
    (select 1
    from meals
    where food.id = meals.food_id
    and user = me)

    I have had no issues with EXISTS/NOT EXISTS queries on mysql 5.0.x. They work quite well.

  14. York says

    I agree with Tobi that ‘hash join’ / ‘hash aggregate’ / ‘hash join hints’ are one of the most-wanted features in MySQL.

Leave a Reply

Your email address will not be published. Required fields are marked *