MySQL Limitations Part 3: Subqueries

This is the third in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1, 2). This post is about subqueries, which in some cases execute outside-in instead of inside-out as users expect.

It’s easy to pick on subqueries in MySQL, so I’ll try to be gentle. The following query will surprise users unpleasantly:

Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT …) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It’s only when there is a SELECT inside it that it works poorly.)

The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it’s not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn’t been used widely.

To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn’t that much of a limitation. But it would be a big improvement if it were fixed.

See if you can guess what limitation number 4 will be!

Share this post

Comments (21)

  • Foobarista

    One thing that helps is to make sure you use a range variable on the subquery if there’s any columns in tables in the outer query that have the same name. I think what’s going on is mysql’s parser doesn’t do scoping very well and thinks the subquery may be correlated if there’s any ambiguity in column names. So, rewrite the above to something like

    select * from a where in (select from b x);

    October 25, 2010 at 12:00 am
  • Yves Trudeau

    Hi Baron,
    Look at the execution plans for 6.0.x and 5.0.x in

    October 25, 2010 at 11:41 am
  • Dathan Pattishall

    Guess for #4 is the Query Cache.

    October 25, 2010 at 12:11 pm
  • Patrick Casey

    My guess for the “normal case” #4 is the threading model and coarse grained memory locks inside the system. Its most prevelent in thinks like the myisam key buffer cache, but there are still a lot of nasty big locks inside innodb that limit practical concurrency. Given that a modern server class machine has a minimum of 8/16 cores (physical/ht), a database which can only use, say, 4-6 of them effectively is leaving a lot of horsepower on the table.

    My two personal vote for #4 because it impacts me but probably not most workload are:

    #4 Inability to span innodb data files across disks or luns. Even if I use file per table, its not practical to put, say, tables apple.idb, pear.idb and banana.idb on one disk and taco.idb, burrito,idb, and burger.idb on another. On my oracle servers I can (and do) provision a lot of luns and put redo on two luns, data files on multiple luns, etc. On my INNODB servers its really not practical so we just end up with single big luns and pray the san can handle it. I’m reasonably sure I could shut down the database, move the file manually, and leave sym-links in their place, but that’s an extreme non starter for me where the production databases do no come down.

    October 25, 2010 at 12:35 pm
  • Patrick Casey

    One interesting point on the subqueries:

    Yes, I would dearly like to see this fixed as well. I’d even take a query hint that told the optimizer “eval the inner loop first” although I’d prefer the optimizer to be brighter.

    As the OP said, a lot of times you are better off with a join, but not every case converts into join semantics easily or precisely and, frankly, a lot more people can write a logically correct inner select than will get a logically correct join.

    October 25, 2010 at 12:39 pm
  • tobi

    It seems to me that many common subqueries can be transformed to joins very easily. The difficult cases can be left aside for the moment. Why not transform the simple cases?

    October 25, 2010 at 1:07 pm
  • tobi

    Limitation 4 is that hash and merge joins are not supported which seriously destroys large query performance for reporting. It can also lead to a runtime that is exponential in the number of joins.

    October 25, 2010 at 1:09 pm
  • tobi

    It also does not have a hash aggregate so that big data sets have to be sorted to be grouped. Sorting big data sets is not a good idea when you can just stuff the data in a hash table instead. Hash joins and aggregates are what drive large scale queries in postgres, sql server and oracle.

    October 25, 2010 at 1:13 pm
  • Patrick Casey

    I believe the challenge with transforming inner selects into joins is that to do it accurately (without changing the results) requires information that, while readily available to a human, isn’t available to an optimizer.


    select * from food where id in (select food_id from meals where user = me)

    The trivial rewrite is:

    select food.* from food inner join meals on = meals.food_id where meals.user = me

    The problem is the result above *might* change the answer.


    1 taco
    2 burger
    3 apple

    breakfast me 3
    lunch me 1
    dinner me 1

    It seems I had a taco for lunch and for dinner (not unreasonable).

    Our original query gives back:
    1 taco
    3 apple

    our rewritten query gives back
    3 apple
    1 taco
    1 taco

    In order to know if we can safely rewrite we have to know if the inner select will contain any duplicates. An application designer knows this. The optimizer cannot, although in certain special cases he can figure it out.

    Likewise, in a query like this in the absence of an order by clause, the mere fact that we did a rewrite will likely change the row retrieval order. Technically this is legit since, in the absence of order by, the user has no right to expect a particular order. In practise though there’s lots of bad code out there that makes assumptions like this and global changes like that would break it.

    October 25, 2010 at 1:31 pm
  • Dmitri Mikhailov

    The correct rewrite is:

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id =
    where meals.user = ‘me’

    – or –

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id =
    and meals.user = ‘me’

    – or –

    select foods.*
    ( select distinct food_id
    from meals
    where meals.user = ‘me’
    ) meals
    inner join foods
    on = meals.food_id

    October 25, 2010 at 2:30 pm
  • Michael Peters

    Does this limitation on subqueries also affect subqueries that are used in JOIN clauses? For instance

    SELECT a.* FORM a JOIN (SELECT b.a_id FROM b WHERE b.blah = ‘asdf’)

    October 25, 2010 at 2:51 pm
  • Shlomi Noach

    Next limitation might be: only one CPU per query.

    October 25, 2010 at 8:42 pm
  • Shlomi Noach

    (make that one core)

    October 25, 2010 at 8:43 pm
  • Ryan Huddleston

    MariaDB 5.3 has the latest work on subquery optimization:

    A huge amount of optimizer improvements are being put into MariaDB 5.3 and it will be exciting to see it get released, as it’s been a long road to get at some of these features.

    Not sure when they plan to release for beta though.

    October 25, 2010 at 9:50 pm
  • Daniël van Eeden

    My guesses for limitation number 4:
    1. single-threaded dump/restore (mysqldump). Which is partly solved by mk-parallel-dump/mk-parallel-restore
    2. single-threaded restore from a crashed database
    3. performance within a stored procedure (Which Roland Bouman explained during fosdem)
    4. the optimizer choosing the wrong index (esp. with prepared statements, but that’s not really the optimizers fault)
    5. TCP scalability issues (SCTP could solve that)

    October 26, 2010 at 1:35 am
  • Baron Schwartz

    Ryan, I agree 100% that the MariaDB team is doing great work and it’s exciting to see it happening and becoming available for people to use.

    October 26, 2010 at 5:23 am
  • Matt

    It is my understanding that some of they subquery fixes were put into 5.4. Materialization to be specific. Doesn’t that solve this case?

    October 26, 2010 at 10:53 pm
  • Fedya

    There is a problem where i dont see any solution without using sub-queries:

    For example:

    id, name

    id (document version id)
    document_id (document id)

    Suppliers Documents
    doc_id (version id)

    One document can have many versions (ids). The last id for there document_id is current document version;

    How to count supplier total last documents and documents verified ?

    just have to use join clause like “ IN (SELECT MAX(id) FROM documents)”
    Of course we can run “SELECT MAX(id) FROM documents” and the just construct and insert string like (2,6,7..) in the main query, but that’s not the case

    if any one have ideas, how to re wright this query…

    October 29, 2010 at 2:39 am
  • Rand

    Why not use an EXISTS query?

    select *
    from food
    where exists
    (select 1
    from meals
    where = meals.food_id
    and user = me)

    I have had no issues with EXISTS/NOT EXISTS queries on mysql 5.0.x. They work quite well.

    November 21, 2010 at 5:20 pm
  • Baron Schwartz

    EXISTS subqueries generally work as expected in MySQL.

    December 6, 2010 at 12:58 pm
  • York

    I agree with Tobi that ‘hash join’ / ‘hash aggregate’ / ‘hash join hints’ are one of the most-wanted features in MySQL.

    February 27, 2013 at 1:52 pm

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.