Four Ways MySQL Executes GROUP BY

MySQL GROUP BYIn this blog post, I’ll look into four ways MySQL executes GROUP BY. 

In my previous blog post, we learned that indexes or other means of finding data might not be the most expensive part of query execution. For example, MySQL GROUP BY could potentially be responsible for 90% or more of the query execution time. 

The main complexity when MySQL executes GROUP BY is computing aggregate functions in a GROUP BY statement. How this works is shown in the documentation for UDF Aggregate Functions. As we see, the requirement is that UDF functions get all values that constitute the single group one after another. That way, it can compute the aggregate function value for the single group before moving to another group.

The problem, of course, is that in most cases the source data values aren’t grouped. Values coming from a variety of groups follow one another during processing. As such, we need a special step.

Handling MySQL GROUP BY

Let’s look at the same table we looked at before:

And the same GROUP BY statements executed in different ways:

1: Index Ordered GROUP BY in MySQL

In this case, we have an index on the column we use for GROUP BY. This way, we can just scan data group by group and perform GROUP BY on the fly (inexpensively).

It works especially well when we use LIMIT to restrict the number of groups we retrieve or when a “covering index” is in use, as a sequential index-only scan is a very fast operation.

If you have a small number of groups though, and no covering index, index order scans can cause a lot of IO. So this might not be the most optimal plan.

2: External Sort GROUP BY in MySQL

If we do not have an index that allows us to scan the data in the group order, we can instead get data sorted through an external sort (also referred to as “filesort” in MySQL).

You may notice I’m using an SQL_BIG_RESULT hint here to get this plan. Without it, MySQL won’t choose this plan in this case.

In general, MySQL prefers to use this plan only if we have a large number of groups because in this case sorting is more efficient than having a temporary table (which we will talk about next).

3: Temporary Table GROUP BY in MySQL

In this case, MySQL also does a full table scan. But instead of running additional sort passes, it creates a temporary table instead. This temporary table contains one row per group, and with each incoming row, the value for the corresponding group is updated. Lots of updates! While this might be reasonable in-memory, it becomes very expensive if the resulting table is so large that updates are going to cause a lot of disk IO. In this case, external sort plans are usually better.

Note that while MySQL selects this plan by default for this use case, if we do not supply any hints it is almost 10x slower than the plan we get using the SQL_BIG_RESULT hint.

You may notice I added “ORDER BY NULL” to this query. This is to show you “clean” the temporary table only plan. Without it, we get this plan: