Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Four Ways MySQL Executes GROUP BY

February 5, 2018

Author

Peter Zaitsev

Insight for DBAs

MySQL

Share this Post:

In this blog post, I’ll look into four ways MySQL executes GROUP BY.

In my previous blog post, we learned that indexes or other means of finding data might not be the most expensive part of query execution. For example, MySQL GROUP BY could potentially be responsible for 90% or more of the query execution time.

The main complexity when MySQL executes GROUP BY is computing aggregate functions in a GROUP BY statement. How this works is shown in the documentation for UDF Aggregate Functions. As we see, the requirement is that UDF functions get all values that constitute the single group one after another. That way, it can compute the aggregate function value for the single group before moving to another group.

The problem, of course, is that in most cases the source data values aren’t grouped. Values coming from a variety of groups follow one another during processing. As such, we need a special step.

Handling MySQL GROUP BY

Let’s look at the same table we looked at before:

mysql> show create table tbl G
*************************** 1. row ***************************
      Table: tbl
Create Table: CREATE TABLE `tbl` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `k` int(11) NOT NULL DEFAULT '0',
 `g` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `k` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> show create table tbl G

*************************** 1. row ***************************

Table: tbl

Create Table: CREATE TABLE `tbl` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`k` int(11) NOT NULL DEFAULT '0',

`g` int(10) unsigned NOT NULL,

PRIMARY KEY (`id`),

KEY `k` (`k`)

) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

And the same GROUP BY statements executed in different ways:

1: Index Ordered GROUP BY in MySQL

mysql> select k, count(*) c from tbl group by k order by k limit 5;
+---+---+
| k | c |
+---+---+
| 2 | 3 |
| 4 | 1 |
| 5 | 2 |
| 8 | 1 |
| 9 | 1 |
+---+---+
5 rows in set (0.00 sec)

mysql> explain select k, count(*) c from tbl group by k order by k limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 5
    filtered: 100.00
       Extra: Using index
1 row in set, 1 warning (0.00 sec)

mysql> select k, count(*) c from tbl group by k order by k limit 5;

+---+---+

| k | c |

+---+---+

| 2 | 3 |

| 4 | 1 |

| 5 | 2 |

| 8 | 1 |

| 9 | 1 |

+---+---+

5 rows in set (0.00 sec)

mysql> explain select k, count(*) c from tbl group by k order by k limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: index

possible_keys: k

key: k

key_len: 4

ref: NULL

rows: 5

filtered: 100.00

Extra: Using index

1 row in set, 1 warning (0.00 sec)

In this case, we have an index on the column we use for GROUP BY. This way, we can just scan data group by group and perform GROUP BY on the fly (inexpensively).

It works especially well when we use LIMIT to restrict the number of groups we retrieve or when a “covering index” is in use, as a sequential index-only scan is a very fast operation.

If you have a small number of groups though, and no covering index, index order scans can cause a lot of IO. So this might not be the most optimal plan.

2: External Sort GROUP BY in MySQL

mysql> explain select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using filesort
1 row in set, 1 warning (0.00 sec)


mysql> select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5;
+---+---+
| g | c |
+---+---+
| 0 | 1 |
| 1 | 2 |
| 4 | 1 |
| 5 | 1 |
| 6 | 2 |
+---+---+
5 rows in set (0.88 sec)

mysql> explain select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 998490

filtered: 100.00

Extra: Using filesort

1 row in set, 1 warning (0.00 sec)

mysql> select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5;

+---+---+

| g | c |

+---+---+

| 0 | 1 |

| 1 | 2 |

| 4 | 1 |

| 5 | 1 |

| 6 | 2 |

+---+---+

5 rows in set (0.88 sec)

If we do not have an index that allows us to scan the data in the group order, we can instead get data sorted through an external sort (also referred to as “filesort” in MySQL).

You may notice I’m using an SQL_BIG_RESULT hint here to get this plan. Without it, MySQL won’t choose this plan in this case.

In general, MySQL prefers to use this plan only if we have a large number of groups because in this case sorting is more efficient than having a temporary table (which we will talk about next).

3: Temporary Table GROUP BY in MySQL

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary
1 row in set, 1 warning (0.00 sec)


mysql> select  g, sum(g) s from tbl group by g order by null limit 5;
+---+------+
| g | s    |
+---+------+
| 0 |    0 |
| 1 |    2 |
| 4 |    4 |
| 5 |    5 |
| 6 |   12 |
+---+------+
5 rows in set (7.75 sec)

mysql> explain select g, sum(g) s from tbl group by g limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 998490

filtered: 100.00

Extra: Using temporary

1 row in set, 1 warning (0.00 sec)

mysql> select g, sum(g) s from tbl group by g order by null limit 5;

+---+------+

| g | s |

+---+------+

| 0 | 0 |

| 1 | 2 |

| 4 | 4 |

| 5 | 5 |

| 6 | 12 |

+---+------+

5 rows in set (7.75 sec)

In this case, MySQL also does a full table scan. But instead of running additional sort passes, it creates a temporary table instead. This temporary table contains one row per group, and with each incoming row, the value for the corresponding group is updated. Lots of updates! While this might be reasonable in-memory, it becomes very expensive if the resulting table is so large that updates are going to cause a lot of disk IO. In this case, external sort plans are usually better.

Note that while MySQL selects this plan by default for this use case, if we do not supply any hints it is almost 10x slower than the plan we get using the SQL_BIG_RESULT hint.

You may notice I added “ORDER BY NULL” to this query. This is to show you “clean” the temporary table only plan. Without it, we get this plan:

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

mysql> explain select g, sum(g) s from tbl group by g limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 998490

filtered: 100.00

Extra: Using temporary; Using filesort

1 row in set, 1 warning (0.00 sec)

In it, we get the “worst of both worlds” with Using Temporary Table and filesort.

MySQL 5.7 always returns GROUP BY results sorted in group order, even if this the query doesn’t require it (which can then require an expensive additional sort pass). ORDER BY NULL signals the application doesn’t need this.

You should note that in some cases – such as JOIN queries with aggregate functions accessing columns from different tables – using temporary tables for GROUP BY might be the only option.

If you want to force MySQL to use a plan that does temporary tables for GROUP BY, you can use the SQL_SMALL_RESULT hint.

4: Index Skip-Scan-Based GROUP BY in MySQL

The previous three GROUP BY execution methods apply to all aggregate functions. Some of them, however, have a fourth method.

mysql> explain select k,max(id) from tbl group by k G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 2
    filtered: 100.00
       Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)

mysql> select k,max(id) from tbl group by k;
+---+---------+
| k | max(id) |
+---+---------+
| 0 | 2340920 |
| 1 | 2340916 |
| 2 | 2340932 |
| 3 | 2340928 |
| 4 | 2340924 |
+---+---------+
5 rows in set (0.00 sec)

mysql> explain select k,max(id) from tbl group by k G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: range

possible_keys: k

key: k

key_len: 4

ref: NULL

rows: 2

filtered: 100.00

Extra: Using index for group-by

1 row in set, 1 warning (0.00 sec)

mysql> select k,max(id) from tbl group by k;

+---+---------+

| k | max(id) |

+---+---------+

| 0 | 2340920 |

| 1 | 2340916 |

| 2 | 2340932 |

| 3 | 2340928 |

| 4 | 2340924 |

+---+---------+

5 rows in set (0.00 sec)

This method applies only to very special aggregate functions: MIN() and MAX(). These do not really need to go through all the rows in the group to compute the value at all.

They can just jump to the minimum or maximum group value in the group directly (if there is such an index).

How can you find MAX(ID) value for each group if the index is only built on (K) column? This is an InnoDB table. Remember InnoDB tables effectively append the PRIMARY KEY to all indexes. (K) becomes (K,ID), allowing us to use Skip-Scan optimization for this query.

This optimization is only enabled if there is a large number of rows per group. Otherwise, MySQL prefers more conventional means to execute this query (like Index Ordered GROUP BY detailed in approach #1).

While we’re on MIN()/MAX() aggregate functions, other optimizations apply to them as well. For example, if you have an aggregate function with no GROUP BY (effectively having one group for all tables), MySQL fetches those values from indexes during a statistics analyzes phase and avoids reading tables during the execution stage altogether:

mysql> explain select max(k) from tbl G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: NULL
  partitions: NULL
        type: NULL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
    filtered: NULL
       Extra: Select tables optimized away
1 row in set, 1 warning (0.00 sec)

mysql> explain select max(k) from tbl G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: NULL

partitions: NULL

type: NULL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: NULL

filtered: NULL

Extra: Select tables optimized away

1 row in set, 1 warning (0.00 sec)

Filtering and Group By

We have looked at four ways MySQL executes GROUP BY. For simplicity, I used GROUP BY on the whole table, with no filtering applied. The same concepts apply when you have a WHERE clause:

mysql> explain select  g, sum(g) s from tbl where k>4 group by g order by NULL limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary
1 row in set, 1 warning (0.00 sec)

mysql> explain select g, sum(g) s from tbl where k>4 group by g order by NULL limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: range

possible_keys: k

key: k

key_len: 4

ref: NULL

rows: 1

filtered: 100.00

Extra: Using index condition; Using temporary

1 row in set, 1 warning (0.00 sec)

For this case, we use the range on the K column for data filtering/lookup and do a GROUP BY when there is a temporary table.

In some cases, the methods do not conflict. In others, however, we have to choose either to use one index for GROUP BY or another index for filtering:

mysql> alter table tbl add key(g);
Query OK, 0 rows affected (4.17 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain select  g, sum(g) s from tbl where k>1 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k,g
         key: g
     key_len: 4
         ref: NULL
        rows: 16
    filtered: 50.00
       Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql> explain select  g, sum(g) s from tbl where k>4 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k,g
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

mysql> alter table tbl add key(g);

Query OK, 0 rows affected (4.17 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> explain select g, sum(g) s from tbl where k>1 group by g limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: index

possible_keys: k,g

key: g

key_len: 4

ref: NULL

rows: 16

filtered: 50.00

Extra: Using where

1 row in set, 1 warning (0.00 sec)

mysql> explain select g, sum(g) s from tbl where k>4 group by g limit 5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: tbl

partitions: NULL

type: range

possible_keys: k,g

key: k

key_len: 4

ref: NULL

rows: 1

filtered: 100.00

Extra: Using index condition; Using temporary; Using filesort

1 row in set, 1 warning (0.00 sec)

Depending on specific constants used in this query, we can see that we either use an index ordered scan for GROUP BY (and “give up” benefiting from the index to resolve the WHERE clause), or use an index to resolve the WHERE clause (but use a temporary table to resolve GROUP BY).

In my experience, this is where MySQL GROUP BY does not always make the right choice. You might need to use FORCE INDEX to execute queries the way you want them to.

Summary

I hope this article provides a good overview of how MySQL executes GROUP BY. In my next blog post, we will look into techniques you can use to optimize GROUP BY queries.

You May Also Like

Alexander Rubin and I wrote the book on practical MySQL performance optimization. The book is available to download at no cost to you.

If your business is looking for a new open-source database option, our white paper covers an array of services that fit into four general categories: Support, Consulting, Managed DBA Services, and Training. Download our paper for guidance on deciding which option is the right fit for your business needs.

Upgrade your MySQL experience with Percona Server for MySQL, a superior, open source alternative to traditional MySQL databases. Merging the best of MySQL Community Edition with features on par with the Enterprise Edition, it offers the perfect blend of performance, security, and cost-effectiveness. Say goodbye to restrictive licensing and embrace a database solution tailored for enterprise needs without the overhead.

0 0 votes

Article Rating

1 Comment

Oldest

Newest Most Voted

Rey

7 years ago

Thank Peter for the great article.

I have 1 concern about “Using temporary” and “Using filesort”.

How “Using temporary” works behind the scene?
MySQL will create a temp in-memory table, and if the size of table is too big, it will need to save to disk (in temp folder), when we do group by, and we have a lot of result, it means a lot of UPDATEs to the temporary table, it leads to heavy IO?

And why MySQL choose “Using temporary” as the default execute plan, with the hint of SQL_BIG_RESULT | SQL_SMALL_RESULT, I guess “Using temporary” will be effective in the case we have small result, I did some tests but the response time is so so, not really fast compare to “Using filesort”

I tried to find document, article for it but I could not find clear enough one.

I understand that “Using filesort” is not really using “file” for “sort”, MySQL will sort data in memory, if the data can not fit in memory, it will separate data into chunks, sort each chunks and save to file, then merge all the chunk using merge sort.

Regards,
Rey