Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you’re looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?
For illustration purposes I’ve created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.
|
1 |
<br> CREATE TABLE `dt` (<br> `grp` int(10) unsigned NOT NULL,<br> `slack` varchar(50) DEFAULT NULL<br>) ENGINE=MyISAM DEFAULT CHARSET=latin1<br> |
And I’m using some silly like query to illustrate some search is required, so count(*) can’t be optimized away for MyISAM Tables.
To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:
|
1 |
<br>mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc limit 10;<br>+------+-----+<br>| grp | cnt |<br>+------+-----+<br>| 9879 | 300 | <br>| 3888 | 298 | <br>| 1793 | 297 | <br>| 2082 | 294 | <br>| 9729 | 293 | <br>| 3760 | 292 | <br>| 2588 | 290 | <br>| 7918 | 290 | <br>| 4055 | 290 | <br>| 6950 | 290 | <br>+------+-----+<br>10 rows in set (21.12 sec)<br><br>mysql> select count(*) cnt from dt where slack like "a%";<br>+---------+<br>| cnt |<br>+---------+<br>| 2352996 | <br>+---------+<br>1 row in set (18.71 sec)<br> |
This allows us to get results in about 40 seconds and as we can see is pretty inefficient – almost half of the time is spent counting the rows which are already traversed and counted for group by operation.
The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.
|
1 |
<br>mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc; <br>+-------+-----+<br>| grp | cnt |<br>+-------+-----+<br>| 9879 | 300 | <br>| 3888 | 298 | <br>| 1793 | 297 | <br>...<br>| 7195 | 180 | <br>| 6703 | 178 | <br>| 10000 | 107 | <br>| 0 | 105 | <br>+-------+-----+<br>10001 rows in set (21.30 sec)<br> |
We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:
|
1 |
<br>mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by null; <br>+-------+-----+<br>| grp | cnt |<br>+-------+-----+<br>| 2257 | 251 | <br>| 2418 | 266 | <br>| 5842 | 258 | <br>| 90 | 276 | <br>...<br>| 2595 | 243 | <br>| 4446 | 239 | <br>| 9802 | 233 | <br>| 1861 | 227 | <br>+-------+-----+<br>10001 rows in set (21.39 sec)<br> |
This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).
The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:
|
1 |
<br>mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup order by cnt desc limit 11;<br>ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP and ORDER BY<br> |
Oops. Bad luck – for some reason you can’t use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.
OK Lets see how fast it is without order by (and limit):
|
1 |
<br>mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup; <br>+-------+---------+<br>| grp | cnt |<br>+-------+---------+<br>| 0 | 105 | <br>| 1 | 259 | <br>| 2 | 200 | <br>...<br>| 9999 | 235 | <br>| 10000 | 107 | <br>| | 2352996 | <br>+-------+---------+<br>10002 rows in set (28.79 sec)<br> |
As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL – it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.
So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:
|
1 |
<br>mysql> select * from (select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup) t order by cnt desc limit 11;<br>+------+---------+<br>| grp | cnt |<br>+------+---------+<br>| NULL | 2352996 | <br>| 9879 | 300 | <br>| 3888 | 298 | <br>| 1793 | 297 | <br>| 2082 | 294 | <br>| 9729 | 293 | <br>| 3760 | 292 | <br>| 7918 | 290 | <br>| 4055 | 290 | <br>| 3749 | 290 | <br>| 6950 | 290 | <br>+------+---------+<br>11 rows in set (29.68 sec)<br> |
Use of extra temporary table for buffering helps us to get result set we’re looking for and workaround this limitation, though it of course comes with performance penalty.
This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.
Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.
UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is – it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:
|
1 |
<br>mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp G<br>*************************** 1. row ***************************<br> id: 1<br> select_type: SIMPLE<br> table: dt<br> type: ALL<br>possible_keys: NULL<br> key: NULL<br> key_len: NULL<br> ref: NULL<br> rows: 37748736<br> Extra: Using where; Using temporary; Using filesort<br>1 row in set (0.01 sec)<br><br>mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup G<br>*************************** 1. row ***************************<br> id: 1<br> select_type: SIMPLE<br> table: dt<br> type: ALL<br>possible_keys: NULL<br> key: NULL<br> key_len: NULL<br> ref: NULL<br> rows: 37748736<br> Extra: Using where; Using filesort<br>1 row in set (0.00 sec)<br> |
As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.
Resources
RELATED POSTS