November 26, 2014

Using GROUP BY WITH ROLLUP for Reporting Performance Optimization

Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you’re looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?

For illustration purposes I’ve created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.

And I’m using some silly like query to illustrate some search is required, so count(*) can’t be optimized away for MyISAM Tables.

To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:

This allows us to get results in about 40 seconds and as we can see is pretty inefficient – almost half of the time is spent counting the rows which are already traversed and counted for group by operation.

The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.

We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:

This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).

The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:

Oops. Bad luck – for some reason you can’t use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.

OK Lets see how fast it is without order by (and limit):

As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL – it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.

So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:

Use of extra temporary table for buffering helps us to get result set we’re looking for and workaround this limitation, though it of course comes with performance penalty.

This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.

Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.

UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is – it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:

As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Daniel Ciulinaru says:

    Hi Peter,

    Looks like you CAN order the results the way you want in a GROUP BY … WITH ROLLUP (at least in 5.1.11-beta – I haven’t checked other versions).
    The key word is DESC after immediately after GROUP BY:

    mysql> select IFNULL(date,’Total’) as ‘date_total’, no_of_files, raw_size as ‘raw_size’, compressed_size as ‘compressed_size’, avg_ratio as ‘avg_ratio’ from (select date(log_datestamp) as date, count(log_instance) as no_of_files, sum(log_uncompressed_size) as raw_size, sum(log_compressed_size) as compressed_size, round(avg(log_compress_ratio_percentage),3) as avg_ratio from logs_stats group by date desc with rollup) as a;
    +————+————-+————–+—————–+———–+
    | date_total | no_of_files | raw_size | compressed_size | avg_ratio |
    +————+————-+————–+—————–+———–+
    | 2007-09-21 | 16 | 2605136552 | 131024683 | 94.988 |
    | 2007-09-20 | 171 | 26946514751 | 1724108146 | 93.246 |
    | 2007-09-19 | 355 | 53270319908 | 11372204678 | 79.300 |
    | 2007-09-18 | 375 | 57924126151 | 12854828516 | 80.481 |
    ……………………………………………………………….
    | 2007-07-16 | 1 | 113834620 | 82608679 | 27.400 |
    | 2007-06-08 | 1 | 132729275 | 12107490 | 90.900 |
    | 2007-05-22 | 1 | 180724880 | 16671364 | 90.800 |
    | 2007-05-03 | 1 | 180577870 | 16670719 | 90.800 |
    | Total | 5641 | 602786174381 | 126053043926 | 80.638 |
    +————+————-+————–+—————–+———–+
    37 rows in set (0.04 sec)

    “Sure, trust the gurus. Just don’t believe anything they say, especially when it comes to performance.”
    – Steven Feuerstein

    Cheers!

  2. peter says:

    Daniel,

    Please compare the query I’m saying which can’t be run and your query. They are VERY different.

    Indeed you can get results sorted in any direction by columns you use for group by
    but if you’re using group by with roll up you can’t sort by value of aggregate function such as count(), avg() sum() – hope it makes things a bit more clear.

  3. Larry Zaetz says:

    Hello Peter-
    I was surprised to see a last name such as yours.
    I was told that some of my family routes could have
    had a similar name. Father was from a town that sounded
    like “Chipchevitz”. Relatives close by came from Sarne.

  4. peter says:

    Indeed. Your last name directly translates as “Hare” in Russian. I guess it should be from some slavic country but probably not Russian directly as Russian last names usually formed using some apendix like “ov”, “ev” etc.

  5. richard says:

    In MySQL 5 and up you could do the following. Since you already have the counts, as long as you grouop by the same fields, using the sum function in the upper query does not change the counts, and you gain speed by using the limit in the subquery.

    SELECT grp
    , sum(cnt)
    FROM ( SELECT grp
    , count(*) cnt
    FROM dt
    WHERE slack LIKE “a%”
    GROUP BY grp
    ORDER BY cnt DESC
    LIMIT 10
    ) t1
    GROUP BY grp WITH rollup;

    -richard

  6. vepa says:

    Having same problem with count() for tagcloud generation. have 300.000 different tags (groups) and 1.5 mil records. takes 120 sec. to populate a tagcloud. Do you know any optimal ways to do tagclouds for huge websites?

    I wanted to store total numebers and update time on each tag but I am using same tagcloud for different things. so I need to store several totals for each tag which will be 20 extra columns for each record.

  7. peter says:

    vepa,

    This may be long (though can be OK if the load is IO bound) – the first thing you should not run such complex queries in real time :)

  8. Dasher says:

    The Tag Cloud only changes when an article is added or changed.

    So geneate a normalised value tag information and store the information in a table.
    Then when needed update the tag table with the tags used for the post.

    Each day – re-normalise the tag table.

    You’ll need to normalise the values otherwise you’ll end up with very large values over time – when what’s important is the relative values for the tags.

    For optimum performance:
    If you’re using PHP for the site – you can also use APC to store the site-wide TAG data (the HTML will be enough) – and then the code that updates the tag table – also updates the APC tag data.

  9. Excellent article!!! Thank you ! It solved my issues with ORDER BY and GROUP BY ..WITH ROLLUP in a great way. I also noticed that this is significant faster than the regular WITH ROLLUP

  10. Great work. I admire when someone sits down and explains the nitty gritty and simple and yet full true details about a technology such that it is understandable.

    As an educator for more than 15 years, and in parallel herewith a professional in software with R&D and Software Production Line experience in Nokia, Ericsson, Motorola, CISCO etc., I greet you for your good work.

    I was using your article to illustrate some ways of making summaries for my accountant (yes – she’s educating in MySQL! – why not – better than making software for her all the time !)

    Humble Thanks,
    Digital Age Institute Ltd.
    David Svarrer

  11. Nahrae says:

    You have written this article about 5 years ago, but this is still worth to be recommended strongly!
    It saved much times for me.

    Thanks in advance,

Speak Your Mind

*