Speeding up GROUP BY if you want aproximate results

Doing performance analyzes today I wanted to count how many hits come to the pages which get more than couple of visits per day. We had SQL logs in the database so It was pretty simple query:

Unfortunately this query ran for over half an hour badly overloaded server and I had to kill it in the end.

The reason for slowness was of course huge temporary table was required (there were about 5 million of distinct pages visited during that day) which resulted in on disk temporary table which as we know quite slow.

Of course it would be possible to allocate more memory to the temporary table or switch to filesort method and get result faster.

I however picked another road which is quite helpful in similar cases – I did not need exact result but approximate figure so I could trick MySQL to do group by a hash of the page instead of page itself:

As you can see now it completes in about 30 seconds – quite handy.

Another trick I want to share which I use a lot when I want to analyze data distribution but table is to large is to just limit it to first number of rows:

Again this is not exact value but normally close enough to make a decision.

Share this post

Comments (24)

  • Dale Johnson

    Peter, that’s a great result. But I don’t really understand though why sorting an int is over 120 times faster than sorting the site name.

    March 7, 2008 at 1:20 am
  • Milos

    Was this a typo:
    HAVING cnt>2) pv;
    on the first query.
    And on the second:
    HAVING cnt>3) pv

    Could this be the performance gain? 🙂

    March 7, 2008 at 2:30 am
  • Robert Synnott

    I suppose that it depends on your application, but I couldn’t possibly agree with the second example in the general case. I’ve seen plenty of applications where the size of DB output (especially logs) is EXTREMELY variable, and where logs emitted by the startup code might be very different to those emitted by service X.

    March 7, 2008 at 5:51 am
  • Tobias Petry

    Dale, the problem is String-operations are really slow.
    if you want to compare two strings all chars have to be the same, so it is very hard to compare two Strings. And if you want to compare 5million Strings this will be much more to do.
    Comparing integers is not very hard, crc32 generates an 32-bit integer which can be compared by the CPU directly, Strings have to be compared in the application, there’s no native Method


    March 7, 2008 at 8:16 am
  • peter


    HAVING is checked after temporary table is populated so it does not give a lot of difference. In reality I did bunch of such queries with different HAVING values they all took about same number so I just copied wrong one I guess 🙂

    March 7, 2008 at 9:25 am
  • peter


    If you become picky indeed there are cases when it does not work well, for example if you have auto populated string lookup table – this one tends to get short typical strings in lower values.

    I’m not giving it as general recipe but it is very handy in large number of cases. Especially if you consider you can use much larger number – having sample of 1mil on 50mil row table will be even more close.

    Of course to make it classical Monte Carlo method we would need to pick random values instead of taking first ones but order by rand() would be too slow and other methods of picking large number of random values are not so general.

    BTW – this is the real average page length value so you can see how far off it is. It is worse than expected if we would have completely random sample but probably better than Innodb row count estimate 🙂

    mysql> select avg(length(page)) from performance_log_080306;
    | avg(length(page)) |
    | 71.2611 |
    1 row in set (24.84 sec)

    March 7, 2008 at 10:01 am
  • peter

    Dale, Tobias

    There are couple of things here. Most important is because of long strings it usually spills out to be on disk temporary table. Even though we just store first 255 characters of the URL (technically you’ve got to use longer values if you want full urls) temporary tables are created on disk because of the size – 5 mil rows * 260 bytes per row (MEMORY tables use static storage) is over 1GB which is way too much for temporary table.

    Table with two integers per row is however much smaller and can fit in in memory temporary table.

    Strings comparisons are also slower though it is lesser impact.

    March 7, 2008 at 10:56 am
  • Jay

    If the execution engine is clever enough, then you should be able to clear this up by placing an index over the ‘page’ column; the sub-query should turn into a simple index scan (rather than full table scan, which I assume is happening?)

    If it’s even cleverer, then it might realise that it can ignore any entries where there aren’t 2 successive entries for each ‘page’ value during the index scan.

    March 7, 2008 at 12:58 pm
  • Dale Johnson

    Maybe I’m giving too much credit to MySQL here, but I assume it uses a heap table to hold the internal list of items, then does a sort, and groups the results to get the counts, but if the heap table size gets too large then it would switch over to a temporary table, probably MyISAM.

    If this were the case, then if the field is a varchar(256), for example, then each row would use 256 bytes instead of the 71 bytes, since memory tables need to be fixed. This would make more sense in terms of a multiplier, since the int is 4 bytes, the ratio of int to char becomes 64:1. I guess the swap to disk becomes the real trigger to slow this one down.

    March 7, 2008 at 2:00 pm
  • peter


    Indeed you can add the index but how long it is going to take ? You can also have index when you create the table but this will slow down inserts. This is log file and query is pretty much adhoc – next time I may be looking for average page generation time number of SQL queries or memcache efficiency.

    This is pretty much classical log table which has no indexes to have inserts as fast as possible which of course affects query times.

    March 7, 2008 at 4:46 pm
  • peter


    There are two reasons for temporary tables to be on disk. First it is created on disk if there are any BLOB or TEXT columns because these are not supported by MEMORY storage engine. Second it can be converted to on disk table if it growths too large.

    In this case it was converted.

    Sorting is whole another beast it uses temporary file if sort size is too large. For some queries you get both temporary table and temporary file but these are very different.

    March 7, 2008 at 4:49 pm
  • Arjen Lentz

    Hi Peter… the method is a good hint, but the function used (CRC) is not.
    CRC is not a hash function, it does not provide a good spread, and you will find a very high collision rate. CRCs are intended to identify bit-errors when transporting a block of data; it is *not* intended to compare different blocks of data, and it is in fact completely unsuited for that task.

    March 8, 2008 at 4:34 am
  • peter


    I know in theory CRC32 is not a hash function… but MySQL does not offer any fast integer hash functions…. MD5/SHA1 are crypto hash functions which is completely different task.

    Even though CRC32 is created for the different task it works pretty good as a hash function.

    On what kind of task did you see collision rate which would be far more than for hash function of given bits ?

    Here is example:

    mysql> select count(distinct page) from (select page from performance_log_080306 limit 100000) x;
    | count(distinct page) |
    | 83102 |
    1 row in set (0.65 sec)

    mysql> select count(distinct crc32(lower(page))) from (select page from performance_log_080306 limit 100000) x;
    | count(distinct crc32(lower(page))) |
    | 83101 |
    1 row in set (0.44 sec)

    mysql> select count(distinct md5(lower(page))) from (select page from performance_log_080306 limit 100000) x;
    | count(distinct md5(lower(page))) |
    | 83102 |
    1 row in set (0.72 sec)

    As you see on about 80K objects we have 1 collision for CRC32 which is quite in line with what you should expect for 32bit hash function.

    What kind of practical data did you see high collision rate for.

    P.S I wish MySQL would get some good 64bit Int hash function built in 🙂

    March 8, 2008 at 11:36 am
  • Diego

    Hi Peter,

    Would adding a ORDER BY NULL improve even better the speed of your query?

    March 10, 2008 at 7:33 am
  • peter

    Yes indeed order by null can be help for group by queries. How good depends on the group by execution mode and amount of groups you get in result set.

    March 10, 2008 at 7:05 pm
  • Andrew Sutherland

    I’ve been playing with crc32 and saw some very high collision rates, so I investigated a bit further and discovered GROUP BY was only taking 2^32 as a max. If I cast crc32 to BINARY though, my results worked perfectly.

    mysql> SELECT tag, COUNT(*) AS count, crc32(tag), BINARY crc32(tag)
    -> FROM tags
    -> GROUP BY BINARY crc32(tag)
    -> ORDER BY count DESC
    -> LIMIT 10
    -> ;
    | tag | count | crc32(tag) | BINARY crc32(tag) |
    | spanish | 4576 | 874050868 | 874050868 |
    | vocab | 4103 | 1178479308 | 1178479308 |
    | vocabulary | 2786 | 2147483647 | 2425997691 |
    | french | 2247 | 2147483647 | 2943733342 |
    | english | 2087 | 746783232 | 746783232 |
    | science | 1957 | 1729573288 | 1729573288 |
    | latin | 1411 | 1421320458 | 1421320458 |
    | chapter | 1274 | 2147483647 | 4186027310 |
    | history | 1171 | 666529867 | 666529867 |
    | words | 939 | 1904025228 | 1904025228 |
    10 rows in set (0.32 sec)

    Notice that both chapter, vocabulary, and french come up with the same crc32 values, but different BINARY crc32 values. So if you GROUP BY with regular crc32, those will group together.

    Thanks for the initial tip though, it’s made a 30 second query into a .32 second query.

    April 9, 2008 at 12:49 pm
  • peter


    Something wrong with your query result – note 2147483647 value for CRC32 for few columns – this is MAX value for SIGNED int while CRC32 should be unsigned. Also for BINARY CRC32 you can get above this value. You might have found a bug in MySQL 🙂

    April 9, 2008 at 1:00 pm
  • Andrew Sutherland

    Hmm – I initially thought it was signed, because PHP’s crc32 is signed. That is interesting though, the mysql docs definitely say unsigned. Do you think there’s any danger in using my method? I haven’t found any collisions yet.

    April 9, 2008 at 1:04 pm
  • peter

    CRC32 in PHP is crazy in the sense it gives results on 32bit and 64bit platforms…

    In your case casting as BINARY you should get string which is slower. Typically just CRC32 should work. Your high collision rate was because it was running into the wall with max signed integer. What MySQL version is this ?

    April 9, 2008 at 1:37 pm
  • Andrew Sutherland

    MySQL 5.0.41

    BINARY crc32(tag) is still much faster than tag. Is there a different way I can cast it to work with integers possibly?

    April 9, 2008 at 1:42 pm
  • peter

    Upgrade to latest 5.0

    simply CRC32(tag) should work.

    April 9, 2008 at 2:07 pm
  • Benni

    Hi all,

    I wonder if this procedure would seed up a very slow query in my application: There is a value ‘x’ in each entry in my database between 0.0 and 100.0 (float). I want to divide this value into N “bins” and make mysql count how many entries are in each bin. My approach for e.g. 2 bins would be an INTERVAL query:

    SELECT INTERVAL(x,50,101), count(*) FROM table GROUP BY 1;

    Problem is: mysql can’t use an index on x because of the INTERVAL function. Is there a way I can use BINARY CRC32() to speed things up? Are there better ways to solve this issue than using an INTERVAL?

    Thank you!


    September 4, 2008 at 5:15 am

Comments are closed.