October 24, 2014

Should MySQL Extend GROUP BY Syntax ?

Jan has a good article about finding the row matching some value in the group:

This is one illustration of group by limitations in SQL language which is not offset by any MySQL specific extensions,yet
As you can see if you want to get one row from the group which is sorted some way you have to use ugly hacks. This is because SQL does not have a notion of sorting the data within the groups and in fact ANSI SQL even forbids you to select columns which are not aggregates or part of group by because result in this case is not defined.

What would be quite helpful is to have GROUPORDER (pick the name) clause which defines which element is selected for non aggregate columns, something like

This could give you row which has highest population for which country, and really you would not even need to query MAX(population) if you do not need that because you’re soring by population you would see MAX(Population) in the population column:

would simply return city with highest population for each country.

In fact we had similar problem in Sphinx when we were working on BoardReader project – we wanted to group search results by thread and return only one best matching post from each thread while matches themselve could be sorted by relevance or freshness. The benefit of not following any standards is what you can always add what you want and this is just what we did with sphinx in similar case.

SQL Standard might be good for enterprise applications but huge amount of Web applications do not care – they are stuck enough in MySQL they would have rewrite a lot going to other database anyway while they would love to get features which allow you to do stuff more efficient or more beautiful way.

On other hand what I’m talking about if probably the most commonly requested extension which is already supported by number of vendors – CONNECT BY is still not implemented.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. No says:

    Great post. Lucky for me I can use “ALTER TABLE tbl ORDER BY col;” because my data is nearly static and just one GROUP BY query needs a “GROUPORDER”.

    The issue as I see it thought, is that those queries aren’t actually any GROUP BY queries, but queries where you want the best matching row for a certain WHERE.

    Basically MySQL should handle those queries like this:

    -Sort by Population
    -Scan through the table and
    a. if we already took a row with that city, trash it
    b. take the city

    This can even fully run on indices!

    Something like

    SELECT DISTINCT( Country ) City, Population, Country FROM City ORDER BY Country, Population DESC

    This would pick all all cities with highest population, and sort them by country.

    **Sadly enough DISTINCT doesn’t take any arguments.**

    I never understood why it doesn’t. Makes no sense to me.

  2. Would window functions provide what you want? Oracle has them (http://orafaq.com/node/55). I thought they were part of a standard.

  3. > in fact ANSI SQL even forbids you to select columns
    > which are not aggregates or part of group by

    ANSI SQL doesn’t require not aggregated selected columns to be present in GROUP BY in all cases. Here is some elaboration on this:

    http://www.oreillynet.com/databases/blog/2007/05/debunking_group_by_myths.html

  4. peter says:

    Olexandr,

    Yes this is simplification there are other bits in pieces which are allowed but generally if selected column is not dependent on the group by column, like in this case City is not Dependent on Country it either would not be allowed or you could see surprising results in the city column.

    That exact article gives good examples and a bit more explanations.

  5. peter says:

    No,

    Sorting table is dangerous trap as if you change query a bit, ie to add range which will be using index to access the rows you will get different plan and not result you’re expecting.

    Regarding DISTINCT – applies to the Row rather than the column – it says basically remove duplicate rows.

    If you would make it to take argument and read like remove duplicate values from this column you would have same problem as group by – what values you should see for other columns ? This would be hard to define independently on query execution plan.

  6. peter says:

    Mark,

    Analytical functions are quite helpful but they are generally designed for the different purpose – as you can see they do aggregates but they run on result set not on the grouped result set (which ensures only one row for each group is left)

    Or may be I do not see how you can get the same result set as I’m looking for using these functions without too much complications :)

    It surely would be helpful to get them in MySQL.

  7. David Grant says:

    I was just wishing for this feature the other day.

  8. Hi Peter,

    “In fact ANSI SQL even forbids you to select columns which are not aggregates or part of group by because result in this case is not defined.”

    Well, it depends which version of ANSI SQL you are referring too. SQL92 says:

    “If T is a grouped table, then each in each that references a column of T shall reference a grouping column or be specified within a .”

    Which indeed wants us to always include all non-aggregated columns in the GROUP BY clause.

    But SQL1999 and SQL2003 have a more sophisticated view on the matter:

    “If T is a grouped table, then let G be the set of grouping columns of T. In each contained in , each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a whose aggregation query is QS.”

    Which just says that you are allowed to include non-aggregated columns in the GROUP BY clause as long as they are completely determined by the GROUP BY clause, in other words, when there is exactly one distinct value per combination of values in the GROUP BY clause.

    (http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html)

    kind regards,

    Roland Bouman

  9. mmm, seems all terms enclosed in angle brackets got removed….quotes again:

    First Quote, GROUP BY according to SQL92:

    “If T is a grouped table, then each [column reference] in each [value expression] that references a column of T shall reference a grouping column or be specified within a [set function specification].”

    Second Quote, GROUP BY according to SQL1999 and SQL2003:

    “If T is a grouped table, then let G be the set of grouping columns of T. In each [value expression] contained in [select list], each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a [set function specification] whose aggregation query is QS.”

  10. peter says:

    Ronald,

    You article was already quoted and I replied :)

    “Functionally Dependent” is really more elaborate explanation of columns in group by – the basics remain the same – you can have the column in the list if there is only one value of that column in the group.

    I’m speaking about different case when there are many column value for each group and we just want to define which one is picked.

    For example in our example grouping by Country we may also include CountryCode because there is only one for each country, however there are multiple Cities for each Country and group by does not allow to pick which one is selected (or even select it at all in more strict databases)

  11. Clem says:

    I believe this is a semi-elegant solution to this ugly problem:

    SELECT City, Country, Population FROM
    (SELECT City, Country, Population
    ORDER BY Country, /* Artificial GROUP BY Country */
    Population DESC /* Artificial MAX of Population within Country */
    ) temptable
    GROUP BY Country; /* Select first row of each country grouping */

Speak Your Mind

*