Wrong GROUP BY makes your queries fragile

This is probably well known issue for everyone having some MySQL experience or experience with any other SQL database. Still I see this problem in many production applications so it is worth to mention it, especially as it is connected to MySQL Performance. No it might not affect MySQL Performance per say but it limits our ability tune MySQL Performance as queries become fragile – changing execution plan leads to different query results.

So what I’m speaking about ?

Lets say you have query something like SELECT A,B,MAX(C) FROM TBL GROUP BY A – what would you expect from such query ? Column A is part of group by so its value is same for whole group. MAX(C) is also particular value for each group, while B is not part of GROUP BY and may well correspond to different values. Which one are you looking to get ? In fact this is where results becomes non-deterministic and fragile – any B from the group could be returned, while you might be expecting some particular one.

Many other DBMS and ANSI SQL Simply forbids such queries, MySQL is more permissive and will return you first B it runs into. This value however can be dependent on selected execution plan and change if you add some indexes. Also plan may change as database statistics changes and your query may become broken without any reason.

My Advice is to stay away from such group by statements, however as usually there are exceptions and such GROUP BY statements may be faster than alternatives. Assume for example we have non-normalized table (or join result) where two columns have 1-1 relationship. For example we have user id and login which are both unique. In this case running

SELECT id,login,max(login_time) FROM log GROUP BY id,login May be replaced by SELECT id,login,max(login_time) FROM log GROUP BY id which may be faster especially if GROUP BY is executed via sorting.

In this case even though we do not know which value from the group will be returned we do not care as it will be same for all rows.

9 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Lukas

19 years ago

I recommend people enable the sql mode to prevent these kinds of queries:

ONLY_FULL_GROUP_BY

Do not allow queries for which the SELECT list refers to non-aggregated columns that are not named in the GROUP BY clause. The following query is invalid with this mode enabled because address is not named in the GROUP BY clause:

SELECT name, address, MAX(age) FROM t GROUP BY name;

As of MySQL 5.0.23, this mode also restricts references to non-aggregated columns in the HAVING clause that are not named in the GROUP BY clause.

Sheeri

19 years ago

especially if GROUP BY is executed via sorting.

Um, in your example, you say that id and login are unique — I’m assuming that’s enforced with UNIQUE KEY, right? The internals manual states that GROUP BY uses sorting when there is no index…..so while you are correct in the general case, that GROUP BY may be faster if it does not have to sort, your example does not fit your point.

The query in your example is a bad one, but not because of the optimizer’s behavior. The query is poor (and I’m sure it’s straight from clients who need your help!) because you do not NEED to GROUP BY 2 elements if they have a 1:1 relationship. Folks need to pick one element in that case. I’m sure other DBMS’ don’t handle this case “right” either.

This can be summed up with a “Best practice” — “Do not GROUP BY more than you need to.”

see
http://dev.mysql.com/doc/internals/en/optimizer.html
under “GROUP BY”

Peter Zaitsev

Author

19 years ago

Sheeri,

In my example I’m saying login and id have one to one relationship in schema. But they are not unique in this example – This is “log” table which for example logs all accesses by users including their id and login.

I also do not understand you further comment – I specially show this as a case when columns can be accessed in group by even though they are not being grouped by.

Really it does not have to be one to one relationship, it is enough to have same value for particular value of the column in group by. For example if we have City and County and each City corresponds to only one County we could have something like:

SELECT county,city, avg(age) FROM people GROUP BY city;

Not correct according to the standard but should not be fragile as there is only one county per city so it does not matter which one will be picked.

cemozdemir

19 years ago

CREATE TABLE tblmakale (makaleid tinyint(3) unsigned NOT NULL default ‘0’, uid tinyint(3) unsigned default ‘0’, makaleicerik varchar(255) default NULL, puan int(3) unsigned default NULL, PRIMARY KEY (makaleid), KEY NewIndex (puan,makaleid,makaleicerik,uid)) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO tblmakale VALUES(“3”, “2”, “gjgh”, “0”);
INSERT INTO tblmakale VALUES(“12”, “5”, “545645645”, “1”);
INSERT INTO tblmakale VALUES(“2”, “3”, “ghjghjghjghj”, “2”);
INSERT INTO tblmakale VALUES(“4”, “1”, “gg”, “3”);
INSERT INTO tblmakale VALUES(“9”, “7”, “thgjnm”, “4”);
INSERT INTO tblmakale VALUES(“0”, “2”, “fdsd”, “5”);
INSERT INTO tblmakale VALUES(“8”, “6”, “rtret”, “5”);
INSERT INTO tblmakale VALUES(“11”, “6”, “ghnb”, “6”);
INSERT INTO tblmakale VALUES(“10”, “5”, “ewrt65”, “7”);
INSERT INTO tblmakale VALUES(“1”, “5”, “fgjhgj”, “10”);
INSERT INTO tblmakale VALUES(“5”, “0”, “g”, “10”);
INSERT INTO tblmakale VALUES(“13”, “3”, “12323423”, “11”);
INSERT INTO tblmakale VALUES(“14”, “2”, “reggfjmnljÃ§pÄ±”, “12”);
INSERT INTO tblmakale VALUES(“16”, “0”, “iiiiiiiiiiiiii”, “18”);
INSERT INTO tblmakale VALUES(“6”, “0”, “hhh”, “20”);
INSERT INTO tblmakale VALUES(“15”, “2”, “ÅŸliÅŸlilÅŸilÅŸi”, “21”);
INSERT INTO tblmakale VALUES(“7”, “3”, “jkll”, “22”);

select a.uid, a.makaleicerik, a.puan, a.makaleid from tblmakale a
inner join tblmakale b on a.uid = b.uid group by a.uid, a.makaleicerik, a.puan having a.puan = max(b.puan)

jawaharlal

18 years ago

select sum of marks for all six columns and display the count more than one row.
my query is like following

select max(marks) from xyz where class=’abc’ group by sname;

marcus

17 years ago

SELECT name, address, MAX(age) FROM t GROUP BY name;

isnt this sql statement wrong? if you are grouping by name, you get a subset of address and age. on age you are applying an aggregate operator (MAX), but which item of the subset is seleected from address?

if table name, address, age contains

meier, street a, 23
meier, street b, 26

the group-by subset should be meier, (street a, street b), (23, 26)

max means

meier, (???), 26

so i’d expect sorta subset instead of questionmark. which aggregate operator is applied here? and, isnt it fatal for development to select apples and max(price) while grouping by peaches?

regards,
marcus

PesAfTeftReem

19 years ago

Bloomberg Drug News anxiety medication buspar BuSpar is available as tablets for oral administration containing 5 mg, 10 mg, 15 mg, or 30 mg of buspirone hydrochloride. http://www.kindercareschool.com/ – buy buspar online

Abduraoof

9 years ago

Using group by we get error
field is not in list
eg : (select name form user group by name)
name is not in list

Abduraoof

9 years ago

Using group by we get error
field is not in list
eg : (select name from user group by name)
name is not in list

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Wrong GROUP BY makes your queries fragile

Related Blog Articles

RECOMMENDED ARTICLES

Rate Limiting Strategies with Valkey/Redis

A Failing Unit Test, a Mysterious TCMalloc Misconfiguration, and a 60% Performance Gain in Docker

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Wrong GROUP BY makes your queries fragile

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Rate Limiting Strategies with Valkey/Redis

A Failing Unit Test, a Mysterious TCMalloc Misconfiguration, and a 60% Performance Gain in Docker

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads