This is probably well known issue for everyone having some MySQL experience or experience with any other SQL database. Still I see this problem in many production applications so it is worth to mention it, especially as it is connected to MySQL Performance. No it might not affect MySQL Performance per say but it limits our ability tune MySQL Performance as queries become fragile – changing execution plan leads to different query results.
So what I’m speaking about ?
Lets say you have query something like SELECT A,B,MAX(C) FROM TBL GROUP BY A – what would you expect from such query ? Column A is part of group by so its value is same for whole group. MAX(C) is also particular value for each group, while B is not part of GROUP BY and may well correspond to different values. Which one are you looking to get ? In fact this is where results becomes non-deterministic and fragile – any B from the group could be returned, while you might be expecting some particular one.
Many other DBMS and ANSI SQL Simply forbids such queries, MySQL is more permissive and will return you first B it runs into. This value however can be dependent on selected execution plan and change if you add some indexes. Also plan may change as database statistics changes and your query may become broken without any reason.
My Advice is to stay away from such group by statements, however as usually there are exceptions and such GROUP BY statements may be faster than alternatives. Assume for example we have non-normalized table (or join result) where two columns have 1-1 relationship. For example we have user id and login which are both unique. In this case running
SELECT id,login,max(login_time) FROM log GROUP BY id,login May be replaced by SELECT id,login,max(login_time) FROM log GROUP BY id which may be faster especially if GROUP BY is executed via sorting.
In this case even though we do not know which value from the group will be returned we do not care as it will be same for all rows.