Last few days I had a lot of a lot of questions at MySQL Performance Forum as well as from our customers regarding query optimization… which had one thing in common – It is not query which needed to be optimized.
Way too frequently people design schema first and then think how the queries they are looking to be answered can be expressed using that schema… which just does not work as a lot of rows needs to be traversed to get results you’re looking for. Here are couple of examples:
GROUP BY Consider SELECT COUNT(*) cnt, page FROM log GROUP BY page ORDER BY cnt DESC limit 10, the query to display top popular pages. There are very many queries of this type in many web applications which do group by one field sort by another and then use LIMIT to output first few rows. This query is not going to be efficient as it needs to traverse a lot of rows to perform group by so for large number of rows you’ve simply got to use different solutions, such as summary table. With MySQL 5.0 it is also easy to implement one by use of triggers.
From Schema Design standpoint it is best to make sure your most frequent queries do not need GROUP BY. It can be reachable by normalization or other techniques.
COUNT Many people think counting is fast, some have heard MyISAM is instant with COUNT(*) without reading enough details to know it only applies to very special case – when there is no WHERE clause. Quite typically you need to page over results and even if you just show first 10 of them and do it by index COUNT(*) query can cause a lot of troubles. Imagine for example some sort of profile browsing at social networking site – SELECT COUNT(*) FROM profiles where STATE_ID=12 and CITY_ID=234 and GENDER=’M’. If number of rows is not too large using covering index ie (STATE_ID,CITY_ID,GENDER) is great because scanning index is frequently makes COUNT 5-10 times faster, sometimes more. This however still works up to certain number of rows in result set, with results set of 100.000+ it becomes pretty inefficient.
There is surely SQL_CALC_FOUND_ROWS but it can often make query slower than two queries, because it will not be able to use covering index to compute the rows. If the number of combinations is small, like in this example you can use precomputed count values with great success. You can also aggressively cache them if they are high – really who cares if there is 100345 or 100360 matching profiles ?
Yet another approach is to throw away count and run main query with LIMIT 1000 or so, cache results for all pages and if 1000 rows were returned write “1000 or more rows matched”. Many people dislike it because match counts as in Google make it look more professional but in practice for many applications users do not really care.
In some cases you can avoid COUNT completely by storing count values in the tables, in others you can but you at least should try to make them efficient – make sure all where clauses are using indexes, try to have queries index covered and avoid joins as much as possible.
ORDER BY I already wrote about ORDER BY LIMIT Optimization so I will not repeat it here. In the nutshell ORDER BY LIMIT can be very fast… if it is done using Index or it can be slow on large data sets if external filesort is used so make sure to check explain for this one. So frequently I see dynamic query generation when some query variants fall back to use filesort, and it does not take many of such queries to overload server.
Also watch high limit values and “post filtering” with Index based ORDER BY, these also can cause a lot of troubles.
So designing schema make sure for queries you’re going to run ORDER BY .. LIMIT can be performed using the index, for example simple JOIN is often showstopper for this.