Missing Data – rows used to generate result setPeter Zaitsev
As Baron writes it is not the number of rows returned by the query but number of rows accessed by the query will most likely be defining query performance. Of course not all row accessed are created equal (such as full table scan row accesses may be much faster than random index lookups row accesses in the same table) but this is very valuable data point to optimize query anyway.
The question of optimizing number of rows accessed is what would be the optimal number indicating query is typically well optimized ? Of course in the perfect world we would like to see rows returned = rows analyzed. though this is only possible to reach for small fraction of queries.
If you’re joining multiple tables or if you have GROUP BY query the number of rows which need to be utilized to create the result set will be larger than number of rows returned.
What I would like to see (for example as another slow query log record) is the number of rows which MySQL used to generate result set. Comparing this number with number of rows query actually accessed we can guess (what is important automatically !) there is potential for optimizing this query.
SELECT GENDER, COUNT(*) FROM PEOPLE GROUP BY GENDER
This query will return only couple of rows but it is clear all rows from the table were used to generate result set and it is not possible to optimize this query directly to only access couple of rows (though this gives us another idea to possibly keep cache table with couple of rows in it)
Now if we have the same table with no indexes and query
SELECT GENDER, COUNT(*) FROM PEOPLE WHERE COUNTRY=’USA’ GROUP BY GENDER
even though full table scan is performed only rows with COUNTRY=’USA’ are used in results set which clearly puts query as optimization candidate.
It is not always possible to optimize queries so the number of rows accessed is same as number of rows used to generate result set – for example any filter which can’t use indexes will make these number different, though such filter will be suboptimal and you may think how to fix the situation.
For example if you have clause like TITLE LIKE “%MYSQL%” you may instead use Full Text Search indexes. If you have WHERE ID%100=0 you can have extra column divisible_by_hundred and keep it indexed. Of course in all cases there is extra cost involved and you should weight if it make sense to optimize such queries. I’m just describing the possibility.
Sounds nice as described right ? Unfortunately it is not that easy to implement it in the general sense as you can’t always track the future of individual row. Queries with temporary result set are especially complicated, for example:
SELECT * FROM (SELECT COUNTRY,COUNT(*) FROM PEOPLE GROUP BY COUNTRY) C WHERE COUNTRY=’USA’
As of MySQL 5.0 MySQL will materialize the subquery in the from clause fully and so “use” all rows in result set while in reality only fraction of them will be needed for end result set as most of the groups are filtered out. There are many similar cases when decision of whenever row is used for result set or not is taken long after it stop existed as individual row which just was accessed.
At the same time I think starting with something and covering basic “single level” queries keeping in account JOINs, GROUP BY, LIMIT would already be helpful for many cases.