As Baron writes it is not the number of rows returned by the query but number of rows accessed by the query will most likely be defining query performance. Of course not all row accessed are created equal (such as full table scan row accesses may be much faster than random index lookups row accesses in the same table) but this is very valuable data point to optimize query anyway.

The question of optimizing number of rows accessed is what would be the optimal number indicating query is typically well optimized ? Of course in the perfect world we would like to see rows returned = rows analyzed. though this is only possible to reach for small fraction of queries.

If you’re joining multiple tables or if you have GROUP BY query the number of rows which need to be utilized to create the result set will be larger than number of rows returned.

What I would like to see (for example as another slow query log record) is the number of rows which MySQL used to generate result set. Comparing this number with number of rows query actually accessed we can guess (what is important automatically !) there is potential for optimizing this query.

For example:

SELECT GENDER, COUNT(*) FROM PEOPLE GROUP BY GENDER

This query will return only couple of rows but it is clear all rows from the table were used to generate result set and it is not possible to optimize this query directly to only access couple of rows (though this gives us another idea to possibly keep cache table with couple of rows in it)

Now if we have the same table with no indexes and query

SELECT GENDER, COUNT(*) FROM PEOPLE WHERE COUNTRY=’USA’ GROUP BY GENDER

even though full table scan is performed only rows with COUNTRY=’USA’ are used in results set which clearly puts query as optimization candidate.

It is not always possible to optimize queries so the number of rows accessed is same as number of rows used to generate result set – for example any filter which can’t use indexes will make these number different, though such filter will be suboptimal and you may think how to fix the situation.

For example if you have clause like TITLE LIKE “%MYSQL%” you may instead use Full Text Search indexes. If you have WHERE ID%100=0 you can have extra column divisible_by_hundred and keep it indexed. Of course in all cases there is extra cost involved and you should weight if it make sense to optimize such queries. I’m just describing the possibility.

Sounds nice as described right ? Unfortunately it is not that easy to implement it in the general sense as you can’t always track the future of individual row. Queries with temporary result set are especially complicated, for example:

SELECT * FROM (SELECT COUNTRY,COUNT(*) FROM PEOPLE GROUP BY COUNTRY) C WHERE COUNTRY=’USA’

As of MySQL 5.0 MySQL will materialize the subquery in the from clause fully and so “use” all rows in result set while in reality only fraction of them will be needed for end result set as most of the groups are filtered out. There are many similar cases when decision of whenever row is used for result set or not is taken long after it stop existed as individual row which just was accessed.

At the same time I think starting with something and covering basic “single level” queries keeping in account JOINs, GROUP BY, LIMIT would already be helpful for many cases.

15 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Baron Schwartz

We could add the total of the Handler_XXX counts up and put them into the slow query log patch. Or put them all into the slow query log output individually.

Instead of resetting the Handler_XXX to 0 before each query, we could just keep a copy of them and set it to the current value before each query, then subtract the difference after the query is done.

Mark Callaghan

For the upcoming Google patch SHOW USER_STATISTICS includes CPU time, bytes sent, bytes received, binlog bytes written and rows read per account. Alas, we have yet to add disk reads/writes per user — they are not as easy to count.

Baron Schwartz

Peter, I think there need to be at least 2 new status variables: Filtered_by_where_clause and Filtered_by_having_clause.

Mark Callaghan

We use Linux 2.6 and clock_gettime from librt provides the CPU time. That library isn’t used by MySQL by default. Because of all difficulties of accounting for IO that you mention, I am not in a rush to count it. We use don’t use O_DIRECT, so many reads for InnoDB may only be from the OS buffer cache.

Mark Callaghan

We were using 32-bit builds until recently. Even with 64-bit builds, buffered IO has advantages. Unless you use the Google patch, dirty page writes can be slow because there is one background thread to handle them. One thread is sufficient for buffered IO because most writes to the buffer cache are fast.

Mark Callaghan

Some of us don’t get to use fancy stuff like that. I also don’t get to use SAN. Is the problem worse there? I would be unhappy if I bought an expensive SAN and then watched 1 thread handle most writes to it.

Mark Callaghan

You can use more than 1 writer thread if you apply the Google patch. I doubt I am write bound like Kevin Burton, but there are spikes and I want concurrent writes for that case.