Finding bad queries is a big part of optimization. A scientific optimization process can be simplified to “can anything be improved for less than it costs not to improve it? – if not, we’re done.” In databases, we care most about the work the database is doing. That is, queries. There are other things we care about, but not as much as queries. The key here is that the question “can anything be improved” requires a process for identifying queries that can be improved. The easiest way to do that is to look for things that indicate sub-optimality.
All of the above might seem to be obvious, but I wanted to frame the rest of this post with a logical starting point. Next let’s see what kinds of things about queries might indicate that they aren’t optimal.
A responsible approach to a task such as finding bad queries begins with a definition of badness that we can be pretty sure is complete. I am interested in knowing whether I’m missing anything, but I believe the definition of badness in this article is fairly complete. (Post comments if I’m wrong.)
Here is my definition:
This last three-part definition is where I can’t be sure that I’ve got a complete definition.
Now, given the definition above, how do we find these queries in some input such as a log of queries? It shouldn’t surprise you that there is a working implementation of most of this in a tool, Maatkit’s mk-query-digest. Here’s how:
The missing piece, part 3 — finding queries that block other queries — is best done by observing clusters in query end times. If there is a sudden burst of queries completing, the first one to complete is likely — but not certain — to have blocked the others from running. This could be due to internal locks, I/O starvation, or what have you.
In all of the above cases, the additional information in Percona Server’s query execution logs is absolutely vital. Without data to analyze, you’re left to guess at what the server is doing. For example, in Percona Server we have microsecond-precision timestamps for log events. Edit: note that I’m referring to the timestamp when the query executed, not merely the duration the query executed for, which has been available in MySQL for years. This is necessary for determining whether a group of queries completed in a cluster. One-second granularity (what you get in standard MySQL) might be enough precision for finding some cases of queries blocking others, but it’s nowhere near precise enough to find and diagnose the kinds of sub-second performance stalls we are interested in. Without data, you can only guess; and guessing ain’t optimizing.
We are considering improving mk-query-digest do to this missing analytical step. I have done it many times with simple awk scripts and various statistical measures. What do you think? Is there more to be done? Is there a better approach? Are we missing some notion of badness, or some heuristic for detecting bad queries? Let me know by posting in the comments.