How to Identify Bad Queries in MySQL

PREVIOUS POST
NEXT POST

Finding bad queries is a big part of optimization. A scientific optimization process can be simplified to “can anything be improved for less than it costs not to improve it? – if not, we’re done.” In databases, we care most about the work the database is doing. That is, queries. There are other things we care about, but not as much as queries. The key here is that the question “can anything be improved” requires a process for identifying queries that can be improved. The easiest way to do that is to look for things that indicate sub-optimality.

All of the above might seem to be obvious, but I wanted to frame the rest of this post with a logical starting point. Next let’s see what kinds of things about queries might indicate that they aren’t optimal.

A responsible approach to a task such as finding bad queries begins with a definition of badness that we can be pretty sure is complete. I am interested in knowing whether I’m missing anything, but I believe the definition of badness in this article is fairly complete. (Post comments if I’m wrong.)

Here is my definition:

  • Queries are bad either singly (an individual query is bad) or in aggregate (a group or class of queries is bad). I think this is a complete definition because the only other kind of number besides “one” and “many” is “zero”, and a nonexistent query can’t be bad.
  • Queries, or groups of queries, are bad because
    1. they are slow and provide a bad user experience, or
    2. they add too much load to the system, or
    3. they block other queries from running

This last three-part definition is where I can’t be sure that I’ve got a complete definition.

Now, given the definition above, how do we find these queries in some input such as a log of queries? It shouldn’t surprise you that there is a working implementation of most of this in a tool, Maatkit’s mk-query-digest. Here’s how:

  1. It finds queries (or groups of queries) that are slow and provide a bad user experience by the parameters to the –outliers option. If a query is slower than X more than N times, it’s bad. You can look at factors such as the variance-to-mean ratio, the Apdex score, and the explain sparkline to see whether the query is likely to be possible to optimize.
  2. It finds queries or groups of queries that add too much load to the system by grouping queries into classes and summing the execution time together. Then it prints out a profile of the classes, so you can see which ones contribute what fraction of total execution time. You can use Amdahl’s Law to decide which of those are candidates for improvement. (The V/M ratio, Apdex score, and the explain sparkline are useful here too.)
  3. mk-query-digest doesn’t do part 3 — finding queries that block other queries — yet.

The missing piece, part 3 — finding queries that block other queries — is best done by observing clusters in query end times. If there is a sudden burst of queries completing, the first one to complete is likely — but not certain — to have blocked the others from running. This could be due to internal locks, I/O starvation, or what have you.

In all of the above cases, the additional information in Percona Server’s query execution logs is absolutely vital. Without data to analyze, you’re left to guess at what the server is doing. For example, in Percona Server we have microsecond-precision timestamps for log events. Edit: note that I’m referring to the timestamp when the query executed, not merely the duration the query executed for, which has been available in MySQL for years. This is necessary for determining whether a group of queries completed in a cluster. One-second granularity (what you get in standard MySQL) might be enough precision for finding some cases of queries blocking others, but it’s nowhere near precise enough to find and diagnose the kinds of sub-second performance stalls we are interested in. Without data, you can only guess; and guessing ain’t optimizing.

We are considering improving mk-query-digest do to this missing analytical step. I have done it many times with simple awk scripts and various statistical measures. What do you think? Is there more to be done? Is there a better approach? Are we missing some notion of badness, or some heuristic for detecting bad queries? Let me know by posting in the comments.

PREVIOUS POST
NEXT POST

Comments

  1. says

    I think there’s a fairly obvious omission from the “what makes a bad query” list – they cause errors or warnings.

    Another would be “they run too often” – running in milliseconds is all well and good, but if you execute that same statement 1000 times a second, then that’s bad (especially with the same literals). Perhaps this falls under #2 as well though.

    As for finding blocked queries, there is no longer any need to “guess”, at least when using InnoDB and talking about locking – that is what the INFORMATION_SCHEMA.INNODB_LOCK_WAITS table is for:

    http://dev.mysql.com/doc/refman/5.5/en/innodb-lock-waits-table.html
    http://dev.mysql.com/doc/refman/5.5/en/innodb-information-schema-transactions.html#innodb-information-schema-examples

    Finally, the standard MySQL has had microsecond slow query log (not table) precision since 5.1.21, which was released in August 2007, let’s try not to turn that in to yet another “MySQL Myth”.. ;)

    As a side note, is there any reason the Percona patch for that still exists? Does it do something different?

  2. says

    Mark,

    Thanks. I think we are talking about a little bit different things in a couple of these cases. Finding queries that block others often needs to happen after the fact, so we need to use a log for that. And it isn’t only the query response time that Percona’s server logs with microseconds, but also the time of occurrence of the query, i.e. the time of day. (In the slow query log, this is the end time, not the start time.) I’m sorry that this looked confusing and I don’t want to look like I’m promulgating a myth, so I will edit the post to clarify.

    The Percona patch does an incredible amount of very useful/indispensable stuff besides just microsecond precision. You should read our docs :)

  3. says

    Well, I’m not so sure it *requires* a log – it simply requires … monitoring it.. :) The monitor can output a log if it wants to, or record it in some database, or detect it on the fly and alert about it etc.

    That’s not to say it’s not of use in the slow log too, and that a log may be the finer grained option to use when sampling doesn’t cut it for you, I was just pointing out that there are other, more precise, ways of finding sessions that are blocked now.

    I hadn’t realized it was the actual timestamp rather than runtime for microsecond precision. My bad, and thanks for the clarification. :)

  4. says

    Mark, yes I agree that there are better ways to find blocked sessions inside InnoDB. But there are also other reasons that a session might block that won’t show up in the lock-waits table, such as table locks, resource starvation, etc. I know that some of these can also be found in the performance schema tables. But I think that with the addition of good tools for finding perpetrators and victims in a generic way, we are likely to find and solve many problems we would not see merely by looking in a place where the light shines brightly. It’s easy to fall in love with the data we have — we also have to find ways to get the data we don’t have :)

  5. Willem says

    “Finding bad queries is a big part of optimization.” – I disagree, this is the easy part. Fighting DBmodel issue’s ( e.g. a generic model) is much harder.

    “A scientific optimization process can be simplified to “can anything be improved for less than it costs not to improve it? – if not, we’re done.” ” – I disagree, the costs cannot be determined easily. You can get away with a slow system, but when customers start to run away, you’re too late for tuning.

    “In databases, we care most about the work the database is doing. That is, queries. There are other things we care about, but not as much as queries.” – I disagree, storage, OS and RDBMS settings (memory, isolation level etc) need to be taken into account as well, otherwise your analyses is incomplete.

    “The key here is that the question “can anything be improved” requires a process for identifying queries that can be improved. The easiest way to do that is to look for things that indicate sub-optimality.” – I disagree. In my systems, there are hundreds of SQL statements which are not optimal ( e.g. full tablescan, not using indexes). The trick is not to look for sub-optimality, but which SQL statement have a big impact on the database ( responsetime times frequency, yes, it’s the Amdahl law you mentioned), and then check if this SQL can be tuned.

    You definition:

    “Queries, or groups of queries, are bad because

    1. they are slow and provide a bad user experience, or
    2. they add too much load to the system, or
    3. they block other queries from running ”

    is useless to me. What is your definition of slow, bad user experience, too much load? I have done many performance tuning jobs, and my first task is to quantify the problems: Figures vs gut-feelings. Maatkit’s mk-query-digest is a good tool for this.

    Can you explain what you mean with : “In all of the above cases, the additional information in Percona Server’s query execution logs is absolutely vital. ” ? For what? I have tuned many queries with help of that maatkit tool. What’s the gain?

    Apart from your call for help on the functionality of mk-query-digest, i don’t see where this article is going, sorry..

    About your question:
    “Are we missing some notion of badness, or some heuristic for detecting bad queries? Let me know by posting in the comments.”
    (1) I look at the cardinality (number of records) visited compared to the cardinality send. Have 2 million records been checked for sending 3 records to the client? I make an efficiency % of this.
    (2) Queries that return a lot of records. Is that user really asking 2 million records, or is there some clientside filtering being done? If so, filtering is best done at DB side.

  6. says

    “Amen brother”.

    And as you mention performance schema – “we’re working on it”, and allow me to cast out a call for feedback too.. :)

    We recently opened up the following Worklogs on the forge site:

    PERFORMANCE_SCHEMA Instrumenting Table IO – http://forge.mysql.com/worklog/task.php?id=4895
    PERFORMANCE SCHEMA TABLE / INDEX IO SUMMARY – http://forge.mysql.com/worklog/task.php?id=5379
    PERFORMANCE_SCHEMA Instrumenting Table Locks – http://forge.mysql.com/worklog/task.php?id=5371
    PERFORMANCE SCHEMA TABLE LOCK WAIT SUMMARY – http://forge.mysql.com/worklog/task.php?id=5420
    PERFORMANCE_SCHEMA Instrumenting Net IO – http://forge.mysql.com/worklog/task.php?id=5420
    PERFORMANCE_SCHEMA Instrumenting Stages – http://forge.mysql.com/worklog/task.php?id=4813
    PERFORMANCE_SCHEMA statements – http://forge.mysql.com/worklog/task.php?id=2515

    All of the above take the current performance_schema (which does file IO and synchronization point instrumentation) literally all the way up to the statement the level, along through “stages” (think, SHOW PROFILE), allowing a fully nested tree of events showing where time and resources (bytes read/written generally) are taken.

    We’d welcome any and all feedback.

  7. Patrick Casey says

    My approach to tuning is pretty similar to yours, first step is always identifying the bottlenecks then figuring out if there’s anything that can be done. I don’t personally run mk_digest since I have app layer logging I can enable on the main deployment I care about right now, but the output is virtually identical.

    There are a couple of red flags I do look for when trolling the logs (and this is just my punch list, not sure it should be generalized into somebody else’s):

    1) Queries with with a high degree of variability in the response time.

    If I run the same query 5 times and get response times of 100ms, 100ms, 14 seconds, 110ms, 98ms, that tells me that something *weird* was up on run number three (the 14 second one). Usually its a lock, or data not in buffer or something, but those can be symptoms of bigger problems.

    2) Clusters of queries taking abnormally long and terminating within a few ms of each other.

    This usually means a lock (back on myisam it almost always meant a lock), but there can be other causes.

    3) Periods of time when all queries show poor performance.

    Can mean either server load is up (backups maybe), a single evil query is nuking the IO subsystem (somebody did an unindexed sort on a 20G log file), or innodb is doing a panic flush or something similar.

    In my experience my users want everything to be fast, but they’ll tolerate a certain set of transactions taking a long time (within reason), especially if you give them some sort of feedback that the operation they’re requesting is probably going to take a long time. What they don’t tolerate though is unpredictable performance. If there’s a link they can click that usually takes 10ms. In my experience the risk of changing things to add that optimization is greater than the resource cost of fetching that data inefficiently.

    Its a balancing act of course, but not everything I find that *can* be optimized should be imho. Its all about the front end, if that performs within spec and the back end is holding up, then poking the system because I can make one minor part ever so slightly more efficient is a bad decision.

  8. says

    Willem, those are good heuristics, thanks.

    I think I can respond to the rest of your comment by an anecdotal story. I saw one tuning project go on for over three months because nobody gathered the correct diagnostic data, properly scoped, and complete enough for analysis. Theories varied from “it’s the database server” to “it’s the RAID controller” to memcached, the load balancer, the switch, the network, the connection pool, lighttpd configuration… it was insanity. After I gathered the right data, I was able to determine that there was a bug in InnoDB and Yasufumi fixed it. That took a few hours total. Gathering and analyzing the data is 95% or more of the work; if you have the right data and you analyze it correctly, the answer is almost always obvious, although implementing the fix might be a lot of work.

    I wrote mk-query-digest, so I certainly understand your points, and they are good — my intent with this article was to show where mk-query-digest is doing an incomplete job and ask how it can be better. It already makes hard things easy, as you mentioned. I want it to do more of that. Perhaps if it weren’t so good, you wouldn’t say that finding bad queries is “the easy part.”

  9. Willem says

    “Perhaps if it weren’t so good, you wouldn’t say that finding bad queries is “the easy part.””

    I can agree with that, and mk-query-digest is an excellent tool. I do not know of any comparable tool for mysql, but this one gives me exactly the info i need on the SQL side. For about your question about what’s missing on maatkit, I’ll tell you how i tune, and what i’m missing. Mind you that i am new to mysql, so i just could be missing sometime from the manual:

    For me, there are 3 components in the system :

    (1) the RDBMS (in this case, Mysql) – think about memory settings, parameters, locking contention etc

    (2) The application – for us DBA’s that’s just a stack of SQL statements, ok? So accesspaths etc.

    (3) The OS and Storage – In the old days we had local disks, be we now have the pleasure of shared storage. If one user copies a multiGIG file over the network, this disturbance in the force is noticed in the slowlogs.

    For SQL tuning, I check the top-sql as query-digest shows. Check the plan and indexes involved. I want to check how much resources this SQL statement takes, i am now trying to get it to work by running show session status before and after the SQL, but i need to analyse the output more. Sort of autotrace command in Oracle.
    This is to me vital if you want to make the cross to one of the other components – how much response came from the i/o system? How much was the waiting compared to the work? ( that well known curve or responsetime = servicetime + waiting time). Instrumentation of the RDBMS here is key. So a good resource profile for an SQL statement is my nr one.

    What would be a good help is to have a tool that can correctly rewrite the SQL statement, make several rewrites and runs these against the database and check the runtime and/or the resource consumption. ( like QUEST SQL optimizer).
    That’s a bit brute-force tuning, but a great help when i am stuck: this happens when the query is overly complex, or the query has a small responsetime already but needs to be cut down further because it’s executed a zillion times.
    Of course, this is not a diagnostic tool but a tool for finding a solution when the problem is identified as a rogue SQL. Thats my wish nr 2 for the fairy.

    ” Gathering and analyzing the data is 95% or more of the work; if you have the right data and you analyze it correctly, the answer is almost always obvious, although implementing the fix might be a lot of work.”
    Thats my mantra i broadcast when people want me to solve a performance problem: “I do NOT look for possible solutions of a performance problem. Solutions come to me, when my diagnosis is ready”. I use this mantra to stop the everlasting : ” “it’s the database server” to “it’s the RAID controller” to memcached, the load balancer, the switch, the network, the connection pool, lighttpd configuration”.

    Regards,
    Willem

  10. says

    Willem, you asked:

    “how much response came from the i/o system? How much was the waiting compared to the work?”

    Your wish is granted: the Percona server writes that information into the query log, and mk-query-digest auto-magically consumes it and adds it to the query report. That’s why I called our enhancements “vital” :-)

  11. Franco Corbelli says

    It’s not so easy to say, generally speaking, without considering the dataset

    Consider this query (bad or good?)

    SELECT SQL_NO_CACHE *
    FROM ps_order_history p1
    WHERE (date_add=(SELECT MAX(p2.date_add)
    FROM ps_order_history p2
    WHERE p1.id_order=p2.id_order))
    and (id_order_state=6)

    running here (about 45s., VERY bad)
    http://www.franzauker.com/pluto.7z

    and here (0.015s, VERY good), about 3000 times faster (!)
    http://www.franzauker.com/pluto2.7z

    “Gathering and analyzing the data is 95% or more of the work; if you have the
    right data and you analyze it correctly, the answer is almost always obvious”
    ;)

    Regards,
    Franco

  12. Franco Corbelli says

    This is obvious for me too, which is why I get paid to do this for last 24 years ;)

    But, in the context of the thread,

    “…I am interested in knowing whether I’m *missing* anything, but I believe the definition of badness in this article is *fairly complete*. (Post comments if I’m wrong.)”

    I think that your’s “bad\good-query” definitions lacks the assumption, “… on a certain dataset…”, here the trivial example of a “good\bad” query on different tuple.

    Kind regards,
    Franco

  13. says

    Franco, I see what you mean now, thanks for clarifying. I actually was considering all queries in the context of their execution on a given dataset. The query analysis I want to do is analysis on a log that results from executing the queries. So I think the answer is that on one of the servers your query would be considered bad, and on the other it would probably be considered okay depending on how often it runs.

  14. Gleb Deykalo says

    Hi Baron,

    Could you please help me with some query execution plan? It looks pretty simple at first glance, but…

    The table is very simple:

    CREATE TABLE test_idx (
    a int(11) DEFAULT NULL,
    b int(11) DEFAULT NULL,
    c int(11) DEFAULT NULL,
    KEY idx_cover_all (a,b,c)
    ) ENGINE=InnoDB;

    Test data:
    insert into test_idx values (1, 1, 1), (2, 2, 2), (3, 3, 3), (1, 2, 3), (3, 2, 1), (2, 3, 1), (2, 1, 3), (1, 3, 2), (1, 2, 2);

    When I ask for an exact row, it uses index and join type is REF:

    mysql> explain select c from test_idx where a = 2 and b = 2 and c = 2\G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: test_idx
    type: ref
    possible_keys: idx_cover_all
    key: idx_cover_all
    key_len: 15
    ref: const,const,const
    rows: 1
    Extra: Using where; Using index

    (1) But why do I see “Using where”? I can not understand why engine (InnoDB) can not filter row inside.

    When I ask for a rows with little more complicated query, it confuses me even more:
    mysql> explain select c from test_idx where a = 2 and b = 2 and c IN (1, 2, 3)\G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: test_idx
    type: range
    possible_keys: idx_cover_all
    key: idx_cover_all
    key_len: 15
    ref: NULL
    rows: 3
    Extra: Using where; Using index

    Join type “range”, but I know that MySQL uses it for IN queries even if it is not actually range, so it is OK. But…

    (2) No reference fields — why?
    (3) Still can see “User where” — why?

    Why InnoDB can not be sure it found exactly correct row? Or is it MySQL server does not believe InnoDB and want to check data on its side?

    I can reproduce it both on Percona Server 5.1 and Percona Server 5.5
    5.1.54-rel12.5-log Percona Server with XtraDB (GPL), Release 12.5, Revision 188
    5.5.21-55 Percona Server (GPL), Release 25.1

Leave a Reply

Your email address will not be published. Required fields are marked *