Analyzing Amazon Aurora Slow Logs with pt-query-digest

Amazon Aurora MySQL slow query logs with pt-query-digest slowIn this blog post we shall discuss how you can analyze slow query logs from Amazon Aurora for MySQL, (referred to as Amazon Aurora in the remaining blog). The tools and techniques explained here apply to the other MySQL compatible services available under Amazon Aurora. However, we’ll focus specially on analyzing slow logs from Amazon Aurora version 2 (MySQL 5.7 compatible) using pt-query-digest. We believe there is a bug in Aurora where it logs really big numbers for query execution and lock times for otherwise really fast queries.

So, the main steps we need are:

  1. Enable slow query logging on your Amazon Aurora DB parameter group, apply the change when appropriate.
  2. Download the slow log(s) that match the time that you are interested to investigate, and optionally concatenate them.
  3. Run pt-query-digest on the downloaded logs and check the results.

Enable slow query logging

For our testing we decided to capture all the SELECT queries that were hitting our Amazon Aurora instance, mainly because we had a sysbench OLTP read only workload and that wouldn’t really have a lot of slow queries. An easy way to do so is to enable the capture of slow query logs and set long_query_time to 0 — you will need to enable slow query logging. To achieve that, we created a new DB parameter group and applied it to our test Aurora instance with the following three parameters set as below:

Once you have the above configuration applied to Amazon RDS, you will be able to see slow query logs being created in the Amazon RDS console.

Download the log file

You can download the log file of your choice using either the Amazon RDS console OR you can use the following AWS CLI command to achieve the same:

Depending on the size of the chosen log file, the above command will take some time to complete the download.

Run pt-query-digest on the log file

Once the file has been downloaded you can analyse that using the following pt-query-digest command.

On our Aurora test slow log file, the initial results didn’t look right so we had to apply a workaround. Here is the header of the initial results from pt-query-digest:

What’s wrong with the above results is that the total query Exec time and Lock time are very large numbers. Digging deeper into the logs revealed a problem with the slow logs themselves that had very large numbers for Query time & Lock time for some queries. For instance in our case, of 5.13 million queries in the log file, only 111 had the anomaly. Even so, it was enough to skew the results.

Incorrect logging

The above two queries are, in fact, really fast, but for some reason the execution time & lock times are wrongly logged in the slow query log. Since the number of such query log records is statistically negligible compared to the total number of queries, we decided to ask pt-query-digest to ignore them using the command line parameter –attribute-value-limit . The default value of this parameter is 0. We decided to increase that to 2^32, and make it ignore the large numbers from the slow query log. So, the pt-query-digest command became:

This caused the 111 queries with the bad log times to be ignored and the results looked good. In our case, the ignored queries were bad variants of queries for which good versions existed. You can tell this because the number of unique queries remained the same as before after the bad variants were ignored. However, this may not always hold true and one should expect to lose some fidelity, especially if you are analyzing a smaller slow log.

That number looks familiar

The really big number 18446744073709.550781 seemed to ring a bell. A quick web search revealed that it could be a regression of an old bug in MySQL’s code. The following bugs were found to have the same value being reported for query exec time & query lock time.

Once slow logs were enabled, we used this sysbench command  to generate the workload for the Amazon Aurora instance. You might like to try it yourselves. Please note that this used sysbench version 1.0.14.

If you are an Amazon Aurora user, have you found any problems analyzing slow query logs? You are welcome to use the comments section, below, to let me know.

Percona Toolkit

pt-query-digest is part of Percona Toolkit, a collection of advanced open source command-line tools, developed and used by the Percona technical staff. Percona Toolkit is open source and free to download and use.

Share this post

Comments (2)

  • Rahul Sinha

    Hi ,I have done same but output is blank.
    Is pt-query-digest is not work with procedure_name doesnot captured by this.
    below is the details.

    [root@ip-172-31-6-146 tmp]# pt-query-digest –group-by fingerprint –order-by Query_time:sum slowquery.log

    # 220ms user time, 20ms system time, 37.44M rss, 232.91M vsz
    # Current date: Thu Aug 1 11:05:47 2019
    # Hostname: ip-172-31-6-146.ap-southeast-1.compute.internal
    # Files: slowquery.log
    # Overall: 0 total, 1 unique, 0 QPS, 0x concurrency ______________________
    # Attribute total min max avg 95% stddev median
    # ============ ======= ======= ======= ======= ======= ======= =======
    # Query size 1.03M 1.03M 1.03M 1.03M 1.03M 0 1.03M

    # Profile
    # Rank Query ID Response time Calls R/Call V/M Item
    # =========== =========== =========== =========== =========== ===== ======
    [root@ip-172-31-6-146 tmp]#

    below is the slow log output
    [root@ip-172-31-6-146 tmp]# head -20 slowquery.log
    b/34888351875705522600949060125748799650296161739669241886 f/34893005952025475147876458366857132852441865462075490304
    EVENTS 1564447805084 # Time: 190730 0:50:05
    # User@Host: mxusr_lsqapps[mxusr_lsqapps] @ [] Id: 760584
    # Query_time: 2.056237 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 215758
    SET timestamp=1564447805;
    CALL mxradon_config.udsp_SetSubscriptionSentTimeStamp(‘d0dc6b17-94c1-11e9-8ffe-02ceb8e07d66’); 1564447805000
    EVENTS 1564447815084 # Time: 190730 0:50:10
    # User@Host: mxusr_lsqapps[mxusr_lsqapps] @ [] Id: 760634
    # Query_time: 2.052549 Lock_time: 0.000100 Rows_sent: 0 Rows_examined: 215758
    SET timestamp=1564447810;
    CALL mxradon_config.udsp_SetSubscriptionSentTimeStamp(‘00620051-659a-11e9-8ffe-02ceb8e07d66’); 1564447810000
    EVENTS 1564447815084 # Time: 190730 0:50:11
    # User@Host: mxusr_lsqapps[mxusr_lsqapps] @ [] Id: 760633
    # Query_time: 2.074477 Lock_time: 0.000084 Rows_sent: 0 Rows_examined: 215758
    SET timestamp=1564447811;
    CALL mxradon_config.udsp_SetSubscriptionSentTimeStamp(‘50964ae7-a15a-11e9-8ffe-02ceb8e07d66’); 1564447811000
    EVENTS 1564447815084 # User@Host: mxusr_lsqapps[mxusr_lsqapps] @ [] Id: 760623
    # Query_time: 2.197263 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 215758
    SET timestamp=1564447811;
    CALL mxradon_config.udsp_SetSubscriptionSentTimeStamp(‘eae30df7-40b2-11e9-8ffe-02ceb8e07d66’); 1564447811000

    August 1, 2019 at 7:14 am
  • Andres