Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Replaying database load with Percona Playback

October 25, 2012

Author

Miguel Angel Nieto

Insight for DBAs

MySQL

Share this Post:

If you are planning to upgrade or make any configuration change on your MySQL database the first advice usually is:

– Benchmark!

How should we do that benchmark? People usually run generic benchmark tools like sysbench, tpcc or mysqlslap that are good to know the number of transactions per seconds that a database can do but it doesn’t care about your workload, data set or queries. Those tools just run random queries against random generated data.

The best way to run a benchmark is replaying the load of your production server against a different database server and here is where Percona Playback can help us. Percona Playback is a new tool that can replay the data captured from the production server in another different server. It can replay queries from tcpdump or slow query logs. With this tool you can measure how a database upgrade, change on my.cnf or schema change can affect the overall performance of your application.

In the example I’m going to show how to meassure the impact of my.cnf changes replaying a slow query log. The service starts with an empty my.cnf configuration so all parameters are set to their default values. We have two servers, PROD and DEV. PROD is the database that is in production and where our customers are working. DEV is where we are going to test all the changes and replay the load. Both servers should have an identical copy of the data, using DEV as slave of PROD or taking an xtrabackup copy just before the data capture.

Important: don’t run the benchmark on the production server because Percona Playback will run both SELECT and DML queries. There is still no –read-only parameter 🙂

So now we need to enable the slow query log on the PROD server. We should have it enabled during the peak time to get valuable information:

SET GLOBAL slow_query_log=1;
SET GLOBAL long_query_time=0;

1 2	SET GLOBAL slow_query_log=1; SET GLOBAL long_query_time=0;

With long_query_time we are logging all queries that are being executed in our database. After capturing enough data and copying the slow query log to the DEV server we can replay the load:

percona-playback --mysql-host=127.0.0.1 --mysql-user=root --mysql-schema=my_db --query-log-file=slow.log

Executed 429467 queries
Spent 00:00:59.668649 executing queries versus an expected 00:00:47.085301 time.
98966 queries were quicker than expected, 330501 were slower
A total of 51 queries had errors.
Expected 372462 rows, got 371851 (a difference of 611)
Number of queries where number of rows differed: 324.

percona-playback --mysql-host=127.0.0.1 --mysql-user=root --mysql-schema=my_db --query-log-file=slow.log

Executed 429467 queries

Spent 00:00:59.668649 executing queries versus an expected 00:00:47.085301 time.

98966 queries were quicker than expected, 330501 were slower

A total of 51 queries had errors.

Expected 372462 rows, got 371851 (a difference of 611)

Number of queries where number of rows differed: 324.

That’s the report of the tool. Seems that the it needed 59 seconds to execute the load but it was 47 seconds originally on the production server. 98966 queries were faster and 330501 slower. We also see that some queries had errors. Those error are shown on the stderr so you can redirect them and review later. The typical reason for an error is that a particular query returns different number or rows. For example:

Connection 43 Rows Sent: 210 != expected 211 for query: SELECT DISTINCT ol_i_id FROM order_line WHERE ol_w_id = 1 AND ol_d_id = 10 AND ol_o_id < 3415 AND ol_o_id >= (3415 - 20);

1	Connection 43 Rows Sent: 210 != expected 211 for query: SELECT DISTINCT ol_i_id FROM order_line WHERE ol_w_id = 1 AND ol_d_id = 10 AND ol_o_id < 3415 AND ol_o_id >= (3415 - 20);

Why there are some queries returns different number of rows? Usually because the data set is not exactly the same on PROD and DEV servers. In the previous report we see that 324 queries returned different row numbers from 429467 queries in total.

Let’s modify the my.cnf to see if we can get any improvement. Take in account that the data set on DEV has been modified by the previous replay so you should restore it before the next benchmark. I’m going to add 1GB of buffer pool and run again the benchmark (innodb_buffer_pool_size=1G).

Executed 429467 queries
Spent 00:00:59.856656 executing queries versus an expected 00:00:47.085301 time.
98730 queries were quicker than expected, 330737 were slower
A total of 51 queries had errors.
Expected 372462 rows, got 371851 (a difference of 611)
Number of queries where number of rows differed: 324.

Executed 429467 queries

Spent 00:00:59.856656 executing queries versus an expected 00:00:47.085301 time.

98730 queries were quicker than expected, 330737 were slower

A total of 51 queries had errors.

Expected 372462 rows, got 371851 (a difference of 611)

Number of queries where number of rows differed: 324.

Not too many changes, is very similar to the previous report. What is the reason to see no improvement? Well, I haven’t warmed up the buffer pool. Usually when doing a benchmark is a good idea to read the data first to warm up the buffer pool. This blog post is not a benchmark per se, just an overview of Percona Playback so we’re going to ignore that. Second try, add 2GB of InnoDB Log Files (innodb_log_file_size=1G, innodb_log_files_in_group=2):

Executed 429467 queries
Spent 00:00:56.772552 executing queries versus an expected 00:00:47.085301 time.
99959 queries were quicker than expected, 329508 were slower
A total of 51 queries had errors.
Expected 372462 rows, got 371851 (a difference of 611)
Number of queries where number of rows differed: 324.

Executed 429467 queries

Spent 00:00:56.772552 executing queries versus an expected 00:00:47.085301 time.

99959 queries were quicker than expected, 329508 were slower

A total of 51 queries had errors.

Expected 372462 rows, got 371851 (a difference of 611)

Number of queries where number of rows differed: 324.

Now we see an improvement but we can go further. Last change, flush logs every second instead of every commit (innodb_flush_log_at_trx_commit=2):

Executed 429467 queries
Spent 00:00:51.605932 executing queries versus an expected 00:00:47.085301 time.
111833 queries were quicker than expected, 317634 were slower
A total of 51 queries had errors.
Expected 372462 rows, got 371851 (a difference of 611)
Number of queries where number of rows differed: 324.

Executed 429467 queries

Spent 00:00:51.605932 executing queries versus an expected 00:00:47.085301 time.

111833 queries were quicker than expected, 317634 were slower

A total of 51 queries had errors.

Expected 372462 rows, got 371851 (a difference of 611)

Number of queries where number of rows differed: 324.

Much better, 51 seconds. Very near to the original value 🙂

This is an easy example of how we can test our changes or updates using real data and real queries from our application. The tool is under active development. If you find any bugs we would appreciate it being filed on launchpad. You can download the tool from:

https://www.percona.com/downloads/Percona-Playback/

0 0 votes

Article Rating

13 Comments

Oldest

Newest Most Voted

Jonathan Levin

13 years ago

The problem is that you miss out on concurrency. Quite a few variables account for improving performance with high concurrency.
This tool is not much different than simply running your log against the database with the obvious advantage seeing the time for each query.
I would argue about this tool generating “your workload”. It replays a log linearly and that is not the same as accurately generating load on the server.

It would be better if you parsed a slow log for different thread ids and times and try to make a tool that replays the log across 5-10 threads (or another number). That would be much closer to generating an accurate load on the database.

Roel Van de Paar

13 years ago

Besides replaying load for quick benchmarking purposes, it would be interesting to see how this tool can be used for QA ends: “recording” a load from the start in order to replay it (and potentially generate the same regression, crash, …).

Akshay Suryavanshi

13 years ago

Regarding the concurrency, a simple workaround could be to just run 5-10 percona-playback sessions simultaneously. This could possibly simulate the desired load on the server.

Peter Boros

13 years ago

Playback handles parallelism automatically. Whenever it sees a new thread id in the slow log, it opens a new mysql client thread, and uses that to replay the statements executed for a given thread id. So, if you have 5 distinct thread ids in the slow log (your application used 5 threads), playback will also use 5 threads, one thread will execute the same statements as the corresponding thread executed in the production workload. It also check for quit commands in the slow log, and whenever it sees the quit command for a thread, playback tears down the corresponding thread it created.

By default it tries to replay the load as fast as it can (note that the mechanism above retains the parallelism characteristics), this is good for example for capacity planning purposes.
The other mode is when it makes each statement last at least as long as we have them in the slow log (this should give you a more accurate reproduction of the production workload). Sometimes this is successful, sometimes it isn’t (for example if a statement is slower than originally, it will take more time).

fher98

13 years ago

Mmmm… if I have no dev server… Would it be a bad idea to run it against production?

alalei

13 years ago

hi:
i try to use tcpdump,and i get errors like：
Database Plugin: libmysqlclient
Running…
terminate called after throwing an instance of ‘boost::thread_resource_error’
what(): boost::thread_resource_error

i am doing this following over redhat as6.3:
1.tcpdump -i any port 3306 -s0 -w aaaa.cap
2. ./percona-playback –input-plugin=tcpdump –tcpdump-file=/root/aaaa.cap –tcpdump-mode=accurate –db-plugin=libmysqlclient –mysql-host=xxx.xxx.xxx.xxx –mysql-port=3306 –mysql-username=root –mysql-schema=sbtest –mysql-password=xxxx –queue-depth 10000
Database Plugin: libmysqlclient
Running…
terminate called after throwing an instance of ‘boost::thread_resource_error’
what(): boost::thread_resource_error
Aborted

what is wrong?

Author

Miguel Angel Nieto

13 years ago

fher98 yes, it would be a bad idea. This tool will run *all* your queries, that means also the DML queries like INSERT, UPDATE and DELETE.

alalei: sorry, don’t know, can you please fill a bug report on launchpad? If it is possible, attach a sample pcap file so we’ll be able to reproduce the problem.

https://launchpad.net/percona-playback

alalei

13 years ago

hi Miguel Angel Nieto: i have submit a bug report at page https://bugs.launchpad.net/percona-playback/+bug/1077792,please help analyse it.

another question:does percona-playback only support mysql with percona-server version? how about mysql-server community version? i test it with mysql-server 5.1,and find that there is no ‘thread-id’ information in the mysql-slow.log file,so the thread-per-connection can not be support in this version?(no connection info in mysql-slow.log,no more thread can be used)

Peter Colclough

13 years ago

Hi Miguel. This is a good place to start. The issue with parallelism exists if you don’t have Percona Server, but at leasst you have a start point. We have had similar issues in the past, and are developing our own logger/query tools, based on Percona slow logs.
The major issue I see is that its all well and good writing all queries to teh slow log (similar to the MySql general log). However, at 4000 per second over the estate , we tried running the General log a couple of years back, and stuffed the server in around 30 minutes.
Is it possible to generate a trimmed down log with just time/thread/sql, with an option to turn off DML statements?

Igor Kryltsov

12 years ago

I guess it is important to somehow create database backup and start capturing workload at the same time so replay will not fail due to missing records. Any suggestions on how to do this. Same problem exists here – https://github.com/buger/gor/ – you can record web load to replay but how monetarily create a snapshot of your server with users sessions from prod to dev to replay it correctly….

Thank you

Jesus Alvarado

10 years ago

terminate called after throwing an instance of ‘tbb::captured_exception’

Adam Swanson

10 years ago

I assume my_db is the database you want to re-run the slow-query log against. What if we have multiple databases? Is there a flag to just run against all databases that are in the slow-query?