Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Handling long-running queries in MySQL with Percona XtraBackup

October 11, 2013

Author

Share this Post:

I recently had a case where replication lag on a slave was caused by a backup script. First reaction was to incriminate the additional pressure on the disks, but it turned out to be more subtle: Percona XtraBackup was not able to execute FLUSH TABLES WITH READ LOCK due to a long-running query, and the server ended up being read-only. Let’s see how we can deal with that kind of situation.

In short

Starting with Percona XtraBackup 2.1.4, you can:

- Configure a timeout after which the backup will be aborted (and the global lock released) with the lock-wait-threshold, lock-wait-query-type and lock-wait-timeout options

- Or automatically kill all queries that prevent the lock to be granted with the kill-long-queries-timeout and kill-long-query-type settings

Full documentation is here

Diagnosing the problem

Percona XtraBackup has to run FLUSH TABLES WITH READ LOCK before backing up non InnoDB tables. And by default, it will wait forever for this lock to be granted. So you can end up with the following situation:

- A backup is started while a long-running query is executing

- The long-running query holds locks that prevent FLUSH TABLES WITH READ LOCK to complete, the backup is stalled

- All new write queries are blocked, waiting for FLUSH TABLES WITH READ to complete

There are several ways to diagnose the problem. First if you look at the output of innobackupex, you will see that FLUSH TABLES WITH READ LOCK is being requested and that everything is stalled from this point on:

[...]
131002 17:49:58  innobackupex: Starting to lock all tables...

>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
>> log scanned up to (4061999899)
[...]

[...]

131002 17:49:58 innobackupex: Starting to lock all tables...

>> log scanned up to (4061999899)

[...]

And with SHOW PROCESSLIST, we can see that write queries are waiting for FLUSH TABLES WITH READ LOCK to be released, while FLUSH TABLES WITH READ LOCK is waiting for the long query to finish:

+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+
| Id | User | Host      | db        | Command | Time | State                        | Info                                                        |
+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+
|  5 | root | localhost | NULL      | Query   |  244 | Sending data                 | select count(*) from employees.employees,employees.salaries |
|  6 | root | localhost | NULL      | Query   |  206 | Waiting for table flush      | FLUSH TABLES WITH READ LOCK                                 |
|  7 | root | localhost | NULL      | Query   |    0 | NULL                         | show processlist                                            |
| 12 | root | localhost | employees | Query   |    2 | Waiting for global read lock | update employees set gender='M' where emp_no=40000          |
| 13 | root | localhost | employees | Query   |    5 | Waiting for global read lock | update employees set gender='M' where emp_no=30000          |
+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+

+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+

+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+

+----+------+-----------+-----------+---------+------+------------------------------+-------------------------------------------------------------+

Let’s now use the new options of Percona XtraBackup to have a better control over this global lock.

Aborting the backup if long queries are running

Here is a way to get a long-running query with the employees database:

mysql> select count(*) from employees.employees,employees.salaries;

1	mysql> select count(*) from employees.employees,employees.salaries;

Now let’s run a backup which will automatically aborts if it cannot take the global lock after waiting for some time:

innobackupex --user=root --password=root --lock-wait-threshold=10 --lock-wait-query-type=all --lock-wait-timeout=5 /root/backup

1	innobackupex --user=root --password=root --lock-wait-threshold=10 --lock-wait-query-type=all --lock-wait-timeout=5 /root/backup

This command line tells Percona XtraBackup to only wait up to 5 seconds if there is any query running for more than 10 seconds.

If we check the output of innobackupex, we will see something like:

[...]
[01] Copying ./employees/employees.ibd to /root/backup/2013-10-07_15-15-38/employees/employees.ibd
[01]        ...done
>> log scanned up to (4062000510)
xtrabackup: Creating suspend file '/root/backup/2013-10-07_15-15-38/xtrabackup_suspended_2' with pid '4313'

131007 15:15:58  innobackupex: Continuing after ibbackup has suspended

131007 15:15:58  innobackupex: Waiting for query 17 (duration 37 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)

131007 15:15:59  innobackupex: Waiting for query 17 (duration 38 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)

131007 15:16:00  innobackupex: Waiting for query 17 (duration 39 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)

131007 15:16:01  innobackupex: Waiting for query 17 (duration 40 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)

131007 15:16:02  innobackupex: Waiting for query 17 (duration 41 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)

131007 15:16:03  innobackupex: Waiting for query 17 (duration 42 sec): select count(*) from employees.employees,employees.salaries
>> log scanned up to (4062000510)
innobackupex: Error: Unable to obtain lock. Please try again. at /usr/bin/innobackupex line 3229.
131007 15:16:04  innobackupex: Waiting for ibbackup (pid=4312) to finish

[...]

[01] Copying ./employees/employees.ibd to /root/backup/2013-10-07_15-15-38/employees/employees.ibd

[01] ...done

>> log scanned up to (4062000510)

xtrabackup: Creating suspend file '/root/backup/2013-10-07_15-15-38/xtrabackup_suspended_2' with pid '4313'

131007 15:15:58 innobackupex: Continuing after ibbackup has suspended

131007 15:15:58 innobackupex: Waiting for query 17 (duration 37 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

131007 15:15:59 innobackupex: Waiting for query 17 (duration 38 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

131007 15:16:00 innobackupex: Waiting for query 17 (duration 39 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

131007 15:16:01 innobackupex: Waiting for query 17 (duration 40 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

131007 15:16:02 innobackupex: Waiting for query 17 (duration 41 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

131007 15:16:03 innobackupex: Waiting for query 17 (duration 42 sec): select count(*) from employees.employees,employees.salaries

>> log scanned up to (4062000510)

innobackupex: Error: Unable to obtain lock. Please try again. at /usr/bin/innobackupex line 3229.

131007 15:16:04 innobackupex: Waiting for ibbackup (pid=4312) to finish

In this case, the backup was aborted because the long-running query did not finish in time. As a side note, because of this bug, it is safer to use Percona XtraBackup 2.1.5+.

Killing long running queries to allow the backup to complete

Let’s change the settings we give to innobackupex. Instead of aborting the backup if it cannot run FLUSH TABLES WITH READ LOCK, let’s now allow the tool to kill any offending query after waiting for maximum 10 seconds:

innobackupex --user=root --password=root --kill-long-queries-timeout=10 --kill-long-query-type=all /root/backup

1	innobackupex --user=root --password=root --kill-long-queries-timeout=10 --kill-long-query-type=all /root/backup

The output of innobackupex shows that the long-running query is killed and that the whole backup can complete:

[...]
>> log scanned up to (4062000510)
xtrabackup: Creating suspend file '/root/backup/2013-10-07_15-27-05/xtrabackup_suspended_2' with pid '4359'

131007 15:27:25  innobackupex: Continuing after ibbackup has suspended
131007 15:27:25  innobackupex: Starting to lock all tables...
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
>> log scanned up to (4062000510)
131007 15:27:35  innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup' as 'root'  (using password: YES).
131007 15:27:35  innobackupex: Connected to MySQL server

131007 15:27:35  innobackupex: Killing query 17 (duration 734 sec): select count(*) from employees.employees,employees.salaries
131007 15:27:35  innobackupex: Connection to database server closed
Query killing process is finished
131007 15:27:35  innobackupex: All tables locked and flushed to disk
[...]

[...]

>> log scanned up to (4062000510)

xtrabackup: Creating suspend file '/root/backup/2013-10-07_15-27-05/xtrabackup_suspended_2' with pid '4359'

131007 15:27:25 innobackupex: Continuing after ibbackup has suspended

131007 15:27:25 innobackupex: Starting to lock all tables...

>> log scanned up to (4062000510)

131007 15:27:35 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup' as 'root' (using password: YES).

131007 15:27:35 innobackupex: Connected to MySQL server

131007 15:27:35 innobackupex: Killing query 17 (duration 734 sec): select count(*) from employees.employees,employees.salaries

131007 15:27:35 innobackupex: Connection to database server closed

Query killing process is finished

131007 15:27:35 innobackupex: All tables locked and flushed to disk

[...]

Conclusion

The new settings of Percona XtraBackup are nice to avoid nasty interactions between backups and long-running queries. However if you want to use the settings discussed in this article, be prepared for the potential consequences:

- Allowing the backup tool to kill queries can cause unexpected behavior in the application

- Aborting one backup can be a minor problem, but if the backup is aborted at each run, this is a major problem!

0 0 votes

Article Rating

6 Comments

Oldest

Newest Most Voted

Shlomi Noach

12 years ago

I find this a recurring and annoying behaviour of FLUSH TABLES WITH READ LOCK.
However we are so used to issue this query, that we forget we don’t always need it: to get a consistent backup, we only need to flush the MyISAM tables; there is no need to flush any InnoDB table.
The incentive for using FLUSH TABLES WITH READ LOCK is to get a “point in time” backup: a backup where the server’s binlog position is known; which is useful for setting up new slaves or doing point-in-time recovery. If you don’t need these, you can drop the FLUSH command altogether.

Editor

Przemysław Malkowski

12 years ago

Schlomi, this is true, that’s why the –no-lock option exists (https://www.percona.com/doc/percona-xtrabackup/2.1/innobackupex/innobackupex_option_reference.html#cmdoption-innobackupex–no-lock). But flushing is not only needed for non-transactional tables, it is also to avoid inconsistencies when any DDL is executed during the final backup stage (when all the .frm, etc. are being copied).

Fran Garcia

12 years ago

That’s a really welcome addition 🙂

What I’m missing from this is some way of limiting the amount of time tables remain locked after the lock has been acquired. For example, say that the backup process acquires the lock, and the copy of all frm files is taking longer than usual (say, because of network issues), so we might want to interrupt the backup and release the lock after 2 minutes to avoid further impact.

Are there any plans to add a similar feature?

Chandu

12 years ago

Does it only kill the select or if i have a insert or update with select .. from …. .; with a multiple joins and conditions, will it get killed ? would it be a good if its killed or had to wait for them to be rolled back before the extrabackup can proceed further.

Author

Stephane Combaudon

12 years ago

Fran,

Feel free to open a feature request: https://bugs.launchpad.net/percona-xtrabackup/+filebug

At the moment, there are different workarounds:
* Using InnoDB only will minimize the locking time
* If you don’t need the binlog position, you can use the –no-lock option
* Using a dedicated backup server allows you to minimize the impact of locking on the application
* Write a script that monitors the time the lock is held and that will kill the backup if it takes too long