Handling long-running queries in MySQL with Percona XtraBackup

I recently had a case where replication lag on a slave was caused by a backup script. First reaction was to incriminate the additional pressure on the disks, but it turned out to be more subtle: Percona XtraBackup was not able to execute FLUSH TABLES WITH READ LOCK due to a long-running query, and the server ended up being read-only. Let’s see how we can deal with that kind of situation.

In short

Starting with Percona XtraBackup 2.1.4, you can:

  • Configure a timeout after which the backup will be aborted (and the global lock released) with the lock-wait-threshold, lock-wait-query-type and lock-wait-timeout options
  • Or automatically kill all queries that prevent the lock to be granted with the kill-long-queries-timeout and kill-long-query-type settings

Full documentation is here

Diagnosing the problem

Percona XtraBackup has to run FLUSH TABLES WITH READ LOCK before backing up non InnoDB tables. And by default, it will wait forever for this lock to be granted. So you can end up with the following situation:

  • A backup is started while a long-running query is executing
  • The long-running query holds locks that prevent FLUSH TABLES WITH READ LOCK to complete, the backup is stalled
  • All new write queries are blocked, waiting for FLUSH TABLES WITH READ to complete

There are several ways to diagnose the problem. First if you look at the output of innobackupex, you will see that FLUSH TABLES WITH READ LOCK is being requested and that everything is stalled from this point on:

And with SHOW PROCESSLIST, we can see that write queries are waiting for FLUSH TABLES WITH READ LOCK to be released, while FLUSH TABLES WITH READ LOCK is waiting for the long query to finish:

Let’s now use the new options of Percona XtraBackup to have a better control over this global lock.

Aborting the backup if long queries are running

Here is a way to get a long-running query with the employees database:

Now let’s run a backup which will automatically aborts if it cannot take the global lock after waiting for some time:

This command line tells Percona XtraBackup to only wait up to 5 seconds if there is any query running for more than 10 seconds.

If we check the output of innobackupex, we will see something like:

In this case, the backup was aborted because the long-running query did not finish in time. As a side note, because of this bug, it is safer to use Percona XtraBackup 2.1.5+.

Killing long running queries to allow the backup to complete

Let’s change the settings we give to innobackupex. Instead of aborting the backup if it cannot run FLUSH TABLES WITH READ LOCK, let’s now allow the tool to kill any offending query after waiting for maximum 10 seconds:

The output of innobackupex shows that the long-running query is killed and that the whole backup can complete: