Taking backups on Percona XtraDB Cluster (without stalls!)

I occasionally see customers who are taking backups from their PXC clusters that complain that the cluster “stalls” during the backup.  As I wrote about in a previous blog post, often these stalls are really just Flow Control.  But why would a backup cause Flow control?

Most backups I know of (even Percona XtraBackup) take a FLUSH TABLES WITH READ LOCK (FTWRL) at some point in the backup process.  This can be disabled in XtraBackup (in certain circumstances), but it is enabled by default.

If you go to your active cluster right now an execute a FTWRL (don’t actually do this in production!), you’ll see this message in your error log on that node:

This indicates that Galera is unable to apply writes on the local node.  This by itself is does not indicate Flow control, but flow control is likely if it lasts too long.  Once the lock is released, we get a message that Galera is at work again:

During this interval (9 seconds in this case), the wsrep_local_recv_queue was backing up on this node and could cause Flow control, depending on how the fc_limit and other settings are configured.  I talk about how to tune Flow control in my other post, but what we really want is for flow control to not be in effect for the duration of our backup for this one specific node.

Astute Galera users know that a Donor during SST does not trigger flow control, even though it may get far behind the rest of the cluster.  What if we could manually make a node act like a donor for the purposes of a backup?  Turns out we now can.

Starting with PXC 5.5.33, a new variable has been added called ‘wsrep_desync’.  This allows us to manually toggle a node into and out of the ‘Donor/Desynced’ state.   The Donor/Desynced state is nothing magical.  It really just turns off flow control, and allows the node to fall arbitrarily far behind the rest of the cluster, but only when it is forced to.  This could be caused by a FTWRL, but also anything that may cause the node to lag like heavy disk utilization.

So, I can set Desync like this:

When I do that, I can see the node drop into the ‘Donor/Desynced’ state:

However, notice that my wsrep_local_recv_queue is still empty, and flow control is not apparently in effect.  myq_status agrees with this:

Moving to Donor/Desynced state does not force the node to fall behind, it just allows it without triggering flow control.  Now, let’s take a FTWRL on node3 and observe: