Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Galera Flow Control in Percona XtraDB Cluster for MySQL

May 2, 2013

Author

Share this Post:

Last week at Percona Live, I delivered a six-hour tutorial about Percona XtraDB Cluster (PXC) for MySQL. I actually had more material than I covered (by design), but one thing I regret we didn’t cover was Flow control. So, I thought I’d write a post covering flow control because it is important to understand.

What is flow control?

One of the things that people don’t often expect when switching to Galera is existence of a replication feedback mechanism, unlike anything you find in standard async MySQL replication. It is my belief that the lack of understanding of this system, or even that it exists, leads to unnecessary frustration with Galera and cluster “stalls” that are preventable.

This feedback, called flow control, allows any node in the cluster to instruct the group when it needs replication to pause and when it is ready for replication to continue. This prevents any node in the synchronous replication group from getting too far behind the others in applying replication.

This may sound counter-intuitive at first: how would synchronous replication get behind? As I’ve mentioned before, Galera’s replication is synchronous to the point of ensuring transactions are copied to all nodes and global ordering is established, but apply and commit is asynchronous on all but the node the transaction is run on.

It’s important to realize that Galera prevents conflicts to such transactions that have been certified but not yet applied, so multi-node writing will not lead to inconsistencies, but that is beyond the scope of this post.

Tuning flow control

Flow control is triggered when a Synced node exceeds a specific threshold relative to the size of the receive queue (visible via the wsrep_local_recv_queue global status variable). Donor/Desynced nodes do not apply flow control, though they may enter states where the recv_queue grows substantially. Therefore care should be taken for applications to avoid using Donor/Desynced nodes, particularly when using a blocking SST method like rsync or mysqldump.

So, flow control kicks in when the recv queue gets too big, but how big is that? And when is flow control relaxed? There are a few settings that are relevant here, and they are all configured via the wsrep_provider_options global variable.

gcs.fc_limit

This setting controls when flow control engages. Simply speaking, if the wsrep_local_recv_queue exceeds this size on a given node, a pausing flow control message will be sent. However, it’s a bit trickier than that, because of fc_master_slave (see below).

The fc_limit defaults to 16 transactions. This effectively means that this is as far as a given node can be behind committing transactions from the cluster.

gcs.fc_master_slave

The fc_limit is modified dynamically if you have fc_master_slave disabled (which it is by default). This mode actually adjusts the fc_limit dynamically based on the number of nodes in the cluster. The more nodes in the cluster, the larger the calculated fc_limit becomes. The theory behind this is that the larger the cluster gets (and presumably busier with more writes coming from more nodes), the more leeway each node will get to be a bit further behind applying.

If you only write to a single node in PXC, then it is recommended you disable this feature by setting fc_master_slave=YES. Despite its name, this setting really does no more than to change if the fc_limit is dynamically resized or not. It contains no other magic that helps single node writing in PXC to perform better.

gcs.fc_factor

If fc_limit controls when flow control is enabled, then fc_factor addresses when it is released. The factor is a number between 0.0 and 1.0, which is multiplied by the current fc_limit (adjusted by the above calculation if fc_master_slave=NO). This yields the number of transactions the recv queue must fall BELOW before another flow control message is sent by the node giving the cluster permission to continue replication.

This setting traditionally defaulted to 0.5, meaning the queue had to fall below 50% of the fc_limit before replication was resumed. A large fc_limit in this case might mean a long wait before flow control gets relaxed again. However, this was recently modified to a default of 1.0 to allow replication to resume as soon as possible.

An example configuration tuning flow control in a master/slave cluster might be:

mysql> set global wsrep_provider_options="gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0";

1	mysql> set global wsrep_provider_options="gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0";

Working with flow control

What happens during flow control

Simply speaking: flow control makes replication stop, and therefore makes writes (which are synchronous) stop, on all nodes until flow control is relaxed.

In normal operation we would expect that a large receive queue might be the result of some brief performance issue on a given node, or perhaps the effect of some large transaction briefly stalling an applier thread.

However, it is possible to halt queue applying on any node by simply by running “FLUSH TABLES WITH READ LOCK”, or perhaps by “LOCK TABLE”, in which case flow control will kick in just as soon as the fc_limit is exceeded. Therefore, care must be taken that your application or some other maintenance operation (like a backup) doesn’t inadvertently cause flow control on your cluster.

The cost of increasing the fc_limit

Keeping the fc_limit small has ~~two~~ three purposes:

1. It limits the amount of delay any node in the cluster might have applying cluster transactions. Therefore, it keeps reads more up to date without needing to use wsrep_causal_reads.

1. It minimizes the expense of certification by keeping the window between new transactions being committed and the oldest unapplied transaction small. The larger the queue is, the more costly certification gets. EDIT: actually the cost of certification depends only the size of the transactions, which translates into number of unique key lookups into the certification index, which is a hash table. A small fc_limit does however keep the certification index smaller in memory.

1. It keeps the certification interval small, which minimizes replication conflicts on a cluster where writes happen on all nodes.

On a master/slave cluster, therefore, it’s reasonable to increase the fc_limit because the only lagging nodes will be the slaves with no writes coming from them. However, with multi-node writing, larger queues will make ~~certification more expensive~~ replication conflicts more likely and therefore time-consuming to the application.

How to tell if flow control is happening and where it is coming from

There are two global status variables you can check to see what flow control is happening:

- wsrep_flow_control_paused – the fraction of time (out of 1.0) since the last SHOW GLOBAL STATUS that flow control is effect, regardless of which node caused it. Generally speaking, anything above 0.0 is to be avoided.

- wsrep_flow_control_sent – the number of flow control messages sent by the local node to the cluster. This can be used to discover which node is causing flow control.

I would strongly recommend monitoring and graphing wsrep_flow_control_sent so you can tell if and when flow control is happening and what node (or nodes) are causing it.

Using myq_gadgets, I can easily see flow control if I execute a FLUSH TABLES WITH READ LOCK on node3:

[root@node3 ~]# myq_status wsrep
Wsrep    Cluster        Node           Queue   Ops     Bytes     Flow        Conflct
    time  name P cnf  #  name  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt dst lcf bfa
09:22:17 myclu P   3  3 node3 Sync T/T   0   0   0   9    0  13K 0.0   0 101   0   0
09:22:18 myclu P   3  3 node3 Sync T/T   0   0   0  18    0  28K 0.0   0 108   0   0
09:22:19 myclu P   3  3 node3 Sync T/T   0   4   0   3    0 4.3K 0.0   0 109   0   0
09:22:20 myclu P   3  3 node3 Sync T/T   0  18   0   0    0    0 0.0   0 109   0   0
09:22:21 myclu P   3  3 node3 Sync T/T   0  27   0   0    0    0 0.0   0 109   0   0
09:22:22 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 0.9   1 109   0   0
09:22:23 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:24 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:25 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:26 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:27 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:20 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:21 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0
09:22:22 myclu P   3  3 node3 Sync T/T   0  29   0   0    0    0 1.0   0 109   0   0

[root@node3 ~]# myq_status wsrep

Wsrep Cluster Node Queue Ops Bytes Flow Conflct

time name P cnf # name cmt sta Up Dn Up Dn Up Dn pau snt dst lcf bfa

09:22:17 myclu P 3 3 node3 Sync T/T 0 0 0 9 0 13K 0.0 0 101 0 0

09:22:18 myclu P 3 3 node3 Sync T/T 0 0 0 18 0 28K 0.0 0 108 0 0

09:22:19 myclu P 3 3 node3 Sync T/T 0 4 0 3 0 4.3K 0.0 0 109 0 0

09:22:20 myclu P 3 3 node3 Sync T/T 0 18 0 0 0 0 0.0 0 109 0 0

09:22:21 myclu P 3 3 node3 Sync T/T 0 27 0 0 0 0 0.0 0 109 0 0

09:22:22 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 0.9 1 109 0 0

09:22:23 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:24 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:25 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:26 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:27 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:20 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:21 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

09:22:22 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0

Notice node3’s queue fills up, it sends 1 flow control message (to pause) and then Flow control is in a pause state 100% of the time. We can tell flow control came from this node because ‘Flow snt’ shows a message sent as soon as flow control is engaged.

Flow control and State transfer donation

Donor nodes should not cause flow control because they are moved from the Synced to the Donor/Desynced state. Donors in that state will continue to apply replication as they are permitted, but will build up a large replication queue without flow control if they are blocked by the underlying SST method, i.e., by FLUSH TABLES WITH READ LOCK.

0 0 votes

Article Rating

11 Comments

Oldest

Newest Most Voted

Anthony Somerset

13 years ago

seems i was caught out by this when running a mysqldump on a particularly large database the other day…

is there anyway to manually set Donor/Desynced status and to exit the State so that manual mysqldumps can be run for other reasons – much akin to doing a stop slave/start slave in traditional mysql replication setups?

Author

Jay Janssen

13 years ago

Anthony, that’s an excellent question. The supported method is to use the Galera arbitration daemon and you can find documentation on that here:

http://www.codership.com/wiki/doku.php?id=data_backup

This is a good topic for a followup blog post, stay tuned…

Raghavendra

13 years ago

‘@Jay, @anthony:

There is a variable for this — wsrep_sst_donor_rejects_queries
— from https://www.percona.com/doc/percona-xtradb-cluster/wsrep-system-index.html

However, its value is checked only during SST, I think it
shouldn’t take more work to make it work elsewhere as well.

Is this something which will be useful?

Raghavendra

13 years ago

Filed https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1179805 for above.

Anthony Somerset

13 years ago

Thanks the data_backup link was exactly what i needed 🙂 just had to adapt and rename my script and it works

Tim Vaillancourt

11 years ago

Thanks for this great article Jay.

Is it possible to change the gcs.fc_limit and gcs.fc_master_slave variables dynamically?

I was hoping I could write to just 1 node in a cluster (and increase the flow control + set fc_master_slave=YES) but if I were failing over to a different node I could drop back to defaults to ensure the nodes are completely “caught up” ONLY during the fail-over. Is that possible?

Thanks!

Tim

Author

Jay Janssen

11 years ago

‘@Tim: Yes, ‘set global wsrep_provider_options=”gcs.fc_limit=1024″‘, for example. Note that these won’t take effect if the node is IN flow control when you set them — i.e., if the node is stuck and it already has a large wsrep_local_recv_queue, they will apply the *next* time a FC calculation is done.

Note that even the default allows nodes to be at least 16 transactions behind in applying. If you want everything “caught up”, you probably want to check out wsrep_causal_reads (and now its replacement, wsrep_sync_wait https://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-system-index.html#wsrep_sync_wait)

Tim Vaillancourt

11 years ago

Great, thanks a lot Jay!

Phil Stracchino

11 years ago

‘@Jay, is there any known/documented interaction between flow control in XtraDB Cluster and SHOW SLAVE STATUS that would cause a running SHOW SLAVE STATUS NONBLOCKING to cause a delay in recovery from flow control? A customer is trying to refactor code in his applications that is causing flow control events in a pre-production five-node 5.6 cluster, of which node 1 is an asynchronous replication slave, and he has discovered that if he kills any running SHOW SLAVE STATUS operations when flow control occurs, the flow control condition is recovered nearly instantly.

Author

Jay Janssen

11 years ago

‘@phil — I’ve seen situations where a backlogged queue (which is why FC happens in the first place) can cause SHOW STATUS to hang, but this was in older PXC. I can imagine a similar situation with SHOW SLAVE STATUS, but it does seem unlikely. What version are you running?

Phil Stracchino

11 years ago

Reply to Jay Janssen

‘@Jay,
Server version: 5.6.22-72.0-56-log Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150

After further examination, it appears that part of the problem is probably that Cacti is using the NONBLOCKING extension which exists only in MySQL 5.7.0-5.7.5. One would expect that the cluster should just ignore this and return an error to Cacti. instead, it appears to trigger an issue with flow control resolution.