Galera Flow Control in Percona XtraDB Cluster for MySQLJay Janssen
Last week at Percona Live, I delivered a six-hour tutorial about Percona XtraDB Cluster (PXC) for MySQL. I actually had more material than I covered (by design), but one thing I regret we didn’t cover was Flow control. So, I thought I’d write a post covering flow control because it is important to understand.
What is flow control?
One of the things that people don’t often expect when switching to Galera is existence of a replication feedback mechanism, unlike anything you find in standard async MySQL replication. It is my belief that the lack of understanding of this system, or even that it exists, leads to unnecessary frustration with Galera and cluster “stalls” that are preventable.
This feedback, called flow control, allows any node in the cluster to instruct the group when it needs replication to pause and when it is ready for replication to continue. This prevents any node in the synchronous replication group from getting too far behind the others in applying replication.
This may sound counter-intuitive at first: how would synchronous replication get behind? As I’ve mentioned before, Galera’s replication is synchronous to the point of ensuring transactions are copied to all nodes and global ordering is established, but apply and commit is asynchronous on all but the node the transaction is run on.
It’s important to realize that Galera prevents conflicts to such transactions that have been certified but not yet applied, so multi-node writing will not lead to inconsistencies, but that is beyond the scope of this post.
Flow control is triggered when a Synced node exceeds a specific threshold relative to the size of the receive queue (visible via the wsrep_local_recv_queue global status variable). Donor/Desynced nodes do not apply flow control, though they may enter states where the recv_queue grows substantially. Therefore care should be taken for applications to avoid using Donor/Desynced nodes, particularly when using a blocking SST method like rsync or mysqldump.
So, flow control kicks in when the recv queue gets too big, but how big is that? And when is flow control relaxed? There are a few settings that are relevant here, and they are all configured via the wsrep_provider_options global variable.
This setting controls when flow control engages. Simply speaking, if the wsrep_local_recv_queue exceeds this size on a given node, a pausing flow control message will be sent. However, it’s a bit trickier than that, because of fc_master_slave (see below).
The fc_limit defaults to 16 transactions. This effectively means that this is as far as a given node can be behind committing transactions from the cluster.
The fc_limit is modified dynamically if you have fc_master_slave disabled (which it is by default). This mode actually adjusts the fc_limit dynamically based on the number of nodes in the cluster. The more nodes in the cluster, the larger the calculated fc_limit becomes. The theory behind this is that the larger the cluster gets (and presumably busier with more writes coming from more nodes), the more leeway each node will get to be a bit further behind applying.
If you only write to a single node in PXC, then it is recommended you disable this feature by setting fc_master_slave=YES. Despite its name, this setting really does no more than to change if the fc_limit is dynamically resized or not. It contains no other magic that helps single node writing in PXC to perform better.
If fc_limit controls when flow control is enabled, then fc_factor addresses when it is released. The factor is a number between 0.0 and 1.0, which is multiplied by the current fc_limit (adjusted by the above calculation if fc_master_slave=NO). This yields the number of transactions the recv queue must fall BELOW before another flow control message is sent by the node giving the cluster permission to continue replication.
This setting traditionally defaulted to 0.5, meaning the queue had to fall below 50% of the fc_limit before replication was resumed. A large fc_limit in this case might mean a long wait before flow control gets relaxed again. However, this was recently modified to a default of 1.0 to allow replication to resume as soon as possible.
An example configuration tuning flow control in a master/slave cluster might be:
mysql> set global wsrep_provider_options="gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0";
Working with flow control
What happens during flow control
Simply speaking: flow control makes replication stop, and therefore makes writes (which are synchronous) stop, on all nodes until flow control is relaxed.
In normal operation we would expect that a large receive queue might be the result of some brief performance issue on a given node, or perhaps the effect of some large transaction briefly stalling an applier thread.
However, it is possible to halt queue applying on any node by simply by running “FLUSH TABLES WITH READ LOCK”, or perhaps by “LOCK TABLE”, in which case flow control will kick in just as soon as the fc_limit is exceeded. Therefore, care must be taken that your application or some other maintenance operation (like a backup) doesn’t inadvertently cause flow control on your cluster.
The cost of increasing the fc_limit
Keeping the fc_limit small has
two three purposes:
- It limits the amount of delay any node in the cluster might have applying cluster transactions. Therefore, it keeps reads more up to date without needing to use wsrep_causal_reads.
It minimizes the expense of certification by keeping the window between new transactions being committed and the oldest unapplied transaction small. The larger the queue is, the more costly certification gets.EDIT: actually the cost of certification depends only the size of the transactions, which translates into number of unique key lookups into the certification index, which is a hash table. A small fc_limit does however keep the certification index smaller in memory.
- It keeps the certification interval small, which minimizes replication conflicts on a cluster where writes happen on all nodes.
On a master/slave cluster, therefore, it’s reasonable to increase the fc_limit because the only lagging nodes will be the slaves with no writes coming from them. However, with multi-node writing, larger queues will make
certification more expensive replication conflicts more likely and therefore time-consuming to the application.
How to tell if flow control is happening and where it is coming from
There are two global status variables you can check to see what flow control is happening:
- wsrep_flow_control_paused – the fraction of time (out of 1.0) since the last SHOW GLOBAL STATUS that flow control is effect, regardless of which node caused it. Generally speaking, anything above 0.0 is to be avoided.
- wsrep_flow_control_sent – the number of flow control messages sent by the local node to the cluster. This can be used to discover which node is causing flow control.
I would strongly recommend monitoring and graphing wsrep_flow_control_sent so you can tell if and when flow control is happening and what node (or nodes) are causing it.
Using myq_gadgets, I can easily see flow control if I execute a FLUSH TABLES WITH READ LOCK on node3:
[root@node3 ~]# myq_status wsrep
Wsrep Cluster Node Queue Ops Bytes Flow Conflct
time name P cnf # name cmt sta Up Dn Up Dn Up Dn pau snt dst lcf bfa
09:22:17 myclu P 3 3 node3 Sync T/T 0 0 0 9 0 13K 0.0 0 101 0 0
09:22:18 myclu P 3 3 node3 Sync T/T 0 0 0 18 0 28K 0.0 0 108 0 0
09:22:19 myclu P 3 3 node3 Sync T/T 0 4 0 3 0 4.3K 0.0 0 109 0 0
09:22:20 myclu P 3 3 node3 Sync T/T 0 18 0 0 0 0 0.0 0 109 0 0
09:22:21 myclu P 3 3 node3 Sync T/T 0 27 0 0 0 0 0.0 0 109 0 0
09:22:22 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 0.9 1 109 0 0
09:22:23 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:24 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:25 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:26 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:27 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:20 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:21 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
09:22:22 myclu P 3 3 node3 Sync T/T 0 29 0 0 0 0 1.0 0 109 0 0
Notice node3’s queue fills up, it sends 1 flow control message (to pause) and then Flow control is in a pause state 100% of the time. We can tell flow control came from this node because ‘Flow snt’ shows a message sent as soon as flow control is engaged.
Flow control and State transfer donation
Donor nodes should not cause flow control because they are moved from the Synced to the Donor/Desynced state. Donors in that state will continue to apply replication as they are permitted, but will build up a large replication queue without flow control if they are blocked by the underlying SST method, i.e., by FLUSH TABLES WITH READ LOCK.