November 26, 2014

Galera Flow Control in Percona XtraDB Cluster for MySQL

Last week at Percona Live, I delivered a six-hour tutorial about Percona XtraDB Cluster (PXC) for MySQL.  I actually had more material than I covered (by design), but one thing I regret we didn’t cover was Flow control.  So, I thought I’d write a post covering flow control because it is important to understand.

What is flow control?

One of the things that people don’t often expect when switching to Galera is existence of a replication feedback mechanism, unlike anything you find in standard async MySQL replication. It is my belief that the lack of understanding of this system, or even that it exists, leads to unnecessary frustration with Galera and cluster “stalls” that are preventable.

This feedback, called flow control, allows any node in the cluster to instruct the group when it needs replication to pause and when it is ready for replication to continue. This prevents any node in the synchronous replication group from getting too far behind the others in applying replication.

This may sound counter-intuitive at first: how would synchronous replication get behind? As I’ve mentioned before, Galera’s replication is synchronous to the point of ensuring transactions are copied to all nodes and global ordering is established, but apply and commit is asynchronous on all but the node the transaction is run on.

It’s important to realize that Galera prevents conflicts to such transactions that have been certified but not yet applied, so multi-node writing will not lead to inconsistencies, but that is beyond the scope of this post.

Galera Flow Control in Percona XtraDB Cluster for MySQLTuning flow control

Flow control is triggered when a Synced node exceeds a specific threshold relative to the size of the receive queue (visible via the wsrep_local_recv_queue global status variable). Donor/Desynced nodes do not apply flow control, though they may enter states where the recv_queue grows substantially. Therefore care should be taken for applications to avoid using Donor/Desynced nodes, particularly when using a blocking SST method like rsync or mysqldump.

So, flow control kicks in when the recv queue gets too big, but how big is that? And when is flow control relaxed? There are a few settings that are relevant here, and they are all configured via the wsrep_provider_options global variable.

gcs.fc_limit

This setting controls when flow control engages. Simply speaking, if the wsrep_local_recv_queue exceeds this size on a given node, a pausing flow control message will be sent. However, it’s a bit trickier than that, because of fc_master_slave (see below).

The fc_limit defaults to 16 transactions. This effectively means that this is as far as a given node can be behind committing transactions from the cluster.

gcs.fc_master_slave

The fc_limit is modified dynamically if you have fc_master_slave disabled (which it is by default). This mode actually adjusts the fc_limit dynamically based on the number of nodes in the cluster. The more nodes in the cluster, the larger the calculated fc_limit becomes. The theory behind this is that the larger the cluster gets (and presumably busier with more writes coming from more nodes), the more leeway each node will get to be a bit further behind applying.

If you only write to a single node in PXC, then it is recommended you disable this feature by setting fc_master_slave=YES. Despite its name, this setting really does no more than to change if the fc_limit is dynamically resized or not. It contains no other magic that helps single node writing in PXC to perform better.

gcs.fc_factor

If fc_limit controls when flow control is enabled, then fc_factor addresses when it is released. The factor is a number between 0.0 and 1.0, which is multiplied by the current fc_limit (adjusted by the above calculation if fc_master_slave=NO). This yields the number of transactions the recv queue must fall BELOW before another flow control message is sent by the node giving the cluster permission to continue replication.

This setting traditionally defaulted to 0.5, meaning the queue had to fall below 50% of the fc_limit before replication was resumed. A large fc_limit in this case might mean a long wait before flow control gets relaxed again. However, this was recently modified to a default of 1.0 to allow replication to resume as soon as possible.

An example configuration tuning flow control in a master/slave cluster might be:

Working with flow control

What happens during flow control

Simply speaking: flow control makes replication stop, and therefore makes writes (which are synchronous) stop, on all nodes until flow control is relaxed.

In normal operation we would expect that a large receive queue might be the result of some brief performance issue on a given node, or perhaps the effect of some large transaction briefly stalling an applier thread.

However, it is possible to halt queue applying on any node by simply by running “FLUSH TABLES WITH READ LOCK”, or perhaps by “LOCK TABLE”, in which case flow control will kick in just as soon as the fc_limit is exceeded. Therefore, care must be taken that your application or some other maintenance operation (like a backup) doesn’t inadvertently cause flow control on your cluster.

The cost of increasing the fc_limit

Keeping the fc_limit small has two three purposes:

  1. It limits the amount of delay any node in the cluster might have applying cluster transactions. Therefore, it keeps reads more up to date without needing to use wsrep_causal_reads.
  2. It minimizes the expense of certification by keeping the window between new transactions being committed and the oldest unapplied transaction small. The larger the queue is, the more costly certification gets. EDIT: actually the cost of certification depends only the size of the transactions, which translates into number of unique key lookups into the certification index, which is a hash table.  A small fc_limit does however keep the certification index smaller in memory.
  3. It keeps the certification interval small, which minimizes replication conflicts on a cluster where writes happen on all nodes.

On a master/slave cluster, therefore, it’s reasonable to increase the fc_limit because the only lagging nodes will be the slaves with no writes coming from them. However, with multi-node writing, larger queues will make certification more expensive replication conflicts more likely and therefore time-consuming to the application.

How to tell if flow control is happening and where it is coming from

There are two global status variables you can check to see what flow control is happening:

  • wsrep_flow_control_paused – the fraction of time (out of 1.0) since the last SHOW GLOBAL STATUS that flow control is effect, regardless of which node caused it. Generally speaking, anything above 0.0 is to be avoided.
  • wsrep_flow_control_sent – the number of flow control messages sent by the local node to the cluster. This can be used to discover which node is causing flow control.

I would strongly recommend monitoring and graphing wsrep_flow_control_sent so you can tell if and when flow control is happening and what node (or nodes) are causing it.

Using myq_gadgets, I can easily see flow control if I execute a FLUSH TABLES WITH READ LOCK on node3:

Notice node3’s queue fills up, it sends 1 flow control message (to pause) and then Flow control is in a pause state 100% of the time.  We can tell flow control came from this node because ‘Flow snt’  shows a message sent as soon as flow control is engaged.

Flow control and State transfer donation

Donor nodes should not cause flow control because they are moved from the Synced to the Donor/Desynced state. Donors in that state will continue to apply replication as they are permitted, but will build up a large replication queue without flow control if they are blocked by the underlying SST method, i.e., by FLUSH TABLES WITH READ LOCK.

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

Comments

  1. seems i was caught out by this when running a mysqldump on a particularly large database the other day…

    is there anyway to manually set Donor/Desynced status and to exit the State so that manual mysqldumps can be run for other reasons – much akin to doing a stop slave/start slave in traditional mysql replication setups?

  2. Anthony, that’s an excellent question. The supported method is to use the Galera arbitration daemon and you can find documentation on that here:

    http://www.codership.com/wiki/doku.php?id=data_backup

    This is a good topic for a followup blog post, stay tuned…

  3. @Jay, @Anthony:

    There is a variable for this — wsrep_sst_donor_rejects_queries
    — from http://www.percona.com/doc/percona-xtradb-cluster/wsrep-system-index.html

    However, its value is checked only during SST, I think it
    shouldn’t take more work to make it work elsewhere as well.

    Is this something which will be useful?

  4. Thanks the data_backup link was exactly what i needed :) just had to adapt and rename my script and it works

  5. Thanks for this great article Jay.

    Is it possible to change the gcs.fc_limit and gcs.fc_master_slave variables dynamically?

    I was hoping I could write to just 1 node in a cluster (and increase the flow control + set fc_master_slave=YES) but if I were failing over to a different node I could drop back to defaults to ensure the nodes are completely “caught up” ONLY during the fail-over. Is that possible?

    Thanks!

    Tim

  6. @Tim: Yes, ‘set global wsrep_provider_options=”gcs.fc_limit=1024″‘, for example. Note that these won’t take effect if the node is IN flow control when you set them — i.e., if the node is stuck and it already has a large wsrep_local_recv_queue, they will apply the *next* time a FC calculation is done.

    Note that even the default allows nodes to be at least 16 transactions behind in applying. If you want everything “caught up”, you probably want to check out wsrep_causal_reads (and now its replacement, wsrep_sync_wait http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-system-index.html#wsrep_sync_wait)

  7. Great, thanks a lot Jay!

Speak Your Mind

*