Monitoring flow control in a Galera cluster is very important. If you do not, you will not understand why writes may sometimes be stalled. Percona XtraDB Cluster 5.6 provides 2 status variables for such monitoring:
wsrep_flow_control_paused_ns. Which one should you use?
What is flow control?
Flow control does not exist with regular MySQL replication, but only with Galera replication. It is simply the mechanism nodes are using when they are not able to keep up with the write load: to keep replication synchronous, the node that is starting to lag instructs the other nodes that writes should be paused for some time so it does not get too far behind.
If you are not familiar with this notion, you should read this blogpost.
Triggering flow control and graphing it
For this test, we’ll use a 3-node Percona XtraDB Cluster 5.6 cluster. On node 3, we will adjust
gcs.fc_limit so that flow control is triggered very quickly and then we will lock the node:
pxc3> set global wsrep_provider_options="gcs.fc_limit=1";
pxc3> flush tables with read lock;
Now we will use sysbench to insert rows on node 1:
$ sysbench --test=oltp --oltp-table-size=50000 --mysql-user=root --mysql-socket=/tmp/pxc1.sock prepare
Because of flow control, writes will be stalled and sysbench will hang. So after some time, we will release the lock on node 3:
pxc3> unlock tables;
During the whole process,
wsrep_flow_control_paused_ns are recorded every second with
mysqladmin ext -i1. We can then build a graph of the evolution of both variables:
While we can clearly see when flow control was triggered on both graphs, it is much easier to know when flow control was stopped with
wsrep_flow_control_paused_ns. It would be even more obvious if we have had several timeframes when flow control is in effect.
Monitoring a server is obviously necessary if you want to be able to catch issues. But you need to look at the right metrics. So don’t be scared if you are seeing that
wsrep_flow_control_paused is not 0: it simply means that flow control has been triggered at some point since the server started up. If you want to know what is happening right now, prefer