PXC – Incremental State transfers in detail

IST Basics

State transfers in Galera remain a mystery to most people.  Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

  • The Joiner node reports Galera a valid Galera GTID to the cluster
  • The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache
  • The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes.  They currently are:

  1. Joining
  2. Joining: preparing for State Transfer
  3. Joining: requested State Transfer
  4. Joining: receiving State Transfer
  5. Joining: State Transfer request failed
  6. Joining: State Transfer failed
  7. Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate.  Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue.  You can connect to the node at this time and poll state.  If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied).  It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take.  Here’s some output from the latest release of myq-tools showing wsrep stats:

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.


Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go.  For example:

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.


Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

  • ‘Joining’ is applying transactions acquired via the IST
  • ‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup.  Growth will trigger flow control, but why would it grow?  Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes).  As far as I can tell, there is no way to tune this or turn it off.  The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster.  This can possibly be improved by:

  • Better IO and other resources available to the Joiner for a quicker cache warmup.  A huge example of this would be flash over convention storage.
  • Buffer pool preloading
  • More Galera apply threads
  • etc.


From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit.  At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).