Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

PXC – Incremental State transfers in detail

August 5, 2015

Author

Jay Janssen

MySQL

Percona Software

Share this Post:

IST Basics

State transfers in Galera remain a mystery to most people. Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

- The Joiner node reports Galera a valid Galera GTID to the cluster

- The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache

- The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes. They currently are:

1. Joining

1. Joining: preparing for State Transfer

1. Joining: requested State Transfer

1. Joining: receiving State Transfer

1. Joining: State Transfer request failed

1. Joining: State Transfer failed

1. Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate. Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue. You can connect to the node at this time and poll state. If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied). It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take. Here’s some output from the latest release of myq-tools showing wsrep stats:

[root@node2 ~]# myq_status wsrep
mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:40 P   4  2 J:Rc 0.4ms    0   0b   0    1 197b  4k   0ns   0   0   0  367k    0  94%
14:04:41 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  5k   0ns   0   0   0  368k    0  93%
14:04:42 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  6k   0ns   0   0   0  371k    0  92%
14:04:43 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  7k   0ns   0   0   0  373k    0  92%
14:04:44 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  8k   0ns   0   0   0  376k    0  92%
14:04:45 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 10k   0ns   0   0   0  379k    0  92%
14:04:46 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 11k   0ns   0   0   0  382k    0  92%
14:04:47 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 12k   0ns   0   0   0  386k    0  91%
14:04:48 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 13k   0ns   0   0   0  390k    0  91%
14:04:49 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 14k   0ns   0   0   0  394k    0  91%
14:04:50 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 15k   0ns   0   0   0  397k    0  91%
14:04:51 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 16k   0ns   0   0   0  401k    0  91%
14:04:52 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 18k   0ns   0   0   0  404k    0  91%
14:04:53 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 19k   0ns   0   0   0  407k    0  91%
14:04:54 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 20k   0ns   0   0   0  411k    0  91%

[root@node2 ~]# myq_status wsrep

mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)

Cluster Node Outbound Inbound FlowC Conflct Gcache Appl

time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef

14:04:40 P 4 2 J:Rc 0.4ms 0 0b 0 1 197b 4k 0ns 0 0 0 367k 0 94%

14:04:41 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 5k 0ns 0 0 0 368k 0 93%

14:04:42 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 6k 0ns 0 0 0 371k 0 92%

14:04:43 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 7k 0ns 0 0 0 373k 0 92%

14:04:44 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 8k 0ns 0 0 0 376k 0 92%

14:04:45 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 10k 0ns 0 0 0 379k 0 92%

14:04:46 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 11k 0ns 0 0 0 382k 0 92%

14:04:47 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 12k 0ns 0 0 0 386k 0 91%

14:04:48 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 13k 0ns 0 0 0 390k 0 91%

14:04:49 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 14k 0ns 0 0 0 394k 0 91%

14:04:50 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 15k 0ns 0 0 0 397k 0 91%

14:04:51 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 16k 0ns 0 0 0 401k 0 91%

14:04:52 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 18k 0ns 0 0 0 404k 0 91%

14:04:53 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 19k 0ns 0 0 0 407k 0 91%

14:04:54 P 4 2 J:Rc 0.4ms 0 0b 0 0 0b 20k 0ns 0 0 0 411k 0 91%

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.

Joining

Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go. For example:

         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:55 P   4  2 Jing 0.6ms    0   0b   0 2243 3.7M 19k   0ns   0   0   0  2236  288  91%
14:04:56 P   4  2 Jing 0.5ms    0   0b   0 4317 7.0M 16k   0ns   0   0   0  6520  199  92%
14:04:57 P   4  2 Jing 0.5ms    0   0b   0 4641 7.5M 12k   0ns   0   0   0 11126  393  92%
14:04:58 P   4  2 Jing 0.4ms    0   0b   0 4485 7.2M  9k   0ns   0   0   0 15575  200  93%
14:04:59 P   4  2 Jing 0.5ms    0   0b   0 4564 7.4M  5k   0ns   0   0   0 20102  112  93%

Cluster Node Outbound Inbound FlowC Conflct Gcache Appl

time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef

14:04:55 P 4 2 Jing 0.6ms 0 0b 0 2243 3.7M 19k 0ns 0 0 0 2236 288 91%

14:04:56 P 4 2 Jing 0.5ms 0 0b 0 4317 7.0M 16k 0ns 0 0 0 6520 199 92%

14:04:57 P 4 2 Jing 0.5ms 0 0b 0 4641 7.5M 12k 0ns 0 0 0 11126 393 92%

14:04:58 P 4 2 Jing 0.4ms 0 0b 0 4485 7.2M 9k 0ns 0 0 0 15575 200 93%

14:04:59 P 4 2 Jing 0.5ms 0 0b 0 4564 7.4M 5k 0ns 0 0 0 20102 112 93%

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.

Joined

14:05:00 P   4  2 Jned 0.5ms    0   0b   0 4631 7.5M  2k   0ns   0   0   0 24692   96  94%

1	14:05:00 P 4 2 Jned 0.5ms 0 0b 0 4631 7.5M 2k 0ns 0 0 0 24692 96 94%

Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

- ‘Joining’ is applying transactions acquired via the IST

- ‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup. Growth will trigger flow control, but why would it grow? Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes). As far as I can tell, there is no way to tune this or turn it off. The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster. This can possibly be improved by:

- Better IO and other resources available to the Joiner for a quicker cache warmup. A huge example of this would be flash over convention storage.

- Buffer pool preloading

- More Galera apply threads

- etc.

Synced

From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit. At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).

         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:05:01 P   4  2 Sync 0.5ms    0   0b   0 3092 5.0M   0   0ns   0   0   0 27748  150  94%
14:05:02 P   4  2 Sync 0.5ms    0   0b   0 1067 1.7M   0   0ns   0   0   0 28804  450  93%
14:05:03 P   4  2 Sync 0.5ms    0   0b   0 1164 1.9M   0   0ns   0   0   0 29954   67  92%
14:05:04 P   4  2 Sync 0.5ms    0   0b   0 1166 1.9M   0   0ns   0   0   0 31107  280  92%
14:05:05 P   4  2 Sync 0.5ms    0   0b   0 1160 1.9M   0   0ns   0   0   0 32258  606  91%
14:05:06 P   4  2 Sync 0.5ms    0   0b   0 1154 1.9M   0   0ns   0   0   0 33401  389  90%
14:05:07 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   1   0ns   0   0   0 34534  297  90%
14:05:08 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   0   0ns   0   0   0 35667  122  89%
14:05:09 P   4  2 Sync 0.5ms    0   0b   0 1121 1.8M   0   0ns   0   0   0 36778  617  88%

Cluster Node Outbound Inbound FlowC Conflct Gcache Appl

time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef

14:05:01 P 4 2 Sync 0.5ms 0 0b 0 3092 5.0M 0 0ns 0 0 0 27748 150 94%

14:05:02 P 4 2 Sync 0.5ms 0 0b 0 1067 1.7M 0 0ns 0 0 0 28804 450 93%

14:05:03 P 4 2 Sync 0.5ms 0 0b 0 1164 1.9M 0 0ns 0 0 0 29954 67 92%

14:05:04 P 4 2 Sync 0.5ms 0 0b 0 1166 1.9M 0 0ns 0 0 0 31107 280 92%

14:05:05 P 4 2 Sync 0.5ms 0 0b 0 1160 1.9M 0 0ns 0 0 0 32258 606 91%

14:05:06 P 4 2 Sync 0.5ms 0 0b 0 1154 1.9M 0 0ns 0 0 0 33401 389 90%

14:05:07 P 4 2 Sync 0.5ms 0 0b 0 1147 1.8M 1 0ns 0 0 0 34534 297 90%

14:05:08 P 4 2 Sync 0.5ms 0 0b 0 1147 1.8M 0 0ns 0 0 0 35667 122 89%

14:05:09 P 4 2 Sync 0.5ms 0 0b 0 1121 1.8M 0 0ns 0 0 0 36778 617 88%

Conclusion

You may notice these states during IST if you aren’t watching the Joiner closely, but if your IST is talking a long while, it should be easy using the above situation to understand what is happening.