Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Automatic replication relaying in Galera 3.x (available with PXC 5.6)

December 19, 2013

Author

Jay Janssen

MySQL

Share this Post:

A decade ago MySQL folks were in love with the concept of a relay slave for MySQL high availability across data centers. A relay is a single slave in a remote data center that receives replication from the global master and, in turn, replicates to all the other local slaves in that data center. This saved a lot of bandwidth, especially back in the days before memcached when scaling reads meant lots of slaves. Sending 20 copies of your replication stream cross-WAN gets expensive.

In Galera and Percona XtraDB Cluster (PXC), by default when a transaction commits on a given node it is sent to every other node in the cluster from that node. That is, the actual writeset payload (the RBR events) are sent over the network to every other node, so the bandwidth to replicate is roughly:

<writeset size> * (<number of nodes> - 1)

1	<writeset size> * (<number of nodes> - 1)

If any of your nodes happen to be in a remote data center, the replication is still duplicated for each remote node, much like a master-slave topology without a relay.

Replication traffic with default Galera tuning (and pre-3.x)

To illustrate this I setup a 3 node PXC 5.6 cluster test environment: (it would work the same on PXC 5.5 and Galera 2.x)

This isn’t the best design for HA, but let’s assume nodes 2 and 3 are in a remote data center. If I use some simple iptables ACCEPT rules in the OUTPUT chain, I can easily track the amount of bandwidth replication uses on each node in a simple 1 minute sysbench update-only test that writes only on node1:

	 pkts bytes target     prot opt in     out     source               destination
node1:
	24689   18M ACCEPT     tcp  --  any    eth1    192.168.70.2         192.168.70.3
	24389   18M ACCEPT     tcp  --  any    eth1    192.168.70.2         192.168.70.4
node2:
	24802 2977K ACCEPT     tcp  --  any    eth1    192.168.70.3         192.168.70.2
	20758 2767K ACCEPT     tcp  --  any    eth1    192.168.70.3         192.168.70.4
node3:
	22764 2871K ACCEPT     tcp  --  any    eth1    192.168.70.4         192.168.70.2
	20872 2772K ACCEPT     tcp  --  any    eth1    192.168.70.4         192.168.70.3

pkts bytes target prot opt in out source destination

node1:

24689 18M ACCEPT tcp -- any eth1 192.168.70.2 192.168.70.3

24389 18M ACCEPT tcp -- any eth1 192.168.70.2 192.168.70.4

node2:

24802 2977K ACCEPT tcp -- any eth1 192.168.70.3 192.168.70.2

20758 2767K ACCEPT tcp -- any eth1 192.168.70.3 192.168.70.4

node3:

22764 2871K ACCEPT tcp -- any eth1 192.168.70.4 192.168.70.2

20872 2772K ACCEPT tcp -- any eth1 192.168.70.4 192.168.70.3

We can see that node1 sends a full 18M of data to both node2 and node3. The traffic from nodes 2 and 3 between each other and back to node1 is group communication, you can think of it as replication acknowledgements and other cluster communication.

Replication traffic with Galera 3 WAN segments configured

Galera 3 (available with PXC 5.6) introduces a new feature called WAN segments that basically implements the relay-slave concept, but in a more elegant way. To enable this, we simply assign each node in a given data center a common gmcast.segment integer in wsrep_provider_options. Each data center must have a distinct identifier and each node in that data center should have the same segment.

If we apply this configuration to our above environment where node1 is in gmcast.segment=1 and nodes 2 and 3 are in gmcast.segment=2, we get the following network throughput from the same 1 minute test:

    pkts bytes target     prot opt in     out     source               destination
node1:
	20642   15M ACCEPT     tcp  --  any    eth1    192.168.70.2         192.168.70.3
	 6088  317K ACCEPT     tcp  --  any    eth1    192.168.70.2         192.168.70.4
node2:
	19045 2368K ACCEPT     tcp  --  any    eth1    192.168.70.3         192.168.70.2
	33652   17M ACCEPT     tcp  --  any    eth1    192.168.70.3         192.168.70.4
node3:
	14682 2144K ACCEPT     tcp  --  any    eth1    192.168.70.4         192.168.70.2
	21974 2522K ACCEPT     tcp  --  any    eth1    192.168.70.4         192.168.70.3

pkts bytes target prot opt in out source destination

node1:

20642 15M ACCEPT tcp -- any eth1 192.168.70.2 192.168.70.3

6088 317K ACCEPT tcp -- any eth1 192.168.70.2 192.168.70.4

node2:

19045 2368K ACCEPT tcp -- any eth1 192.168.70.3 192.168.70.2

33652 17M ACCEPT tcp -- any eth1 192.168.70.3 192.168.70.4

node3:

14682 2144K ACCEPT tcp -- any eth1 192.168.70.4 192.168.70.2

21974 2522K ACCEPT tcp -- any eth1 192.168.70.4 192.168.70.3

We can now clearly see that our replication is following this path, using node2 as a relay:

So our hypothetical WAN link here between segment 1 and segment 2 only needs a single copy of the replication stream instead of one per remote node.

But why is this better than a regular old async relay slave? It’s better because node2 was chosen dynamically to be the relay, I did not configure anything special besides the segment designation. The cluster could have just as easily chosen node3. If node2 failed, node3 will simply take over relay responsibilities (assuming there were more nodes).

Further, as I understand the feature, there’s nothing forcing all replication to get relayed through a single node in each segment. Any given transaction from any given node in the cluster might use any node in a given segment as a relay. The relaying is actually per-transaction and fully dynamic. No fuss, no muss.

What about commit latency?

Astute readers know that node1 still must ultimately get acknowledgement from all other nodes before responding to the client. When we are using segment relays, this should add some latency to commit time.

In my testing I was on a single virtual LAN, but my commit latency averages came out about pretty close. I also setup a WAN environment on AWS where node1 was in us-east-1 and nodes 2 and 3 were in us-west-1 and the difference in commit latency was effectively nil.

The additional latency is about 1ms in the LAN test case, these are 3 VMs on the same physical host, so there’s probably some additional overhead here in play. The high latency between the data centers fully masks the relaying overhead in a true WAN case.

Here are the raw results from the WAN tests:

No Segments

Sysbench run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8
Random number generator seed is 0 and will be ignored

Threads started!

OLTP test statistics:
    queries performed:
        read:                            0
        write:                           3954
        other:                           0
        total:                           3954
    transactions:                        0      (0.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 3954   (65.80 per sec.)
    other operations:                    0      (0.00 per sec.)

General statistics:
    total time:                          60.0952s
    total number of events:              3954
    total time taken by event execution: 480.4790s
    response time:
         min:                                 83.20ms
         avg:                                121.52ms
         max:                                321.30ms
         approx.  95 percentile:             169.67ms

Threads fairness:
    events (avg/stddev):           494.2500/1.85
    execution time (avg/stddev):   60.0599/0.03

Sysbench run

sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:

Number of threads: 8

Random number generator seed is 0 and will be ignored

Threads started!

OLTP test statistics:

queries performed:

write: 3954

other: 0

total: 3954

transactions: 0 (0.00 per sec.)

deadlocks: 0 (0.00 per sec.)

read/write requests: 3954 (65.80 per sec.)

other operations: 0 (0.00 per sec.)

General statistics:

total time: 60.0952s

total number of events: 3954

total time taken by event execution: 480.4790s

response time:

min: 83.20ms

avg: 121.52ms

max: 321.30ms

approx. 95 percentile: 169.67ms

Threads fairness:

events (avg/stddev): 494.2500/1.85

execution time (avg/stddev): 60.0599/0.03

With Segments

Sysbench run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8
Initializing random number generator from seed (1).

Threads started!

OLTP test statistics:
    queries performed:
        read:                            0
        write:                           3944
        other:                           0
        total:                           3944
    transactions:                        0      (0.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 3944   (65.63 per sec.)
    other operations:                    0      (0.00 per sec.)

General statistics:
    total time:                          60.0957s
    total number of events:              3944
    total time taken by event execution: 480.1212s
    response time:
         min:                                 82.96ms
         avg:                                121.73ms
         max:                                226.33ms
         approx.  95 percentile:             166.85ms

Threads fairness:
    events (avg/stddev):           493.0000/1.58
    execution time (avg/stddev):   60.0151/0.03

Sysbench run

sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:

Number of threads: 8

Initializing random number generator from seed (1).

Threads started!

OLTP test statistics:

queries performed:

write: 3944

other: 0

total: 3944

transactions: 0 (0.00 per sec.)

deadlocks: 0 (0.00 per sec.)

read/write requests: 3944 (65.63 per sec.)

other operations: 0 (0.00 per sec.)

General statistics:

total time: 60.0957s

total number of events: 3944

total time taken by event execution: 480.1212s

response time:

min: 82.96ms

avg: 121.73ms

max: 226.33ms

approx. 95 percentile: 166.85ms

Threads fairness:

events (avg/stddev): 493.0000/1.58

execution time (avg/stddev): 60.0151/0.03

Test for yourself

I built my test environment on both local VMs and in AWS using an open source Vagrant environment you can find here: https://github.com/jayjanssen/pxc_testing/tree/5_6_segments (check the run_segments.sh script as well as the README.md and documentation for the submodule).

We’ve also released Percona Xtradb Cluster 5.6 RC1 with Galera 3.2 , the above Vagrant environment should pull the latest 5.6 build in automatically.