Group Replication in Percona Server for MySQL

group replication percona server mysqlPercona Server for MySQL 8.0.18 ships all functionality to run Group Replication and InnoDB Cluster setups, so I decided to evaluate how it works and how it compares with Percona XtraDB Cluster in some situations.

For this I planned to use three bare metal nodes, SSD drives, and a 10Gb network available for in-between nodes communication, but later I also added tests on three bare metal nodes with NVMe drives and 2x10Gb network cards.

To simplify deployment, I created simple ansible scripts.

Load Data

The first initial logical step is to load data into an empty cluster, so let’s do this with our sysbench-tpcc script.

The resulted dataset is about 100GB.

Group Replication, Load Time

The time to finish the script is 61 minutes, 19 seconds.

Let’s review how the network was loaded on a secondary node in Group Replication during the execution:

Average network traffic: 19.02 MiB/sec

PXC 5.7.28, Load Time

The time to finish the script is 39 minutes, 27 seconds.

Average network traffic: 29.81 MiB/sec

PXC 8.0.15 Experimental, Load Time

The time to finish the script is 43 minutes, 22 seconds.

Average network traffic: 27.35 MiB/sec

One Node PXC 5.7.28 Load Time

To see how PXC would perform without network interactions, I loaded data into a one node PXC cluster and it took 36 minutes, 34 seconds.  So there is a minimal network overhead for PXC 5.7 (36 minutes for one node vs 39 minutes for three nodes).

Node Joining

The next experiment I wanted to perform is to see how long it would take for a new node to join the cluster with data loaded in the previous part. The Group Replication supports two methods to catch-up: incremental (loading data from binary logs) and the clone plugin (physical copy of data).

Let’s measure time for a new node to join to catch-up with both methods:

Incremental

It took 1 hour and 52 minutes for a node to join and apply binary logs.

Incremental State Transfer in Percona XtraDB Cluster

It might not be obvious, but actually it is possible to have an incremental state transfer for big dataset changes in Percona XtraDB Cluster too, we just need to use a big enough gcache.

For testing purposes, I will set wsrep_provider_options=”gcache.size=150G” to check how long it will take to ship and apply IST in PXC.

Log extract:

In total it took 29 minutes 30 seconds to transfer and apply IST. So it was four times faster to apply IST than to apply binary logs.

Clone

I show the full log during the clone process, as it contains interesting information:

In total it took about 5 minutes for a new node to catch up, as we see Group Replication automatically Network load during clone transfer:

Network traffic was 525 MiB/sec, which about half of 10Gb network bandwidth.

Notes about Clone: It required a new node to restart, and MySQL could not do it automatically as I used a custom systemctl file, so it seems MySQL could not handle a restart. I had to perform a manual restart.

Actually, in this case, we are limited by SATA SSD read performance, that’s why we see only 525MiB/sec.

This brings up an interesting topic because although the clone plugin is working fast, it also shows that it can exhaust available resources, and if there is a live client load, likely it will be affected.

SST in PXC 5.7.28

Let’s compare how SST in Percona XtraDB Cluster performs for joining a new node.

So it took 9 minutes for SST to complete and a new node to join.

Network load during SST:

Network traffic during SST was 229 MiB/sec, which is only 25% of 10Gb network bandwidth. The reason why SST uses less network bandwidth is a bug in the SST script, where our xbstream binary does not use multi-threading.

If we apply the fix to use multiple threads:

We are back to 496MiB/sec of transfer and the total time for SST is 5 minutes, 23 seconds.

Clone and SST on NVMe Storage with 2x10Gb Network

In previous cases our transfer rate was limited by SATA SSD read performance, so let’s see if we are able to achieve a faster transfer time when more resources are available. The data is stored on NVMe devices and for the network, I used a 2x10Gb network connection.

The raw network throughput I am able to achieve with iPerf3 is 17.8 Gbits/sec (or 2.23 GiB/sec).

Clone plugin:

So the clone plugin used all available network bandwidth in 2.22GiB/sec, finishing the transfer four times faster than with SATA SSD and 10Gb network.

As for PXC SST, without a fix for xbstream:

There I see only 909MiB/sec.

And with the fix for xbstream:

In this case, we get 1.26GiB/sec, which is noticeably slower than the clone plugin. I do not know yet why with SST we can get to the 2GiB/sec throughput.

Conclusions

The clone plugin is really a fast way to transfer data, and in my experiments was faster than SST in Percona XtraDB Cluster. The only downside it could not join a new node without a manual restart.

The incremental update was really slow and not the best way to perform a node join, but Group Replication chose this way by default even though it was not the optimal decision.

Both technologies are sensitive to available hardware, and the hardware upgrade (storage, network) could be viable options to improve performance.

As for loading data, the 3-node Group replication cluster was slower than the Percona XtraDB Cluster: 61 minutes for Group Replication and 39 minutes for PXC.

Share this post

Comments (3)

  • Kenny Gryp Reply

    Hi Vadim, Thank you for evaluating!

    Group Replication uses (classic) Replication for incremental mode. Configure MTS writeset based replication to improve catchup performance! MTS LOGICAL_CLOCK is already enabled but binlog_transaction_dependency_tracking=WRITESET is missing (https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#sysvar_binlog_transaction_dependency_tracking)

    In your custom systemctl, to enable RESTART in MySQL 8.0, make sure to set MYSQLD_PARENT_PID=1 (https://dev.mysql.com/doc/mysql-secure-deployment-guide/8.0/en/secure-deployment-post-install.html#secure-deployment-systemd-startup)

    January 17, 2020 at 7:19 pm
  • Sids Reply

    Awesome insights @Vadim. By any chance, did you performed multi-master stress use-cases on GR and PXC?! I think, there would be significant difference in both as well.

    January 17, 2020 at 9:01 pm
  • lefred Reply

    Hi Vadim,

    Thank you for testing MySQL Group Replication and InnoDB Clone !

    I checked a bit your config and I think some adjustments might provide you better results with the tpcc prepare workload.

    Could you try to set these on **all members** of the group (using set persist as below or in your ansible playbook):

    set persist binlog_transaction_dependency_tracking=WRITESET;
    set persist binlog_transaction_dependency_history_size=1000000;
    set persist slave_checkpoint_period=3000;
    set persist slave_pending_jobs_size_max=13421772800;
    set persist slave_checkpoint_group=52420;
    — IIRC, these you should already have —
    set persist slave_parallel_workers=16;
    set persist slave_parallel_type = LOGICAL_CLOCK;

    I’m looking forward to see your results !

    Cheers,

    lefred.

    January 22, 2020 at 5:06 pm

Leave a Reply