I was curious to check how Percona XtraDB Cluster behaves when it comes to MySQL replication latency — or better yet, call it data propagation latency. It was interesting to see whenever I can get stale data reads from other cluster nodes after write performed to some specific node. To test it I wrote quite a simple script (you can find it in the end of the post) which connects to one node in the cluster, performs an update and then immediately does the read from second node. If the data has been already propagated — good, if not we’ll continue to retry reads until it finally propagates, and then measure the latency. This is used to see whenever application can see any stale reads.
My setup is 3 Percona XtraDB Cluster nodes talking through dedicated 1Gbit cluster network (DPE1, DPE2, DPE3) and I’m running the test from 4th server (SMT2) so it is pretty realistic setup from typical data center latency point of view though the server hardware is not the most recent.
First lets look at baseline when cluster has no load but running the script doing writes to DPE1 and immediately reading from DPE2
Summary: 94 out of 10000 rounds (0.94%) Delay distribution: Min: 0.71 ms; Max: 2.16 ms Avg: 0.89 ms
These results tell me 2 things. First Replication by default in Percona XtraDB Cluster is Asynchronous from Data Propagation Standpoint – it takes time (though short one in this case) for changes committed on the one node to become visible to the other. Second – it is actually doing quite well with less than 1% of tests able to see any inconsistency and the delay being less than 1ms in average with rather stable results.
But we do not setup clusters to be idle right ? So lets to another test, now running the Sysbench load on DPE1. With concurrency of 32 this corresponds to pretty significant load.
sysbench --test=oltp --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=32 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=3000 run
Results become as follows:
Summary: 3901 out of 10000 rounds (39.01%) Delay distribution: Min: 0.66 ms; Max: 201.36 ms Avg: 3.81 ms
Summary: 3893 out of 10000 rounds (38.93%) Delay distribution: Min: 0.66 ms; Max: 42.9 ms Avg: 3.76 ms
As expected we can observe inconsistency much more frequently in almost 40% though the average observed delay remains just few milliseconds, which is something most applications would not even notice.
Now if we run sysbench on DPE2 (the load on the node which we’re reading from)
Summary: 3747 out of 10000 rounds (37.47%) Delay distribution: Min: 0.86 ms; Max: 108.15 ms Avg: 8.62 ms
Summary: 3721 out of 10000 rounds (37.21%) Delay distribution: Min: 0.81 ms; Max: 291.81 ms Avg: 8.54 ms
We can observe the effect in similar amount of cases but delay is higher in this case both in average and the Maximum one. This tells me from data propagation standpoint the cluster is more sensitive to the load on the nodes which receive the data, not the ones where writes are done.
Lets remember though what Sysbench OLTP has only rather small portion of writes. What if we look at workloads which consists 100% of Writes. We can do it with Sysbench, for example:
sysbench --test=oltp --oltp-test-mode=nontrx --oltp-nontrx-mode=update_key --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=32 --init-rng=on --max-requests=0 --max-time=3000 run
Running this load on DPE1 I’m getting:
Summary: 1062 out of 10000 rounds (10.62%) Delay distribution: Min: 0.71 ms; Max: 285.07 ms Avg: 3.21 ms
Summary: 1113 out of 10000 rounds (11.13%) Delay distribution: Min: 0.81 ms; Max: 275.94 ms Avg: 5.06 ms
Surprise! results are actually better than if we put mixed load as we can observe any delay only in about 11%.
However if we run the same side load on DPE2 we get:
Summary: 5349 out of 10000 rounds (53.49%) Delay distribution: Min: 0.81 ms; Max: 519.61 ms Avg: 5.02 ms
Summary: 5355 out of 10000 rounds (53.55%) Delay distribution: Min: 0.81 ms; Max: 526.95 ms Avg: 5.06 ms
Which is the worse result out there with over 50% samples produced inconsistent data and average delay for those over 5ms and outliers going to half a second.
From these results I read the side load on the node TO which updates are being propagated causes largest delay.
At this point I remembered there is one more test I can run. What is if I put side load on DPE3 server, from which I’m not touching from the test at all ?
Summary: 833 out of 10000 rounds (8.33%) Delay distribution: Min: 0.66 ms; Max: 353.61 ms Avg: 2.76 ms
No surprise here as DPE3 is not being directly read or written to the load on it should cause minimal delays to data propagation from DPE1 to DPE2.
The propagation latency we’ve observed in the test so far is quite good but it is not synchronous replication behavior – we still can’t treat the cluster as if it were single server from generic application. Right. Default configuration for Percona XtraDB Cluster at this point is to replicate data asynchronously, but still guaranty there is no conflicts and data inconsistency then updates are done on multiple nodes. There is an option you can enable to get fully synchronous replication behavior:
mysql> set global wsrep_causal_reads=1;
Query OK, 0 rows affected (0.00 sec)
When this option is enabled the cluster will wait for the data to be actually replicated (committed) before serving the read. The great thing is wsrep_causal_reads is session variables so you can mix different applications on the same cluster – some requiring better data consistency guarantees, other being OK with a little bit stale data but looking for the best performance possible.
So far so good. We can make cluster to handle significant load with small transactions and still have very respectful data propagation delay or we can enable wsrep_causal_reads=1 option and get full data consistency. But what happens if we have some larger transactions ? To test this I have created the copy of sbtest table and will run a long update while running my test to see how the latency is impacted:
mysql> update sbtest2 set k=k+1;
Query OK, 1000000 rows affected (1 min 14.12 sec)
Rows matched: 1000000 Changed: 1000000 Warnings: 0
Running this query on the DPE1 box I’m getting following result:
Result Mismatch for Value 48; Retries: 1 Delay: 0.76 ms
Result Mismatch for Value 173; Retries: 1 Delay: 1.21 ms
Result Mismatch for Value 409; Retries: 1 Delay: 0.86 ms
Result Mismatch for Value 460; Retries: 142459 Delay: 46526.7 ms
Result Mismatch for Value 461; Retries: 65 Delay: 22.92 ms
Result Mismatch for Value 464; Retries: 1 Delay: 0.71 ms
Result Mismatch for Value 465; Retries: 1 Delay: 0.76 ms
Summary: 452 out of 10000 rounds (4.52%) Delay distribution: Min: 0.66 ms; Max: 46526.7 ms Avg: 104.28 ms
So the propagation delay was pretty good until this given query had to be replicated, in which case we could observe the replication delay for over 45 seconds which is quite nasty.
Note though delay was for less period than it takes to execute the query on the master. This is because application of the changes on the master in parallel and updates to the sbtest table and sbtest2 table can be done in parallel (even changes to the same table can) but the certification process is serial as well as sending write set to the other nodes, and it must be taking some 45 seconds to send the write set and perform certification.
If we run the same query on DPE2 the interesting thing happens. The script does not show any data propagation delays but it visibly stalls, as I guess because the UPDATE statement issued to DPE1 is blocked for some time. To check this idea I decided to use the sysbench script with very simple point update queries to see if we get any significant stalls. My base run on DPE1 is as follows:
root@dpe01:/etc/mysql# sysbench --test=oltp --oltp-auto-inc=off --oltp-test-mode=nontrx --oltp-nontrx-mode=update_key --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=1 --init-rng=on --max-requests=0 --max-time=300 run
approx. 95 percentile: 0.94ms
We can see quite respectful performance with longest request taking some 300ms – so no stalls. Lets do the run again now running the same update statement on the different cluster node:
approx. 95 percentile: 0.97ms
As we see there is a stall in update for 50+ second, again while certification is happening. So certification does not only delay data propagation but can stall updates done to the different tables on the different nodes.
Percona XtraDB Cluster performs very well when it comes to small transactions offering very small propagation delay and an option of synchronous replication all together. However when it comes to large transactions you can get in a lot of trouble with major stalls both in terms of data propagation and in terms of writes. The system I did test on is pretty old and I would expect modern systems can run certification several times faster still taking tens of seconds for what I would consider medium size transaction modifying 1 million of rows is rather long time. So make sure to have a good understanding how large transactions your application has and how longs stalls it can handle.
As promised the script I was using for testing.
# The idea with this script is as follows. We have 2 nodes. We write to one node and when read from second node
# To see whenever we get the same data or different
$writer->query(“UPDATE $table set k=$val where id=$id”);
/* Loop while we get the right result */
$result=$reader->query(“SELECT k from $table where id=$id”);
if ($retries!=0) /* If we had to retry compute stats */
echo(“Result Mismatch for Value $val; Retries: $retries Delay: $delay ms\n”);
echo(“Summary: $delays out of $val rounds ($delay_pct%) Delay distribution: Min: $min_delay ms; Max: $max_delay ms Avg: $avg_delay ms\n”);
Appendix2: Percona XtraDB Cluster related configuration
# PXC Settings for Version: '5.5.29-55-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.2.r3843