WAN Synchronous Clusters: Dealing with Latency Using ConcurrencyPrzemysław Malkowski
In this blog, we’ll discuss how to use concurrency to help with WAN latency when using synchronous clusters.
WAN Latency Problem
Our customers often ask us for help or advice with WAN clustering problems. Historically, the usual solution for MySQL WAN deployments is having the primary site in one data center, and stand-by backup site in another data center (replicating from the primary asynchronously). These days, however, there is a huge desire to employ available synchronous replication solutions for MySQL. These solutions include things like Galera (i.e., Percona XtraDB Cluster) or the recently released MySQL Group Replication. This trend is attributable to the fact that these solutions are less problematic and provide more automatic fail over and fail back procedures. But it’s also because businesses want to write in both data centers simultaneously.
Unfortunately, WAN link reliability and latency makes the synchronous replication solution a big challenge. In many cases, these challenges force geographically separate data centers to still replicate asynchronously.
From a requirements point of view, the Galera founders official documentation has WAN related recommendations and some dedicated options (like segments) — as described in Jay’s blog post. But WAN deployments are absolutely possible, and even an advertised option, in Galera. The MySQL Group Replication team, however, seem to discourage such use cases, as we can read:
Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.
While perhaps obvious to some, I would like to point out a simple dependency that might be a viable solution in some deployments that face significant network latency. That solution is concurrency! When you face the problem of limited write throughput due to a transaction commit latency, you can employ more writer threads. By using separate connections to MySQL, overall you can to commit more transactions at the same time.
Let me demonstrate with example results based on a very simple test case. I tested both Percona XtraDB Cluster (with Galera replication) and MySQL Group Replication. I configured a minimal cluster of three nodes in each case, running as Docker containers on the same host (simulating a WAN network). For this setup, latency is around 0.05ms on average. Then, I introduced an artificial network latency of 50ms and 100ms into one of the node’s network interfaces. I later repeated the same tests using VirtualBox VM instances, running on a completely different server. The results were very similar. The command to simulate additional network latency is:
# tc qdisc add dev eth0 root netem delay 50ms
To delay the ping to other nodes in the cluster:
# ping -c 2 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=50.0 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=50.1 ms
The test is very simple: execute 500 small insert transactions, each inserting just single row (but that is less relevant now).
For testing, I used a simple mysqlslap command:
mysqlslap --password=*** --host=$IP --user=root --delimiter=";" --number-of-queries=500 --create-schema=test --concurrency=$i --query="insert into t1 set a='fooBa'"
and simple single table:
CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`a` char(5) DEFAULT NULL,
PRIMARY KEY (`id`)
Interestingly, without increased latency, the same test takes much longer against the Group Replication cluster, even though by default Group Replication works with enabled group_replication_single_primary_mode, and disabled group_replication_enforce_update_everywhere_checks. Theoretically, it should be a lighter operation, from a “data consistency checks” point of view. Also with WAN-type latencies, Percona XtraDB Cluster seems to be slightly faster in this particular test. Here are the test results for the three different network latencies:
I used the same InnoDB settings for both clusters, each node under a separate Docker container or Virtual Box VM. Similar test result could differ a lot in real production systems, where more CPU cores provide better multi-concurrency conditions.
It also wasn’t my idea to benchmark Galera versus Group Replication, but rather to show that the same concurrency to write throughput dependency applies to both technologies. I might be missing some tuning on the Group Replication side, so I don’t claim any verified winner here.
Just to provide some more details, I was using Percona XtraDB Cluster 5.7.16 and MySQL with Group Replication 5.7.17.
One important note: when you expect higher concurrency to provide better throughput, you must make sure the concurrency is not limited by server settings. For example, you must look at innodb_thread_concurrency (I used 0, so unlimited), slave_parallel_workers for GR and wsrep_slave_threads for Galera (among others related to IO operations, etc.).
Apart from “concurrency tuning,” which could involve application changes if not architectural re-design, there are of course more possible optimizations for WAN environments. For example:
for saving/minimizing network utilization using binlog_row_image=minimal and other variables.
But these are out of the scope of this post. I hope this simple post helps you deal with the speed of light better! 😉