Making the impossible: 3 nodes intercontinental replication

In this post I want to show new possibilities which open with Percona XtraDB Cluster.
We will create 3 nodes Cluster with nodes on different continents (Europe, USA, Japan) and each node will accept write queries.
Well, you theoretically could create 3 node traditional MySQL ring replication, but this is not what you want to use day-to-day.

To show how it works I will use Amazon m1.xlarge instances, by one in Tokyo, Ireland and North California, running RedHat Entreprise 6.2 64bit.

In fact to create instances is most time consuming task. After that using my script you will have cluster running in 5 min or less.

There however some precautions needed if you run Amazon instances.
First, you need to open ports in the firewall. For the communication the nodes need, by default, 4444, 4567, 4568 ports
(see our FAQ why)

Second, please take into account that the communication is done using open channels, and you may want to establish an encrypted connection, using it in real life.

Now, as we have running instances, you can install Cluster packages from RPM repositories.
You can follow steps from documentation.

Or I prepared simple script which does all the work:,
you also can follow steps from the script to adjust it for your environment.

You just need to change nodes host names and your keys in the file:

When all nodes start, we have running 3 nodes EU<->USA<->JAPAN, and each node is ready to execute both read and write queries.

Of course you may wonder what is query response time in such environment. We can check it.

Simple table: CREATE TABLE t (ID INT) and simple query: INSERT INTO t VALUES (1)

Response time on single node in EU (no cluster setup): 0.005100 sec
Response time on two-nodes (EU<->JAP) cluster: 0.275642 sec
Response time on three-nodes (EU<->JAP<->USA) cluster: 0.294754 sec

Well, one may argue that 0.27 sec for single query is kind of big, but this is the physic law, you cannot
go faster than speed of light, and the round trip between Europe and Japan takes time (or at least until scientists figure out how to attach transaction to Faster than light neutrino).
Also note, that XtraDB Cluster can apply events in parallel and throughput should be less affected by big distance.


Share this post

Comments (31)

  • Harrison Reply

    .25 seconds for a write seems pretty slow, particularly if you have to do multiple per request. How does galera work with batching?

    If you did a multi-insert packet, would it be .25? If you did a multi-row insert would it be .25 seconds? If you had a SP which did 10 writes, would it still be roughly .25?

    January 11, 2012 at 6:47 pm
  • Vadim Tkachenko Reply


    Synchronization happens on COMMIT stage.
    That is if you have batch in BEGIN / COMMIT; each individual INSERT inside will be quick, but COMMIT will take .25 sec.


    mysql> BEGIN;
    Query OK, 0 rows affected (0.00 sec)

    mysql> insert into t1 values (3);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into t1 values (4);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into t1 values (5);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into t1 values (6);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into t1 values (7);
    Query OK, 1 row affected (0.00 sec)

    mysql> COMMIT;
    Query OK, 0 rows affected (0.16 sec)

    In this case it is 0.16 sec as I use USA-WEST < -> EU-WEST setup.

    You see that only COMMIT is 0.16.

    Does it answer ?

    January 11, 2012 at 7:48 pm
  • Harrison Reply

    Yup, sounds good.

    January 11, 2012 at 9:56 pm
  • Florian Haas Reply

    For the sake of comparison, what’s the latency penalty when running on real hardware, in a directly switched Gigabit Ethernet network?

    January 12, 2012 at 1:25 am
  • Henrik Ingo Reply

    To continue Harrison’s trail, it now depends on what you are comparing against whether this is truly amazing, ok or a degradation. The question, is what have you done with 3 continents until today?

    If the answer is that you have a readonly datacenter with mysql slaves, and all writes are directed to one master data center, then this is a lot faster. You can now write at any datacenter, which is closest to the user. The commit will be slower, but the actual inserts and updates are now fast. On average it is a net gain (if you have users spread around the world).

    If the answer is that you already write in multiple data centers, for instance using a “Master of Record” scheme, then this is not faster and might be slower. Otoh this is much simpler to manage as you don’t need to map each user to a “home” datacenter.

    Florian: Codership has done a similar EU-US test with throughput and latency numbers published here:

    January 12, 2012 at 2:11 am
  • Florian Haas Reply

    Ingo, thanks a lot, but while that post claims to compare performance both in a LAN environments and in EC2 instances, all I actually see is single-node vs. EC2 numbers. Am I missing something? Please point me to the single-LAN, multi-node, multi-master replication numbers if that post (or another) has them. Thanks.

    January 12, 2012 at 2:43 am
  • Florian Haas Reply

    Henrik, sorry for mixing up your first and last names. What was I thinking? I’m juggling a few tasks here today.

    January 12, 2012 at 2:45 am
  • Henrik Ingo Reply

    Hehe, it’s common with Germans and I’m completely ok with it 🙂

    The codership blog post has tests with:
    – single node
    – clusters inside one Amazon data center
    – clusters spanning Amazon data centers on two continents

    Was there something more you were looking for?

    January 12, 2012 at 3:53 am
  • Florian Haas Reply

    Hehe, it’s common with Germans and I’m completely ok with it

    … as it seems common among Finns to mix up Germans and Austrians. 🙂

    I was looking for benchmarks for >= 2 non-virtual hosts running in a local LAN with synchronous replication, no EC2 involved.

    January 12, 2012 at 4:31 am
  • Henrik Ingo Reply

    Ok, touche, but still within German speaking culture.

    I have tested Galera non-virtualized, but then you don’t get a comparison against multiple data center setups:

    January 12, 2012 at 4:37 am
  • Florian Haas Reply

    I’m actually not so much looking for multi-DC setups at this point (although having something like that for MySQL will come in handy), I’m just trying to get an idea about the performance in comparison to “traditional” MySQL HA solutions.

    And looks like it’s just what I’ve been looking for, thanks!

    January 12, 2012 at 4:41 am
  • Henrik Ingo Reply

    Oh, sorry, I didn’t publish a lot of latency numbers though. As far as I remember (I could dig them up from logs one day) they are like for semi-sync replication: noticeable but not of any practical consequence. Most importantly tps on a parallel workload is equal or better than a single node, so 1 ms of latency up or down is a nice tradeoff to make.

    January 12, 2012 at 4:42 am
  • marrtins Reply

    What happens if say one continent goes offline for a day? Is synchronization done automatically after re-join?
    This CAPTCHA is very hard to crack.

    January 12, 2012 at 5:03 am
  • Alex Reply

    Indeed. It will either replay missed transactions of take full DB snapshot via mysqldump/rsync/xtrabackup – depending on the size of transaction cache on other nodes.

    January 12, 2012 at 5:23 am
  • Max Reply

    Looks pretty intersting. It’s been a awhile since I looked at Galera so I can’t remember, but is there an asynch mode or do you have to synchronously commit everything?

    January 12, 2012 at 12:54 pm
  • Vadim Tkachenko Reply


    There are numbers from local DC, nodes connected via 1Gb network.

    1 node. INSERT: 0.000308 sec
    2 nodes. INSERT: 0.000619 sec

    January 12, 2012 at 4:31 pm
  • Vadim Tkachenko Reply


    only “sync” mode is supported.
    As soon as we do “async” we immediately get into problems inherent for regular MySQL replication,
    and it would make it as a product with totally different characteristics.

    January 12, 2012 at 4:34 pm
  • Henrik Ingo Reply

    Max: If you want async replication you can just use MySQL replication alongside Galera. For instance 1 node in a Galera cluster can also be a MySQL replication master, to which some other MySQL servers connect.

    Otoh Vadim’s tests here show that the most likely reason you’d want to do that is moot: You can just run a Galera cluster over long distances and it will work and the overhead is still acceptable.

    January 13, 2012 at 12:51 am
  • Max Reply

    Henrik: I think it’s fair to say that 0.25 latency can be too high for many applications. It would be cool to be able to decide between synch and asynch on a database/table/query basis.

    January 13, 2012 at 11:48 am
  • Henrik Ingo Reply

    Max: If you want your write to be synchronously written to 2 or more continents, then it’s not going to be faster than that. All of the magical NoSQL solutions (MongoDB, Voldemort, etc…) would be equally slow (or more, considering they don’t support multiple operations within one transaction) too, or you would have to run them in a mode where you don’t really commit to more than one data center after all. Here too you can of course choose to use MySQL replication for Data Center redundancy and only use Galera within a data center (like you would with NDB). But then you’ll have to deal with the mess of eventual consistency some other way. A 0.2 sec latency on commit might be a nice tradeoff to pay for that convenience.

    Note that this latency only applies for writes. Reads will not have this latency at all. Also the latency happens at commit time – it will often be possible to design your application (with AJAX or other asynchronous means) such that it is very responsive for any interactive part, and then the commit is more or less silent when everything else is done (only errors would bring something back to the user).

    January 13, 2012 at 1:55 pm
  • Max Reply

    Henrik:Totally agree with you. I was referring to being able to do some operations in asynch mode to speed up the writes. I agree that 0.25 latency for intercontinental transactions is good. It might still be too much in some cases though.

    January 16, 2012 at 8:39 am
  • Maarten Reply

    It’d be cool to have benchmarks of a setup that I think a lot of companies looking for more or less HA are looking for that are not operating worldwide: a cluster in 2 datacenters in 2 cities in the same country.

    March 9, 2012 at 1:32 pm
  • Henrik Ingo Reply

    Maarten: Benchmark results are of course the real facts, but based on my experience, a Galera cluster spanning something like 50 km with good fiber connection between will perform really well. Most workloads you might not see any overhead at all.

    However, note that the way Galera handles failovers by majority, you’d really want to have 3 separate data centers.

    March 9, 2012 at 2:36 pm
  • Suman Reply

    Hi, thanks for your notes and the setup script. I had a quick question: I set up three nodes on three different Availability Zones (US East, US West and Europe) to simulate this cluster, and when I run the script, it gives me this error message:
    “mysql_install_db: command not found”

    I’m not able to find mysql_install_db anywhere on the file system either.

    I tried this with both Amazon’s default as well as RHEL 6.2 stock AMI images (I had to modify the script slightly for the Amazon image) but it didn’t work on either.

    Should I install Percona server:
    on these images before running this intercontinental script?

    March 21, 2012 at 8:38 am
  • Suman Reply

    Sorry I figured out how to get the script working.

    Just need to ADD these two lines:
    yes | yum -y remove mysql-libs
    yes | yum -y install Percona-XtraDB-server Percona-XtraDB-client

    right BEFORE:
    yes | yum -y install Percona-XtraDB-Cluster-server Percona-XtraDB-Cluster-client

    and then the mysql_install_db will be installed properly.

    March 21, 2012 at 9:18 am
  • Steve Mushero Reply

    Great stuff but I worry what happens when the link is poor, like we have in China to the outside world. If the link is down/bad for an hour, one server will be thrown out of the cluster, I assume, but what to do with the clients of that DB, and how to redirect them to other DBs, especially if the orphaned server takes a long time to recover. I believe on the LAN Galera will move virtual IPs, but that’s useless for WAN use, I’d think.

    Also, it’s nice to talk about re-sync by snapshot or backup but at 100GB+ and global links, this is a many hours/days effort to just move the data, so not realistic. It would have to re-sync automatically as best speed, like a regular MySQL slave would.

    August 30, 2012 at 8:56 pm
  • Alex Reply


    1) I guess in WAN case you could use DNS-based failover, which is analogous to VIP moving. However in your case the server in China must be dedicated to Chinese clients probably because it is too slow for the clients to reach the outside. And so if it looses connection to the outside, most likely the clients do too or it is too slow for them anyways.

    2) As of version 2.0 Galera now supports missed transactions replay on rejoining the cluster, like regular MySQL slave. Except that it can do it much faster thanks to parallel applying.

    August 30, 2012 at 9:45 pm
  • Steve Mushero Reply

    Yes, we could, and we do this even in LAN based failover if necessary, via local host files or local/remote DNS. Fast enough as long as not global public DNS.

    Bigger problem is the slow reads this creates – the cluster is great because normally only writes are slow, but reads are local and fast, but with the local node out of the cluster, now both reads and writes are slow.

    Can we have a local read-only slave from the cluster, i.e. use normal MySQL replication with the cluster ?

    August 31, 2012 at 6:40 am
  • Alex Reply


    1) yes, you can use normal MySQL master-slave replication from/to any node in the cluster.
    2) you still can use disconnected node for reads if stale data is acceptable. Just set wsrep_on=0 globally or per session in addition to read_only.

    August 31, 2012 at 7:07 am
  • laapsaap Reply

    Article should be corrected. Faster-than-light neutrino anomaly was caused by a measurement error, it was within speed of light in reality.

    November 27, 2013 at 9:56 am
  • Nikke Reply

    What is the latency between the nodes? Trying to compare this to a data ceter build that I am interested in.

    September 27, 2016 at 11:14 pm

Leave a Reply