November 26, 2014

Percona XtraDB Cluster Feature 2: Multi-Master replication

This is about the second great feature – Multi-Master replication, what you get with Percona XtraDB Cluster.

It is recommended you get familiar with general architecture of the cluster, described on the previous post.

By Multi-Master I mean the ability to write to any node in your cluster and do not worry that eventually you get out-of-sync situation, as it regularly happens with regular MySQL replication if you imprudently write to the wrong server.

This is long-waited feature, I’ve seen growing demand for it for last two years or even more.
Percona XtraDB Cluster provides it, and let’s see how it works.

With our Cluster you can write to any node, and the Cluster guarantees consistency of writes. That is the write is either committed on all nodes or not committed at all.

Let me draw some diagram there. For the simplicity I will use a two-nodes example, but the same logic is applied when you have N nodes.

As you see all queries are executed locally on the node, and only on COMMIT there is special handling.
When you issue COMMIT, your transaction has to pass certification on all nodes. If it does not pass, you
will receive “ERROR” as response on your query. After that transaction is applied on the local node.

That is response time of COMMIT consist of several parts:

  • Network roundtrip
  • Certification time
  • Local applying

Please note that applying the transaction on remote nodes does not affect the response time of COMMIT,
as it happens in the background after the response on certification.

The two important consequences of this architecture.

  • First: we can have several appliers working in parallel. This gives us true parallel replication. Slave can have many parallel threads, and you can tune it by variable wsrep_slave_threads
  • Second: There might be a small period of time when the slave is out-of-sync from master. This happens because the master may apply event faster than a slave. And if you do read from the slave, you may read data, that has not changes yet. You can see that from diagram. However you can change this behavior by using variable wsrep_causal_reads=ON. In this case the read on the slave will wait until event is applied (this however will increase the response time of the read. This gap between slave and master is the reason why this replication named “virtually synchronous replication”, not real “synchronous replication”

The described behavior of COMMIT also has the second serious implication.
If you run write transactions to two different nodes, the cluster will use an optimistic locking model.
That means a transaction will not check on possible locking conflicts during individual queries, but rather on the COMMIT stage. And you may get ERROR response on COMMIT. I am highlighting this, as this is one of incompatibilities with regular InnoDB, that you may experience. In InnoDB usually DEADLOCK and LOCK TIMEOUT errors happen in response on particular query, but not on COMMIT. Well, if you follow a good practice, you still check errors code after “COMMIT” query, but I saw many applications that do not do that.

So, if you plan to use Multi-Master capabilities of XtraDB Cluster, and run write transactions on several nodes, you may need to make sure you handle response on “COMMIT” query.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Vadim,

    Do I understand correctly if Master will fail on commit after slave has committed it will fail and drop out of cluster ?

  2. Peter,

    That is the idea I guess.
    If Master fails to make commit it will make an act of suicide and throw itself from cluster.
    The same with the slave, if it fails to commit certified transaction.

  3. Alex says:

    Peter, yes, that’s the case. The only exception is DDL statements which are allowed to fail.

  4. Alex,

    Why DDL would be allowed to fail ?

  5. Alex says:

    Peter,

    DDLs are not transactional, so we need to deal with them differently: first replicate and then execute (this is the only way to ensure that they are executed in the same order on all off the nodes). So at the point of execution we don’t know if this DDL would have failed on master or not. Perhaps you’re trying to drop a table which does not exist any more (some other client has dropped it while your statement was being replicated)

  6. Doug says:

    Vadim,

    I use Percona Server 5.1 in a Two Master No Slave MMM configuration. The read write ratio of our application is 90/10. How do I distribute the reads and writes in a 3 Node Percona XtraDB Cluster? My method today is via MySQL MMM of Read & Write Roles virtual IP through a JDBC connection.

  7. Doug,

    You actually may not worry about read-write split now.
    You just do both queries to any available node.

    If you still would like to have dedicated write server, then you pick one of threes for Read-Write, and two that left – for Read

  8. Alex says:

    Depending on how much control you want over R/W splitting, you may try internal multimaster JDBC load balancer (developed for NDB, sends writes everywhere):
    jdbc:mysql:loadbalance://,,…/?loadBalanceBlacklistTimeout=5000

  9. Alex says:

    oops, comment formatting ate my text. Check the full version here: http://www.codership.com/content/setting-multi-master-jdbc-mysqlgalera

    To blog maintainer: when comments are formatted by the engine, a preview button is a must. Or at least a warning.

  10. You could always use haproxy to distribute your requests evenly across the cluster or nodes, or even distribute them based on server load. We do that currently for our non sql front end servers, so adding the database cluster would be trivial. I’m really excited about this development and waiting to see it in production. We were researching cassandra and contemplating a switch as our system and dataset grow, but this is a far easier solution for us. (and we already know how to deal with most mysql issues, so that’s an added bonus.)

  11. Hi Alex/Vadim,

    We had install mysql-server-wsrep-5.5.20-23.4 with galera-23.2.0.

    When starting the second node and higher we get a Warning that IST will be unavailable ( The logs are bellow).
    Other then that, the replication works fine. We are concern of the possibility that incremental state transfer is not working.
    Does it mean that the cluster copies all data from main the node to the others on each transaction?

    Can you please help us trace this problem?
    Thanks,
    Eitan

    Apr 1 14:14:26 ubuntu-server-129 mysqld: 120401 14:14:26 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (8630f2a8-7be8-11e1-0800-7e1ae413f999): 1 (Operation not permitted)
    Apr 1 14:14:26 ubuntu-server-129 mysqld: #011 at galera/src/replicator_str.cpp:prepare_for_IST():429. IST will be unavailable.

    Apr 1 14:14:26 ubuntu-server-129 mysqld: 120401 14:14:26 [Note] WSREP: Node 1 (ubuntu-server-129) requested state transfer from ‘*any*’. Selected 0 (ubuntu-server-128)(SYNCED) as donor.
    Apr 1 14:14:26 ubuntu-server-129 mysqld: 120401 14:14:26 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 5010)
    Apr 1 14:14:26 ubuntu-server-129 mysqld: 120401 14:14:26 [Note] WSREP: Requesting state transfer: success, donor: 0
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:26 [Warning] IP address ‘192.168.210.128’ could not be resolved: Name or service not known
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: wsrep_start_position var submitted: ‘8630f2a8-7be8-11e1-0800-7e1ae413f999:5010′
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Received SST: 8630f2a8-7be8-11e1-0800-7e1ae413f999:5010
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: SST received: 8630f2a8-7be8-11e1-0800-7e1ae413f999:5010
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: 1 (ubuntu-server-129): State transfer from 0 (ubuntu-server-128) complete.
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Shifting JOINER -> JOINED (TO: 5010)
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Member 1 (ubuntu-server-129) synced with group.
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 5010)
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Synchronized with group, ready for connections
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: 0 (ubuntu-server-128): State transfer to 1 (ubuntu-server-129) complete.
    Apr 1 14:14:27 ubuntu-server-129 mysqld: 120401 14:14:27 [Note] WSREP: Member 0 (ubuntu-server-128) synced with group.
    Apr 1 14:17:01 ubuntu-server-129 CRON[3512]: (root) CMD ( cd / && run-parts –report /etc/cron.hourly)

  12. Hi,

    I have a thought that the warning appears since the node is not yet in the cluster, and in that case the only way for sync is SST.

    Am I right?

    I have came to this conclusion since, when a I’ve dropped the second node for a short while, did some changes to the main node, restart the second node again: this time I can see some messages logs relating to IST:

    120401 19:31:44 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.210.129:4568
    …….

    120401 19:31:45 [Note] WSREP: Received SST: 8630f2a8-7be8-11e1-0800-7e1ae413f999:5011
    120401 19:31:45 [Note] WSREP: SST received: 8630f2a8-7be8-11e1-0800-7e1ae413f999:5011
    120401 19:31:45 [Note] WSREP: Receiving IST: 1 writesets, seqnos 5011-5012

  13. Alex says:

    Hi Eitan,

    You’re absolutely correct!

  14. Thanks a lot Alex!

    One more question relating to the (write) performance of Galera.

    We have compared a stand alone MySQL with one node cluster of Galera (inserting new records).
    The one node cluster is, sometimes, slower by about 20% then the stand alone MySQL.
    Is that an expected behavior?

    Thanks,
    Eitan

  15. erkules says:

    Etian what have you compared?
    Using Galera one approach is to compare with semisync replication. Even if Galera offers much more than semisync have a look at http://linsenraum.de/erkules/2012/03/galera-als-replikationsersatz.html it is written in german but the graphs count.
    You see galera faster than MySQL 5.5.21 semisync (blue line). MySQL was getting out of semisync after a concurrency of 64.
    The green line ist MariaDB 5.5.20 with group commit and xtradb. Even if they are faster.
    Galere performs awesome and you don’t have the data in the relay log only.
    And don’t forgett you can write to the ‘slave’ also :)

  16. Alex says:

    Etian,

    depending on the configuration you used, it may be. Understandably Galera adds some overhead even in a single node cluster since it still has to prepare and pack the writesets. The reason for this is keeping track of global transaction ID and caching the writesets for potential IST. This overhead can be turned off by either unloading Galera wsrep provider or setting wsrep_on=0 globally. But that of course will effectively turn it into a non-Galera node.

    So if you want single node comparison, you probably should compare single Galera node with plain mysql with binlog turned on.

    Having said that, 20% may be not really justified (so it is our bad), but 10%, is probably unavoidable.

  17. Saeed says:

    Hi,

    Could you please point to any online tutorial for setting up master-master replication with XtraDB Cluster?

    Thanks

  18. Eugene says:

    Hi,

    I understand that Percona XtraDB Cluster helps with High Availability,But how does it help with scalability? because as i understand, replication just copies/or makes sure all database server have the same information (data) so that every server act as master. to me this also means that all server may likely get filled-up at the same time. does Percona XtraDB Cluster has some kind of shedding or such that each database server(mysql) that is added into the Percona XtraDB Cluster will help increase storage space?

  19. alex says:

    Eugene,
    Sure, replication “just copies” changes made by transaction. Copying changes is much faster than executing the transaction. So this leaves the slave with spare cycles to execute more transactions.

  20. san says:

    How do I separate Read and write traffic in typical jdbc web application? There is only one solution to use mysql-proxy to do so ? I would like to use HAProxy because it is highly recommended for Percona Xtradb Cluster.

  21. gpfeng.cs says:

    Does the master need to get the slave’s certifiction result before real Physical COMMIT(or ROLLBACK) ?

    What about that master and slave get different certifiction results? upon that, which one make the suicide?

    from http://www.mysqlperformanceblog.com/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/#comment-1279058

    it seems that local trxs(befor broadcast) will be aborted/rollbacked

  22. gpfeng.cs says:

    it seems that local trxs(befor broadcast) will be aborted/rollbacked
    –>
    it seems that ONLY local trxs(befor broadcast) will be aborted/rollbacked

  23. Ashok says:

    Team

    I want to sync percona extraDb cluster with another percona extraDB cluster.

    Please letus know how can we do it.

    Thanks
    Ashok

  24. Hi Ashok,

    Thanks for the comment, but 1) this is an old post and 2) the place to ask this question is on our discussion forums, under Percona XtraDB Cluster. Here’s the URL: http://www.percona.com/forums/questions-discussions/percona-xtradb-cluster

    That’s the place to connect with your peers and many times with Percona experts, too.

Speak Your Mind

*