September 2, 2014

State of the art: Galera – synchronous replication for InnoDB

First time I heard about Galera on Percona Performance Conference 2009, Seppo Jaakola was presenting “Galera: Multi-Master Synchronous MySQL Replication Clusters”. It was impressed as I personally always wanted it for InnoDB, but we had it in plans at the bottom of the list, as this is very hard to implement properly.
The idea by itself is not new, I remember synchronous replication was announced for SolidDB on MySQL UC 2007, but later the product was killed by IBM.

So long time after PPC 2009 there was available version mysql-galera-0.6, which had serious flow, to setup a new node you had to take down whole cluster. And all this time Codership ( company that develops Galera) was working on 0.7 release that introduces node propagation keeping cluster online. You can play with 0.7pre release by yourself MySQL/Galera Release 0.7pre.

In current version propagation is done by mysqldump from one of nodes (“donor”). In next release Codership is going to support LVM snapshot and xtrabackup which will make the setup of new node even easier. The current annoyance I see is that if you shutdown one node for short period of time for quick maintenance, after start, the node has to load whole mysqldump, like it is new empty node. I hope Codership guys will address this also.
Another thing I miss for now is support of InnoDB-plugin, which as we know performs much better than standard InnoDB ®.

So what is so interesting about Galera. Couple things:

- High Availability. Any of N standby nodes are available immediately when main node fails. Galera is serious pretender to be included to the list, Yves put recently, http://www.percona.com/blog/2009/10/16/finding-your-mysql-high-availability-solution-%e2%80%93-the-questions/. I am not sure how many nines it will provide :), but efforts on test setup and deployment should be comparable with MMM setup.

- Scale Writes. Galera allows to write to any of N nodes and automatically propagate to other nodes. It sounds too ideal, and there is drawback – with increasing amount of nodes you write to, your transaction rollback rate may increase, especially if you working on the same dataset. You can find some results on Codership’s page, and I am going to run my own benchmarks also. Also from benchmark you can see that communication overhead maybe significant for short writes.

- Scale Reads. It can be done with regular replication, but with synchronous your “slaves-nodes” are in the same state, there is no “slave behind”. When you read from any slave, you read actual data. Although it also has serious drawback – our cluster is fast as fast the “weakest” node in the chain. So if one node gets overloaded and performance degrades, the same happens with whole cluster.

- Heterogeneous-database replication. It is not here yet, and I do not know what’s in Codership roadmap, but group manager protocol in Galera is database independent, and it’s only matter of database drivers. For InnoDB currently it is set of patches, and I see it is quite possible to make the same for Postgres. So MySQL-Postgres cluster setup is not so far ahead :)

On “Company page” Codership says their goal is “to promote and exploit the latest developments in computer science to produce fast and scalable synchronous replication solution that “just works” for databases and similar applications”, which I think they have success in. Implementing fast, scalable and working group communication and transaction manager is the art.

As for now I would not put 0.7 release into production yet, but you may seriously consider to play with it in test environment, and report bugs to Codership team, they are very responsive.
I am waiting for next releases and looking to make integration with XtraDB.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Tim Vaillancourt says:

    Exciting. I know what I am doing this weekend..

    Tim

  2. Hi Vadim,

    Thanks for the kind words and nice title!

    We are working on innodb plugin version in our current R&D sprint, it should be available
    in ~3 weeks time.

    So far, MySQL and mariaDB have taken all our attention and we haven’t been able to devote
    much time for PostgreSQL work. But in theory, heterogeneous replication is sure possible.
    However it would work only in SQL statement level replication. Note that, the write
    scalability is mostly due to the effectivity of RBR events and heterogeneous cluster
    would be best suitable for read scaling.

    -seppo

    P.S. For those, who are suspicious to run with multi-master, it is always
    possible to direct writes just to one node. This setup works then as synchronous
    master slave replication.

  3. John says:

    Sounds like a dream, can’t wait to give it a try. I agree, shutting a node down for a period of time requiring a db dump would be a show stopper for me. Would be better if they had some form of an async “catch up” and once up to speed enable the node as being online and switch to sync mode.

  4. Alex says:

    Hi John,

    “Donor” node is not “shut down”, it is just blocked for the duration of state snapshot transfer. This first implementation is using mysqldump which is slow, but at least as reliable as mysqldump itself and “just works”. Later we plan to add state transfer modes which would block donor for much shorter time, but may require some special setup (like LVM). Incremental state transfer is also in the works.

    But for now you could just keep a special reserved “donor” node for such purposes, and it does not have to be as powerful as “working” nodes: applying writesets requires much less resources than serving clients.

  5. Andy says:

    What kind of performance penalty does synchronous replication incur compared to no replication and asyn replication?

    In an N nodes cluster, an update will only commit after it’s been propagated to all N nodes, right? That sounds like it could introduce significant performance drop.

  6. Vadim says:

    Andy,

    I did not run benchmarks myself yet, but you can find some results on Codership’s website: http://www.codership.com/en/content/benchmarking-write-scalability . I have no reason to not believe them.

    As you see for short transactions the penalty may be significant (compare numbers for 1 node), and it decreases with size of transaction, which is understandable.

    For sure it’s drawback of technology, but it’s price for consistent data.

  7. Alex says:

    Andy,

    This is a rather interesting question, but the answer is not so straightforward. Galera cluster performance overhead comes not only from group communication latencies, but from a number of other factors and strongly depends on the load profile, number of CPU cores, IO subsystem and network configuration.

    You’re absolutely right suggesting that more nodes mean more overhead, but even with TCP transport adding new node results in disproportionally low additional latency. You may want to check http://www.codership.com/en/content/sysbench-ec2-size-matters which is about the only case where it could be reliably measured. With multicast transport communication latencies should depend on the number of nodes even less.

    That said, Galera replication overhead may be significant regardless of the number of nodes. You can see this effect taken to extreme in a very synthetic mysqlslap benchmark here: http://www.codership.com/en/content/benchmarking-write-scalability. Curiously, one way to make for additional communication overhead is to increase the number of concurrent server connections. Rule of thumb is to double the amount of connections per node compared to what gives the best performance on a standalone server.

    Much bigger overhead comes from certification conflicts and resulting transaction rollbacks. The conflict rate grows roughly as N^2 and is the major limiting factor for multi-master scalability.

    As for comparison with async master-slave – we don’t have any numbers yet. We don’t know of any benchmarks which can be used to benchmark performance of async master-slave cluster as a whole. Comparing just master performance is not so informative as standalone server.

    Vadim,

    I would not call that a “drawback” – it is a limitation ;). Galera replication is just a tool suitable for some tasks and unsuitable for others. As they say, YMMV.

  8. Mahesh says:

    Hi,

    Does Galera work with MySQL 5.0?

  9. alex says:

    Mahesh,

    No. MySQL must be patched to make use of Galera replication, and the patch exists only for 5.1. There are no principal obstacles to port it to 5.0 (there is nothing special about 5.1 in this regard), however there is significant code divergence between 5.0 and 5.1, so it won’t be trivial.

  10. Mahesh says:

    Alex,

    Thanks for the reply. Which version (major+minor) of mysql should we use with it and can it be reliably used in a production environment. I was searching for such a solution and your answer will decide whether I can recommend to my client.

    Thanks.

  11. alex says:

    Manesh,

    The release that we tout for production is 0.7 and it is for MySQL 5.1.39. It’s been released Dec. 1st and we provide support for it (you can check out the patch and binary demos at https://launchpad.net/codership-mysql). 0.7.1 is coming out soon and will be for MySQL 5.1.41. In general, each MySQL/Galera release targets particular MySQL version (since it is a patch). In future we will continue to follow official MySQL GA releases with our own.

    BR

  12. of cource says:

    Manesh,

    The release that we tout for production is 0.7 and it is for MySQL 5.1.39. It’s been released Dec. 1st and we provide support for it (you can check out the patch and binary demos at no-url). 0.7.1 is coming out soon and will be for MySQL 5.1.41. In general, each MySQL/Galera release targets particular MySQL version (since it is a patch). In future we will continue to follow official MySQL GA releases with our own.

    BR

  13. Sohail says:

    Hi
    Can u plz share a How To on percona server master-master DB replication.

    Regards

  14. Michael Will says:

    The link http://www.codership.com/en/content/benchmarking-write-scalability does not work anymore?

    I am running mysqlslap against a galera-head of a master/master pair right now and don’t see the same kind of disk i/o on it that I have seen when running mysql 5.1 or mariadb 5.5 on the same hardware in a master/slave. I wonder if it is writing as synchronous?

    I was able to write 100,000 queries in about 50-60 seconds to mysql 5.1 and in about 6-9 seconds on mariadb 5.5, while the galera two-node cluster seems to get it done in 11-14 seconds. This is all on centos 6.4, adaptec raid controller with six 300G SAS drives in a raid10 for the database and four 2TB SATA drives in a raid10 for the binlog (yeah I know, not very balanced in terms of I/O).

    mysqlslap –concurrency=25,50,100,200,500,1000,2000 –iterations=10 –number-int-cols=6 –number-char-cols=6 –auto-generate-sql-secondary-indexes=3 –auto-generate-sql –csv=/tmp/mysqlslap_3index_innodb.csv –engine=innodb –auto-generate-sql-add-autoincrement –auto-generate-sql-load-type=mixed –number-of-queries=100000

    MariaDB-Galera-server-5.5.29-1.x86_64

    mysql-5.1.67-1.el6_3.x86_64

    MariaDB-server-5.5.30-1.x86_64

  15. Alex says:

    Hi Michael,

    Yes, I’m afraid the link is gone. But it is hardly relevant any more – it’s been 4 years!

    You probably are not seeing the same amount of IO because of different MySQL configuration. Make sure you’re using the same innodb_* parameters and binlog settings (log-bin, binlog-format, sync-binlog, etc.)

Speak Your Mind

*