Percona Replication Manager, a solution for MySQL high availability with replication using Pacemaker

The content of this article is outdated, look here for more up to date information.

Over the last year, the frustration of many of us at Percona regarding issues with MMM has grown to a level where we started looking at other ways of achieving higher availability using MySQL replication. One of the weakness of MMM is its communication layer, so instead of reinventing a flat tire, we decided, Baron Schwartz and I, to develop a solution using Pacemaker, a well known and established cluster manager with a bullet proof communication layer. One of the great thing about Pacemaker is its flexibility but flexibility may results in complexity. With the help of people from the Pacemaker community, namely Florian Haas and Raoul Bhatia, I have been able to modify the existing MySQL Pacemaker resource agent in a way that it survived our replication tests and offered a behavior pretty similar to MMM regarding Virtual IP addresses, VIPs, management. We decided to call this solution PRM for Percona Replication Manager. All the parts are opensource and available under the GPL license.

Keep in mind this solution is hot from the press, consider it alpha. Like I said above, it survived testing in a very controlled environment but it is young and many issues/bugs are likely to be found. Also, it is different from Yoshinori Matsunobu’s MHA solution and in fact it is quite a complement to it. One of my near term goal is to integrate with MHA for master promotion.

The solution is basically made of 3 pieces:

  • The Pacemaker cluster manager
  • A Pacemaker configuration
  • A MySQL resource agent

Here I will not cover the Pacemaker installation since this is fairly straightforward and covered elsewhere. I’ll discuss the MySQL resource agent and the supporting configuration while assuming basic knowledge of Pacemaker.

But, before we start, what does this solution offers.

  • Reader and writer VIPs behaviors similar to MMM
  • If the master fails, a new master is promoted from the slaves, no master to master setup needed. Selection of master is based on scores published by the slaves, the more up to date slaves have higher scores for promotion
  • Some nodes can be dedicated to be only slaves or less likely to become master
  • A node can be the preferred master
  • If replication on a slave breaks or lags beyond a defined threshold, the reader VIP(s) is removed. MySQL is not restarted.
  • If no slaves are ok, all VIPs, readers and writer, will be located on the master
  • During a master switch, connections are killed on the demoted master to avoid replication conflicts
  • All slaves are in read_only mode
  • Simple administrative commands can remove master role from a node
  • Pacemaker stonith devices are supported
  • No logical limits in term of number of nodes
  • Easy to add nodes

In order to setup the solution you’ll need my version of the MySQL resource agent, it is not yet pushed to the main Pacemaker resource agents branch. More testing and cleaning will be needed before that happen. You can get the resource agent from here:

You can also the whole branch from here:

On my Ubuntu Lucid VM, this file goes in /usr/lib/ocf/resource.d/heartbeat/ directory.

To use this agent, you’ll need a Pacemaker configuration. As a starting point, I’ll discuss the configuration I use during my tests.

Let’s review the configuration. First it begins by 3 node entries defining the 3 nodes I have in my cluster. One attribute is required to each node, the IP address that will be used for replication. This is a real IP address not a reader or writer VIP. This attribute allows the use of a private network for replication if needed.

Next is the mysql primitive resource declaration. This primitive defines the mysql resource on each node and has many parameters, here’s the ones I had to define:

  • config: The path of the my.cnf file. Remember that Pacemaker will start MySQL, not the regular init.d script
  • pid: The pid file. This is use by Pacemaker to know if MySQL is already running. It should match the my.cnf pid_file setting.
  • socket: The MySQL unix socket file
  • replication_user: The user to use when setting up replication. It is also currently used for the ‘CHANGE MASTER TO’ command, something that should/will change in the future
  • replication_passwd: The password for the above user
  • max_slave_lag: The maximum allowed slave lag in seconds, if a slave lags by more than that value, it will lose its reader VIP(s)
  • evict_outdated_slaves: A mandatory to set this to false otherwise Pacemaker will stop MySQL on a slave that lags behind. This will absolutely not help its recovery.
  • test_user and test_passwd: The credentials to test MySQL. Default is to run select count(*) on mysql.user table, so the user given should at least have select on that table.
  • op monitor: An entry is needed for each role, Master and Slave. Intervals must not be the same.

Following the mysql primitive declaration, the primitives for 3 reader vips and one writer vip are defined. Those are straightforward so I’ll skip detailed description. The next interesting element is the master-slave “ms” declaration. This is how Pacemaker defines an asymmetrical resource having a master and slaves. The only thing that may change here is clone-max=”3″ which should match the number of database nodes you have.

The handling of the VIPs is the truly new thing in the resource agent. I am grateful to Florian Haas who told me to use node attributes to avoid Pacemaker from over reacting. The availability of a reader or writer VIPs on a node are controlled by the attributes readerOK and writerOK and the location rules. An infinite negative weight is given when a VIP should not be on a host. I also added a few colocation rules to help spread the reader VIPs on all the nodes.

As a final thought on the Pacemaker configuration, remember that in order for a pacemaker cluster to run correctly on a 2 nodes cluster, you should set the quorum policy to ignore. Also, this example configuration has no stonith devices defined so stonith is disable. At the end of the configuration, you’ll notice the replication_info cluster attribute. You don’t have to define this, the mysql RA will add it automatically when the first a node will promoted to master.

There are not many requirements regarding the MySQL configuration, Pacemaker will automatically add “skip-start-slave” for a saner behavior. One of the important setting is “log_slave_updates = OFF” (default value). In some cases, if slaves are logging replication updates, it may cause failover issues. Also, the solution relies on the read_only setting on the slave so, make sure the application database use doesn’t have the SUPER privilege which overrides read_only.

Like I mentioned above, this project is young. In the future, I’d like to integrate MHA to benefit for its capacity of bringing all the nodes to a consistent level. Also, the security around the solution should be improved, a fairly easy task I believe. Of course, I’ll work with the maintainers of the Pacemaker resources agents to include it in the main branch once it matured a bit.

Finally, if you are interested by this solution but have problems setting it up, just contact us at Percona, we’ll be pleased to help.

Share this post

Comments (107)

  • Shlomi Noach

    Good work! A couple comments:

    The pacemaker config looks quite intimidating, what with all those unclear abbreviations and the amount of options.
    I sure wasn’t satisfied with MMM’s behavior, but its configuration was simple enough to understand. Here, it looks like I’m going to copy+paste your config and hope for the best. If anything goes wrong — I’ll have to have deeper understanding of pacemaker.

    This is not a criticism, but an observation: in order to set up the PRM high-avaliability solution for MySQL, you’ll need a sys-admin in addition to your DBA. Not all DBAs will know how to manage and analyze a Pacemaker configuration.

    Just consider the fact that you, as in Percona, had to go to Florian, who is probably one of the most knowledgeable people on Pacemaker, to make it work (e.g. Florian told you you had better used node attributes). I suspect things will not go smooth on all installations. How many Florians are there to be contacted?

    Again, this is merely an observation. Perhaps there is no easy way out. I would surely like a solution which focuses on usability and just wraps it all up for you.

    November 29, 2011 at 10:51 pm
  • vineet

    What about data integrity in such cluster environment?

    As per my understanding this is still asynchronous replication and there are chance of data loss in case of master node failure.

    November 29, 2011 at 11:15 pm
  • Florian Haas

    Yves & Shlomi, thanks for the kudos.

    Yves, I know I still owe you a review on that RA; sorry this has taken a while — I’ll try to get to it today.

    Shlomi, as to your comment about the config looking intimidating: I concur, but fear not: it can be made a lot less so. Most of what you see under $cib-bootstrap-options is just scaffolding. The mysql_replication property is auto-managed by the RA. And for the reader VIPs, we can make use of a cool feature in Pacemaker: clones allow us to manage an entire IP range as one resource, and we can then have all those constraints just apply to the clone. This makes for a much-condensed configuration:

    node testvirtbox1 \n attributes IP=""
    node testvirtbox2 \n attributes IP=""
    node testvirtbox3 \n attributes IP=""
    primitive p_mysql ocf:heartbeat:mysql \n params config="/etc/mysql/my.cnf" pid="/var/run/mysqld/" \n socket="/var/run/mysqld/mysqld.sock" replication_user="root" \n replication_passwd="rootpass" max_slave_lag="15" evict_outdated_slaves="false" \n binary="/usr/bin/mysqld_safe" test_user="root" \n test_passwd="rootpass" \n op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \n op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1"
    # unique_clone_address="true" configures the resource
    # to manage an IP range when cloned
    primitive p_reader_vip ocf:heartbeat:IPaddr2 \n params ip="" unique_clone_address="true"
    clone reader_vip p_reader_vip \n meta globally-unique="true" clone-max=3 clone-node-max=3
    primitive writer_vip ocf:heartbeat:IPaddr2 \n params ip="" nic="eth0" \n meta target-role="Started"
    ms ms_MySQL p_mysql \n meta clone-max="3" notify="true"
    location reader_vip_reader_ok reader_vip \n rule -inf: readerOK eq 0
    location writer_vip_writer_ok writer_vip \n rule -inf: writerOK eq 0
    property $id="cib-bootstrap-options" \n stonith-enabled="false" \n no-quorum-policy="ignore"
    rsc_defaults $id="rsc-options" \n resource-stickiness="100"

    It’s still not super simple, but it’s a lot simpler than hacking a million shell scripts to do this on your own, less reliably. As Yves also mentions, this is an “alpha” solution and his changes to the RA are not yet merged upstream, so we are expecting a few more changes to happen before it’s merged.

    November 30, 2011 at 1:04 am
    • Yves Trudeau

      I tried the clone set with IPAddr2, it sort of work but the behavior is not entirely good. All the reader vips can end up on the same node even and valid nodes can be left without any reader vips. I tried a negative colocation rule with not luck. That’s why I reverted to using individual IPAddr2 resources.

      November 30, 2011 at 8:43 am
  • Florian Haas

    Argl. It seems like I missed a closing </code> tag. Yves, if you’re able to edit the comment and fix that, please do.

    November 30, 2011 at 1:06 am
  • Florian Haas

    vineet, that problem is well understood. Yves’ approach is for scale-out. If you need transaction-safe synchronous replication, slap MySQL on DRBD, put everything under Pacemaker management, and you’re good to go. That solution has been around for years.

    November 30, 2011 at 1:08 am
  • Marcus Bointon

    15 seconds for slave lag seems short. If you have a query that takes 15 seconds to run on a master, it’s not going to be complete on a slave until 30 seconds have elapsed (assuming servers have the same performance). It seems silly to knock slaves offline just because they’re running a long query – it’s just a known limitation of async replication. If you had several slaves, a slow query like that would knock them all offline at once, which seems to be inviting disaster.

    November 30, 2011 at 4:02 am
  • Yves Trudeau

    I indeed help from Florian but that was for the resource agent design, the implementation is not that complex. For sure the is a learning curve from MMM but Pacemaker is not that complicated.

    Slave lag is adjustable, so adjust it to what makes sense to you. MMM was also removing slaves from cluster if the were lagging behind so it is not a new behavior.

    Like you said, data integrity is not guaranteed by replication but yet, in many deployment, it is not a hard requirement and replication just do the job.

    November 30, 2011 at 6:34 am
  • Viacheslav Biriukov

    What about set_read_only function and timeouts?
    If I get it right: resource notify master to kill all connections before it sets read-only (I mean during live migrations)?

    November 30, 2011 at 7:03 am
  • William

    Thanks for all of the work on this. I would like to add the main thing missing from this blog post ( and many others around the web) is the description of the problem you are trying to solve and the draw backs to the approach. Sadly, I do see a lot of solutions to scale out just assume that data integrity is not an absolute requirement. Doing both is very difficult, no doubt.

    November 30, 2011 at 8:15 am
  • Florian Haas

    William, doing both is actually not much of a problem at all if you combine Yves’ approach for managing the slaves with the traditional DRBD approach for managing the master. And you can run all of that under Pacemaker as your unifying cluster infrastructure umbrella,

    November 30, 2011 at 8:20 am
  • Florian Haas

    Yves, if you want IP addresses to move away from the node you’ll just have to reset the resource stickiness.

    Let’s see if the configuration snippet works out better this time. 🙂

    node testvirtbox1
    node testvirtbox2
    node testvirtbox3
    primitive p_mysql ocf:heartbeat:mysql \
    params config=”/etc/mysql/my.cnf” pid=”/var/run/mysqld/” \n socket=”/var/run/mysqld/mysqld.sock” \
    replication_user=”root” replication_passwd=”rootpass” max_slave_lag=”15″ evict_outdated_slaves=”false” \
    binary=”/usr/sbin/mysqld” test_user=”root” test_passwd=”rootpass” \
    op monitor interval=”20s” role=”Master” OCF_CHECK_LEVEL=”1″ \
    op monitor interval=”30s” role=”Slave” OCF_CHECK_LEVEL=”1″
    ms ms_MySQL p_mysql \
    meta clone-max=”3″ notify=”true”
    primitive p_reader_vip ocf:heartbeat:IPaddr2 \
    params ip=”″ unique_clone_address=”true” \
    meta resource-stickiness=0
    clone reader_vip p_reader_vip \
    meta globally-unique=”true” clone-max=3 clone-node-max=3
    primitive writer_vip ocf:heartbeat:IPaddr2 \
    params ip=”″ nic=”eth0″
    location reader_vip_reader_ok reader_vip rule -inf: readerOK eq 0
    location writer_vip_writer_ok writer_vip rule -inf: writerOK eq 0
    property stonith-enabled=”false” no-quorum-policy=”ignore”
    rsc_defaults resource-stickiness=”100″

    November 30, 2011 at 10:12 am
    • Yves Trudeau

      I tried your config but still some issues:

      Online: [ testvirtbox1 testvirtbox3 testvirtbox2 ]

      writer_vip (ocf::heartbeat:IPaddr2): Started testvirtbox3
      Master/Slave Set: ms_MySQL
      Masters: [ testvirtbox3 ]
      Slaves: [ testvirtbox1 testvirtbox2 ]
      Clone Set: reader_vip (unique)
      p_reader_vip:0 (ocf::heartbeat:IPaddr2): Started testvirtbox2
      p_reader_vip:1 (ocf::heartbeat:IPaddr2): Started testvirtbox1
      p_reader_vip:2 (ocf::heartbeat:IPaddr2): Started testvirtbox1
      root@testvirtbox1:/tmp/mysql.ocf.ra.debug# crm_attribute -N testvirtbox3 -l reboot –name readerOK –query -q

      November 30, 2011 at 12:01 pm
  • Henrik Ingo

    +1 to what Schlomi says. While choosing Pacemaker as a robust clustering framework may not be the worst idea, the next step should be some kind of wrapper where the user provides some simple ini file and you hide all the Pacemaker complexity from end users. Without that, this is a lost cause.

    Also, if you think pacemaker configuration is a lot to digest, you should see the logs that it outputs!

    December 1, 2011 at 2:04 am