Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Percona Replication Manager, a solution for MySQL high availability with replication using Pacemaker

November 29, 2011

Author

Yves Trudeau

MySQL

Share this Post:

The content of this article is outdated, look here for more up to date information.

Over the last year, the frustration of many of us at Percona regarding issues with MMM has grown to a level where we started looking at other ways of achieving higher availability using MySQL replication. One of the weakness of MMM is its communication layer, so instead of reinventing a flat tire, we decided, Baron Schwartz and I, to develop a solution using Pacemaker, a well known and established cluster manager with a bullet proof communication layer. One of the great thing about Pacemaker is its flexibility but flexibility may results in complexity. With the help of people from the Pacemaker community, namely Florian Haas and Raoul Bhatia, I have been able to modify the existing MySQL Pacemaker resource agent in a way that it survived our replication tests and offered a behavior pretty similar to MMM regarding Virtual IP addresses, VIPs, management. We decided to call this solution PRM for Percona Replication Manager. All the parts are opensource and available under the GPL license.

Keep in mind this solution is hot from the press, consider it alpha. Like I said above, it survived testing in a very controlled environment but it is young and many issues/bugs are likely to be found. Also, it is different from Yoshinori Matsunobu’s MHA solution and in fact it is quite a complement to it. One of my near term goal is to integrate with MHA for master promotion.

The solution is basically made of 3 pieces:

- The Pacemaker cluster manager

- A Pacemaker configuration

- A MySQL resource agent

Here I will not cover the Pacemaker installation since this is fairly straightforward and covered elsewhere. I’ll discuss the MySQL resource agent and the supporting configuration while assuming basic knowledge of Pacemaker.

But, before we start, what does this solution offers.

- Reader and writer VIPs behaviors similar to MMM

- If the master fails, a new master is promoted from the slaves, no master to master setup needed. Selection of master is based on scores published by the slaves, the more up to date slaves have higher scores for promotion

- Some nodes can be dedicated to be only slaves or less likely to become master

- A node can be the preferred master

- If replication on a slave breaks or lags beyond a defined threshold, the reader VIP(s) is removed. MySQL is not restarted.

- If no slaves are ok, all VIPs, readers and writer, will be located on the master

- During a master switch, connections are killed on the demoted master to avoid replication conflicts

- All slaves are in read_only mode

- Simple administrative commands can remove master role from a node

- Pacemaker stonith devices are supported

- No logical limits in term of number of nodes

- Easy to add nodes

In order to setup the solution you’ll need my version of the MySQL resource agent, it is not yet pushed to the main Pacemaker resource agents branch. More testing and cleaning will be needed before that happen. You can get the resource agent from here:

https://github.com/y-trudeau/resource-agents/raw/master/heartbeat/mysql

You can also the whole branch from here:

https://github.com/y-trudeau/resource-agents/zipball/master

On my Ubuntu Lucid VM, this file goes in /usr/lib/ocf/resource.d/heartbeat/ directory.

To use this agent, you’ll need a Pacemaker configuration. As a starting point, I’ll discuss the configuration I use during my tests.

node testvirtbox1 
        attributes IP="10.2.2.160"
node testvirtbox2 
        attributes IP="10.2.2.161" 
node testvirtbox3 
        attributes IP="10.2.2.162"
primitive p_mysql ocf:heartbeat:mysql 
        params config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid" 
               socket="/var/run/mysqld/mysqld.sock" replication_user="root" 
               replication_passwd="rootpass" max_slave_lag="15" evict_outdated_slaves="false" 
               binary="/usr/bin/mysqld_safe" test_user="root"    
               test_passwd="rootpass"                                                                                               
        op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" 
        op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1"
primitive reader_vip_1 ocf:heartbeat:IPaddr2 
        params ip="10.2.2.171" nic="eth0"
primitive reader_vip_2 ocf:heartbeat:IPaddr2 
        params ip="10.2.2.172" nic="eth0"
primitive reader_vip_3 ocf:heartbeat:IPaddr2 
        params ip="10.2.2.173" nic="eth0"
primitive writer_vip ocf:heartbeat:IPaddr2 
        params ip="10.2.2.170" nic="eth0" 
        meta target-role="Started"
ms ms_MySQL p_mysql 
        meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true"
location No-reader-vip-1-loc reader_vip_1 
        rule $id="No-reader-vip-1-rule" -inf: readerOK eq 0
location No-reader-vip-2-loc reader_vip_2 
        rule $id="No-reader-vip-2-rule" -inf: readerOK eq 0
location No-reader-vip-3-loc reader_vip_3 
        rule $id="No-reader-vip-3-rule" -inf: readerOK eq 0
location No-writer-vip-loc writer_vip 
        rule $id="No-writer-vip-rule" -inf: writerOK eq 0
colocation reader_vip_1_dislike_reader_vip_2 -200: reader_vip_1 reader_vip_2
colocation reader_vip_1_dislike_reader_vip_3 -200: reader_vip_1 reader_vip_3
colocation reader_vip_2_dislike_reader_vip_3 -200: reader_vip_2 reader_vip_3
property $id="cib-bootstrap-options" 
        dc-version="1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe" 
        cluster-infrastructure="openais" 
        expected-quorum-votes="3" 
        stonith-enabled="false" 
        no-quorum-policy="ignore" 
        last-lrm-refresh="1322236006"
property $id="mysql_replication" 
        replication_info="10.2.2.162|mysql-bin.000090|106"
rsc_defaults $id="rsc-options" 
        resource-stickiness="100"

node testvirtbox1

attributes IP="10.2.2.160"

node testvirtbox2

attributes IP="10.2.2.161"

node testvirtbox3

attributes IP="10.2.2.162"

primitive p_mysql ocf:heartbeat:mysql

params config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid"

socket="/var/run/mysqld/mysqld.sock" replication_user="root"

replication_passwd="rootpass" max_slave_lag="15" evict_outdated_slaves="false"

binary="/usr/bin/mysqld_safe" test_user="root"

test_passwd="rootpass"

op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1"

op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1"

primitive reader_vip_1 ocf:heartbeat:IPaddr2

params ip="10.2.2.171" nic="eth0"

primitive reader_vip_2 ocf:heartbeat:IPaddr2

params ip="10.2.2.172" nic="eth0"

primitive reader_vip_3 ocf:heartbeat:IPaddr2

params ip="10.2.2.173" nic="eth0"

primitive writer_vip ocf:heartbeat:IPaddr2

params ip="10.2.2.170" nic="eth0"

meta target-role="Started"

ms ms_MySQL p_mysql

meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true"

location No-reader-vip-1-loc reader_vip_1

rule $id="No-reader-vip-1-rule" -inf: readerOK eq 0

location No-reader-vip-2-loc reader_vip_2

rule $id="No-reader-vip-2-rule" -inf: readerOK eq 0

location No-reader-vip-3-loc reader_vip_3

rule $id="No-reader-vip-3-rule" -inf: readerOK eq 0

location No-writer-vip-loc writer_vip

rule $id="No-writer-vip-rule" -inf: writerOK eq 0

colocation reader_vip_1_dislike_reader_vip_2 -200: reader_vip_1 reader_vip_2

colocation reader_vip_1_dislike_reader_vip_3 -200: reader_vip_1 reader_vip_3

colocation reader_vip_2_dislike_reader_vip_3 -200: reader_vip_2 reader_vip_3

property $id="cib-bootstrap-options"

dc-version="1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe"

cluster-infrastructure="openais"

expected-quorum-votes="3"

stonith-enabled="false"

no-quorum-policy="ignore"

last-lrm-refresh="1322236006"

property $id="mysql_replication"

replication_info="10.2.2.162|mysql-bin.000090|106"

rsc_defaults $id="rsc-options"

resource-stickiness="100"

Let’s review the configuration. First it begins by 3 node entries defining the 3 nodes I have in my cluster. One attribute is required to each node, the IP address that will be used for replication. This is a real IP address not a reader or writer VIP. This attribute allows the use of a private network for replication if needed.

Next is the mysql primitive resource declaration. This primitive defines the mysql resource on each node and has many parameters, here’s the ones I had to define:

- config: The path of the my.cnf file. Remember that Pacemaker will start MySQL, not the regular init.d script

- pid: The pid file. This is use by Pacemaker to know if MySQL is already running. It should match the my.cnf pid_file setting.

- socket: The MySQL unix socket file

- replication_user: The user to use when setting up replication. It is also currently used for the ‘CHANGE MASTER TO’ command, something that should/will change in the future

- replication_passwd: The password for the above user

- max_slave_lag: The maximum allowed slave lag in seconds, if a slave lags by more than that value, it will lose its reader VIP(s)

- evict_outdated_slaves: A mandatory to set this to false otherwise Pacemaker will stop MySQL on a slave that lags behind. This will absolutely not help its recovery.

- test_user and test_passwd: The credentials to test MySQL. Default is to run select count(*) on mysql.user table, so the user given should at least have select on that table.

- op monitor: An entry is needed for each role, Master and Slave. Intervals must not be the same.

Following the mysql primitive declaration, the primitives for 3 reader vips and one writer vip are defined. Those are straightforward so I’ll skip detailed description. The next interesting element is the master-slave “ms” declaration. This is how Pacemaker defines an asymmetrical resource having a master and slaves. The only thing that may change here is clone-max=”3″ which should match the number of database nodes you have.

The handling of the VIPs is the truly new thing in the resource agent. I am grateful to Florian Haas who told me to use node attributes to avoid Pacemaker from over reacting. The availability of a reader or writer VIPs on a node are controlled by the attributes readerOK and writerOK and the location rules. An infinite negative weight is given when a VIP should not be on a host. I also added a few colocation rules to help spread the reader VIPs on all the nodes.

As a final thought on the Pacemaker configuration, remember that in order for a pacemaker cluster to run correctly on a 2 nodes cluster, you should set the quorum policy to ignore. Also, this example configuration has no stonith devices defined so stonith is disable. At the end of the configuration, you’ll notice the replication_info cluster attribute. You don’t have to define this, the mysql RA will add it automatically when the first a node will promoted to master.

There are not many requirements regarding the MySQL configuration, Pacemaker will automatically add “skip-start-slave” for a saner behavior. One of the important setting is “log_slave_updates = OFF” (default value). In some cases, if slaves are logging replication updates, it may cause failover issues. Also, the solution relies on the read_only setting on the slave so, make sure the application database use doesn’t have the SUPER privilege which overrides read_only.

Like I mentioned above, this project is young. In the future, I’d like to integrate MHA to benefit for its capacity of bringing all the nodes to a consistent level. Also, the security around the solution should be improved, a fairly easy task I believe. Of course, I’ll work with the maintainers of the Pacemaker resources agents to include it in the main branch once it matured a bit.

Finally, if you are interested by this solution but have problems setting it up, just contact us at Percona, we’ll be pleased to help.

0 0 votes

Article Rating

107 Comments

Oldest

Newest Most Voted

Shlomi Noach

14 years ago

Good work! A couple comments:

The pacemaker config looks quite intimidating, what with all those unclear abbreviations and the amount of options.
I sure wasn’t satisfied with MMM’s behavior, but its configuration was simple enough to understand. Here, it looks like I’m going to copy+paste your config and hope for the best. If anything goes wrong — I’ll have to have deeper understanding of pacemaker.

This is not a criticism, but an observation: in order to set up the PRM high-avaliability solution for MySQL, you’ll need a sys-admin in addition to your DBA. Not all DBAs will know how to manage and analyze a Pacemaker configuration.

Just consider the fact that you, as in Percona, had to go to Florian, who is probably one of the most knowledgeable people on Pacemaker, to make it work (e.g. Florian told you you had better used node attributes). I suspect things will not go smooth on all installations. How many Florians are there to be contacted?

Again, this is merely an observation. Perhaps there is no easy way out. I would surely like a solution which focuses on usability and just wraps it all up for you.

vineet

14 years ago

What about data integrity in such cluster environment?

As per my understanding this is still asynchronous replication and there are chance of data loss in case of master node failure.

Florian Haas

14 years ago

Yves & Shlomi, thanks for the kudos.

Yves, I know I still owe you a review on that RA; sorry this has taken a while — I’ll try to get to it today.

Shlomi, as to your comment about the config looking intimidating: I concur, but fear not: it can be made a lot less so. Most of what you see under $cib-bootstrap-options is just scaffolding. The mysql_replication property is auto-managed by the RA. And for the reader VIPs, we can make use of a cool feature in Pacemaker: clones allow us to manage an entire IP range as one resource, and we can then have all those constraints just apply to the clone. This makes for a much-condensed configuration:
node testvirtbox1 n attributes IP="10.2.2.160" node testvirtbox2 n attributes IP="10.2.2.161" node testvirtbox3 n attributes IP="10.2.2.162" primitive p_mysql ocf:heartbeat:mysql n params config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid" n socket="/var/run/mysqld/mysqld.sock" replication_user="root" n replication_passwd="rootpass" max_slave_lag="15" evict_outdated_slaves="false" n binary="/usr/bin/mysqld_safe" test_user="root" n test_passwd="rootpass" n op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" n op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1" # unique_clone_address="true" configures the resource # to manage an IP range when cloned primitive p_reader_vip ocf:heartbeat:IPaddr2 n params ip="10.2.2.171" unique_clone_address="true" clone reader_vip p_reader_vip n meta globally-unique="true" clone-max=3 clone-node-max=3 primitive writer_vip ocf:heartbeat:IPaddr2 n params ip="10.2.2.170" nic="eth0" n meta target-role="Started" ms ms_MySQL p_mysql n meta clone-max="3" notify="true" location reader_vip_reader_ok reader_vip n rule -inf: readerOK eq 0 location writer_vip_writer_ok writer_vip n rule -inf: writerOK eq 0 property $id="cib-bootstrap-options" n stonith-enabled="false" n no-quorum-policy="ignore" rsc_defaults $id="rsc-options" n resource-stickiness="100"
It’s still not super simple, but it’s a lot simpler than hacking a million shell scripts to do this on your own, less reliably. As Yves also mentions, this is an “alpha” solution and his changes to the RA are not yet merged upstream, so we are expecting a few more changes to happen before it’s merged.

Author

Yves Trudeau

14 years ago

Reply to Florian Haas

Florian
I tried the clone set with IPAddr2, it sort of work but the behavior is not entirely good. All the reader vips can end up on the same node even and valid nodes can be left without any reader vips. I tried a negative colocation rule with not luck. That’s why I reverted to using individual IPAddr2 resources.

Florian Haas

14 years ago

Argl. It seems like I missed a closing </code> tag. Yves, if you’re able to edit the comment and fix that, please do.

Florian Haas

14 years ago

vineet, that problem is well understood. Yves’ approach is for scale-out. If you need transaction-safe synchronous replication, slap MySQL on DRBD, put everything under Pacemaker management, and you’re good to go. That solution has been around for years.

Marcus Bointon

14 years ago

15 seconds for slave lag seems short. If you have a query that takes 15 seconds to run on a master, it’s not going to be complete on a slave until 30 seconds have elapsed (assuming servers have the same performance). It seems silly to knock slaves offline just because they’re running a long query – it’s just a known limitation of async replication. If you had several slaves, a slow query like that would knock them all offline at once, which seems to be inviting disaster.

Author

Yves Trudeau

14 years ago

‘@Shlomi
I indeed help from Florian but that was for the resource agent design, the implementation is not that complex. For sure the is a learning curve from MMM but Pacemaker is not that complicated.

@Marcus
Slave lag is adjustable, so adjust it to what makes sense to you. MMM was also removing slaves from cluster if the were lagging behind so it is not a new behavior.

@vineet
Like you said, data integrity is not guaranteed by replication but yet, in many deployment, it is not a hard requirement and replication just do the job.

Viacheslav Biriukov

14 years ago

What about set_read_only function and timeouts?
If I get it right: resource notify master to kill all connections before it sets read-only (I mean during live migrations)?

William

14 years ago

‘@Yves Trudeau
Thanks for all of the work on this. I would like to add the main thing missing from this blog post ( and many others around the web) is the description of the problem you are trying to solve and the draw backs to the approach. Sadly, I do see a lot of solutions to scale out just assume that data integrity is not an absolute requirement. Doing both is very difficult, no doubt.

Florian Haas

14 years ago

William, doing both is actually not much of a problem at all if you combine Yves’ approach for managing the slaves with the traditional DRBD approach for managing the master. And you can run all of that under Pacemaker as your unifying cluster infrastructure umbrella,

Florian Haas

14 years ago

Yves, if you want IP addresses to move away from the node you’ll just have to reset the resource stickiness.

Let’s see if the configuration snippet works out better this time. 🙂

node testvirtbox1
node testvirtbox2
node testvirtbox3
primitive p_mysql ocf:heartbeat:mysql
params config=”/etc/mysql/my.cnf” pid=”/var/run/mysqld/mysqld.pid” n socket=”/var/run/mysqld/mysqld.sock”
replication_user=”root” replication_passwd=”rootpass” max_slave_lag=”15″ evict_outdated_slaves=”false”
binary=”/usr/sbin/mysqld” test_user=”root” test_passwd=”rootpass”
op monitor interval=”20s” role=”Master” OCF_CHECK_LEVEL=”1″
op monitor interval=”30s” role=”Slave” OCF_CHECK_LEVEL=”1″
ms ms_MySQL p_mysql
meta clone-max=”3″ notify=”true”
primitive p_reader_vip ocf:heartbeat:IPaddr2
params ip=”10.2.2.171″ unique_clone_address=”true”
meta resource-stickiness=0
clone reader_vip p_reader_vip
meta globally-unique=”true” clone-max=3 clone-node-max=3
primitive writer_vip ocf:heartbeat:IPaddr2
params ip=”10.2.2.170″ nic=”eth0″
location reader_vip_reader_ok reader_vip rule -inf: readerOK eq 0
location writer_vip_writer_ok writer_vip rule -inf: writerOK eq 0
property stonith-enabled=”false” no-quorum-policy=”ignore”
rsc_defaults resource-stickiness=”100″

Author

Yves Trudeau

14 years ago

Reply to Florian Haas

Florian,
I tried your config but still some issues:

Online: [ testvirtbox1 testvirtbox3 testvirtbox2 ]

writer_vip (ocf::heartbeat:IPaddr2): Started testvirtbox3
Master/Slave Set: ms_MySQL
Masters: [ testvirtbox3 ]
Slaves: [ testvirtbox1 testvirtbox2 ]
Clone Set: reader_vip (unique)
p_reader_vip:0 (ocf::heartbeat:IPaddr2): Started testvirtbox2
p_reader_vip:1 (ocf::heartbeat:IPaddr2): Started testvirtbox1
p_reader_vip:2 (ocf::heartbeat:IPaddr2): Started testvirtbox1
root@testvirtbox1:/tmp/mysql.ocf.ra.debug# crm_attribute -N testvirtbox3 -l reboot –name readerOK –query -q
1

Henrik Ingo

14 years ago

‘+1 to what Schlomi says. While choosing Pacemaker as a robust clustering framework may not be the worst idea, the next step should be some kind of wrapper where the user provides some simple ini file and you hide all the Pacemaker complexity from end users. Without that, this is a lost cause.

Also, if you think pacemaker configuration is a lot to digest, you should see the logs that it outputs!

Author

Yves Trudeau

14 years ago

Reply to Henrik Ingo

Hi Henrik,
I do both MMM and Pacemaker and I don’t agree with you that Pacemaker is much more complex setup. The problem with MMM is that it fails to deliver what it is supposed to do. I do agree though that we need a step by step documentation and the idea of configuration wrapper is interesting, I’ll work on that in a near future.

erkan yanar

14 years ago

Thx for the post. Ive got some questions.

1. Why do you use no-quorum-policy=”ignore” at all (there is expected-quorum-votes=”3″ defined also). Having 3 nodes you should be able to go for quorum

2. binary=”/usr/bin/mysqld_safe” why not just simple take mysqld? mysqld_safe is going to restart mysqld if it exists in an ‘abnormal’ way. Imho only pacemaker should do it.

3. I don’t see where/how you are starting your master initially. I.e. if you want to upgrade an installation with the HA capabilities of pacemaker.

4. Still reading the ocf. Great work!

Author

Yves Trudeau

14 years ago

Reply to erkan yanar

Hi Erkan,
1. I used no-quorum-policy=”ignore” to allow the cluster to start with only 1 or 2 nodes (in a 3 nodes cluster). With more nodes it would less needed to have “ignore”
2. You have a point, the old ocf was using mysqld_safe and I just didn’t change it
3. Indeed, I need to document more. I have in mind at least these: step by step install, migration from MMM, adding nodes, common problems and how to solve them. I’ll work on that.
4. thanks

Florian Haas

14 years ago

Henrik, please be introduced to Pacemaker crm shell templates. http://www.clusterlabs.org/doc/crm_cli.html#_templates

No need to hack your own wrapper or come up with an .ini file. You just create an publish a template, and users only need to fill in the blanks.

Henrik Ingo

14 years ago

Florian, so why don’t create a system that has web.ini, with this part as it’s only content:
$ grep -n ^%% ~/.crmconf/web
23:%% ip 192.168.1.101
31:%% netmask
35:%% lvs_support
61:%% id websvc
65:%% configfile /etc/apache2/httpd.conf
71:%% options
76:%% envfilesthese

Why do I need to even enter the crm for a basic setup?

And remember, the crm is already a wrapper around the official, internal xml based configuration format. That the configuration is so difficult that we are discussing a wrapper around a wrapper to hide it… This is the reason why Pacemaker is right up there with Symbian C++ and autotools as most difficult technologies I’ve ever tried to learn. (For a wrapper around autotools, see pandora build system. For Symbian there was no cure and it is now dying a slow death…)

Yves: I didn’t ever use MMM, but I thought it was supposed to be simple to use. If it is difficult *and* doesn’t work, I’m surprised people ever used it. But look, you and Florian are at this moment the only people in the world who know how to use Pacemaker to manage MySQL replication clusters *correctly*. Even you can only do it with Florian’s help. So I’m wishing you good luck in bringing this technology to the masses, and based on our talks in London I do see there is a chance you can actually come up with something great, but at the moment it’s not there yet.

Author

Yves Trudeau

14 years ago

Reply to Henrik Ingo

Henrik,
Why people use MMM if it doesn’t work? Because the only other solution is Flipper and it too had issue. I mean I had many customers with hung MMM blocking their production. I submitted patches to the MMM LP project but they have never even been acknowledged. The MMM code is also terrible to follow and trace. Managing replication in a distributed environment is surprisingly challenging and that is why using Pacemaker is so great, all the aspect of dealing with inter-node communication are handled. Also, Pacemaker is incredibly powerful and flexible. If you deal with distributed computing you should know how to use it, a life saver.

Florian Haas

14 years ago

Henrik, what on earth makes you think the “internal xml based configuration format” is “official”? Yes we all know that due to a misguided release decision back in Heartbeat 2 (that was what? 5 years ago?) a predecessor of this actually required you to edit XML to retrieve or modify the configuration.

But one of the core changes to Pacemaker when it was spun off from Heartbeat, in 2008, was the introduction of the crm shell as the preferred way to manage the cluster configuration. The crm shell syntax is no less “official” than the underlying XML, which you do not need to touch. Ever.

See http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple.

What’s next? “Linux is crap for text editing” because way-back-when there was only ed? Please.

Andrew Beekhof

14 years ago

I can officially say that the shell is no less official than writing raw XML.
We chose XML for the CIB so that machines could easily read it, users were never intended to see the XML

There was always supposed to be a GUI or CLI sitting on top, it just took a few more years than we intended for them to get written.

Henrik Ingo

14 years ago

Thank gods we now have vi and emacs, the easy to use text editors 🙂 (It won’t let me quit, it won’t let me quit… Please if you tell me how to get out of this I promise I will never us it again…)

I’m thinking the internal xml configuration format is the official one because that’s actually used internally. crm converts to that. There are things you can do with the internal xml and half-xml that you cannot do with crm shell. Yes I did ever have to use them when I wrote my own MySQL agent last Summer. (It’s true the person who’s going to use my agent doesn’t need to understand that.)

But what really worries me is that I’m not 100% convinced all the Pacemaker developers, who write the internal code, will understand the xml configuration stuff and I fear they will fumble because it’s difficult and unintuitive. You of course will have no problems, but the average dev will.

Btw, when I go to clusterlabs.org and click Explore | Reference documentation. The first document in that list uses the xml stuff to do something. Yes, I started reading from the top, silly me. When that didn’t satisfy me, I then – for whatever reason – jumped instead to read the 1.0 documentation “Pacemaker Explained”. Does it advice me to use the xml notation, well yes it does.

If you want people to use the crm way, you can make it more explicit. (But in my case I’m happy I did read about the xml syntax since I really used it for one line of code eventually.)

For the record, I did learn Symbian. I get stuff done with autotools even if I’m sure I will never understand it. I did write a Pacemaker mysql agent some months ago. (You would hate me if I told you what I made it do 🙂 But I’m also saying these technologies are more difficult than they should be.

Henrik Ingo

14 years ago

Florian: Btw, just want to say for the record: Using Pacemaker and DRBD for MySQL HA, while still perhaps difficult to setup, does work correctly and people use it for good reason.

For using Pacemaker for MySQL replication I did not yet see that there would be an agent that I would actually trust to do all the correct steps in a failover situation, and they certainly would not be capable of handling more than a 2 node system (ie they couldn’t do what MHA does). But the design you explained in London does sound correct, so what Yves publishes here – while I haven’t reviewed any of it – should be an awesome improvement. (Still not easy to setup, but at least worth the trouble perhaps 🙂

Florian Haas

14 years ago

Henrik, sorry, you’re just not making sense. You’re complaining about the fact that a piece of infrastructure allows you access to its internal configuration syntax. So what? You don’t have to use it. The fact that it uses XML internally doesn’t make that the “official” interface. Is the “official” view on a btrfs filesystem the internal B-tree structure, or is it perhaps file handles and inodes and everything else that makes it a POSIX compliant filesystem? Is the “official” means of access to an RDBMS the internal storage engine implementation, or is it maybe SQL?

But what really worries me is that I’m not 100% convinced all the Pacemaker developers, who write the internal code, will understand the xml configuration stuff and I fear they will fumble because it’s difficult and unintuitive. You of course will have no problems, but the average dev will.

Huh? “The internal code” is well below the XML layer.

Henrik Ingo

14 years ago

Look, if you want you can take my previous comment as feedback on clusterlabs.org usability. I went to read your documentation. I read about the xml configuration syntax. I didn’t like it. (And it seems you don’t either.) I didn’t read about crm.

Improve the website so people read what you want them to read.

Marcus Bointon

14 years ago

I didn’t twig that pacemaker had anything to do with crm until this thread! I’m using heartbeat/crm for a redundant web front-end, managing a traditional floating IP. When I set it up I did find it really cryptic, especially since there didn’t seem to be any config file at all, depending entirely on some distributed database with no simple on-disk representation. That also meant that it didn’t seem possible to pre-configure it then bring it up in a working state – I had to bring it up broken, then configure it interactively. Now that it’s up and running, it’s working fine, but getting it there wasn’t easy or comprehensible, and that was only to manage a simple IP address on two nodes! I don’t think I’d want to attempt anything more complex with it, at least not without a lot of time. Also the crm_mon application doesn’t work properly under anything but bash (I prefer zsh) and often shows confusing information about things that should be absolutely clear-cut (this node is up, this node is down, node x is holding resource y etc). It might be completely brilliant of course (and I doubt Pecona would have chosen it otherwise), but it’s not obvious.

FWIW, I really like mmm – it’s been working really well for me (1.x originally installed for me by Percona, and I’ve done several 2.x deploys since then), coping beautifully with all kinds of weird network and incidents, allowing downtime-free upgrades and more.

Andrew Beekhof

14 years ago

Marcus:

Actually there is an on-disk representation (also in XML).
We don’t encourage people to modify it directly because most of the time this will not achieve the intended result, but if you’re careful it is possible to do safely.

With the caveat that you appear to be running a pre-Pacemaker (ie. very old) version of our software, I’d have thought crm_mon already did a reasonable job of showing “node x is holding resource y”, perhaps you’d prefer the output with -n instead?

Also, please file a bug with details on the crm_mon/zsh issue. This is the first I’ve heard of it.

Henrik:

Could you clarify who you mean by “the Pacemaker developers, who write the internal code”?
Do you refer to the people writing the resource agent scripts that Pacemaker uses or to the authors of Pacemaker itself?

To your comment that “If you want people to use the crm way, you can make it more explicit”, I would make two points:

– People should use whatever they are comfortable with.

There are multiple graphical and command-line options for configuring the cluster.
The proper thing for the project to do is make the options known, not to dictate a specific tool that admins must use.

Having said that, the first document listed under the Explore tab is “Clusters from Scratch” which does not use XML.
That you jumped ahead to the reference material and clicked on the first alphabetically sorted document should not infer much about our relative preference for either configuration method.

– The purpose of “Pacemaker Explained” is very different from documents like “Clusters from Scratch”.

One does negate the other. The first (XML, reference) needs to exist so that shells and GUIs can be written, this is the API that Pacemaker itself commits to and details all the available options and configuration constructs. The second (crm, howto) is needed because XML is hard to read.

I would also point out that “Pacemaker Explained” does /not/ advise you “to use the xml notation”. It only says that the shell syntax is not within the scope of that particular document.

Henrik Ingo

14 years ago

By “Pacemaker developers” I suppose I mean both of those. My point here as in my opinion the xml format wasn’t entirely intuitive, ie the way some xml attributes map to real life objects wasn’t easy for me to remember. So it makes me wonder if even the average Pacemaker and/or Corosync developer fully understands them, or is at risk of making mistakes due to not remembering some exception or interpretation of some configuration parameter. For instance I remember there was something that if I set unique=”1″ it will trigger a restart of the resource (as a side effect, it seems?). Why? If I wanted to restart a resource I would want to say something like restart=”1″.

Anyway, I suppose what it really comes down to is that if you did usability testing along the lines of a person unfamiliar with pacemaker going to clusterlabs.org to learn how to set it up, I’m afraid results would be poor – using myself as an example I didn’t end up reading the documents that Florian at least feels is the recommended one, which resulted in a poor experience.

Even there I’m being generous, if you really want to make a usability test you should ask people to first search for “Pacemaker documentation” and then learn to set it up. When “clusterlabs.org” shows up in search results, it’s not at all obvious that is what I’m looking for…

I admit that on the Explore page “Clusters from Scratch” comes before reference documentation. But the title made me think of “Linux from Scratch”, so not the user friendly easy documentation I was looking for. “HowTo Guides” made me hopeful but that goes to a wiki page about contributing code to the project and such. So really “Reference documentation” was the only one that looked like documentation, and there Pacemaker Explained is what seems to define the essence of Pacemaker. The Pacemaker project should think about this: how should people use Pacemaker easily, and does the web page support that experience or not?

Andrew Beekhof

14 years ago

> My point here as in my opinion the xml format wasn’t entirely intuitive

See above, it wasn’t meant to be.

> So it makes me wonder if even the average Pacemaker and/or Corosync developer fully understands them,
> or is at risk of making mistakes due to not remembering some exception or interpretation of some configuration parameter.

We have approximately 500 automated regression tests to avoid relying on people’s memory.

> For instance I remember there was something that if I set unique=”1″ it will trigger a restart of
> the resource (as a side effect, it seems?). Why? If I wanted to restart a resource I would want
> to say something like restart=”1″.

1. This is something that came from the OCF standard, not Pacemaker.
2. It means more than just “restart”, it means “never have two things in the cluster with the same value for this”

For the rest, I disagree but you are of course entitled to your opinion, especially if you back them up with contributions.
People are always invited to contribute to the project and improve the things, such as the website and documentation, that they consider are lacking.

Marcus Bointon

14 years ago

Andrew,

I’m just setting up another heartbeat/CRM installation and ran into the zsh issue I mentioned again. A bit of rummaging led me to this post which describes the problem: http://oss.clusterlabs.org/pipermail/pacemaker/2010-September/007645.html

Short version: crm uses shell options that only work with bash.

The server I was working on had root’s shell set to zsh; changing it to bash fixed crm (using ‘sudo -i’ from a user account, as that article suggests), but of course left me with bash as the default shell.

I don’t know if this is fixed in later versions of heartbeat – I’m using the stock package for Ubuntu Lucid.

Andrew Beekhof

14 years ago

Marcus:
That mail had a couple of work-arounds listed, did you try them?
Perhaps ask Ubuntu to include the patch listed in the bugzilla.

If you’re still having problems, please consider contacting the mailing list with the output and errors.

Florian Haas

14 years ago

Marcus, the issue you’re having is not related to Heartbeat, only Pacemaker. And the stock packages from Lucid are pretty dated at this point; here’s an updated PPA:

https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa

Marcus Bointon

14 years ago

Thanks for that Florian – I installed packages from that PPA and my zsh problems have gone away.

Florian Haas

14 years ago

Marcus, good to know. Everyone else, apologies that we took this thread off on an OT tangent; we can go back to MySQL integration in Pacemaker now. 🙂

Lars Fronius

14 years ago

Well, there is galera’s synchronous multi master replication around, which works quite nice. Is there any approach to build a RA for this? It would be quite nice to see, because you don’t need to keep struggling with asynchronous replication which can break your cluster when you want to migrate your clusters Master-IP.

Florian Haas

14 years ago

Lars, feel free to either write one, or integrate Galera replication with the existing MySQL RA. http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html is the OCF resource agent developer’s guide, and linux-ha-dev is the correct mailing list to post to if you’re looking for help.

Henrik Ingo

14 years ago

Lars: Since Galera monitors it’s own state, and takes all the actions necessary upon node or network failures, there is little need left for an external cluster manager.

Mainly, if you like to use virtual IPs (which are not at all necessary with galera, but could be convenient for some operational tasks depending on how you are used to doing things) you could still use Pacemaker to move VIPs around, even if not Galera. You would then have to write some Pacemaker functionality that is aware of Galera, even if not managing Galera itself.

The other thing Pacemaker could be used for would be to restart a galera node after it has crashed. Given that mysqld_safe already does that, I consider using Pacemaker for that purpose completely overkill.

Florian Haas

14 years ago

The other thing Pacemaker could be used for would be to restart a galera node after it has crashed. Given that mysqld_safe already does that, I consider using Pacemaker for that purpose completely overkill.

Henrik: that suggestion is completely misguided. Yves: can you please change your config snippet to include binary=/usr/sbin/mysqld. In a Pacemaker cluster, it should be Pacemaker that takes care of resource monitoring and recovery. Thanks.

Lars Fronius

14 years ago

I think Pacemaker would be useful for Galera, when it comes to provisioning of new nodes. One gets donor in the cluster, to transfer its state. I would want Pacemaker for moving a VIP away from that (donor-)node. You could also choose a load-balancer, which then automatically disables this IP.

Author

Yves Trudeau

14 years ago

Lars, Galera is an interesting product for sure but it is far from a one size fits all. There are many cases where normal replication is better.

Author

Yves Trudeau

14 years ago

Florian, mysqld_safe does more than just restarting MySQL, it also sets the ulimits and redirect logs to syslog. For my part, I found very convenient to have binary=/usr/bin/mysqld_safe.

Henrik Ingo

14 years ago

Lars: Exactly. You need some sort of failover or load balancing also with Galera. VIPs are not your only option and imo not even the best option, but they are simple and well understood. If you want to use VIPs then you need something that will move them around.

If you you use Galera and don’t use VIPs, then using Pacemaker purely as a mysqld_safe replacement would be pure folly.

Yves: I don’t have a lot of field experience there, but I would consider Florian’s advice. If Pacemaker sees that MySQL is not responding, it will try to restart something. Will it then try to restart mysqld or mysqld_safe? And will mysqld_safe simulatenously try to restart something? It sounds like you’re in for a mess… Better to move the functionality provided by mysqld_safe into the pacemaker agent and let pacemaker control everything.

Andreas Stallmann

14 years ago

Hi!

I’m using MySQL in a “classical” setup with Pacemaker and DRBD for failover and replication. I stumbled over your solution while looking for an alternative, which works with the built-in MySQL replication AND Pacemaker. Nice work!

Still, the aspect of asynchronous replication is something that worries me a bit. I read that there are already synchronous replication mechanisms out there for MySQL, especially Continuent Tungsten Replicator (see http://www.continuent.com). Couldn’t one weld toghether your approach for high availability and continuents replication solution? Just a (probably based on misunderstood information) thought…

Cheers and good bye,

Andreas

Henrik Ingo

14 years ago

Andreas: Tungsten is also asynchronous. Possibly what you are referring to is Galera, which is synchronous. Please see this more recent Percona blog for more info on that one: http://www.facebook.com/profile.php?id=665178886

Mark Stunnenberg

14 years ago

Nice article!

Though, I’m trying to get this working in my test setup but I’m having problems.
I have 3 servers, both have the standard mysql db’s setup as in mysql_install_db, created a replication client on all 3, reset masters etc. Stopped all mysqld’s and used your config to get this running, I’m stuck at an error that shows up in the logs

“Jan 19 13:01:56 tabit mysql[818]: ERROR: /usr/lib/ocf/resource.d//heartbeat/mysql: 1313: -q: not found”

No resources are assigned to any machine.

status keeps at:
Online: [ tabit meissa toucan ]

Failed actions:
p_mysql:1_start_0 (node=toucan, call=20, rc=1, status=complete): unknown error
p_mysql:0_start_0 (node=tabit, call=20, rc=1, status=complete): unknown error
p_mysql:0_start_0 (node=meissa, call=20, rc=1, status=complete): unknown error

Any idea’s?

Andreas Stallmann

14 years ago

Hi!

@Yves Trudeau:

I currently have a setup (MySQL on DRBD with Pacemaker) where my applications read and write to the same VIP and I can’t change that (and can’t force our dev-team to change it). Does your configuration work with only one VIP, too? If yes, how would the appropriate crm-setup look like?

Thanks for your good work,

Andreas

Author

Yves Trudeau

14 years ago

Reply to Andreas Stallmann

Hi Andreas,
It is easy to use only the write_vip, the reader_vips are not mandatory at all.

Andreas Stallmann

14 years ago

Just an other question. I receive the error

WARNING: p_mysql: action monitor_Slave_0 not advertised in meta-data, it may not be supported by the RA

when commiting the config. Additionaly the following error shows in crm_mon:

Failed actions:
p_mysql:1_monitor_0 (node=int-ipfuie-mgmt01, call=36, rc=5, status=complete): not installed
p_mysql:0_monitor_0 (node=int-ipfuie-mgmt02, call=49, rc=1, status=complete): unknown error
p_mysql:0_stop_0 (node=int-ipfuie-mgmt02, call=57, rc=1, status=complete): unknown error

Any suggestions?

Cheers,

Andreas

Author

Yves Trudeau

14 years ago

Reply to Andreas Stallmann

‘@Andreas,
Have you installed the mysql RA from GitHub? The default one that comes with many distribution will not work.

Andreas Stallmann

14 years ago

Hi again,

I still see

Master/Slave Set: ms_MySQL
p_mysql:0 (ocf::heartbeat:mysql): Slave int-ipfuie-mgmt01 (unmanaged) FAILED
p_mysql:1 (ocf::heartbeat:mysql): Slave int-ipfuie-mgmt02 (unmanaged) FAILED

and

p_mysql:0_monitor_0 (node=int-ipfuie-mgmt01, call=35, rc=1, status=complete): unknown error
p_mysql:0_stop_0 (node=int-ipfuie-mgmt01, call=37, rc=1, status=complete): unknown error
p_mysql:0_monitor_0 (node=int-ipfuie-mgmt02, call=24, rc=1, status=complete): unknown error
p_mysql:0_stop_0 (node=int-ipfuie-mgmt02, call=26, rc=1, status=complete): unknown error

There are no obvious errors in /var/log/messages. Any ideas where to look?

By the way: If this is not the right place to ask such questions, please redirect me to a more appropriate site/forum/mailing list.

Thanks,

Andreas

Andreas Stallmann

14 years ago

…and just an other tought:

Because you did not mention the prerequisites for your setup, I did it according to

http://dev.mysql.com/doc/refman/5.1/en/replication-howto.html and http://dev.mysql.com/doc/refman/5.1/de/replication-howto.html

Anything wrong with that? Do you perhaps rely on a “clean” setup without the nodes being preconfigured for replication?

Cheers,

Andreas

PS: You where right with your last comment; I forgot to copy the agent to one of my nodes.

Florian Haas

14 years ago

Andreas, the Pacemaker mailing list is the best option to discuss configuration issues. http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Andreas Stallmann

14 years ago

One possible alternation to your resource script (for pacemaker pacemaker-1.1.2.1 under OpenSuSE 11.3):

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d//heartbeat/}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

The paths you provided did not work. I don’t know, if this is any relevant to other distributions.

Cheers,

Andreas

Lars Fronius

14 years ago

There was a changed made, when RHCS Resource-Agents were merged into Pacemaker. The path in versions before that merge of resource-agents was
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d//heartbeat/}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
, for later versions it is
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
You just have to adapt it by hand…

Andreas Stallmann

14 years ago

I found a possible bug in the script. When I call it with ocf-tester, I get:

mysql[31420]: ERROR: /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs: line 332: -q: command not found

Indeed, in line 889 of your script I read:

ocf_run -q $MYSQL $mysql_options

whereas ocf_run is called without -q everywhere else in your script.

Secondly, the ocf-tester reports:

ERROR 1045 (28000): Access denied for user ‘root’@’localhost’ (using password: NO)

This happens, although I provided a password for “test_user=root”. Could it be, that the password is not read from the OCF_RESKEY_test_passwd-Parameter but instead still uses the default (which is empty)?

Cheers,

Andreas

Florian Haas

14 years ago

Andreas, when you test this agent, please rebuild the resource-agents package from upstream git, or at least get a reasonably recent one that your distro may ship. Don’t expect to be able to drop this agent into an age-old install.