November 23, 2014

Percona XtraDB Cluster: Failure Scenarios with only 2 nodes

During the design period of a new cluster, it is always advised to have at least 3 nodes (this is the case with PXC but it’s also the same with PRM). But why and what are the risks ?

The goal of having more than 2 nodes, in fact an odd number is recommended in that kind of clusters, is to avoid split-brain situation. This can occur when quorum (that can be simplified as “majority vote”) is not honoured. A split-brain is a state in which the nodes lose contact with one another and then both try to take control of shared resources or provide simultaneously the cluster service.
On PRM the problem is obvious, both nodes will try to run the master and slave IPS and will accept writes. But what could happen with Galera replication on PXC ?

Ok first let’s have a look with a standard PXC setup (no special galera options), 2 nodes:

two running nodes (percona1 and percona2), communication between nodes is ok

Same output on percona2

Now let’s check the status variables:

on percona2:

only wsrep_local_index defers as expected.

Now let’s stop the communication between both nodes (using firewall rules):

This rule simulates a network outage that makes the connections between the two nodes impossible (switch/router failure)

We can see that the node appears down, but we can still run some statements on it:

on node1:

on node2:

And if you test to use the mysql server:

If you try to insert data just while the communication problem occurs, here is what you will have:

Is is possible anyway to have a two nodes cluster ?

So, by default, Percona XtraDB Cluster does the right thing, this is how it needs to work and you don’t suffer critical problem when you have enough nodes.
But how can we deal with that and avoid the resource to stop ? If we check the list of parameters on galera’s wiki wcan see that there are two options referring to that:

pc.ignore_quorum: Completely ignore quorum calculations. E.g. in case master splits from several slaves it still remains operational. Use with extreme caution even in master-slave setups, because slaves won’t automatically reconnect to master in this case.
pc.ignore_sb : Should we allow nodes to process updates even in the case of split brain? This is a dangerous setting in multi-master setup, but should simplify things in master-slave cluster (especially if only 2 nodes are used).

Let’s try first with ignoring quorum:

By ignoring the quorum, we ask to the cluster to not perform the majority calculation to define the Primary Component (PC). A component is a set of nodes which are connected to each other and when everything is ok, the whole cluster is one component. For example if you have 3 nodes, and if 1 node gets isolated (2 nodes can see each others and 1 node can see only itself), we have then 2 components and the quorum calculation will be 2/3 (66%) on the 2 nodes communicating each others and 1/3 (33%) on the single one. In this case the service will be stopped on the nodes where the majority is not reached. The quorum algorithm helps to select a PC and guarantees that there is no more than one primary component in the cluster.
In our 2 nodes setup, when the communication between the 2 nodes is broken, the quorum will be 1/2 (50%) on both node which is not the majority… therefore the service is stopped on both node. In this case, service means accepting queries.

Back to our test, we check that the data is the same on both nodes:

Adding ignore quorum and restart both nodes:

set global wsrep_provider_options=”pc.ignore_quorum=true”; (this seems to not work properly currently, I needed to change it in my.cnf and restart the nodes)

wsrep_provider_options = “pc.ignore_quorum = true”

break again connection between both nodes

iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

and perform an insert, this first insert will take longer:

and on node2 you can also add records:

The wsrep status variables are like this now:

Then we fix the connection problem:

iptables -D INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

nothing changes, data stays different:

You can keep inserting data, it won’t be replicated and you will have 2 different version of your inconsistent data !

Also when we restart a it will request an SST or in certain case fail to start like this:

Of course all the data that was written on the node we just restarted is lost after the SST.

Now let’s try with pc.ignore_sb=true:

When the quorum algorithm fails to select a Primary Component, we have then a split-brain condition. In our 2 nodes setup when a node loses connection to it’s only peer, the default is to stop accepting queries to avoid database inconsistency. We can bypass this behaviour by ignoring the split-brain by adding

wsrep_provider_options = “pc.ignore_sb = true” in my.cnf

Then we can insert in both nodes without any problem when the connection between the nodes is gone:

When the connection is back, the two servers are like independent, these are now two single node clusters.

We can see it in the log file:

Now when we put the two settings to true, with a two node cluster, it acts exactly like when ignore_sb is enabled.
And if the local state seqno is greater than group seqno it fails to restart. You need again to delete the file grastate.dat to request a full SST and you loose again some data.

This is why two node clusters is not recommended at all. Now if you have only storage for 2 nodes, using the galera arbitrator is a very good alternative then.

On a third node, instead of running Percona XtraDB Cluster (mysqld) just run garbd:

Currently there is no init script for garbd, but this is something easy to write as it can run in daemon mode using -d

If the communication fails between node1 and node2, they will communicate and eventually send the changes through the node running garbd (node3) and if one node dies, the other one behaves without any problem and when the dead node comes back it will perform its IST or SST.

In conclusion: 2 nodes cluster is possible with Percona XtraDB Cluster but it’s not advised at all because it will generates a lot of problem in case of issue on one of the nodes. It’s much safer to use then a 3rd node even a fake one using garbd.

If you plan anyway to have a cluster with only 2 nodes, don’t forget that :
– by default if one peer dies or if the communication between both nodes is unstable, both nodes won’t accept queries
– if you plan to ignore split-brain or quorum, you risk to have inconsistent data very easily

About Frederic Descamps

Frédéric joined Percona in June 2011, he is an experienced Open Source consultant with expertise in infrastructure projects as well in development tracks and database administration.

Frédéric is a believer of devops culture.

Comments

  1. Fred,

    I see when you’re trying to work with node1 you seems to get different error message “deadlock” and “unknown command” why is this ? Would not it be more practical if some same error message is used for all cases when cluster is in the state it can’t run command.

  2. Peter: deadlock are common in Galera replication (when for example you have hight concurrent writes and you perform them on several nodes. In this case it returns a deadlock because the full process of split brain is not yet finished and to avoid inconsistent data as the transaction cannot be committed on the second node. The “unknown command” is returned when the node is set to not allow queries any more, the value of wsrep_ready is OFF. Generally the load balancer checks that status and don’t send any connection to that node.. the application should not cope with that error.

    So to summarize, the deadlock occurs just because the cluster didn’t finish yet to put “offline” all the nodes.

  3. Asko Aavamäki says:

    Nice rundown!

    I wonder if another article could demonstrate how that two node cluster would perform during that IST/SST mentioned at the end after the other node rejoins the cluster. Would this show that the node that remained operational would be blocked as well in order to server the SST requests to the node that is returning to the cluster, making the whole DB inoperational?

    Maybe the effect could be demonstrated both for mysqldump and rsync SSTs (or even XtraBackup if that could be used to achieve non-blocking in that scenario!)

    The issue is noted in documentation and is the other big reason for having a minimum of three nodes in addition to split brain etc. but I think it’d be good to see what that type of situation looks like in the logs so if someone runs in to it, it’s less of a mystery.

  4. Asko,

    thank you for the idea, I’ll try to find some time to write such post.

    cheers,

  5. Balasundaram Rangasamy says:

    Nice one. I had similar problem, when I was working with only 2 node cluster. This article nicely explains why we should not go for 2 node cluster.

    When I try to update a table on both node simultaneously, I did deadlock error, unknown command and even second node becomes unavailable.

  6. REJECT doesn’t really simulate a failed network connection as a ICMP reject will be sent. Using DROP is a better test. It probably doesn’t change the results… and a misconfigured router/switch might be more common then a failed network connection.

  7. Jalvin Trivedi says:

    If i am working on two nodes if i rais one query after some time node1 is fail but still query is not complite then what happen about the query ???? Is it running continuously or restart ????????

Speak Your Mind

*