Percona XtraDB Cluster: Failure Scenarios with only 2 nodes

During the design period of a new cluster, it is always advised to have at least 3 nodes (this is the case with PXC but it’s also the same with PRM). But why and what are the risks ?

The goal of having more than 2 nodes, in fact an odd number is recommended in that kind of clusters, is to avoid split-brain situation. This can occur when quorum (that can be simplified as “majority vote”) is not honoured. A split-brain is a state in which the nodes lose contact with one another and then both try to take control of shared resources or provide simultaneously the cluster service.
On PRM the problem is obvious, both nodes will try to run the master and slave IPS and will accept writes. But what could happen with Galera replication on PXC ?

Ok first let’s have a look with a standard PXC setup (no special galera options), 2 nodes:

two running nodes (percona1 and percona2), communication between nodes is ok

Same output on percona2

Now let’s check the status variables:

on percona2:

only wsrep_local_index defers as expected.

Now let’s stop the communication between both nodes (using firewall rules):

This rule simulates a network outage that makes the connections between the two nodes impossible (switch/router failure)

We can see that the node appears down, but we can still run some statements on it:

on node1:

on node2:

And if you test to use the mysql server:

If you try to insert data just while the communication problem occurs, here is what you will have: