Percona XtraDB Cluster: Quorum and Availability of the clusterStephane Combaudon
Percona XtraDB Cluster (PXC) has become a popular option to provide high availability for MySQL servers. However many people are still having a hard time understanding what will happen to the cluster when one or several nodes leave the cluster (gracefully or ungracefully). This is what we will clarify in this post.
Nodes leaving gracefully
Let’s assume we have a 3-node cluster and all nodes have an equal weight, which is the default.
What happens if Node1 is gracefully stopped (
service mysql stop)? When shutting down, Node1 will instruct the other nodes that it is leaving the cluster. We now have a 2-node cluster and the remaining members have 2/2 = 100% of the votes. The cluster keeps running normally.
What happens now if Node2 is gracefully stopped? Same thing, Node3 knows that Node2 is no longer part of the cluster. Node3 then has 1/1 = 100% of the votes and the 1-node cluster can keep on running.
In these scenarios, there is no need for a quorum vote as the remaining node(s) always know what happened to the nodes that are leaving the cluster.
Nodes becoming unreachable
On the same 3-node cluster with all 3 nodes running, what happens now if Node1 crashes?
This time Node2 and Node3 must run a quorum vote to estimate if it is safe continue: they have 2/3 of the votes, 2/3 is > 50%, so the remaining 2 nodes have quorum and they keep on working normally.
Note that the quorum vote does not happen immediately when Node2 and Node3 are not able to join Node1. It only happens after the ‘suspect timeout’ (evs.suspect_timeout) which is 5 seconds by default. Why? It allows the cluster to be resilient to short network failures which can be quite useful when operating the cluster over a WAN. The tradeoff is that if a node crashes, writes are stalled during the suspect timeout.
Now what happens if Node2 also crashes?
Again a quorum vote must be performed. This time Node3 has only 1/2 of the votes: this is not > 50% of the votes. Node3 doesn’t have quorum, so it stops processing reads and writes.
If you look at the
wsrep_cluster_status status variable on the remaining node, it will show
NON_PRIMARY. This indicates that the node is not part of the Primary Component.
Why does the remaining node stop processing queries?
This is a question I often hear: after all, MySQL is up and running on Node3 so why is it prevented from running any query? The point is that Node3 has no way to know what happened to Node2:
- Did it crash? In this case, it is safe for the remaining node to keep on running queries.
- Or is there a network partition between the two nodes? In this case, it is dangerous to process queries because Node2 might also process other queries that will not be replicated because of the broken network link: the result will be two divergent datasets. This is a split-brain situation, and it is a serious issue as it may be impossible to later merge the two datasets. For instance if the same row has been changed in both nodes, which row has the correct value?
Quorum votes are not held because it’s fun, but only because the remaining nodes have to talk together to see if they can safely proceed. And remember that one of the goals of Galera is to provide strong data consistency, so any time the cluster does not know whether it is safe to proceed, it takes a conservative approach and it stops processing queries.
In such a scenario, the status of Node3 will be set to
NON_PRIMARY and a manual intervention is needed to re-bootstrap the cluster from this node by running:
SET GLOBAL wsrep_provider_options='pc.boostrap=YES';
An aside question is: now it is clear why writes should be forbidden in this scenario, but what about reads? Couldn’t we allow them?
Actually this is possible from PXC 5.6.24-25.11 with the wsrep_dirty_reads setting.
Split-brain is one of the worst enemies of a Galera cluster. Quorum votes will take place every time one or several nodes suddenly become unreachable and are meant to protect data consistency. The tradeoff is that it can hurt availability, because in some situations a manual intervention is necessary to instruct the remaining nodes that they can accept executing queries.