Achieving Disaster Recovery with Percona XtraDB Cluster

October 7, 2019

Author

Michael Patrick

Percona Software

Share this Post:

Disaster Recovery with Percona XtraDB Cluster One thing that comes up often from working with a variety of clients at Percona is “How can I achieve a Disaster Recovery (DR) solution with Percona XtraDB Cluster (PXC)?” Unfortunately, decisions are sometimes made with far-reaching consequences by individuals who often do not well understand the architecture and its limitations. As a Technical Account Manager (TAM), I am often engaged to help clients look for better solutions, or at least try to help them with it by mitigating as many issues as possible. Clearly, in a perfect world, we would like to get the right experts involved in these types of discussions to ensure more appropriate solutions, but we all know this is not a perfect world.

One such example involves the idea that if we take a PXC cluster and split it into two datacenters with two nodes in a primary datacenter and one node in a separate datacenter, we will have a hot standby node at all times. In this case, the application can be pointed to the third node in the event of something catastrophic. This sounds great…in theory. The problem is latency.

Latency can cripple a PXC cluster

By design, PXC is meant to work with nodes that can communicate with one another quickly. The underlying cluster technology, known as Galera, is considered “virtually synchronous” in nature. In this architecture, writesets are replicated to all active nodes in the cluster at transaction commit and will go into a queue. Next, each node performs a certification of the writeset which is deterministic in nature. A bug, notwithstanding, each node will either accept or reject the certification in the same manner. So, either the writeset is applied on all nodes or it is rolled back by all nodes. What matters in this discussion about the above is the write queue.

As writes come in on one of the nodes, the writesets are replicated to each of the other nodes. In the above three-node cluster, the writes are certified quickly with the two nodes in the same datacenter. However, the third node is located in a different datacenter some distance away. In this case, the writeset must travel across the WAN and will go into a queue (wsrep_local_recv_queue).

So, what’s the problem?

To ensure that one node does not get too far behind the rest of the cluster, any node can send a flow control message to the cluster. This instructs the cluster to stop replicating new events until the slow node catches up to within some number of writesets as defined by gcs.fc_limit in the configuration. Essentially, when the number of transactions in the queue exceeds the gcs.fc_limit, flow control messages will be sent and the cluster will stop replicating new writesets. Unless you have changed it, this will be 16 writesets.

Remember that PXC is virtually synchronous

When replication stops, all nodes stop accepting writes momentarily. In this event, the system seems to stall until the local recv queue makes some space for new writesets, at which point replication will continue. This can appear as a stall to the application and leads to huge performance issues.

So, what is a better solution to the above situation? It is preferable to utilize an asynchronous Slave server replicating from the PXC cluster for failover. This is the same replication that is built into Percona Server and not Galera. While this may include adding another server, this could be mitigated by the use of garbd. This process will act as an arbitrator to maintain quorum of the cluster and thus decrease the number of data nodes in PXC by running the lightweight garbd process on an app server or some other server in the environment. This keeps server count down in cases that are cost-sensitive.

The asynchronous nature means no sending of flow controls from the node in the remote datacenter. Replication will lag and catch up as needed with the PXC cluster none the wiser. Because all nodes in the PXC cluster are local to one another, ideally latency is minimal and writesets can be applied much more quickly and stalls minimized. Of course, this does come with a few challenges.

One challenge is that due to the nature of asynchronous replication, there can be significant lag in the DR node. This is not always an issue, however, as the only time you are using this server is during a disaster. The writes will be sent over immediately by the Master, so it is reasonable to expect that the DR node will eventually catch up and you hope to not lose any data, although there are no guarantees.

This brings us to another concern. Simple asynchronous replication has no guarantee of consistency like PXC provides. In PXC, there are controls in place to guarantee consistency, but asynchronous replication provides none. There are, therefore, cases where data drift can occur between Master and Slave. To mitigate this risk, you can use pt-table-checksum from the Percona Toolkit to detect inconsistency between the Master node of the PXC cluster and the Slave and rectify it with pt-table-sync from the same toolkit. This, of course, requires that you run this process often. If it is done as part of an automated process, it should also be monitored to ensure it is being done and whether or not data drift is occurring.

You will also want to monitor that the Master node does not go down, as there is no built-in process of failing the DR node over to a new Master within the PXC cluster. Our very own Yves Trudeau wrote a utility to manage this, and more information can be found here: https://github.com/y-trudeau/Mysql-tools/tree/master/PXC

Improving Performance

While this solution presents some additional complexity, it does provide for a more performant PXC cluster. As a TAM, I have seen geographically-distributed PXC result in countless incidents of a system down in production. Even when the system doesn’t go down, it often slows down due to the latency issues. Why take that performance impact on every transaction for a DR solution that you hope to never use when there is an alternative solution as has been proposed here? Instead, you maybe could benefit from an alternative approach that provides an acceptable failover solution while improving performance day in and day out.