In this blog post, we will be discussing the PXC Replication Manager script/tool which basically facilitates both source and replica failover when working with multiple PXC clusters, across different DC/Networks connected via asynchronous replication mechanism.

Such topologies emerge from requirements like database version upgrades, reporting or streaming for applications, separate disaster recovery or backup solutions, and multi-source data needs etc.

We will try to further understand the usage with the help of a demo. So, let’s dive into the practical scenario below.

Topology

Above, we have two separate PXC clusters, which will be connected via async replication. Basically, both the clusters will act as both source/replica for each other.

Minimal configuration for PXC/Galera and async replication

DC1-1:

DC1-2:

DC2-1:

DC2-2:


Bootstrap/PXC Ready

Both clusters should be bootstrapped separately, and the other PXC nodes started normally to sync via SST.

First Node:

Second/other Nodes:

At this stage, both clusters should be active, and running but not connected to each other in any way.

Manual Asynchronous Replication setup

We need to set up a one-time manual async replication in order for automation to work later on with the help of the Replication Manager script.

  • We will be taking a dump from the DC1 [DC1-1] node.

  • And, then restore the dump on the DC2 [DC2-1] node.

  • Once restoration is finished, we can establish the replication.

Note: Ensure the user for replication purposes is available on the source node.

Similarly, in order to set up a circular/multi-source asynchronous replication we can establish the async replication channel on DC1 as well.

  • Async replication established on [DC1-2] connects to source [DC2-2].

At this stage, both clusters [DC1 and DC2] act as both source and replicas. Any writes on any of the cluster nodes will sync to others.

Now, we will set up the PXC replication manager to control the failover events inside both clusters.

PXC Replication Manager Setup/Configuration

1)  First, we need to create some manual tables on  DC1 [DC1-1], which will capture the cluster information and the other associated metadata needed for failover decision making.

2) Next, we need to insert the desired information inside these tables. Please note that we will not perform any manual writes on the [replication] table. This table will be managed by the replication manager script.

  • The below table will contain PXC cluster and replication credential related information.

  • The below is the mapping table which decides which cluster is participating as source/replica.

  • The below table will assign particular weight for each node to be eligible for source/replica failover.

Note – The above details will sync to all connected cluster and async nodes.

3) Finally, we have to download and set up the replication-manager script via cron so that it can do continuous monitoring on the topology and perform the failover when required.

  • The script by default takes inputs from – [/root/.my.cnf] so we must ensure the credentials are added there. 

  • It’s not required to use the super/root user only. We can provide the minimum below grants also with any customized user.

E.g,

  • Once the above steps are done, we can test the script. It will output the verbose information about the events in the log file – /tmp/replication_manager.log. By default this file does not exist so we must create it.  

  • Lastly, we can put the script inside a cron as per the requirement. We should activate the script on all participating nodes as the role can be switched at each failover. 

Testing Replica Failover

  • Below we can see the table – “replication” showing the current status of which nodes are replica’s at the moment. 

DC2-1 (Replica):

DC1-1 (Source):

  • Now we will stop the database service of one of the replicas [DC2-1] and observe the behaviour.

  • The other remaining PXC node [DC2-2] is now reflected in the “Proposed” stage and the previous replica[DC2-1] is shown as Stopped or with  “No” status as below.

  • Finally, after some time frame, we can see Node [DC2-2] becomes the new replica.

DC1-1 (Source):

The replication manager log file – /tmp/replication_manager.log would also show information regarding changing replica nodes and updating replication tables.

Testing Source Failover

  • Now here we will test what happens when the source [DC1-1] itself is down or not available.
Before source failover:

Stopping the source:

After source failover:
  • Once the source [DC1-1] is down , the replica [DC2-2] is connected via another source [DC1-2].

Important considerations

  • As the replication manager script is still in technical preview, it would be advisable to test it thoroughly before using it in production. The topology we used above is for demo purposes and we suggest using a minimum of 3 PXC nodes in production. Also avoid doing writes from multiple nodes in the cluster to lessen the risk of inconsistency issues.
  • It will support only with the PXC/MariaDB based Galera environment. For isolated async nodes it won’t work and we have to rely on async replication failover
  • The script depends on GTID based replication. The auto failover won’t work when using binary log file/position based replication.

References

Final Thought

We discussed some use cases of the PXC replication manager script, which is capable of handling both source and replica failovers within PXC/MariaDB-based Galera clusters. The script has certain limitations as it does not work with standard asynchronous replicas. It is applicable only when the nodes are part of a PXC cluster.

The script is also capable of handling more complex topologies, such as multi-source replication, where a single cluster synchronizes data from multiple sources using multiple replication channels. We may explore these scenarios in some separate blog posts.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments