Group replication is a fault-tolerant/highly available replication topology that ensures if the primary node goes down, one of the other candidates or secondary members takes over so write and read operations can continue without any interruptions.

However, there are some scenarios where, due to outagesnetwork partitions, or database crashes, the group membership could be broken, or we end up with isolated members. In those situations, we have to perform either a partial or full recovery to make the whole topology active again.

MySQL Group Replication topology depicts nodes are down

MySQL Group Replication topology depicts Network Partition

Let’s consider the scenario below: All the group nodes are down while the Read/Write operations are ongoing on the Primary member [Node1].

1) We have the group replication topology below, which I set up using the dbdeployer tool.

2) Running workload on the primary [127.0.0.1:23637].

 Output:

3) Here, we have deliberately stopped all the group nodes.

At this stage, the cluster is completely down, and all group members have stopped. Now, we will try to recover the cluster again.

Bootstrapping/recovering cluster nodes

4) First, let’s start each node. Since I am using dbdeployer, I can manage(start/stop) or do other stuff with some internal command/scripts below.

5) Verify the current status by connecting to each node.

All of the nodes are offline at the moment.

6) We can recover the cluster by bootstrapping one of the group nodes, and then, after starting the other nodes, they will join via the distributed recovery process. But before that, it’s important to know the status of the transactions certified/committed on each node so we don’t end up bootstrapping the wrong node. Bootstrapping the wrong node can result in extreme data loss!

Note: “received_transaction_set” is empty in our case because all the nodes restarted again. However if the nodes were up and only GR is broken due to network distruption or some other event then we can observe the values in this section as well.

From the gtid_executed information here it is clear that Node1(“00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148034”) has the latest GTID coordinates so we should bootstrap this node. 

Please note: When you have collected the transaction sets from all the group members, compare them to find which member has the biggest transaction set overall, including both the executed transactions (gtid_executed) and the certified transactions (on the group_replication_applier channel). You can do this manually by looking at the GTIDs, or you can compare the GTID sets using stored functions (GTID_SUBSET & GTID_SUBSTRACT).

7) Let’s start the bootstrap process on Node1.

Now, if we check the status on Node1, it will reflect that the member is online.

8) Now, we can start the other nodes by simply starting the group replication. The node is joined by the Distributed Recovery Mechanism based on the state transfer from the available Binary logs on the source or by doing a full cloning of data via Clone Plugin if the binary logs do not exist anymore on any of the available group nodes.

Note: If the binary logs somehow rotated or purged then we need to restore from the backups or alternatively we can set the Clone plugin which will automatically perform the full snapshot process.

Also, we can verify if the node is still doing any distributed recovery by executing the command below. When the recovery is complete, and the node shows “SECONDARY” status, the replication process [Slave_IO_Running/Slave_SQL_Running] will be stopped. 

At this stage, we have a fully running cluster now.

Recovery from the backups

In rare situations when the data is completely corrupted, we want to do some partial recovery, or we want to add new nodes, we can directly perform backup/restore activity.

Let’s try to understand it using the scenario below. Here, we will try to bring back the cluster with the physical backup [Percona XtraBackup].

1) We took a backup from one of the group nodes for the demo purpose.

Output:

2) In order to make that backup useful, let’s prepare it as well.

Output:

Now, considering we want to bring back the cluster with this fresh data.

3) First, we will clean the data directory of the target member [Node1].

4) Then, we will copy the prepared data in the Node1 data directory.

5) Bootstrap the Node [Node1].

The other nodes could be joined via the distributed recovery/ cloning process (if already set); however, here we will demonstrate how we can perform the restore/recovery by using the backups.

6) Let’s take a fresh backup from Node1 and remember the GTID execution information.

Output:

GTID execution details:

7)  Next, prepare the backup.

Output:

8) Clean the Node2 data directory and copy the prepared backup on Node2.

9) Once the service is restarted, we can verify whether gtid_purged is applied or not.

Otherwise, we can set the gtid_purged with the help of the below command.

10) Finally, we can establish the replication and initiate the GR process.

Similar steps can be performed for subsequent nodes [Node3] as well. Moreover, in the palace of Percona XtraBackup, we can also use some logical backup/restore way like [mydumper/mysqldump/MySQL shell, etc.] to perform the recovery and establish the GR process. We are not going to discuss them here; however, the process is usually the same as we do for any async replication setup. Once the restoration completes with the expected gtid coordinates, then we can execute the below command.

Moreover, since MySQL 8.0.17 clone plugin is available which can be used for the remote cloning operations for distributed recovery in a Group Replication. This requires a bit of additional configuration and steps to be executed..

Recovery via cloning

Need to execute on Donor node[Node1]:

1) Dynamically installing the plugin.

2) Persist the changes on the database.

3) Grant BACKUP_ADMIN  privilege to the replication user – “rsandbox” which is responsible for the Distributed recovery process.

4) Verify the plugin status.

Need to execute on Recipient node [Node3]:

1) Dynamically installing the plugin.

2) Persist the changes on the database.

3) Verify the plugin status.

4) Define the threshold for clone distributed recovery on all concerned/group nodes. The node will use the incremental data sync unless the transaction gap crosses the threshold below, in which case the data cloning process will run. For this test, we took a very small value, although the default is quite big.

5) Finally, start Node3. In the MySQL logs, we can see that the distributed recovery via the clone plugin has started. This process will replace or clone the whole data directory from the Donor/Source [Node1].

6) Next, we can track the GTID execution and coordinate information using the file and command below.

     OR

Otherwise, we can set the gtid_purged  with the below command by passing the grid coordinates from the clone status above.

7) Running CHANGE REPLICATION command and initiating group replication.

8) Finally, we can see Node3 showing in the group replication topology.

Wrap-up

In this blog post, we explored ways to recover the group replication from some of the failures. We also discussed other approaches, like backups via Percona XtraBackup or the clone plugin process, which facilitates the disturbed recovery/data sync-up among the group nodes.


MySQL Performance Tuning is an essential guide covering the critical aspects of MySQL performance optimization.

 

Download the guide and unlock the full potential of your MySQL database today!

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments