Effective Strategies for Recovering MySQL Group Replication From Failures

Group replication is a fault-tolerant/highly available replication topology that ensures if the primary node goes down, one of the other candidates or secondary members takes over so write and read operations can continue without any interruptions.

However, there are some scenarios where, due to outages, network partitions, or database crashes, the group membership could be broken, or we end up with isolated members. In those situations, we have to perform either a partial or full recovery to make the whole topology active again.

MySQL Group Replication topology depicts nodes are down

MySQL Group Replication topology depicts Network Partition

Let’s consider the scenario below: All the group nodes are down while the Read/Write operations are ongoing on the Primary member [Node1].

1) We have the group replication topology below, which I set up using the dbdeployer tool.

node1 [localhost:23637] {root} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1   |       23638 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1   |       23639 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
3 rows in set (0.01 sec)

node1 [localhost:23637] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1 | 23637 | ONLINE | PRIMARY | 8.0.36 | XCom |

| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1 | 23638 | ONLINE | SECONDARY | 8.0.36 | XCom |

| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1 | 23639 | ONLINE | SECONDARY | 8.0.36 | XCom |

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

3 rows in set (0.01 sec)

2) Running workload on the primary [127.0.0.1:23637].

sysbench 
--db-driver=mysql 
--mysql-user=sbtest_user 
--mysql-password=Sbtest@2022 
--mysql-db=sbtest 
--mysql-host=127.0.0.1 
--mysql-port=23637 
--tables=10 
--table-size=10000 
--create_secondary=off 
--threads=50 
--time=0 
--events=0 
--report-interval=1 /opt/homebrew/Cellar/sysbench/1.0.20_6/share/sysbench/oltp_read_write.lua run

sysbench

--db-driver=mysql

--mysql-user=sbtest_user

--mysql-password=Sbtest@2022

--mysql-db=sbtest

--mysql-host=127.0.0.1

--mysql-port=23637

--tables=10

--table-size=10000

--create_secondary=off

--threads=50

--time=0

--events=0

--report-interval=1 /opt/homebrew/Cellar/sysbench/1.0.20_6/share/sysbench/oltp_read_write.lua run

Output:

[ 1s ] thds: 50 tps: 1726.75 qps: 35178.95 (r/w/o: 24710.96/6964.73/3503.26) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 50 tps: 2528.19 qps: 50605.04 (r/w/o: 35405.76/10142.90/5056.39) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 50 tps: 2316.05 qps: 46257.26 (r/w/o: 32348.96/9290.15/4618.15) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
...

[ 1s ] thds: 50 tps: 1726.75 qps: 35178.95 (r/w/o: 24710.96/6964.73/3503.26) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 2s ] thds: 50 tps: 2528.19 qps: 50605.04 (r/w/o: 35405.76/10142.90/5056.39) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 3s ] thds: 50 tps: 2316.05 qps: 46257.26 (r/w/o: 32348.96/9290.15/4618.15) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

...

3) Here, we have deliberately stopped all the group nodes.

./stop_all 
# executing 'stop' on /Users/aniljoshi/sandboxes/group_msb_8_0_36
executing 'stop' on node3
stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node3
executing 'stop' on node2
stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node2
executing 'stop' on node1
stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1

./stop_all

# executing 'stop' on /Users/aniljoshi/sandboxes/group_msb_8_0_36

executing 'stop' on node3

stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node3

executing 'stop' on node2

stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node2

executing 'stop' on node1

stop /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1

At this stage, the cluster is completely down, and all group members have stopped. Now, we will try to recover the cluster again.

Bootstrapping/recovering cluster nodes

4) First, let’s start each node. Since I am using dbdeployer, I can manage(start/stop) or do other stuff with some internal command/scripts below.

shell> ./start_all 
# executing 'start' on /Users/aniljoshi/sandboxes/group_msb_8_0_36
executing "start" on node 1
... sandbox server started
executing "start" on node 2
.. sandbox server started
executing "start" on node 3
.. sandbox server started

shell> ./start_all

# executing 'start' on /Users/aniljoshi/sandboxes/group_msb_8_0_36

executing "start" on node 1

... sandbox server started

executing "start" on node 2

.. sandbox server started

executing "start" on node 3

.. sandbox server started

5) Verify the current status by connecting to each node.

node1 [localhost:23637] {root} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier |           |             |        NULL | OFFLINE      |             |                |                            |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.00 sec)

node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier |           |             |        NULL | OFFLINE      |             |                |                            |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.01 sec)

node3 [localhost:23639] {root} ((none)) >  select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier |           |             |        NULL | OFFLINE      |             |                |                            |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.01 sec)

node1 [localhost:23637] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

1 row in set (0.00 sec)

node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

1 row in set (0.01 sec)

node3 [localhost:23639] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

1 row in set (0.01 sec)

All of the nodes are offline at the moment.

6) We can recover the cluster by bootstrapping one of the group nodes, and then, after starting the other nodes, they will join via the distributed recovery process. But before that, it’s important to know the status of the transactions certified/committed on each node so we don’t end up bootstrapping the wrong node. Bootstrapping the wrong node can result in extreme data loss!

Note: “received_transaction_set” is empty in our case because all the nodes restarted again. However if the nodes were up and only GR is broken due to network distruption or some other event then we can observe the values in this section as well.

ode1 [localhost:23637] {root} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED
    -> ;
+-----------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                        |
+-----------------------------------------------+
| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148034 |
+-----------------------------------------------+
1 row in set (0.00 sec)

node1 [localhost:23637] {root} ((none)) > SELECT received_transaction_set FROM 
    ->       performance_schema.replication_connection_status WHERE 
    ->       channel_name="group_replication_applier";
+--------------------------+
| received_transaction_set |
+--------------------------+
|                           |
+--------------------------+
1 row in set (0.01 sec)

node2 [localhost:23638] {msandbox} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED;
+----------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                       |
+----------------------------------------------+
| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-96208 |
+----------------------------------------------+
1 row in set (0.00 sec)


node2 [localhost:23638] {msandbox} ((none)) > SELECT received_transaction_set FROM 
    ->       performance_schema.replication_connection_status WHERE 
    ->       channel_name="group_replication_applier";
+--------------------------+
| received_transaction_set |
+--------------------------+
|                          |
+--------------------------+
1 row in set (0.01 sec)

node3 [localhost:23639] {msandbox} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED;
+----------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                       |
+----------------------------------------------+
| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-76208 |
+----------------------------------------------+
1 row in set (0.00 sec)


node3 [localhost:23639] {msandbox} ((none)) > SELECT received_transaction_set FROM 
    ->       performance_schema.replication_connection_status WHERE 
    ->       channel_name="group_replication_applier";
+--------------------------+
| received_transaction_set |
+--------------------------+
|                          |
+--------------------------+
1 row in set (0.00 sec)

ode1 [localhost:23637] {root} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED

-> ;

+-----------------------------------------------+

| @@GLOBAL.GTID_EXECUTED |

+-----------------------------------------------+

| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148034 |

+-----------------------------------------------+

1 row in set (0.00 sec)

node1 [localhost:23637] {root} ((none)) > SELECT received_transaction_set FROM

-> performance_schema.replication_connection_status WHERE

-> channel_name="group_replication_applier";

+--------------------------+

| received_transaction_set |

+--------------------------+

| |

+--------------------------+

1 row in set (0.01 sec)

node2 [localhost:23638] {msandbox} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED;

+----------------------------------------------+

| @@GLOBAL.GTID_EXECUTED |

+----------------------------------------------+

| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-96208 |

+----------------------------------------------+

1 row in set (0.00 sec)

node2 [localhost:23638] {msandbox} ((none)) > SELECT received_transaction_set FROM

-> performance_schema.replication_connection_status WHERE

-> channel_name="group_replication_applier";

+--------------------------+

| received_transaction_set |

+--------------------------+

| |

+--------------------------+

1 row in set (0.01 sec)

node3 [localhost:23639] {msandbox} ((none)) > SELECT @@GLOBAL.GTID_EXECUTED;

+----------------------------------------------+

| @@GLOBAL.GTID_EXECUTED |

+----------------------------------------------+

| 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-76208 |

+----------------------------------------------+

1 row in set (0.00 sec)

node3 [localhost:23639] {msandbox} ((none)) > SELECT received_transaction_set FROM

-> performance_schema.replication_connection_status WHERE

-> channel_name="group_replication_applier";

+--------------------------+

| received_transaction_set |

+--------------------------+

| |

+--------------------------+

1 row in set (0.00 sec)

From the gtid_executed information here it is clear that Node1(“00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148034”) has the latest GTID coordinates so we should bootstrap this node.

Please note: When you have collected the transaction sets from all the group members, compare them to find which member has the biggest transaction set overall, including both the executed transactions (gtid_executed) and the certified transactions (on the group_replication_applier channel). You can do this manually by looking at the GTIDs, or you can compare the GTID sets using stored functions (GTID_SUBSET & GTID_SUBSTRACT).

mysql> SELECT @@GTID_SUBSET();
mysql> SELECT @@GTID_SUBSTRACT();

1 2	mysql> SELECT @@GTID_SUBSET(); mysql> SELECT @@GTID_SUBSTRACT();

7) Let’s start the bootstrap process on Node1.

mysql> SET GLOBAL group_replication_bootstrap_group=ON;
mysql> START GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_bootstrap_group=OFF;

mysql> SET GLOBAL group_replication_bootstrap_group=ON;

mysql> START GROUP_REPLICATION;

mysql> SET GLOBAL group_replication_bootstrap_group=OFF;

Now, if we check the status on Node1, it will reflect that the member is online.

node1 [localhost:23637] {msandbox} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.00 sec)

node1 [localhost:23637] {msandbox} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1 | 23637 | ONLINE | PRIMARY | 8.0.36 | XCom |

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

1 row in set (0.00 sec)

8) Now, we can start the other nodes by simply starting the group replication. The node is joined by the Distributed Recovery Mechanism based on the state transfer from the available Binary logs on the source or by doing a full cloning of data via Clone Plugin if the binary logs do not exist anymore on any of the available group nodes.

node2 [localhost:23638] {msandbox} ((none)) > start group_replication;

 node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1   |       23638 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
2 rows in set (0.00 sec)

node3 [localhost:23639] {root} ((none)) > start group_replication;
Query OK, 0 rows affected (1.36 sec)

node3 [localhost:23639] {root} ((none)) >  select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1   |       23638 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1   |       23639 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
3 rows in set (0.00 sec)

node2 [localhost:23638] {msandbox} ((none)) > start group_replication;

node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1 | 23637 | ONLINE | PRIMARY | 8.0.36 | XCom |

| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1 | 23638 | ONLINE | SECONDARY | 8.0.36 | XCom |

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

2 rows in set (0.00 sec)

node3 [localhost:23639] {root} ((none)) > start group_replication;

Query OK, 0 rows affected (1.36 sec)

node3 [localhost:23639] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

| group_replication_applier | 00023637-1111-1111-1111-111111111111 | 127.0.0.1 | 23637 | ONLINE | PRIMARY | 8.0.36 | XCom |

| group_replication_applier | 00023638-2222-2222-2222-222222222222 | 127.0.0.1 | 23638 | ONLINE | SECONDARY | 8.0.36 | XCom |

| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1 | 23639 | ONLINE | SECONDARY | 8.0.36 | XCom |

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

3 rows in set (0.00 sec)

Note: If the binary logs somehow rotated or purged then we need to restore from the backups or alternatively we can set the Clone plugin which will automatically perform the full snapshot process.

Also, we can verify if the node is still doing any distributed recovery by executing the command below. When the recovery is complete, and the node shows “SECONDARY” status, the replication process [Slave_IO_Running/Slave_SQL_Running] will be stopped.

node2 [localhost:23638] {root} ((none)) > SHOW SLAVE STATUS FOR CHANNEL 'group_replication_recovery'G;

1	node2 [localhost:23638] {root} ((none)) > SHOW SLAVE STATUS FOR CHANNEL 'group_replication_recovery'G;

At this stage, we have a fully running cluster now.

Recovery from the backups

In rare situations when the data is completely corrupted, we want to do some partial recovery, or we want to add new nodes, we can directly perform backup/restore activity.

Let’s try to understand it using the scenario below. Here, we will try to bring back the cluster with the physical backup [Percona XtraBackup].

1) We took a backup from one of the group nodes for the demo purpose.

shell> xtrabackup --host=127.0.0.1 -usbtest_user -pSbtest@2022 -P23637 --backup --target-dir=/Users/aniljoshi/backup/

1	shell> xtrabackup --host=127.0.0.1 -usbtest_user -pSbtest@2022 -P23637 --backup --target-dir=/Users/aniljoshi/backup/

Output:

...
2024-08-03T19:54:50.102188+05:30 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (266657170) to (266657878) was copied.
2024-08-03T19:54:50.329784+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

...

2024-08-03T19:54:50.102188+05:30 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (266657170) to (266657878) was copied.

2024-08-03T19:54:50.329784+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

2) In order to make that backup useful, let’s prepare it as well.

shell> xtrabackup --prepare --use-memory=2G --target-dir=/Users/aniljoshi/backup/

1	shell> xtrabackup --prepare --use-memory=2G --target-dir=/Users/aniljoshi/backup/

Output:

...
2024-08-03T19:58:00.432140+05:30 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 266658326
2024-08-03T19:58:00.434701+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

...

2024-08-03T19:58:00.432140+05:30 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 266658326

2024-08-03T19:58:00.434701+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

Now, considering we want to bring back the cluster with this fresh data.

3) First, we will clean the data directory of the target member [Node1].

shell> rm -rf  /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1/data/

1	shell> rm -rf /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1/data/

4) Then, we will copy the prepared data in the Node1 data directory.

shell> cp -aR /Users/aniljoshi/backup/  /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1/data/

1	shell> cp -aR /Users/aniljoshi/backup/ /Users/aniljoshi/sandboxes/group_msb_8_0_36/node1/data/

5) Bootstrap the Node [Node1].

mysql> SET GLOBAL group_replication_bootstrap_group=ON;
mysql> START GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_bootstrap_group=OFF;
 
node1 [localhost:23637] {root} (performance_schema) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 11400bec-51a6-11ef-b950-29b40b1b64a7 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.00 sec)

mysql> SET GLOBAL group_replication_bootstrap_group=ON;

mysql> START GROUP_REPLICATION;

mysql> SET GLOBAL group_replication_bootstrap_group=OFF;

node1 [localhost:23637] {root} (performance_schema) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

1 row in set (0.00 sec)

The other nodes could be joined via the distributed recovery/ cloning process (if already set); however, here we will demonstrate how we can perform the restore/recovery by using the backups.

6) Let’s take a fresh backup from Node1 and remember the GTID execution information.

shell> xtrabackup --host=127.0.0.1 -usbtest_user -pSbtest@2022 -P23637 --backup --target-dir=/Users/aniljoshi/backup/Primary/

1	shell> xtrabackup --host=127.0.0.1 -usbtest_user -pSbtest@2022 -P23637 --backup --target-dir=/Users/aniljoshi/backup/Primary/

Output:

...
2024-08-03T20:16:36.182978+05:30 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (266723771) to (266724479) was copied.
2024-08-03T20:16:36.408616+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

...

2024-08-03T20:16:36.182978+05:30 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (266723771) to (266724479) was copied.

2024-08-03T20:16:36.408616+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

GTID execution details:

shell> cat /Users/aniljoshi/backup/Primary/xtrabackup_binlog_info
mysql-bin.000006	193	00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040

1 2	shell> cat /Users/aniljoshi/backup/Primary/xtrabackup_binlog_info mysql-bin.000006 193 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040

7) Next, prepare the backup.

shell> xtrabackup --prepare --use-memory=2G --target-dir=/Users/aniljoshi/backup/Primary/

1	shell> xtrabackup --prepare --use-memory=2G --target-dir=/Users/aniljoshi/backup/Primary/

Output:

...
2024-08-03T20:17:47.817955+05:30 0 [Note] [MY-013072] [InnoDB] Starting shutdown...
2024-08-03T20:17:47.822229+05:30 0 [Note] [MY-013084] [InnoDB] Log background threads are being closed...
2024-08-03T20:17:47.865162+05:30 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 266724886
2024-08-03T20:17:47.867836+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

...

2024-08-03T20:17:47.817955+05:30 0 [Note] [MY-013072] [InnoDB] Starting shutdown...

2024-08-03T20:17:47.822229+05:30 0 [Note] [MY-013084] [InnoDB] Log background threads are being closed...

2024-08-03T20:17:47.865162+05:30 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 266724886

2024-08-03T20:17:47.867836+05:30 0 [Note] [MY-011825] [Xtrabackup] completed OK!

8) Clean the Node2 data directory and copy the prepared backup on Node2.

shell> cp -aR /Users/aniljoshi/backup/Primary/  /Users/aniljoshi/sandboxes/group_msb_8_0_36/node2/data/

1	shell> cp -aR /Users/aniljoshi/backup/Primary/ /Users/aniljoshi/sandboxes/group_msb_8_0_36/node2/data/

9) Once the service is restarted, we can verify whether gtid_purged is applied or not.

node2 [localhost:23638] {root} ((none)) > show global variables like '%gtid_purged%';
+---------------+-----------------------------------------------+
| Variable_name | Value                                         |
+---------------+-----------------------------------------------+
| gtid_purged   | 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040 |
+---------------+-----------------------------------------------+
1 row in set (0.01 sec)

node2 [localhost:23638] {root} ((none)) > show global variables like '%gtid_purged%';

+---------------+-----------------------------------------------+

| Variable_name | Value |

+---------------+-----------------------------------------------+

| gtid_purged | 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040 |

+---------------+-----------------------------------------------+

1 row in set (0.01 sec)

Otherwise, we can set the gtid_purged with the help of the below command.

MySQL> set global gtid_purged="00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040 ";

1	MySQL> set global gtid_purged="00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148040 ";

10) Finally, we can establish the replication and initiate the GR process.

node2 [localhost:23638] {root} ((none)) > change master to master_user=rsandbox,master_password='rsandbox’ FOR CHANNEL 'group_replication_recovery';

node2 [localhost:23638] {root} ((none)) > start group_replication;
Query OK, 0 rows affected (1.97 sec)

node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 11400bec-51a6-11ef-b950-29b40b1b64a7 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
| group_replication_applier | c2b74588-51a7-11ef-81da-5f8da38bd06f | 127.0.0.1   |       23638 | ONLINE       | SECONDARY     | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
2 rows in set (0.00 sec)

node2 [localhost:23638] {root} ((none)) > change master to master_user=rsandbox,master_password='rsandbox’ FOR CHANNEL 'group_replication_recovery';

node2 [localhost:23638] {root} ((none)) > start group_replication;

Query OK, 0 rows affected (1.97 sec)

node2 [localhost:23638] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

2 rows in set (0.00 sec)

Similar steps can be performed for subsequent nodes [Node3] as well. Moreover, in the palace of Percona XtraBackup, we can also use some logical backup/restore way like [mydumper/mysqldump/MySQL shell, etc.] to perform the recovery and establish the GR process. We are not going to discuss them here; however, the process is usually the same as we do for any async replication setup. Once the restoration completes with the expected gtid coordinates, then we can execute the below command.

MySQL> change master to master_user=rsandbox,master_password='rsandbox’ FOR CHANNEL 'group_replication_recovery';
MySQL> start group_replication;

1 2	MySQL> change master to master_user=rsandbox,master_password='rsandbox’ FOR CHANNEL 'group_replication_recovery'; MySQL> start group_replication;

Moreover, since MySQL 8.0.17 clone plugin is available which can be used for the remote cloning operations for distributed recovery in a Group Replication. This requires a bit of additional configuration and steps to be executed..

Recovery via cloning

Need to execute on Donor node[Node1]:

1) Dynamically installing the plugin.

node1 [localhost:23637] {root} ((none)) > install plugin clone soname 'mysql_clone.so';
Query OK, 0 rows affected (0.16 sec)

1 2	node1 [localhost:23637] {root} ((none)) > install plugin clone soname 'mysql_clone.so'; Query OK, 0 rows affected (0.16 sec)

2) Persist the changes on the database.

[mysqld]
plugin-load-add=mysql_clone.so
clone=FORCE_PLUS_PERMANENT

[mysqld]

plugin-load-add=mysql_clone.so

clone=FORCE_PLUS_PERMANENT

3) Grant BACKUP_ADMIN privilege to the replication user – “rsandbox” which is responsible for the Distributed recovery process.

node1 [localhost:23637] {root} ((none)) > show grants for rsandbox@'%';
+---------------------------------------------+
| Grants for rsandbox@%                       |
+---------------------------------------------+
| GRANT USAGE ON *.* TO `rsandbox`@`%`        |
| GRANT BACKUP_ADMIN ON *.* TO `rsandbox`@`%` |
| GRANT `R_REPLICATION`@`%` TO `rsandbox`@`%` |
+---------------------------------------------+
3 rows in set (0.00 sec)

node1 [localhost:23637] {root} ((none)) > show grants for rsandbox@'%';

+---------------------------------------------+

| Grants for rsandbox@% |

+---------------------------------------------+

| GRANT USAGE ON *.* TO `rsandbox`@`%` |

| GRANT BACKUP_ADMIN ON *.* TO `rsandbox`@`%` |

| GRANT `R_REPLICATION`@`%` TO `rsandbox`@`%` |

+---------------------------------------------+

3 rows in set (0.00 sec)

4) Verify the plugin status.

node3 [localhost:23639] {root} ((none)) > select plugin_name,plugin_status from information_schema.plugins where plugin_name='clone';
+-------------+---------------+
| plugin_name | plugin_status |
+-------------+---------------+
| clone       | ACTIVE        |
+-------------+---------------+
1 row in set (0.00 sec)

node3 [localhost:23639] {root} ((none)) > select plugin_name,plugin_status from information_schema.plugins where plugin_name='clone';

+-------------+---------------+

| plugin_name | plugin_status |

+-------------+---------------+

| clone | ACTIVE |

+-------------+---------------+

1 row in set (0.00 sec)

Need to execute on Recipient node [Node3]:

1) Dynamically installing the plugin.

node1 [localhost:23637] {root} ((none)) > install plugin clone soname 'mysql_clone.so';
Query OK, 0 rows affected (0.16 sec)

1 2	node1 [localhost:23637] {root} ((none)) > install plugin clone soname 'mysql_clone.so'; Query OK, 0 rows affected (0.16 sec)

2) Persist the changes on the database.

[mysqld]
plugin-load-add=mysql_clone.so
clone=FORCE_PLUS_PERMANENT

[mysqld]

plugin-load-add=mysql_clone.so

clone=FORCE_PLUS_PERMANENT

3) Verify the plugin status.

node3 [localhost:23639] {root} ((none)) > select plugin_name,plugin_status from information_schema.plugins where plugin_name='clone';
+-------------+---------------+
| plugin_name | plugin_status |
+-------------+---------------+
| clone       | ACTIVE        |
+-------------+---------------+
1 row in set (0.00 sec)

node3 [localhost:23639] {root} ((none)) > select plugin_name,plugin_status from information_schema.plugins where plugin_name='clone';

+-------------+---------------+

| plugin_name | plugin_status |

+-------------+---------------+

| clone | ACTIVE |

+-------------+---------------+

1 row in set (0.00 sec)

4) Define the threshold for clone distributed recovery on all concerned/group nodes. The node will use the incremental data sync unless the transaction gap crosses the threshold below, in which case the data cloning process will run. For this test, we took a very small value, although the default is quite big.

node3 [localhost:23637] {root} ((none)) >  set global group_replication_clone_threshold = 10;
Query OK, 0 rows affected (0.00 sec)

1 2	node3 [localhost:23637] {root} ((none)) > set global group_replication_clone_threshold = 10; Query OK, 0 rows affected (0.00 sec)

5) Finally, start Node3. In the MySQL logs, we can see that the distributed recovery via the clone plugin has started. This process will replace or clone the whole data directory from the Donor/Source [Node1].

2024-08-03T18:21:04.039835Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Cloning from a remote group donor.'
2024-08-03T18:21:03.033878Z 0 [Warning] [MY-013469] [Repl] Plugin group_replication reported: 'This member will start distributed recovery using clone. It is due to the number of missing transactions being higher than the configured threshold of 10.'
2024-08-03T18:21:04.150730Z 132 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started
...
/Users/aniljoshi/opt/mysql/8.0.36/bin/mysqld: Shutdown complete (mysqld 8.0.36)  MySQL Community Server - GPL.
2024-08-03T18:21:09.6NZ mysqld_safe mysqld restarted

2024-08-03T18:21:04.039835Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Cloning from a remote group donor.'

2024-08-03T18:21:03.033878Z 0 [Warning] [MY-013469] [Repl] Plugin group_replication reported: 'This member will start distributed recovery using clone. It is due to the number of missing transactions being higher than the configured threshold of 10.'

2024-08-03T18:21:04.150730Z 132 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started

...

/Users/aniljoshi/opt/mysql/8.0.36/bin/mysqld: Shutdown complete (mysqld 8.0.36) MySQL Community Server - GPL.

2024-08-03T18:21:09.6NZ mysqld_safe mysqld restarted

6) Next, we can track the GTID execution and coordinate information using the file and command below.

shell> cat #view_status 
2 1
1722709264045254 1722709271116851
127.0.0.1:23637
0

./mysql-bin.000006
4836
00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056

shell> cat #view_status

2 1

1722709264045254 1722709271116851

127.0.0.1:23637

./mysql-bin.000006

4836

00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056

node3 [localhost:23639] {root} ((none)) > select * from performance_schema.clone_statusG;
*************************** 1. row ***************************
             ID: 1
            PID: 0
          STATE: Completed
     BEGIN_TIME: 2024-08-03 23:51:04.045
       END_TIME: 2024-08-03 23:51:11.117
         SOURCE: 127.0.0.1:23637
    DESTINATION: LOCAL INSTANCE
       ERROR_NO: 0
  ERROR_MESSAGE: 
    BINLOG_FILE: mysql-bin.000006
BINLOG_POSITION: 4836
  GTID_EXECUTED: 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056
1 row in set (0.00 sec)



node3 [localhost:23639] {root} ((none)) > show variables like 'gtid_purged';
+---------------+-----------------------------------------------+
| Variable_name | Value                                         |
+---------------+-----------------------------------------------+
| gtid_purged   | 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056 |
+---------------+-----------------------------------------------+
1 row in set (0.01 sec)

node3 [localhost:23639] {root} ((none)) > select * from performance_schema.clone_statusG;

*************************** 1. row ***************************

ID: 1

PID: 0

STATE: Completed

BEGIN_TIME: 2024-08-03 23:51:04.045

END_TIME: 2024-08-03 23:51:11.117

SOURCE: 127.0.0.1:23637

DESTINATION: LOCAL INSTANCE

ERROR_NO: 0

ERROR_MESSAGE:

BINLOG_FILE: mysql-bin.000006

BINLOG_POSITION: 4836

GTID_EXECUTED: 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056

1 row in set (0.00 sec)

node3 [localhost:23639] {root} ((none)) > show variables like 'gtid_purged';

+---------------+-----------------------------------------------+

| Variable_name | Value |

+---------------+-----------------------------------------------+

| gtid_purged | 00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-148056 |

+---------------+-----------------------------------------------+

1 row in set (0.01 sec)

Otherwise, we can set the gtid_purged with the below command by passing the grid coordinates from the clone status above.

node3 [localhost:23639] {root} ((none)) > set global gtid_purged="00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-128056 ";

1	node3 [localhost:23639] {root} ((none)) > set global gtid_purged="00023636-bbbb-cccc-dddd-eeeeeeeeeeee:1-128056 ";

7) Running CHANGE REPLICATION command and initiating group replication.

node3 [localhost:23639] {root} ((none)) > CHANGE REPLICATION SOURCE TO SOURCE_USER='rsandbox', SOURCE_PASSWORD='rsandbox' FOR CHANNEL 'group_replication_recovery';
Query OK, 0 rows affected, 2 warnings (0.01 sec)

node3 [localhost:23639] {root} ((none)) > start group_replication;
Query OK, 0 rows affected (10.69 sec)

node3 [localhost:23639] {root} ((none)) > CHANGE REPLICATION SOURCE TO SOURCE_USER='rsandbox', SOURCE_PASSWORD='rsandbox' FOR CHANNEL 'group_replication_recovery';

Query OK, 0 rows affected, 2 warnings (0.01 sec)

node3 [localhost:23639] {root} ((none)) > start group_replication;

Query OK, 0 rows affected (10.69 sec)

8) Finally, we can see Node3 showing in the group replication topology.

node3 [localhost:23639] {root} ((none)) >  select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1   |       23639 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
| group_replication_applier | 11400bec-51a6-11ef-b950-29b40b1b64a7 | 127.0.0.1   |       23637 | ONLINE       | PRIMARY     | 8.0.36         | XCom                       |
| group_replication_applier | c2b74588-51a7-11ef-81da-5f8da38bd06f | 127.0.0.1   |       23638 | ONLINE       | SECONDARY   | 8.0.36         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
3 rows in set (0.01 sec)

node3 [localhost:23639] {root} ((none)) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

| group_replication_applier | 00023639-3333-3333-3333-333333333333 | 127.0.0.1 | 23639 | ONLINE | SECONDARY | 8.0.36 | XCom |

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+

3 rows in set (0.01 sec)

Wrap-up

In this blog post, we explored ways to recover the group replication from some of the failures. We also discussed other approaches, like backups via Percona XtraBackup or the clone plugin process, which facilitates the disturbed recovery/data sync-up among the group nodes.

MySQL Performance Tuning is an essential guide covering the critical aspects of MySQL performance optimization.

Download the guide and unlock the full potential of your MySQL database today!

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Effective Strategies for Recovering MySQL Group Replication From Failures

Bootstrapping/recovering cluster nodes

Recovery from the backups

Recovery via cloning

Need to execute on Donor node[Node1]:

Need to execute on Recipient node [Node3]:

Wrap-up

Related Blog Articles

RECOMMENDED ARTICLES

Security Advisory: CVE Affecting Percona Monitoring and Management (PMM)

Open Source Is Not Just Code: It’s Integrity

A Guide to Redis Performance Best Practices

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Effective Strategies for Recovering MySQL Group Replication From Failures

Bootstrapping/recovering cluster nodes

Recovery from the backups

Recovery via cloning

Need to execute on Donor node[Node1]:

Need to execute on Recipient node [Node3]:

Wrap-up

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Security Advisory: CVE Affecting Percona Monitoring and Management (PMM)

Open Source Is Not Just Code: It’s Integrity

A Guide to Redis Performance Best Practices

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads