In this blog post, we will explore how network partitions impact group replication and the way it detects and responds to failures. In case you haven’t checked out my previous blog post about group replication recovery strategies, please have a look at them for some insight.

Topology:

Scenario 1: One of the GR nodes [node3] faces some network interruption

Group Replication network partition in case of single node failures

This diagram depicts a single node down/partitioned.

1. Here, we will block the group replication communication on port[23764] for node3.

Blocking communication:

Note –   PF(packet filter) tool  is used here for enabling the firewall and ruleset because it’s compatible with the macOS which I used for my setup. On Linux/Windows other native firewalls/tools  can be used for similar network breaking things.  

Verifying the rule:

Output:


 2. Now, let’s verify the group status again.

Note – node3[127.0.0.1:23639] is showing “UNREACHABLE” however the node will be expelled/removed from the group after ~10 seconds because there is a 5 second waiting period as well before suspicion is created.   

Note – In cases dealing with less stable/slow networks we can also increase the value of group_replication_member_expel_timeout to avoid unnecessary expulsions.

Finally, node3 does not appear anymore in the list.

While looking over the node3 details it displays the “ERROR” and “Invalid Protocol”. The caveat here is if an application is still using this node it will serve the stall data. Well, we can change the behaviour using setting group_replication_exit_state_action if the node left the cluster/unable to join back.

3. Now, we will remove that port[23764] blocking from node3 by flushing/clearing the rules.

We can see node3 showing in RECOVERING mode, which resembles it doing some distributed recovery to accommodate the changes/sync up.

In the logs, we can verify that it is performing Incremental recovery from the donor. Well, if the missing transaction gap/threshold doesn’t reach the value of [group_replication_clone_threshold], the node tries to pull the data from the donor’s binary logs to perform the incremental recovery. In case the threshold is reached, a fully distributed recovery via the clone process will trigger. So, binary log retention is equally important to allow the data sync.

Note – The default value[9223372036854775807] of group_replication_clone_threshold  is very high and unrealistic so in the production environment it is better to consider some safe threshold so in case of missing binary logs the full distributed recovery doesn’t halt.

Once the recovery is finished the node3 appears ONLINE.

Scenario 2: Now, two of the GR nodes [node2 & node3] face some network interruption

Group Replication Network Partition when multiple nodes down.

This diagram depicts multiple node failures.

1. Here again, we will block the group replication communication on ports[23763 & 23764 ] for both node3/node4.

Blocking communication:

Verifying the rule:

Output:

2. Now, let’s verify the group status again from node1.

Now, do the same for node2.

Finally, let’s check from node3’s perspective.

So here only node1 [127.0.0.1:23637] is ONLINE and in PRIMARY state, while each node has a different VIEW of the cluster.

As the cluster was not able to reach a majority, node1 could not continue further with any load/traffic. 

Workload impacted:

3. Now, we will remove those ports [23763 & 23764] blocking from both node2/node3.

4. After a few time frames, node2 and node3 joined the group with the usual recovery process.

There are some limits on how many attempts the expelled node tries to join the cluster. This is decided by the settings below. Group_replication_autorejoin_tries tells the number of tries (default ~3) that a member does to automatically rejoin the group if it is expelled/removed OR in case unable to contact a majority of the group.

After having an unsuccessful auto-rejoin attempt, the member waits ~five minutes before the next try. If the specified limit is exhausted without the member rejoining or being stopped, the member proceeds to the action specified by the group_replication_exit_state_action [ABORT_SERVER, OFFLINE_MODE, READ_ONLY] ]parameter.

In some situations, especially if the auto rejoins [Group_replication_autorejoin_tries], attempts are exhausted, and even fixing the network or other blockers, the node doesn’t auto rejoin the group members. In those scenarios, some manual intervention would be required like restarting the group replication or the database service itself.

E.g.,

How long a member will wait, especially under the network partition when the majority will be lost, is decided by the group_replication_unreachable_majority_timeout”. This specifies the number of seconds a member that suffers a network partition/is unable to reach out to the majority wait before leaving the group. The default value is set to 0, which means that members that find themselves in a minority due to a network partition wait forever to leave the group. 

If you set any timeout and that time period elapses, all pending transactions processed by the minority nodes are rolled back, and the server will move to the ERROR state.

If we define a timeout for the above variable, the node in the minority group will leave the cluster. Unless we have the majority of nodes available, this parameter wouldn’t come into any effect. This is only applicable when the majority of members in the group have lost.

E.g.,

In a 3-node setup if we lose the two members and comprise the majority then based on the value of group_replication_unreachable_majority_timeout the action decided. Either the node will wait forever or leaves the group.

For testing purposes, we have set it to ~10 seconds so the node leaves the group after reaching the threshold.

Final thoughts:

So, in the above blog post we see how the network partition or interruption affects the group nodes and how the resilient behavior of group replication can help in adding the nodes back to the group again. We also discussed some important parameters that could affect the node expelled/rejoining process. Well, network partition has also a serious impact on the performance of the transactions/workload. It’s worth mentioning a performance problem reported here –   https://bugs.mysql.com/bug.php?id=99133 by one of my colleagues, which highlights the dropout in write throughput when the network partition triggers among the group nodes. It seems this behavior is still the same and not fully solved.


MySQL Performance Tuning is our essential guide covering
the critical aspects of MySQL performance optimization.

 

Download the guide and unlock the full potential of your MySQL database today!

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments