Managing farms of MySQL servers under a replication environment is very efficient with the help of a MySQL orchestrator tool. This ensures a smooth transition happens when there is any ad hoc failover or a planned/graceful switchover comes into action.
Several configuration parameters play a crucial role in controlling and influencing failover behavior. In this blog post, we’ll explore some of these key options and how they can impact the overall failover process.
Let’s discuss each of these settings one by one with some examples.
By default, this option is disabled. However, when it is “true” and a master failover takes place while the candidate master has still not consumed all the relay log events, the failover or promotion process will be terminated.
If this setting remains “false,” then in a scenario where all the replicas are lagging and the current master is down, one of the members will be chosen as a new master, which eventually can lead to data loss on the new master node. Later, when the old master is again added as a replica, it could lead to duplicate entry problems.
Considering “FailMasterPromotionIfSQLThreadNotUpToDate” is enabled in the orchestrator configuration file “orchestrator.conf.json”:
|
1 |
"FailMasterPromotionIfSQLThreadNotUpToDate": true |
This is the topology managed by the orchestrator:
|
1 |
Anils-MacBook-Pro.local:22637 [0s,ok,8.0.36,rw,ROW,>>,GTID]<br>+ Anils-MacBook-Pro.local:22638 [0s,ok,8.0.36,ro,ROW,>>,GTID]<br>+ Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID] |
Below, we are running some workloads via sysbench, which will help increase the replication lag for our testing purposes.
|
1 |
sysbench <br>--db-driver=mysql <br>--mysql-user=sbtest_user <br>--mysql-password=Sbtest@2022 <br>--mysql-db=sbtest <br>--mysql-host=127.0.0.1 <br>--mysql-port=22637 <br>--tables=15 <br>--table-size=3000000 <br>--create_secondary=off <br>--threads=100 <br>--time=0 <br>--events=0 <br>--report-interval=1 /opt/homebrew/Cellar/sysbench/1.0.20_7/share/sysbench/oltp_read_write.lua run<br> |
Output:
|
1 |
256s ] thds: 100 tps: 1902.77 qps: 38008.31 (r/w/o: 26611.72/7591.05/3805.54) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00<br>[ 257s ] thds: 100 tps: 1960.71 qps: 38960.02 (r/w/o: 27292.45/7746.15/3921.42) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00<br>[ 258s ] thds: 100 tps: 1803.48 qps: 35773.52 (r/w/o: 24991.65/7174.91/3606.96) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00<br> |
After some duration, the replication lag starts increasing on the replicas, and in the meantime, we just stopped the primary [127.0.0.1:22637].
|
1 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: <br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000012<br> Read_Master_Log_Pos: 941353804<br> Relay_Log_File: mysql-relay.000034<br> Relay_Log_Pos: 318047296<br> Relay_Master_Log_File: mysql-bin.000012<br> Slave_IO_Running: No<br> Slave_SQL_Running: Yes<br>…<br><br> Seconds_Behind_Master: 215<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 2003<br> Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)<br> |
|
1 |
slave2 [localhost:22639] {msandbox} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: <br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000012<br> Read_Master_Log_Pos: 941353804<br> Relay_Log_File: mysql-relay.000002<br> Relay_Log_Pos: 302890408<br> Relay_Master_Log_File: mysql-bin.000012<br> Slave_IO_Running: No<br> Slave_SQL_Running: Yes<br><br>…<br> Seconds_Behind_Master: 215<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 2003<br> Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)<br> |
The result was that the master promotion failed as the SQL thread was not up to date.
|
1 |
2025-06-04 23:42:21 ERROR RecoverDeadMaster: failed promotion. FailMasterPromotionIfSQLThreadNotUpToDate is set and promoted replica Anils-MacBook-Pro.local:22638 's sql thread is not up to date (relay logs still unapplied). Aborting promotion<br> |
Now, if the option “FailMasterPromotionIfSQLThreadNotUpToDate” is false or the default one, the failover will work flawlessly even if the replica is suffering from replication lag.
In the same above scenario, under “FailMasterPromotionIfSQLThreadNotUpToDate”: false condition, the master promotion happens successfully.
cat /tmp/recovery.log:
|
1 |
20250604 23:51:19: Detected AllMasterReplicasNotReplicating on Anils-MacBook-Pro.local:22637. Affected replicas: 2<br>20250604 23:52:41: Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2<br>20250604 23:52:56: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637<br>20250604 23:53:07: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638<br>20250604 23:53:07: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638 |
This parameter is the opposite of what we discussed above. Here, instead of aborting the master failover, it will be delayed until the candidate master has consumed all the relay log files. When it’s “true,” the orchestrator process will wait for the SQL thread to catch up before promoting to a new master.
Considering “DelayMasterPromotionIfSQLThreadNotUpToDate” is enabled in the orchestrator configuration file “orchestrator.conf.json”:
|
1 |
"DelayMasterPromotionIfSQLThreadNotUpToDate": true |
Some workload was running on the master [127.0.0.1:22637], and in a few seconds, the replication lag started emerging. We stopped the master node around this.
We can see in the log file /tmp/tmp/recovery.log that the failover initial process started.
|
1 |
20250605 19:10:04: Detected UnreachableMasterWithLaggingReplicas on Anils-MacBook-Pro.local:22637. Affected replicas: 2<br>20250605 19:10:06: Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2<br>20250605 19:10:06: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637<br>20250605 19:10:16: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637<br> |
However, we can observe that due to the replication lag on the candidate master [Anils-MacBook-Pro.local:22638] the promotion was put on hold to recover the lag before the failover.
|
1 |
2025-06-05 19:10:27 ERROR DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:27 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s<br>...<br>2025-06-05 19:10:27 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 19:10:30 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>...<br>2025-06-05 19:10:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:37 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s<br> |
|
1 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for source to send event<br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000017<br> Read_Master_Log_Pos: 981203536<br> Relay_Log_File: mysql-relay.000002<br> Relay_Log_Pos: 179833778<br> Relay_Master_Log_File: mysql-bin.000017<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br>...<br> Seconds_Behind_Master: 227<br> |
|
1 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: <br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000017<br> Read_Master_Log_Pos: 1017383839<br> Relay_Log_File: mysql-relay.000002<br> Relay_Log_Pos: 237223256<br> Relay_Master_Log_File: mysql-bin.000017<br> Slave_IO_Running: No<br> Slave_SQL_Running: No<br><br>…<br> Seconds_Behind_Master: NULL<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 2003<br> Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)<br> |
|
1 |
slave2 [localhost:22639] {root} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for source to send event<br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000017<br> Read_Master_Log_Pos: 907360640<br> Relay_Log_File: mysql-relay.000004<br> Relay_Log_Pos: 167191740<br> Relay_Master_Log_File: mysql-bin.000017<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br>...<br> Seconds_Behind_Master: 210<br> |
|
1 |
slave2 [localhost:22639] {root} ((none)) > show slave statusG;<br>*************************** 1. row ***************************<br> Slave_IO_State: <br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000017<br> Read_Master_Log_Pos: 1017383839<br> Relay_Log_File: mysql-relay.000004<br> Relay_Log_Pos: 237135927<br> Relay_Master_Log_File: mysql-bin.000017<br> Slave_IO_Running: No<br> Slave_SQL_Running: No<br><br>…<br> Seconds_Behind_Master: NULL<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 2003<br> Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)<br> |
There is one more observation here. If we gracefully switch over while the replication lag persists on all replicas, we will get the message below.
|
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 <br> |
Output:
|
1 |
Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting.<br> |
This is happening because of the condition below, whereby the replication lag should be equal to or less than the defined “ReasonableMaintenanceReplicationLagSeconds: 20.“
|
1 |
if !designatedInstance.HasReasonableMaintenanceReplicationLag() {<br> return nil, nil, fmt.Errorf("Desginated instance %+v seems to be lagging too much for this operation. Aborting.", designatedInstance.Key)<br> } |
|
1 |
func (this *Instance) HasReasonableMaintenanceReplicationLag() bool {<br> // replicas with SQLDelay are a special case<br> if this.SQLDelay > 0 {<br> return math.AbsInt64(this.SecondsBehindMaster.Int64-int64(this.SQLDelay)) <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds)<br> }<br> return this.SecondsBehindMaster.Int64 <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds)<br>}<br> |
https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/logic/topology_recovery.go#L2124
https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/inst/instance.go#L37
Once this condition is met and replication lag is below that threshold, OR else we have increased the threshold, the graceful takeover works fine.
|
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638<br> |
Here, the orchestrator service logs reflect the failover process now waiting for all the relay logs to finish.
|
1 |
2025-06-05 20:15:01 INFO CommandRun successful. exit status 0<br>2025-06-05 20:15:01 INFO topology_recovery: Completed PreGracefulTakeoverProcesses hook 1 of 1 in 5.463653s<br>2025-06-05 20:15:01 INFO topology_recovery: done running PreGracefulTakeoverProcesses hooks<br>2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will set Anils-MacBook-Pro.local:22637 as read_only<br>2025-06-05 20:15:01 INFO instance Anils-MacBook-Pro.local:22637 read_only: true<br>2025-06-05 20:15:01 INFO auditType:read-only instance:Anils-MacBook-Pro.local:22637 cluster:Anils-MacBook-Pro.local:22637 message:set as true<br>2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will wait for Anils-MacBook-Pro.local:22638 to reach master coordinates mysql-bin.000021:221642748<br><br>2025-06-05 20:19:53 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pr<br><br>..<br><br>2025-06-05 20:26:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on Anils-MacBook-Pro.local:22638<br>2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: found no reason to override promotion of Anils-MacBook-Pro.local:22638<br>2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: successfully promoted Anils-MacBook-Pro.local:22638<br>2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: promoted server coordinates: mysql-bin.000017:221167478<br>2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: will apply MySQL changes to promoted master<br>2025-06-05 20:26:37 INFO Will reset replica on Anils-MacBook-Pro.local:22638<br> |
Once lag resolves, the takeover process runs successfully.
|
1 |
20250605 20:19:53: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637<br>20250605 20:26:39: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638<br>20250605 20:26:39: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638<br>20250605 20:26:39: Planned takeover complete<br> |
|
1 |
shell> orchestrator-client -c topology -a testcluster <br> |
Output:
|
1 |
Anils-MacBook-Pro.local:22638 [0s,ok,8.0.36,rw,ROW,>>,GTID]<br>- Anils-MacBook-Pro.local:22637 [null,nonreplicating,8.0.36,ro,ROW,>>,GTID]<br>+ Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID]<br> |
This parameter ensures that a master promotion will be aborted if the replica is lagging >= the configured number of minutes. In order to use this flag, we must also use “ReplicationLagQuery” and a heartbeat mechanism, “pt-hearbeat“, to assess the correct replication lag.
Let’s see how it works.
We have set the value below in the orchestrator configuration “orchestrator.conf.json,” which ensures that if the lag exceeds ~1 minute, the master promotion process will fail.
|
1 |
"FailMasterPromotionOnLagMinutes": 1,<br> |
As we discussed above, enabling this option depends on setting the “ReplicationLagQuery,” which fetches the replication lag details from the heartbeat mechanism instead of relying on the seconds_behind_master status.
|
1 |
2025-06-06 08:54:47 INFO starting orchestrator, version: 3.2.6, git commit: 89f3bdd33931d5e234890787a24cc035fa106b32<br>2025-06-06 08:54:47 INFO Read config: /Users/aniljoshi/orchestrator/conf/orchestrator.conf.json<br>2025-06-06 08:54:47 FATAL nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery to be set |
By default, the orchestrator uses the slave status “seconds_behind_master” to monitor the replication lag. However, in a scenario where the replication is already broken and the master also failed, the value of “seconds_behind_master” would be “null,” which eventually would be of no use to get the accurate details needed to make the decision.
So here we are going to use the pt-hearbeat as a source of replication lag. pt-heartbeat is a replication delay monitoring system that measures delay by looking at actual replicated data. This provides “absolute” lag from the master as well as sub-second resolution.
Below is the “ReplicationLagQuery” configuration, which we will define in the orchestrator configuration file.
|
1 |
"ReplicationLagQuery": "SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS unsigned INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts DESC LIMIT 1",<br> |
We also need to have a separate pt-heartbeat process that will run on both source/replica instances.
|
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22637 --update &<br> |
|
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22638 --update &<br> |
|
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22639 --update &<br> |
- –read-only-interval => When –check-read-only is specified, the interval to sleep while the server is found to be read-only. If unspecified, –interval is used.
- –fail-successive-errors => If specified, pt-heartbeat will fail after given number of successive DBI errors (failure to connect to server or issue a query).
- –interval => How often to update or check the heartbeat –table. Updates and checks begin on the first whole second then repeat every –interval seconds for –update and every –interval plus –skew seconds for –monitor.
Reference – https://docs.percona.com/percona-toolkit/pt-heartbeat.html
The delay is calculated on the replicas, as the difference between the current system time and the replicated timestamp value from the heartbeat table. Basically, on the master node, pt-heartbeat updates the heartbeat table every second with the server ID and the current timestamp. These updates are replicated to the replica nodes through asynchronous replication.
E.g.,
|
1 |
slave1 [localhost:22638] {msandbox} (percona) > select * from percona.heartbeat;<br>+----------------------------+-----------+------------------+----------+-----------------------+---------------------+<br>| ts | server_id | file | position | relay_source_log_file | exec_source_log_pos |<br>+----------------------------+-----------+------------------+----------+-----------------------+---------------------+<br>| 2025-06-06T10:34:56.410320 | 100 | mysql-bin.000023 | 1654366 | NULL | NULL |<br>+----------------------------+-----------+------------------+----------+-----------------------+---------------------+<br>2 rows in set (0.00 sec)<br> |
|
1 |
slave2 [localhost:22639] {root} (percona) > SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS signed INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts<br>DESC LIMIT 1;<br>+-------+<br>| delay |<br>+-------+<br>| 53 |<br>+-------+<br>1 row in set (0.01 sec)<br> |
Let’s see the behaviour of enabling “FailMasterPromotionOnLagMinutes” with a quick scenario.
We were running some workload in the background to have a replication delay/lag.
|
1 |
slave1 [localhost:22638] {root} ((none)) > show slave statusGl;<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for source to send event<br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000005<br> Read_Master_Log_Pos: 355883980<br> Relay_Log_File: mysql-relay.000003<br> Relay_Log_Pos: 452576487<br> Relay_Master_Log_File: mysql-bin.000003<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br> Replicate_Do_DB: <br> Replicate_Ignore_DB: <br> Replicate_Do_Table: <br> Replicate_Ignore_Table: <br> Replicate_Wild_Do_Table: <br> Replicate_Wild_Ignore_Table: <br> Last_Errno: 0<br> Last_Error: <br> Skip_Counter: 0<br> Exec_Master_Log_Pos: 453709829<br> Relay_Log_Space: 2503386167<br> Until_Condition: None<br> Until_Log_File: <br> Until_Log_Pos: 0<br> Master_SSL_Allowed: No<br> Master_SSL_CA_File: <br> Master_SSL_CA_Path: <br> Master_SSL_Cert: <br> Master_SSL_Cipher: <br> Master_SSL_Key: <br> Seconds_Behind_Master: 696<br> |
Then we tried doing a master graceful failover, but this failed due to the replication lag.
|
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 <br>Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting.<br> |
However, as soon as replication lag came < 1 minute, the condition we specified for [FailMasterPromotionOnLagMinutes], the failover process ran very smoothly.
|
1 |
slave1 [localhost:22638] {root} ((none)) > show slave statusGl;<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for source to send event<br> Master_Host: 127.0.0.1<br> Master_User: rsandbox<br> Master_Port: 22637<br> Connect_Retry: 60<br> Master_Log_File: mysql-bin.000005<br> Read_Master_Log_Pos: 605457002<br> Relay_Log_File: mysql-relay.000008<br> Relay_Log_Pos: 605455949<br> Relay_Master_Log_File: mysql-bin.000005<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br> Replicate_Do_DB: <br> Replicate_Ignore_DB: <br> Replicate_Do_Table: <br> Replicate_Ignore_Table: <br> Replicate_Wild_Do_Table: <br> Replicate_Wild_Ignore_Table: <br> Last_Errno: 0<br> Last_Error: <br> Skip_Counter: 0<br> Exec_Master_Log_Pos: 605455733<br> Relay_Log_Space: 605457511<br> Until_Condition: None<br> Until_Log_File: <br> Until_Log_Pos: 0<br> Master_SSL_Allowed: No<br> Master_SSL_CA_File: <br> Master_SSL_CA_Path: <br> Master_SSL_Cert: <br> Master_SSL_Cipher: <br> Master_SSL_Key: <br> Seconds_Behind_Master: 1<br> |
So, the master failed over to “Anils-MacBook-Pro.local:22638″.
|
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 <br>Anils-MacBook-Pro.local:22638<br> |
Failover logs “/tmp/recovery.log”.
|
1 |
20250607 22:19:53: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638<br>20250607 22:19:54: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638<br>20250607 22:19:54: Planned takeover complete<br> |
The purpose of the options discussed above is to provide control over the granularity of the MySQL orchestrator failover process, especially in scenarios where replicas are experiencing replication lag. Essentially, we have the option to either wait for the lag to resolve before triggering a failover, OR to proceed with an immediate failover despite the lag. Additionally, the setting like [FailMasterPromotionOnLagMinutes, FailMasterPromotionIfSQLThreadNotUpToDate] ensures the failover should fail in case of lag prevails as per the conditions, providing maximum consistency.