Managing farms of MySQL servers under a replication environment is very efficient with the help of a MySQL orchestrator tool. This ensures a smooth transition happens when there is any ad hoc failover or a planned/graceful switchover comes into action.
Several configuration parameters play a crucial role in controlling and influencing failover behavior. In this blog post, we’ll explore some of these key options and how they can impact the overall failover process.
Let’s discuss each of these settings one by one with some examples.
FailMasterPromotionIfSQLThreadNotUpToDate
By default, this option is disabled. However, when it is “true” and a master failover takes place while the candidate master has still not consumed all the relay log events, the failover or promotion process will be terminated.
If this setting remains “false,” then in a scenario where all the replicas are lagging and the current master is down, one of the members will be chosen as a new master, which eventually can lead to data loss on the new master node. Later, when the old master is again added as a replica, it could lead to duplicate entry problems.
Considering “FailMasterPromotionIfSQLThreadNotUpToDate” is enabled in the orchestrator configuration file “orchestrator.conf.json”:
1 |
"FailMasterPromotionIfSQLThreadNotUpToDate": true |
This is the topology managed by the orchestrator:
1 2 3 |
Anils-MacBook-Pro.local:22637 [0s,ok,8.0.36,rw,ROW,>>,GTID] + Anils-MacBook-Pro.local:22638 [0s,ok,8.0.36,ro,ROW,>>,GTID] + Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID] |
Below, we are running some workloads via sysbench, which will help increase the replication lag for our testing purposes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
sysbench --db-driver=mysql --mysql-user=sbtest_user --mysql-password=Sbtest@2022 --mysql-db=sbtest --mysql-host=127.0.0.1 --mysql-port=22637 --tables=15 --table-size=3000000 --create_secondary=off --threads=100 --time=0 --events=0 --report-interval=1 /opt/homebrew/Cellar/sysbench/1.0.20_7/share/sysbench/oltp_read_write.lua run |
Output:
1 2 3 |
256s ] thds: 100 tps: 1902.77 qps: 38008.31 (r/w/o: 26611.72/7591.05/3805.54) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 257s ] thds: 100 tps: 1960.71 qps: 38960.02 (r/w/o: 27292.45/7746.15/3921.42) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 258s ] thds: 100 tps: 1803.48 qps: 35773.52 (r/w/o: 24991.65/7174.91/3606.96) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 |
After some duration, the replication lag starts increasing on the replicas, and in the meantime, we just stopped the primary [127.0.0.1:22637].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000012 Read_Master_Log_Pos: 941353804 Relay_Log_File: mysql-relay.000034 Relay_Log_Pos: 318047296 Relay_Master_Log_File: mysql-bin.000012 Slave_IO_Running: No Slave_SQL_Running: Yes … Seconds_Behind_Master: 215 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 2003 Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
slave2 [localhost:22639] {msandbox} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000012 Read_Master_Log_Pos: 941353804 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 302890408 Relay_Master_Log_File: mysql-bin.000012 Slave_IO_Running: No Slave_SQL_Running: Yes … Seconds_Behind_Master: 215 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 2003 Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61) |
The result was that the master promotion failed as the SQL thread was not up to date.
1 |
2025-06-04 23:42:21 ERROR RecoverDeadMaster: failed promotion. FailMasterPromotionIfSQLThreadNotUpToDate is set and promoted replica Anils-MacBook-Pro.local:22638 's sql thread is not up to date (relay logs still unapplied). Aborting promotion |
Now, if the option “FailMasterPromotionIfSQLThreadNotUpToDate” is false or the default one, the failover will work flawlessly even if the replica is suffering from replication lag.
In the same above scenario, under “FailMasterPromotionIfSQLThreadNotUpToDate”: false condition, the master promotion happens successfully.
cat /tmp/recovery.log:
1 2 3 4 5 |
20250604 23:51:19: Detected AllMasterReplicasNotReplicating on Anils-MacBook-Pro.local:22637. Affected replicas: 2 20250604 23:52:41: Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2 20250604 23:52:56: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637 20250604 23:53:07: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638 20250604 23:53:07: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638 |
DelayMasterPromotionIfSQLThreadNotUpToDate
This parameter is the opposite of what we discussed above. Here, instead of aborting the master failover, it will be delayed until the candidate master has consumed all the relay log files. When it’s “true,” the orchestrator process will wait for the SQL thread to catch up before promoting to a new master.
Considering “DelayMasterPromotionIfSQLThreadNotUpToDate” is enabled in the orchestrator configuration file “orchestrator.conf.json”:
1 |
"DelayMasterPromotionIfSQLThreadNotUpToDate": true |
Some workload was running on the master [127.0.0.1:22637], and in a few seconds, the replication lag started emerging. We stopped the master node around this.
We can see in the log file /tmp/tmp/recovery.log that the failover initial process started.
1 2 3 4 |
20250605 19:10:04: Detected UnreachableMasterWithLaggingReplicas on Anils-MacBook-Pro.local:22637. Affected replicas: 2 20250605 19:10:06: Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2 20250605 19:10:06: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637 20250605 19:10:16: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637 |
However, we can observe that due to the replication lag on the candidate master [Anils-MacBook-Pro.local:22638] the promotion was put on hold to recover the lag before the failover.
1 2 3 4 5 6 7 8 9 10 |
2025-06-05 19:10:27 ERROR DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:27 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s ... 2025-06-05 19:10:27 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 19:10:30 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 ... 2025-06-05 19:10:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:37 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s |
Anils-MacBook-Pro.local:22638
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000017 Read_Master_Log_Pos: 981203536 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 179833778 Relay_Master_Log_File: mysql-bin.000017 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Seconds_Behind_Master: 227 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000017 Read_Master_Log_Pos: 1017383839 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 237223256 Relay_Master_Log_File: mysql-bin.000017 Slave_IO_Running: No Slave_SQL_Running: No … Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 2003 Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61) |
Anils-MacBook-Pro.local:22639
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
slave2 [localhost:22639] {root} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000017 Read_Master_Log_Pos: 907360640 Relay_Log_File: mysql-relay.000004 Relay_Log_Pos: 167191740 Relay_Master_Log_File: mysql-bin.000017 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Seconds_Behind_Master: 210 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
slave2 [localhost:22639] {root} ((none)) > show slave statusG; *************************** 1. row *************************** Slave_IO_State: Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000017 Read_Master_Log_Pos: 1017383839 Relay_Log_File: mysql-relay.000004 Relay_Log_Pos: 237135927 Relay_Master_Log_File: mysql-bin.000017 Slave_IO_Running: No Slave_SQL_Running: No … Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 2003 Last_IO_Error: Error reconnecting to source '[email protected]:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61) |
There is one more observation here. If we gracefully switch over while the replication lag persists on all replicas, we will get the message below.
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 |
Output:
1 |
Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting. |
This is happening because of the condition below, whereby the replication lag should be equal to or less than the defined “ReasonableMaintenanceReplicationLagSeconds: 20.”
1 2 3 |
if !designatedInstance.HasReasonableMaintenanceReplicationLag() { return nil, nil, fmt.Errorf("Desginated instance %+v seems to be lagging too much for this operation. Aborting.", designatedInstance.Key) } |
1 2 3 4 5 6 7 |
func (this *Instance) HasReasonableMaintenanceReplicationLag() bool { // replicas with SQLDelay are a special case if this.SQLDelay > 0 { return math.AbsInt64(this.SecondsBehindMaster.Int64-int64(this.SQLDelay)) <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds) } return this.SecondsBehindMaster.Int64 <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds) } |
https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/logic/topology_recovery.go#L2124
https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/inst/instance.go#L37
Once this condition is met and replication lag is below that threshold, OR else we have increased the threshold, the graceful takeover works fine.
1 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 |
Here, the orchestrator service logs reflect the failover process now waiting for all the relay logs to finish.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
2025-06-05 20:15:01 INFO CommandRun successful. exit status 0 2025-06-05 20:15:01 INFO topology_recovery: Completed PreGracefulTakeoverProcesses hook 1 of 1 in 5.463653s 2025-06-05 20:15:01 INFO topology_recovery: done running PreGracefulTakeoverProcesses hooks 2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will set Anils-MacBook-Pro.local:22637 as read_only 2025-06-05 20:15:01 INFO instance Anils-MacBook-Pro.local:22637 read_only: true 2025-06-05 20:15:01 INFO auditType:read-only instance:Anils-MacBook-Pro.local:22637 cluster:Anils-MacBook-Pro.local:22637 message:set as true 2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will wait for Anils-MacBook-Pro.local:22638 to reach master coordinates mysql-bin.000021:221642748 2025-06-05 20:19:53 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638 2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pr .. 2025-06-05 20:26:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on Anils-MacBook-Pro.local:22638 2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: found no reason to override promotion of Anils-MacBook-Pro.local:22638 2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: successfully promoted Anils-MacBook-Pro.local:22638 2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: promoted server coordinates: mysql-bin.000017:221167478 2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: will apply MySQL changes to promoted master 2025-06-05 20:26:37 INFO Will reset replica on Anils-MacBook-Pro.local:22638 |
Once lag resolves, the takeover process runs successfully.
1 2 3 4 |
20250605 20:19:53: Will recover from DeadMaster on Anils-MacBook-Pro.local:22637 20250605 20:26:39: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638 20250605 20:26:39: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638 20250605 20:26:39: Planned takeover complete |
1 |
shell> orchestrator-client -c topology -a testcluster |
Output:
1 2 3 |
Anils-MacBook-Pro.local:22638 [0s,ok,8.0.36,rw,ROW,>>,GTID] - Anils-MacBook-Pro.local:22637 [null,nonreplicating,8.0.36,ro,ROW,>>,GTID] + Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID] |
FailMasterPromotionOnLagMinutes
This parameter ensures that a master promotion will be aborted if the replica is lagging >= the configured number of minutes. In order to use this flag, we must also use “ReplicationLagQuery” and a heartbeat mechanism, “pt-hearbeat”, to assess the correct replication lag.
Let’s see how it works.
We have set the value below in the orchestrator configuration “orchestrator.conf.json,” which ensures that if the lag exceeds ~1 minute, the master promotion process will fail.
1 |
"FailMasterPromotionOnLagMinutes": 1, |
As we discussed above, enabling this option depends on setting the “ReplicationLagQuery,” which fetches the replication lag details from the heartbeat mechanism instead of relying on the seconds_behind_master status.
1 2 3 |
2025-06-06 08:54:47 INFO starting orchestrator, version: 3.2.6, git commit: 89f3bdd33931d5e234890787a24cc035fa106b32 2025-06-06 08:54:47 INFO Read config: /Users/aniljoshi/orchestrator/conf/orchestrator.conf.json 2025-06-06 08:54:47 FATAL nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery to be set |
By default, the orchestrator uses the slave status “seconds_behind_master” to monitor the replication lag. However, in a scenario where the replication is already broken and the master also failed, the value of “seconds_behind_master” would be “null,” which eventually would be of no use to get the accurate details needed to make the decision.
So here we are going to use the pt-hearbeat as a source of replication lag. pt-heartbeat is a replication delay monitoring system that measures delay by looking at actual replicated data. This provides “absolute” lag from the master as well as sub-second resolution.
Below is the “ReplicationLagQuery” configuration, which we will define in the orchestrator configuration file.
1 |
"ReplicationLagQuery": "SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS unsigned INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts DESC LIMIT 1", |
We also need to have a separate pt-heartbeat process that will run on both source/replica instances.
Anils-MacBook-Pro.local:22637:
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22637 --update & |
Anils-MacBook-Pro.local:22638:
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22638 --update & |
Anils-MacBook-Pro.local:22639:
1 |
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22639 --update & |
- –read-only-interval => When –check-read-only is specified, the interval to sleep while the server is found to be read-only. If unspecified, –interval is used.
- –fail-successive-errors => If specified, pt-heartbeat will fail after given number of successive DBI errors (failure to connect to server or issue a query).
- –interval => How often to update or check the heartbeat –table. Updates and checks begin on the first whole second then repeat every –interval seconds for –update and every –interval plus –skew seconds for –monitor.
Reference – https://docs.percona.com/percona-toolkit/pt-heartbeat.html
The delay is calculated on the replicas, as the difference between the current system time and the replicated timestamp value from the heartbeat table. Basically, on the master node, pt-heartbeat updates the heartbeat table every second with the server ID and the current timestamp. These updates are replicated to the replica nodes through asynchronous replication.
E.g.,
1 2 3 4 5 6 7 |
slave1 [localhost:22638] {msandbox} (percona) > select * from percona.heartbeat; +----------------------------+-----------+------------------+----------+-----------------------+---------------------+ | ts | server_id | file | position | relay_source_log_file | exec_source_log_pos | +----------------------------+-----------+------------------+----------+-----------------------+---------------------+ | 2025-06-06T10:34:56.410320 | 100 | mysql-bin.000023 | 1654366 | NULL | NULL | +----------------------------+-----------+------------------+----------+-----------------------+---------------------+ 2 rows in set (0.00 sec) |
1 2 3 4 5 6 7 8 |
slave2 [localhost:22639] {root} (percona) > SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS signed INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts DESC LIMIT 1; +-------+ | delay | +-------+ | 53 | +-------+ 1 row in set (0.01 sec) |
Let’s see the behaviour of enabling “FailMasterPromotionOnLagMinutes” with a quick scenario.
We were running some workload in the background to have a replication delay/lag.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
slave1 [localhost:22638] {root} ((none)) > show slave statusGl; *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000005 Read_Master_Log_Pos: 355883980 Relay_Log_File: mysql-relay.000003 Relay_Log_Pos: 452576487 Relay_Master_Log_File: mysql-bin.000003 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 453709829 Relay_Log_Space: 2503386167 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 696 |
Then we tried doing a master graceful failover, but this failed due to the replication lag.
1 2 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting. |
However, as soon as replication lag came < 1 minute, the condition we specified for [FailMasterPromotionOnLagMinutes], the failover process ran very smoothly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
slave1 [localhost:22638] {root} ((none)) > show slave statusGl; *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 22637 Connect_Retry: 60 Master_Log_File: mysql-bin.000005 Read_Master_Log_Pos: 605457002 Relay_Log_File: mysql-relay.000008 Relay_Log_Pos: 605455949 Relay_Master_Log_File: mysql-bin.000005 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 605455733 Relay_Log_Space: 605457511 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 1 |
So, the master failed over to “Anils-MacBook-Pro.local:22638”.
1 2 |
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 Anils-MacBook-Pro.local:22638 |
Failover logs “/tmp/recovery.log”.
1 2 3 |
20250607 22:19:53: Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638 20250607 22:19:54: (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638 20250607 22:19:54: Planned takeover complete |
Conclusion
The purpose of the options discussed above is to provide control over the granularity of the MySQL orchestrator failover process, especially in scenarios where replicas are experiencing replication lag. Essentially, we have the option to either wait for the lag to resolve before triggering a failover, OR to proceed with an immediate failover despite the lag. Additionally, the setting like [FailMasterPromotionOnLagMinutes, FailMasterPromotionIfSQLThreadNotUpToDate] ensures the failover should fail in case of lag prevails as per the conditions, providing maximum consistency.
Thanks for the great article.
We are going to use Orchestrator.
And Orchestrator supports semi-sync.
To use semi-sync replication, are the ‘LockedSemiSyncMaster, DetectSemiSyncEnforcedQuery, MasterWithTooManySemiSyncReplicas, RecoverLockedSemiSyncMaster, EnforceExactSemiSyncReplicas’ settings in orchestrator.conf.json mandatory?
How can I use semi-sync replication?
You need to have an existing semi-sync topology – https://dev.mysql.com/doc/refman/8.4/en/replication-semisync.html before using orchestrator semi-sync feature.
Then, you can implement the semi-sync related configurations – https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md#semi-sync-topology accordingly under orchestrator configuration file.
Anil great post!
What is the future of orchestrator now that its own development stopped at 2021.
What is the future of primary-replica topology if there is no active open source tool able to do a proper failover?
Well the good news is Percona already forked this and updating the releases quite regularly. You can check out here – https://github.com/percona/orchestrator/tags