Last week I helped 2 different customers with infinite replication loops. I decided to write a blog post about these infinite loop of binary log statements in MySQL Replication. To explain what they are, how to identify them… and how to fix them.
An infinite replication loop is one or more transactions that are replicating over and over in a Multi Master Replication relationship. This happens because those transactions belong to a server-id which is different from the two actual masters.
Let’ s have a look first at the normal flow:
Replication flow: [1]
When you setup MySQL replication, you define the global variable server_id, this value must be unique for each host. In the illustration above, arrows represent the replication flow, so 1 is replicating to 2 and to 3, and 2 is replicating to 1.
When your active master (having more masters receiving simultaneous writes causes problems and is not beneficial in 99% of all cases) commits a transaction, the DML statements from that transaction are written to the binary log. They are called binary log events. A binary log event contains the server-id where it was executed. Then that event is replicated to the slaves (the I/O slave thread filters out only the events with a different server_id value. It is also possible to change that behaviour by enabling replicate_same_server_id but this not today’s focus[2]).
As we are in multi-master replication, the other master should also replicate events :
Closer look:
Infinite Replication Loop flow:
If you have master-master with slaves or if you want to backup one master, you need log_slave_updates[3] to be enabled. This is for example the case if you re using MMM with slaves.
But if for one reason that I explain later, one or more binary events with a different server-id than those of the two current masters are still in the relay log, they are just applied. Then written with their server-id on the binary log, and the slaves (included the other master) will replicate them… and write them in their binary log… and then the loop is created.
Look at the following process, notice that the server-id has changed on our server 2:
And that’s the loop ! The event for server-id 2 gets replicated over and over…
How can you create loops ?
How to identify an infinite replication loop ?
When you have an infinite replication loop, the value of Seconds_behind_master will have many randomly looking values. The value of Seconds_behind_master is calculated form the timestamp included in the last executed sql statement. As some of them are looping, the timestamp of those statements does not increase, causing a flip of values from 0 to the time those looping statement were first inserted.
Another way to see it is by monitoring your MySQL server with cacti, you will immediately notice the problem. Look at those graphs:
You can see the CPU usage and the replication lag… once again cacti is very helpful !
How to fix then ?
There are several possibilities to fix this problem. They are also linked to the situation.
Example on MMM:
1. check the mmm mode and put it to passive to disable any automatic failover during the process because high availability tools like mmm can cause failover if you stop replication
# mmm_control @production set_passive (to avoid automatic failover)
2. stop replication on the master having writer role (mysql1)
3. wait until some statements are replicated
4. change the replication master position on the active master to a newer position from the passive master (get position from show master statusG on mysql 4)
Illustration of the recommended solution :
Server 1 is the active master (receive the writes)
Server 4 is the passive master
Server 3 is a normal slave
Step1 Stop the replication on the active master (server 1):
mysql> stop slave;
Step 2: there are writes on the active master, the loop statements are also replicated from the relay-log but they aren’t not replicated again.
Step 3: the binary event with the wrong server-id are performed on the two slaves (mysql 3 and mysql 4) and the new events (with server-id 1) are also replicated.
Step 4: the event with server-id 2 on mysql 4 (the passive master) is written in the binary log and the new events are also performed on the two slaves
Step 5: the new statements (with server-id 1) are also saved in the binary-log of mysql 4 and the master position is increased…
Step 6: the slave position on the active master (mysql 1) is changed to the position returned by SHOW MASTER STATUS on mysql 4 (this position will increase every time writes are performed on the active master, so just take a position when you are sure that the looping statements have been executed) and slave is restarted on mysql 1
Note: you can also check all the statements that were in the loop using mysqlbinlog.
For example what was the loop of events from server-id 2?
# mysqlbinlog mysql-relay-bin.000x | grep ‘id 2’ | sort | uniq
Special thanks to Kenny[5] who helped me on the cases and to have reviewed this post.
[1] http://dev.mysql.com/doc/refman/5.1/en/replication-implementation-details.html
[4] http://dev.mysql.com/doc/refman/5.1/en/change-master-to.html