The Multi-Source GTID Replication MazeAnanias Tsalouchidis
In this blog post, we’ll look at how to navigate some of the complexities of multi-source GTID replication.
GTID replication is often a real challenge for DBAs, especially if this has to do with multi-source GTID replication. A while back, I came across a really interesting customer environment with shards where multi-master, multi-source, multi-threaded MySQL 5.6 MIXED replication was active. This is a highly complex environment that has both pros and cons, introducing risks as a trade-off for specific customer requirements.
This is the set up of part of this environment:
I started looking into this setup when a statement broke replication between
db10. Replication broke due to a statement executed on a schema that was not present on
db10. This also resulted in changes originating from
db1 to not being pushed down to
db10, as we stopped the replication thread (for db1 channel).
On the other hand, replication was not stopped on
db2 because the schema in question was present on
db2. Replication between
db20 was broken as well because the schema was not present in
In order to fix
db10 replication, four GTID sets were injected in
Here are some interesting blog posts regarding how to handle/fix GTID replication issues:
After injecting the GTID sets, we started replication again and everything ran fine.
After that, we had to check the
db20 replication, which, as I’ve already said, was broken as well. In this case, injecting only the first GTID trx into
db20 instead of all of those causing issues on
db10 was enough!
You may wonder how this is possible. Right? The answer is that the rest of them were replicated from
db20, although the channel was not the same.
Another strange thing is the fact that although the replication thread for the
db20 channel was stopped (broken), checking the slave status on
db20 showed that
Executed_Gtid_Set was moving for all channels even though
Retrieved_Gtid_Set for the broken one was stopped! So what was happening there?
This raised my curiosity, so I decided to do some further investigation and created scenarios regarding other strange things that could happen. An interesting one was about the replication filters. In our case, I thought “What would happen in the following scenario … ?”
Let’s say we write a row from
db123.table789. This row is replicated to
db10 (let’s say using channel 1) and to
db2 (let’s say using channel2). On channel 1, we filter out the
db123.% tables, on channel2 we don’t.
db1 writes the row and the entry to the binary log.
db2 writes the row after reading the entry from the binary log and subsequently writes the entry to its own binary log and replicates this change to
db20. This change is also replicated to
db10. So now, on
db10 (depending on which channel finds the GTID first) it either gets filtered on channel1 and written to its own bin log at just
commit with any actual DDL/DML removed, or if it is read first on channel2 (
db2 and then
db10) then it is NOT filtered out and executed instead. Is this correct? It definitely ISN’T!
Points of interest
You can find answers to the above questions in the points of interest listed below. Although it’s not really clear through the official documentation, this is what happens with GTID replication and multi-source GTID replication:
- As we know GTID sets are unique across all nodes in a given cluster. In multi-source replication,
Executed_Gtid_Setis common for all channels. This means that regardless the originating channel, when a GTID transaction is executed it is recorded in all channels’
Executed_Gtid_Set. Although it’s logical (each database is unique, so if a trx is going to affect a database it shouldn’t be tightened to a single channel regardless of the channel it uses), the documentation doesn’t provide much info around this.
- When we have multi-source, multi-level replication, there are cases where the GTID sets originating from one master can end up on one slave via different replication paths. It’s not clear if it applies any special algorithm (although it doesn’t seem that there could be one), but the preferred method seems to be FIFO. The fastest wins! This means that GTID sets can travel to the slave via different channels, and it’s related to how fast the upper-level slaves can commit changes. In fact, the path doesn’t really matter as it only executes each GTID trx once.
- Replication filters are global regardless the channel. This means they apply each filter to all channels. This is normal as we can’t define a replication filter per channel. In order to be able to debug such cases, adding a small replication delay per channel seems a good idea.