Multi-Source Replication Performance with GTID

Multi-Source Replication Performance with GTID

PREVIOUS POST
NEXT POST

In this blog post, we’ll look at the performance of multi-source replication with GTID.

Note. This blog was edited as I received some comments from Percona colleagues. I’ve updated the contents in the marked EDIT section. Thanks!

Multi-Source Replication is a topology I’ve seen discussed recently, so I decided to look into how it performs with the different replication concepts. Multi-source replication use replication channels, which allow a slave to replicate from multiple masters. This is a great way to consolidate data that has been sharded for production or simplify the analytics process by using the same server. Since multiple masters are taking writes, care is needed to not overlook the slave. The traditional replication concept uses the binary log file name, and the position inside that file.

This was the standard until the release of global transaction identifiers (GTID). I have set up a test environment to validate which concept would perform better, and be a better choice for use in this topology.

SETUP

My test suite is rather simple, consisting of only three virtual machines, two masters and one slave. The slaves’ replication channels are set up using the same concept for each run, and no run had any replication filters. To prevent any replication errors, each master took writes against a different schema and user grants are identical on all three servers. The setup below ran with both replication channels using binary log file and position. Then the tables were dropped and the servers changed to use GTID for the next run.

Prepare the sysbench tables:

I used a read-only sysbench to warm up the InnoDB buffer pool. Both commands ran on the slave to ensure both schemas were loaded into the buffer pool:

After warming up the buffer pool, the slave should be fully caught up with both masters. To remove IO contention as a possible influencer, I stopped the SQL thread while I generated load on the master. Leaving the IO thread running allowed the slave to write the relay logs during this process, and help ensure that the test only measures the difference in the slave SQL thread.

Each master had a sysbench run against it for the schema that was designated to it in order to generate the writes:

Once the writes completed, I monitored the IO activity on the slave to ensure it was 100% idle and that all relay logs were fully captured. Once everything was fully written, I enabled a capture of the replication lag once per minute for each replication channel, and started the slaves SQL threads:

The above chart depicts the cumulative lag seen on the slave by pt-heartbeat since starting the sql_thread. The first item to noticed is that the replication delay was higher overall with the binary log. This could be because the SQL thread stopped for a different amount of time. This may appear to give GTID an advantage in this test, but remember with this test the amount of delay is less important than the processed rate. Focusing on when replication began to display a significant change towards catching up you can see that there are two distinct drops in delay. This is caused by the fact that the slave has two replication threads that individually monitor their delay. One of the replication threads caught up fully and the other was delayed for a bit longer.

In every test run, GTID took slightly longer to fully catch up than the traditional method. There are a couple of reasons to expect GTID’s to be slightly slower. One possibility is the that there are additional writes on the slave, in order to keep track of all the GTID’s that the slave ran. I removed the initial write to the relay log, but we must retain the committed GTID, and this causes additional writes. I used the default settings for MySQL, and as such log_slave_updates was disabled. This causes the replicated GTID to be stored in a table, which is periodically compressed. You can find more details on how log_slave_updates impacts GTID replication here.

EDIT TO ORIGINAL POST

I received some feedback regarding the above test that I missed a potential variable which could have impacted the results. With a one hour sysbench there could be a different number of transactions written to each server. This makes it so that each run could be different, and not necessarily comparable. To resolve this I set up a different test, where I could easily control the number, and size of transactions.

I created an identical table inside of each master’s schema.

Just as in the previous test I then stopped the sql_thread on both channels of the slave:

I used a very simple bash loop to generate 500,000 insert statements against each of the masters:

I also removed pt-heartbeat from this test. I did not want to add the additional writes, which might influence the results. Instead, I used the output from show slave status. I ran this test 10 times with each replication concept, truncating the table between runs. I also decided to change the way I looked at speed of the replication. Since this test focuses on replicating 1 million insert statements, I looked at the results as transactions per second:

Overall the transactions per second remained within 10% for each concept. Looking at the average, my initial response was that it validates my original statement that GTID is “slightly” slower. However, when you look further the difference in the average transactions per second is only 32.32 transactions. That is .97% of the total transactions per second GTID averaged. Since the difference is less than 1%, I have to adjust my conclusion. There is no significant difference in the performance of the replication concepts.

END EDITED SECTION

 

So the question still exists, why should we use GTID, especially with multisource replication? I’ve found that the answer lies in the composition of a GTID. From MySQL’s GTID Concepts, a GTID is composed of two parts, the source_id, and the transaction_id. The source_id is a unique identifier targeting the server which originally wrote the transaction. This allows you to identify in the binary log which master took the initial write, and so you can pinpoint problems much easier.

The below excerpt from DB1’s (a master from this test) binary log shows that, before the transaction being written, the “SET @@SESSION.GTID_NEXT” ran. This is the GTID that you can follow through the rest of the topology to identify the same transaction.

“d1ab72e9-0220-11e8-aee7-00155dab6104” is the server_uuid for DB1, and 270035 is the transaction id.

Conclusion

The tests I ran show that there is not a significant difference in throughput for either replication concept. GTID is being recommended for use with multisource replication due to the ease of managing and troubleshooting the complex topology.

The GTID concept allows a slave to know exactly which server initially wrote the transaction, even in a tiered environment. This means that if you need to promote a slave from the bottom tier, to the middle tier, simply changing the master is all that is needed. The slave can pick up from the last transaction it ran on that server and continue replicating without a problem. Stephane Combaudon explains this in detail in a pair of blogs. You can find part 1 here and part 2 here. Facebook also has a great post about their experience deploying GTID-based replication and the troubles they faced.

PREVIOUS POST
NEXT POST

Share this post

Comments (2)

  • René Cannaò Reply

    Bradley,
    Any reason why you didn’t use parallel replication in this test?

    March 30, 2018 at 9:48 pm
    • Bradley Mickel Reply

      Hi René,

      I didn’t use parallel replication because I was trying to keep everything as simple as possible in order to reduce the variance additional features might add.

      April 2, 2018 at 10:02 am

Leave a Reply