This post discusses ways of fixing broken GTID replication.
This blog series is all about the daily stories we have in Managed Services, dealing with customers’ environments (mostly when we need to quickly restore a service level within the SLA time).
One of the issues we encounter daily is replication using the GTID protocol. While there are a lot of blogs written about this subject, I would like to just highlight GTID replication operations, and the way you can deal with broken replication.
Most of the time we face way more complex scenarios then the one I’m about to present as an example, but the main goal of this blog is to quickly highlight the tools that can be used to fix issues to resume replication.
After reading this blog, you might ask yourself “Now, we know how to fix replication, but what about consistency?” The next blog will be entirely focused on that matter, data consistency!
Little less talk, little more action…
Replication is broken, and the SHOW SLAVE STATUS command output looks like below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
mysql> show slave statusG *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.0.12 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysqld-bin.000005 Read_Master_Log_Pos: 879 Relay_Log_File: mysqld-relay-bin.000009 Relay_Log_Pos: 736 Relay_Master_Log_File: mysqld-bin.000005 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1' Skip_Counter: 0 Exec_Master_Log_Pos: 634 Relay_Log_Space: 1155 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1' Replicate_Ignore_Server_Ids: Master_Server_Id: 3 Master_UUID: 46fdb7ad-5852-11e6-92c9-0800274fb806 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 161108 16:47:53 Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-4, 4fbe2d57-5843-11e6-9268-0800274fb806:1-3 Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-3, 4fbe2d57-5843-11e6-9268-0800274fb806:1-3, 81a567a8-5852-11e6-92cb-0800274fb806:1 Auto_Position: 1 1 row in set (0.00 sec) |
When a slave configured to replicate using the GTID protocol breaks, pay attention to the SHOW SLAVE STATUS command output. You will find the Retrieved_Gtid_Set and Executed_Gtid_Set in the listed columns. You can see that the last global transaction ID retrieved from the current master was not executed (it’s going to appear on the Retrieved_Gtid_Set but not on the Executed_Gtid_Set, following the GTID format).
That means that the slave has retrieved a transaction that, for some other reason, it couldn’t execute. That’s the global transaction ID you need if you want to inject a fake transaction and get replication resumed. The fake transaction you inject takes the place of the one that has an SQL that cannot be executed due to an error found in Last_Error from the SHOW SLAVE STATUS.
1 2 3 4 5 6 7 8 9 10 |
#: replication is broken due to error 1062, when the primary key of a particular table is violated Last_Errno: 1062 Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1' #: you can identify what is the global transaction id with problems, so, getting the replication streaming broken Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-4, 4fbe2d57-5843-11e6-9268-0800274fb806:1-3 Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-3, 4fbe2d57-5843-11e6-9268-0800274fb806:1-3, 81a567a8-5852-11e6-92cb-0800274fb806:1 |
As shown above, we can see clearly that the transaction causing issues is global transaction ID number 4, coming from master under UUID 46fdb7ad-5852-11e6-92c9-0800274fb806. You can use the SHOW RELAYLOG EVENTS to try and verify that is the transaction’s query that is causing problems:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
mysql> show relaylog events in 'mysqld-relay-bin.000009' from 736G *************************** 1. row *************************** Log_name: mysqld-relay-bin.000009 Pos: 736 Event_type: Gtid Server_id: 3 End_log_pos: 682 Info: SET @@SESSION.GTID_NEXT= '46fdb7ad-5852-11e6-92c9-0800274fb806:4' *************************** 2. row *************************** Log_name: mysqld-relay-bin.000009 Pos: 784 Event_type: Query Server_id: 3 End_log_pos: 755 Info: BEGIN *************************** 3. row *************************** Log_name: mysqld-relay-bin.000009 Pos: 857 Event_type: Query Server_id: 3 End_log_pos: 848 Info: insert into wb.t1 set i=1 *************************** 4. row *************************** Log_name: mysqld-relay-bin.000009 Pos: 950 Event_type: Xid Server_id: 3 End_log_pos: 879 Info: COMMIT /* xid=66 */ 4 rows in set (0.00 sec) |
Before fixing and resuming the replication stream, we need to check why that INSERT query breaks replication. Let’s SELECT data and check the structure of table wb.t1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql> select * from wb.t1; +---+ | i | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> show create table wb.t1; +-------+-----------------------------------------------------+ | Table | Create Table | +-------+-----------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `i` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`i`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-----------------------------------------------------+ 1 row in set (0.01 sec) |
It’s clear that something is wrong somewhere else than just the database. It’s time to fix and resume replication, and catch up with the master data. GTID replication data uses the below tools:
- pt-slave-restart
- mysqlslavetrx
- inject a fake/empty transaction
pt-slave-restart
One of the easiest ways of resuming replication on slaves when replication is broken is using pt-slave-restart, which is part of Percona Toolkit. Once you find the above facts (mainly the master UUID of the problematic global transaction ID that broke slave replication), you can move forward using pt-slave-restart with the GTID flag —-master-uuid. This passes the master’s UUID and it skips all global transactions breaking replication on a specific slave server, as you can see below: