This post discusses ways of fixing broken GTID replication.
This blog series is all about the daily stories we have in Managed Services, dealing with customers’ environments (mostly when we need to quickly restore a service level within the SLA time).
One of the issues we encounter daily is replication using the GTID protocol. While there are a lot of blogs written about this subject, I would like to just highlight GTID replication operations, and the way you can deal with broken replication.
Most of the time we face way more complex scenarios then the one I’m about to present as an example, but the main goal of this blog is to quickly highlight the tools that can be used to fix issues to resume replication.
After reading this blog, you might ask yourself “Now, we know how to fix replication, but what about consistency?” The next blog will be entirely focused on that matter, data consistency!
Little less talk, little more action…
Replication is broken, and the SHOW SLAVE STATUS command output looks like below:
|
1 |
mysql> show slave statusG<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for master to send event<br> Master_Host: 192.168.0.12<br> Master_User: repl<br> Master_Port: 3306<br> Connect_Retry: 60<br> Master_Log_File: mysqld-bin.000005<br> Read_Master_Log_Pos: 879<br> Relay_Log_File: mysqld-relay-bin.000009<br> Relay_Log_Pos: 736<br> Relay_Master_Log_File: mysqld-bin.000005<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: No<br> Replicate_Do_DB:<br> Replicate_Ignore_DB:<br> Replicate_Do_Table:<br> Replicate_Ignore_Table:<br> Replicate_Wild_Do_Table:<br> Replicate_Wild_Ignore_Table:<br> Last_Errno: 1062<br> Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1'<br> Skip_Counter: 0<br> Exec_Master_Log_Pos: 634<br> Relay_Log_Space: 1155<br> Until_Condition: None<br> Until_Log_File:<br> Until_Log_Pos: 0<br> Master_SSL_Allowed: No<br> Master_SSL_CA_File:<br> Master_SSL_CA_Path:<br> Master_SSL_Cert:<br> Master_SSL_Cipher:<br> Master_SSL_Key:<br> Seconds_Behind_Master: NULL<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 0<br> Last_IO_Error:<br> Last_SQL_Errno: 1062<br> Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1'<br> Replicate_Ignore_Server_Ids:<br> Master_Server_Id: 3<br> Master_UUID: 46fdb7ad-5852-11e6-92c9-0800274fb806<br> Master_Info_File: mysql.slave_master_info<br> SQL_Delay: 0<br> SQL_Remaining_Delay: NULL<br> Slave_SQL_Running_State:<br> Master_Retry_Count: 86400<br> Master_Bind:<br> Last_IO_Error_Timestamp:<br> Last_SQL_Error_Timestamp: 161108 16:47:53<br> Master_SSL_Crl:<br> Master_SSL_Crlpath:<br> Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-4,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-3,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1<br> Auto_Position: 1<br>1 row in set (0.00 sec)<br> |
That means that the slave has retrieved a transaction that, for some other reason, it couldn’t execute. That’s the global transaction ID you need if you want to inject a fake transaction and get replication resumed. The fake transaction you inject takes the place of the one that has an SQL that cannot be executed due to an error found in Last_Error from the SHOW SLAVE STATUS.
|
1 |
#: replication is broken due to error 1062, when the primary key of a particular table is violated<br>Last_Errno: 1062<br>Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into wb.t1 set i=1'<br> <br>#: you can identify what is the global transaction id with problems, so, getting the replication streaming broken<br> Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-4,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-3,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1 |
|
1 |
mysql> show relaylog events in 'mysqld-relay-bin.000009' from 736G<br>*************************** 1. row ***************************<br> Log_name: mysqld-relay-bin.000009<br> Pos: 736<br> Event_type: Gtid<br> Server_id: 3<br>End_log_pos: 682<br> Info: SET @@SESSION.GTID_NEXT= '46fdb7ad-5852-11e6-92c9-0800274fb806:4'<br>*************************** 2. row ***************************<br> Log_name: mysqld-relay-bin.000009<br> Pos: 784<br> Event_type: Query<br> Server_id: 3<br>End_log_pos: 755<br> Info: BEGIN<br>*************************** 3. row ***************************<br> Log_name: mysqld-relay-bin.000009<br> Pos: 857<br> Event_type: Query<br> Server_id: 3<br>End_log_pos: 848<br> Info: insert into wb.t1 set i=1<br>*************************** 4. row ***************************<br> Log_name: mysqld-relay-bin.000009<br> Pos: 950<br> Event_type: Xid<br> Server_id: 3<br>End_log_pos: 879<br> Info: COMMIT /* xid=66 */<br>4 rows in set (0.00 sec) |
|
1 |
mysql> select * from wb.t1;<br>+---+<br>| i |<br>+---+<br>| 1 |<br>+---+<br>1 row in set (0.00 sec)<br> <br>mysql> show create table wb.t1;<br>+-------+-----------------------------------------------------+<br>| Table | Create Table |<br>+-------+-----------------------------------------------------+<br>| t1 | CREATE TABLE `t1` (<br> `i` int(11) NOT NULL DEFAULT '0',<br> PRIMARY KEY (`i`)<br>) ENGINE=InnoDB DEFAULT CHARSET=latin1 |<br>+-------+-----------------------------------------------------+<br>1 row in set (0.01 sec)<br> |
pt-slave-restart
One of the easiest ways of resuming replication on slaves when replication is broken is using pt-slave-restart, which is part of Percona Toolkit. Once you find the above facts (mainly the master UUID of the problematic global transaction ID that broke slave replication), you can move forward using pt-slave-restart with the GTID flag —-master-uuid. This passes the master’s UUID and it skips all global transactions breaking replication on a specific slave server, as you can see below:
|
1 |
[root@dbops02 ~]# pt-slave-restart --master-uuid 46fdb7ad-5852-11e6-92c9-0800274fb806 --host=localhost -u root<br>2016-11-08T17:24:09 h=localhost,u=root mysqld-relay-bin.000009 736 1062<br>2016-11-08T17:24:25 h=localhost,u=root mysqld-relay-bin.000010 491 1062<br>2016-11-08T17:24:34 h=localhost,u=root mysqld-relay-bin.000010 736 1062<br>2016-11-08T17:24:35 h=localhost,u=root mysqld-relay-bin.000010 981 1062<br>2016-11-08T17:24:36 h=localhost,u=root mysqld-relay-bin.000010 1226 1062 |
mysqlslavetrx
To use mysqlslavetrx (which is part of MySQL Utilities developer by Oracle), I recommend you read the article written by Daniel Guzman, and install MySQL Utilities on your database servers. Using it to skip problematic transactions and inject fake ones is pretty straightforward as well .
So, find the below on the slave side:
|
1 |
Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-13,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-8,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1 |
|
1 |
#: this below function is pretty cool and will exactly shows you if the slave is lacking some<br>#: GTIDs master have and vice-versa - this is going to help out using mysqlslavetrx to put<br>#: master and slave consistently with the same binary logs contents<br>mysql> SELECT GTID_SUBTRACT('46fdb7ad-5852-11e6-92c9-0800274fb806:1-13','46fdb7ad-5852-11e6-92c9-0800274fb806:1-8') gap;<br>*************************** 1. row ***************************<br>gap: 46fdb7ad-5852-11e6-92c9-0800274fb806:9-13<br>1 row in set (0.00 sec) |
|
1 |
[root@dbops02 mysql-utilities-1.6.2]# mysqlslavetrx --gtid-set=46fdb7ad-5852-11e6-92c9-0800274fb806:9-13 --verbose --slaves=wb@localhost:3306<br>WARNING: Using a password on the command line interface can be insecure.<br>#<br># GTID set to be skipped for each server:<br># - localhost@3306: 46fdb7ad-5852-11e6-92c9-0800274fb806:9-13<br>#<br># Injecting empty transactions for 'localhost:3306'...<br># - 46fdb7ad-5852-11e6-92c9-0800274fb806:9<br># - 46fdb7ad-5852-11e6-92c9-0800274fb806:10<br># - 46fdb7ad-5852-11e6-92c9-0800274fb806:11<br># - 46fdb7ad-5852-11e6-92c9-0800274fb806:12<br># - 46fdb7ad-5852-11e6-92c9-0800274fb806:13<br>#<br>#...done.<br># |
When you get back to the MySQL client on the slave, you’ll see that the retrieved and executed out of SHOW SLAVE STATUS will point that they are in the same position:
|
1 |
Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-13,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-13,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1 |
Inject a Fake Transaction
Fake transactions are called empty transactions as well, but the fact is if a global transaction is affecting a slave, you must fake empty transactions that won’t affect data to resume replication and carry on processing the data streaming from the master aka replication! We need to know that it’s not going to affect future slaves, especially if this server becomes a new master after a failover/switchover process. You can get more information about Errant Transactions here and here.
|
1 |
mysql> stop slave;<br>Query OK, 0 rows affected (0.01 sec)<br> <br>mysql> set gtid_next='46fdb7ad-5852-11e6-92c9-0800274fb806:14';<br>Query OK, 0 rows affected (0.00 sec)<br> <br>mysql> begin; commit;<br>Query OK, 0 rows affected (0.00 sec)<br> <br>Query OK, 0 rows affected (0.00 sec)<br> <br>mysql> set gtid_next=automatic;<br>Query OK, 0 rows affected (0.00 sec) |
|
1 |
Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-14,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-14,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1 |
|
1 |
mysql> show slave statusG<br>*************************** 1. row ***************************<br> Slave_IO_State: Waiting for master to send event<br> Master_Host: 192.168.0.12<br> Master_User: repl<br> Master_Port: 3306<br> Connect_Retry: 60<br> Master_Log_File: mysqld-bin.000005<br> Read_Master_Log_Pos: 3329<br> Relay_Log_File: mysqld-relay-bin.000011<br> Relay_Log_Pos: 491<br> Relay_Master_Log_File: mysqld-bin.000005<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br> Replicate_Do_DB:<br> Replicate_Ignore_DB:<br> Replicate_Do_Table:<br> Replicate_Ignore_Table:<br> Replicate_Wild_Do_Table:<br> Replicate_Wild_Ignore_Table:<br> Last_Errno: 0<br> Last_Error:<br> Skip_Counter: 0<br> Exec_Master_Log_Pos: 3329<br> Relay_Log_Space: 3486<br> Until_Condition: None<br> Until_Log_File:<br> Until_Log_Pos: 0<br> Master_SSL_Allowed: No<br> Master_SSL_CA_File:<br> Master_SSL_CA_Path:<br> Master_SSL_Cert:<br> Master_SSL_Cipher:<br> Master_SSL_Key:<br> Seconds_Behind_Master: 0<br>Master_SSL_Verify_Server_Cert: No<br> Last_IO_Errno: 0<br> Last_IO_Error:<br> Last_SQL_Errno: 0<br> Last_SQL_Error:<br> Replicate_Ignore_Server_Ids:<br> Master_Server_Id: 3<br> Master_UUID: 46fdb7ad-5852-11e6-92c9-0800274fb806<br> Master_Info_File: mysql.slave_master_info<br> SQL_Delay: 0<br> SQL_Remaining_Delay: NULL<br> Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it<br> Master_Retry_Count: 86400<br> Master_Bind:<br> Last_IO_Error_Timestamp:<br> Last_SQL_Error_Timestamp:<br> Master_SSL_Crl:<br> Master_SSL_Crlpath:<br> Retrieved_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-14,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3<br> Executed_Gtid_Set: 46fdb7ad-5852-11e6-92c9-0800274fb806:1-14,<br>4fbe2d57-5843-11e6-9268-0800274fb806:1-3,<br>81a567a8-5852-11e6-92cb-0800274fb806:1<br> Auto_Position: 1<br>1 row in set (0.00 sec) |