MySQL replication primer with pt-table-checksum / pt-table-sync, part 2

MySQL replicationThis is the second and last tutorial blog post on how to use pt-table-checksum / pt-table-sync tools for MySQL replication.

In the first post, I showed you how to use the pt-table-checksum  / pt-table-sync  tools in a typical MySQL replication setup. In this post, we’ll discuss MySQL replication for more advanced topologies. I will show you how to use these tools in a chained master/slave replication setup in general, and in Percona XtraDB Cluster specifically. (Here is another good post about how to use the pt-table-checksum  tool in Percona XtraDB Cluster.)

Let me first show you an example where I tried to run pt-table-checksum / pt-table-sync on a chained master/slave setup. In this example, the setup is Master -> Slave1 -> Slave2, where the master has a binary log format set to STATEMENT, while the chained Slave1 has a binary log format set to ROW.

As soon as I tried to execute pt-table-checksum on this replication topology, it failed because the binary logs use different formats. The pt-table-checksum documentation mentions this issue in the Limitations section.

The problem here is that Slave1 receives changes from its Master, which will be STATEMENT ( pt-table-checksum  will force this into STATEMENT binlog_format in its session). Slave1 executes the checksum queries, and all seems good until that point. But then Slave1 will also write the changes to its binary log to replicate to Slave2. pt-table-checksum  only works properly when STATEMENT-based formatting is used to generate the checksum. If it’s ROW-based, the row changes to the checksums table are just replicated from the chain master,  and no checksumming happens. In this case, this is what happens with Slave2, which is why  pt-table-checksum  doesn’t perform its checksum.

To remedy this problem we need to ignore this checking by using –no-check-binlog-format, or we need to change the binary format to STATEMENT on chain master (Slave1).

In our second example, I’m going to run pt-table-checksum  on a three-node Percona XtraDB Cluster. Data inconsistencies can occur in Galera Cluster because of human errors or bugs. For this purpose, I inserted a few rows in Node1, and the Node1/Node2 test.dummy table data look like the following:

Whereas Node3 is missing last two records, as shown below (I intentionally deleted the last couple rows with the setting wsrep_on off/on in between):

I have filled the DSN table accordingly with checksum user credentials:

pt-table-checksum  on Percona XtraDB Cluster works with a single cluster, where all nodes are cluster nodes and not regular replicas. The example above is a good one – Node1, Node2 and Node3 all belong to a single cluster. Cluster-to-cluster setups aren’t supported. (Refer to documentation for more details.)

Let’s try to run pt-table-checksum  from authoritative node Node1:

pt-table-checksum  prints messages during the checksum process saying it can’t check the replica lag. It’s obvious why: SHOW SLAVE STATUS is not supported on cluster nodes as it would be in a traditional master -> slave(s) replication setup. Further, pt-table-checksum  found the difference on the test.dummy table. Now, pt-table-checksum  also supports “recursion-method=cluster,” which will auto-discover cluster nodes by querying the variable wsrep_incoming_addresses status. This new recursion-method “cluster” works equally well in Percona XtraDB Cluster. You may use “recursion-method=cluster” in place of “recursion-method=dsn” Let’s find the differences again using recursion-method “cluster”:

The differences are found on Node3. Now the next big challenge is to synchronize Node3 with the other nodes in the cluster. pt-table-sync  is the tool to synchronize MySQL table data differences, but unfortunately, the pt-table-sync  tool is not supported on Galera Cluster. I tried few hacks with pt-table-sync in Percona XtraDB Cluster, but no luck (as shown below):

I tried to run pt-table-sync  from Node1 to sync the differences on Node3. I first used –dry-run to make sure everything was good. Later, I passed the –print option to review queries before actually syncing. In XtraDB Cluster, pt-table-sync  prints INSERT/UPDATE/DELETE queries instead of REPLACE queries. Executing pt-table-sync  with the –execute option will break the cluster, as those changes will be replicated to all the cluster nodes. Since Node1 and Node2 already have those records in their table, the cluster will break with a “Duplicate Key” error. To workaround this problem, use the pt-table-sync  –replace option (as shown below):

When I run pt-table-sync  with “–execute” and “–replace”, it breaks with deadlock errors (see here: https://bugs.launchpad.net/percona-toolkit/+bug/1250450). Let’s try to run those REPLACE statements from one of the authoritative nodes (i.e., Node1):

As you can see, pt-table-sync  tool prints REPLACE INTO statements, which seems correct as those will be no-op changes for nodes that already have that data, and replication will sync unsynchronized nodes. But here is the problem: Galera Cluster works through a ROW binary log format, and REPLACE statements won’t get to an unsynchronized node since there won’t be a binary log event recorded for REPLACE statements into the binary log if executed from a node that is already synchronized (Node1 and Node2 in this case) since a ROW image change didn’t occur.

So how do we synchronized those changes in Galera Cluster? I’ve found a couple of workarounds to synchronize nodes in Galera Cluster!

Approach # 1:
Generate a .sql script via pt-table-sync  tool with –print option, and review the changes that need to occur on the unsynchronized node. Once you are satisfied, you can execute this script directly on the problematic node with binary logging disabled for that particular execution (you can push SET SQL_LOG_BIN=0 at first and enable SQL_LOG_BIN=1 again at last line of script). This ensures those changes don’t synchronize to other nodes of the cluster.

NOTE: This could be risky if you forgot to disable binary logging (i.e., SET SQL_LOG_BIN=0 for the synchronization process) because as previously explained changes will replicate to all nodes, and all the remaining nodes in the cluster will go down (as those rows already exist on the other nodes). So, be careful with this method!

Also, it’s important to mention that when manually syncing with an SQL file, it may not be an atomic operation. Between the time it takes to generate the SQL file and execute it, there is a chance that the rows in question could change. To deal with this, you should engage LOCK TABLE tablename WRITE followed by UNLOCK TABLES to release the lock after the syncing process. In fact, the most secure method is to stop writing to the table that needs to be synchronized on all the nodes before running pt-table-sync. You can resume writes on the table after the sync operation for the target table completes. One other important thing is if there are a vast number of table rows to sync via this method, it makes sense to enable wsrep_desync (set global wsrep_desync=on ) until the node gets synchronized, and then disable wsrep_desync  again.

As during this synchronization process, this node may lag behind and flow control will no longer take care of the desynced node.

Approach # 2:
Another approach is to rebuild database table in question. I got this idea from Jervin’s impressive post. You need to use the pt-online-schema-change tool and a NOOP ALTER. You need to run it through one of the authoritative synchronization nodes of the cluster, as below:

Once the table is rebuild, ROW images will be copied to the node needing synchronization. It might be a good idea to use the –dry-run option before doing the actual changes with –execute.

You’ll need to decide which of these approaches to choose: while this second approach seems a much safer option, it may take some time to complete. If the database table is in gigabytes and terabytes, and the differences in rows are few compared to the actual table size, then the first approach could be faster (but riskier).

In next example, I have added an async node as a replica of Node1, where Node1 acts as the master server for the async slave. The setup looks like:

pt-table-checksum  can only detect differences on a replica from the master node (i.e., Node1). In this case, the tool needs to be executed from the master node (Node1). If you run pt-table-checksum  from Node2 or Node3, it won’t detect diffs on the replica because its master node is Node1. You can find details about this in the documentation.

Let’s try to see all this in below examples. In this test, I’m going to run the pt-table-checksum  tool from Node1 (master) of the async node (slave) to see if it can find the differences on the async replica node.

The pt-table-checksum  tool can find the differences on async node test.dummy table. Let’s try to sync those diffs with the help of pt-table-sync tool:

As you can see, I executed pt-table-sync tool from the async node to find the exact differences, and it printed the two rows that were missing on the dummy table on the async replica. Further, when executed with the –execute option, the tool fails with a “Deadlock” error. In reality, there is no deadlock found from the InnoDB status. In order to sync the slave with its master, you can take one of the approaches described above: either use the .sql script to execute into slave server directly, or rebuild the table on the master Galera node and it will rebuild across the slaves.

Let me show you one final example where I tried to run pt-table-checksum  from Node3 to find the differences on the async slave with Node1 as its master host:

Here, Node3 refused to find differences because there is no direct replica attached to it. The documentation mentions this behavior, as cluster environment, differences can only be detected on the master node directly connected to async slave (as is the case here with Node1 to async replica). In order to run a checksum from Node3 to check for differences between cluster nodes, use the –recursion-method mentioned in previous examples.

Conclusion:
Galera Cluster doesn’t support pt-table-sync in general. But here we’ve found a couple of workarounds to solve this problem:

  • You can sync those changes manually by redirecting the pt-table-sync --print  output to file to generate a .SQL template, review changes carefully and execute it on nodes with the missing data. Again, this approach is fast but possibly dangerous.
  • Another option is the noop alter method using pt-online-schema-change  to rebuild the table. It’s the recommended method, but it could be slower.

Finally, after synchronizing the data changes it is a good idea to re-run the checksum tool to verify any discrepancies.

Share this post

Comments (3)

  • Aftab Khan Reply

    Nice article! I think there is one more way to synchronise inconsistent node using latest incremental backup. Quoting from Jay’s blog:
    “but this method will only work if and only if:

    * You have a large gcache
    * You can restore a backup faster than it takes for your gcache to overflow ”

    https://www.percona.com/blog/2015/07/16/bypassing-sst-pxc-incremental-backups/

    March 21, 2016 at 5:27 am
  • snk Reply

    does pt-table-checksum work with multi-source ?

    December 1, 2017 at 7:17 am
  • Jagannath Reply

    How to use pt-table-checksum and pt-table-sync in master-master replication with mixed binlog.

    May 10, 2018 at 7:00 am

Leave a Reply