Changing an async slave of a PXC cluster to a new Master

Async and PXC

A common question I get about Percona XtraDB Cluster is if you can mix it with asynchronous replication, and the answer is yes!  You can pick any node in your cluster and it can either be either a slave or a master just like any other regular MySQL standalone server (Just be sure to use log-slave-updates in both cases on the node in question!).  Consider this architecture:

Canvas 1

However, there are some caveats to be aware of.  If you slave from a cluster node, there is no built in mechanism to fail that slave over automatically to another master node in your cluster.  You cannot assume that the binary log positions are the same on all nodes in your cluster (even if they start binary logging at the same time), so you can’t issue a CHANGE MASTER without knowing the proper binary log position to start at.

Canvas 2

Until recently, I thought it was not possible to easily do this without scanning both the relay log on the slave and the binary log on the new master and trying to match up the (binary) replication event payloads somehow.  It turns out it’s actually quite easy to do without the old master being available and without pausing writes on your cluster.

Galera Seqnos and Binary log Xids

Given the above scenario, we have a 2 node cluster with node3 async replicating from node1.  Writes are going to node2.  If I stop writes briefly and look at node1’s binary log, I can see something interesting:

Notice the Xid field. Galera is modifying this field that exists in a standard binlog with its last commited seqno:

If I check node2, sure enough its latest binlog has the same Xid, even though it’s in a different file at a different position:

So, with both nodes doing log-bin (and both with log-slave-updates), even though they are on different log files, they still have the same Xid that is very easy to search for.

But, does our slave (node3) know about these Xids? If we check the latest relay log:

Yes, indeed! We’re assuming the SQL thread on this slave has processed everything in the relay log and, if it has, then we know that the Galera transaction seqno 86277 was the last thing written on the slave. If it hasn’t, you’d have to find the last position applied by the SQL thread using SHOW SLAVE STATUS and find the associated Xid with that position.

Testing it out

Now, let’s simulate a failure to see if we can use this information to our advantage. I have write load running against node2 and I kill node1: