Percona XtraDB Cluster: Multi-node writing and Unexpected deadlocks

Percona XtraDB Cluster (PXC) and the technology it uses (Galera) is an exciting alternative to traditional MySQL replication.  For those who don’t know, it gives you:

  • Fully Synchronous replication with a write latency increase equivalent to a ping RTT to the furthest node
  • Automatic cluster synchronization, both incremental and full restores
  • The ability to read and write on every node without needing to worry about replication delay

However, good engineers know there is always a trade-off.  If someone tries to sell you on a technology that sounds amazing without telling you the tradeoffs, be careful.

 

One of the tradeoffs in Galera is how multi-node writing is implemented.  This is highly simplified, but locking inside of an RDBMS can either be optimistic or pessimistic.

Pessimisic locking is what Innodb does as a transaction executes.  All the locks it needs (or it thinks it needs) will be acquired throughout the course of the transaction.  If a given statement in the transaction needs a lock, it will wait for that lock before proceeding (this is where deadlocks and lock wait timeouts can happen).  By the time the transaction reaches the COMMIT, it should be guaranteed to succeed (unless something weird happens like the disk fails or the server crashes) because it acquired all the locks it needed in advance.

Now, think about pessimistic locking in terms of a cluster.  It would be incredibly expensive to go gain every lock on every node you need for all running transactions.  How would deadlock detection work?  On a single node, a deadlock is fairly easy to see by looking at all transactions waiting for locks and checking for circular dependencies, but that would be much more complicated in a cluster.  It’s not to say that pessimistic locking isn’t possible in a cluster, but it is pretty tricky, and all that extra network traffic would really slow down transactions.

 

You may have already guessed, but Galera does not use pessimistic locking cluster-wide, it uses optimistic locking.  Initially a transaction on PXC behaves much like a regular Innodb transaction.  It does pessimistic locking on the node it is started from all the way up to point of the commit.  At that point it knows it has gained all the locks necessary on the local node, but it has no idea about the rest of the cluster.  So, it optimistically shoots the transaction out to the rest of the nodes in the cluster to see if they will agree to it.  If they do, then the commit succeeds at a simple cost of one network roundtrip per node (which is done in parallel to all nodes at once).

This sounds all well and good, but what is the tradeoff?

Not locking resources cluster-wide means that there can be locking conflicts.  In short, these take the form of deadlock errors to the clients of affected transactions.  Technically these aren’t necessarily deadlocks, but instead a cross-node locking conflict of some kind.  These deadlocks can happen in places where you wouldn’t necessarily expect one.

So, enough talking, let’s see what one looks like.  Suppose we have a 3 node PXC cluster.  Let’s create a table to test on and insert some data:

Now, on two different nodes we will open transactions and attempt to modify the same row:

So far, so good.  We have an open transaction on node1 updating row 1 in our table. Note we haven’t committed yet.  Now, let’s go to node3:

Our commit succeeded!  On a single node, this should have blocked waiting for our first transaction, but it does not.  Let’s go back to node1 and see what happened:

Whoops!  We got a deadlock error on commit, that doesn’t happen in normal Innodb.

Galera is allowing the first transaction to commit to “win”.  Any other transactions on other nodes with competing locks are automatically rolled back and will receive a deadlock error no matter what they try to do next (even a SELECT or something else seemingly innocuous).

Dealing with this problem, quite simply, is the tradeoff to multi-node writing in a synchronous replication system with optimistic locking.

So, when is this really a problem?  Really whenever transactions across multiple nodes can take the same write locks.  Whether or not this is a problem for your system is purely application-dependent, but if you’re not sure, I’d say you should assume it will be a problem.

What can you do to work with/around this problem?  Here are some ideas from simplest to most complex:

  • Only write to a single node in the cluster (at a time)-– all the pessimistic locking will happen as usual on that node only and there should be no conflicts from other nodes.  This is pretty easy to rig up with an HAproxy configuration or similar so if your main writer node fails, clients fail over gracefully.
  • Update your application DB library to handle sudden deadlocks— If you end up with one of these deadlock errors, simply retry the transaction.  Any software using transactions should have some facility to do this anyway (IMO).
  • Restructure your application/schema to avoid data hot-spots— this is probably easier said than done in most cases, but in principle the least likely it is for a single piece of data to be written on multiple nodes at the same time, the less often this problem will happen.  For some databases this may be perfectly natural, but I believe most conventional relational databases will naturally have these kinds of problems.

Want to learn more about PXC and go hands-on?   I will walk you through real examples on this and a lot more at my Percona XtraDB Cluster in Practice tutorial at Percona Live NYC  on Oct 1-2, 2012.

Share this post

Comments (21)

  • pcrews

    I was also seeing this in my galera tests for kewpie, glad to see you’ve distilled the problem further and I’ll have to post once I’ve dusted those tests off to make sure they are still working.

    Nice, informative post!