October 31, 2014

Understanding Multi-node writing conflict metrics in Percona XtraDB Cluster and Galera

I have addressed previously how multi-node writing causes unexpected deadlocks in PXC, at least, it is unexpected unless you know how Galera replication works.  This is a complicated topic and I personally feel like I’m only just starting to wrap my head around it.

The magic of Galera replication

The short of it is that Galera replication is not a doing a full 2-phase commit to replicate (where all nodes commit synchronously with the client), but it is instead something Codership calls “virtually synchronous” replication.  This could either be clever marketing or clever engineering, at least at face value.   However, I believe it really is clever engineering and that it is probably the best compromise for performance and data protection out there.

There’s likely a lot more depth we could cover in this definition, but fundamentally “virtually synchronous replication” means:

  • Writesets (or “transactions”) are replicated to all available nodes in the cluster on commit (and enqueued on each).
  • EDIT: Writesets are then “certified” on every node (in order).  This certification should be deterministic on every node, so every node either accepts or rejects the writeset.  There is no way for a node to tell the rest of the cluster a writeset didn’t pass certification (or this would be a form of two-phase commit), so the only way nodes might get different certification results is if there is a Galera bug.
  • Enqueued writesets are applied on those nodes independently and asynchronously from the original commit on the source node.  And:
  • At this point the transaction can and should be considered permanent in the cluster.  But how can that be true if they are not applied?  Because:
    • Galera can do conflict detection between different writesets, so enqueued (but not yet committed) writesets are protected from local conflicting commits until our replicated writeset is committed. AND:
    • When the writeset is actually applied on a given node, any locking conflicts it detects with open (not-yet-committed) transactions on that node cause that open transaction to get rolled back.
    • Writesets being applied by replication threads always win.
So why is this “virtually synchronous”?  Because simply getting our writesets to every node on commit means that they are guaranteed to apply — therefore we don’t have to force all nodes to commit simultaneously to guarantee cluster consistency as you would in a two-phase commit architecture.

Seeing when replication conflicts happen

This brings me to my topic for today, the mysterious SHOW GLOBAL STATUS variables called:

  • wsrep_local_cert_failures
  • wsrep_local_bf_aborts

I found that understanding these helped me understand Galera replication better.  If you are experiencing the “unexpected deadlocks” problem, then you are likely seeing one or both of these counters increase over time, but what do they mean?

Actually, they are two sides to the same coin (kind of).  Both apply to some local transaction getting aborted and rolled back, and the difference comes down to when and how that transaction conflict was detected.  It turns out there are two possible ways:

wsrep_local_cert_failures

The Galera documentation states that this is the:

Total number of local transactions that failed certification test.

What is a local certification test?  It’s quite simple:  On COMMIT, galera takes the writeset for this transaction and does conflict detection against all pending writesets in the local queue on this node.  If there is a conflict, the deadlock on COMMIT error happens (which shouldn’t happen in normal Innodb), the transaction is rolled back, and this counter is incremented.

To put it another way, some other conflicting write from some other node was committed before we could commit, and so we must abort.

This local certification failure is only triggered by a Galera writeset comparison operation comparing a given to-be-commited writeset to all other writesets enqueued locally on the local node.  The local transaction always loses.

EDIT: certification happens on every node.  A ‘local’ certification failure is only counted on the node that was the source of the transaction.

wsrep_local_bf_aborts

Again, the Galera documentation states that this is the:

Total number of local transactions that were aborted by slave transactions while in execution.

This kind of sounds like the same thing, but this is actually an abort from the opposite vector:  instead of a local transaction triggering the failure on commit, this is triggered by Galera replication threads applying replicated transactions.

To be clearer: a transaction was open on this node (not-yet-committed), and a conflicting writeset from some other node that was being applied caused a locking conflict.  Again, first committed (from some other node) wins, so our open transaction is again rolled back.  “bf” stands for brute-force:  any transaction can get aborted by galera any time it is necessary.

Note that this conflict happens only when the replicated writeset (from some other node) is being applied, not when it’s just sitting in the queue.  If our local transaction got to its COMMIT and this conflicting writeset was in the queue, then it should fail the local certification test instead.

A brute force abort is only triggered by a locking conflict between a writeset being applied by a slave thread and an open transaction on the node, not by a Galera writeset comparison as in the local certification failure.

Testing it all out

So this is the part of the post where I wanted to show that these counters were being incremented using an example from my last post.  Those examples should trigger brute force aborts, but they didn’t seem to increment either of these counters on any of my testing nodes.   Codership agrees this seems like a bug and is investigating.  I’ll update this post if and when an actual bug is opened, but I have seen these counters being incremented in the wild, so any bug is likely some edge case.

By the way, I can’t think of how to reliably produce local certification errors without just a lot of fast modifications to a single row, because those depend on the replication queue being non-empty and I don’t know any way to pause the Galera queue for a controlled experiment.

 

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

Comments

  1. To product certification errors try using a “wayback”. :-) IE clever use of the sleep() function.

    This post by Peter Laursen might help you understand what I’m thinking about.

    http://blog.webyog.com/2012/11/13/the-wonderful-way-back-machine-in-mysql/?utm_source=rss&utm_medium=rss&utm_campaign=the-wonderful-way-back-machine-in-mysql

  2. Sergey says:

    Don’t have enough understanig about bf_abort.
    As I know, after the certification stage transaction is physically commited on the source node. So, after the physical commit failure on the local (slave) node Galera should abort a transaction that already have been commited on the source node. Is it correct and Galera does it?

  3. Sergey,

    The flow is:

    source node commit: local certification, enqueue on all other nodes, commit locally

    then:

    on all other nodes: trx sits in queue (Briefly) and cause local_cert_failures for local conflicting trxs trying to commit, then apply, and cause bf_aborts for any open transactions causing lock conflicts with apply

    Once the commit succeeds on the source node, nothing can cause it to rollback. bf_abort is an already replicated transaction blasting local transactions out of its way. Galera is designed so a transaction getting applied by replication cannot fail.

  4. Hi Jay

    I easily get a small percentage of deadlocks just by running sysbench:

    – create a table with small number of rows
    – use small buffer pool / disk bound workload
    – use lots of sysbench threads, multi-master of course
    – expect 1-5% of deadlocks

    With bigger tables and in-memory workload the amount of deadlocks is negligible so you may only get less than one per minute.

    henrik

  5. Sergey says:

    Sorry, bit no understanding anyway.

    For example, one of the question: where does Galera run Local Certification: on both sides, master and slave? I thought before that’s master’s job, it must send a certification request to slave and get some response from slave (succeess or not). Only if response is success, it can do a physical commit. Is it wrong?

  6. Sergey: “local” certification = only on the node the write occurs from. By locally verifying the transaction against its own slave queue, it guarantees that the transaction should have no conflicts cluster-wide (since all transactions are synchronously copied to all other nodes’ queues.

  7. gpfeng.cs says:

    Thanks Jay Janssen. It helps me a lot

  8. gpfeng.cs says:

    I’m quite clear about the difference of wsrep_local_cert_failures and wsrep_local_bf_aborts after reading this blog.

    From what you say:

    “Writesets (or “transactions”) are replicated to all available nodes in the cluster on commit (and enqueued on each).
    Enqueued writesets are applied on those nodes independently and asynchronously from the original commit on the source node. And:
    At this point the transaction can and should be considered permanent in the cluster”

    I get the conclusion: only local trxs will be aborted or rolled back while broadcasted and enqueued trxs will be promised to commit.

    But I got confused when I read the blog:http://www.mysqlperformanceblog.com/2012/01/19/percona-xtradb-cluster-feature-2-multi-master-replication/, the diagram is similar to http://www.codership.com/wiki/doku.php?id=certification

    from what I see: broadcasted trxs(sit in the queue) need to be certified and may be rollback, as the total order service guarantees that every nodes enqueue trxs in the same order and them produce same result against the certification test.

    Is there something wrong with my understanding?

  9. Hi gpfeng.cs:
    This is a good point: actually the flow is:

    – writeset is replicated
    – writeset is certified on all nodes (pass or fail, they should all produce same result)
    – if pass, writeset is put in apply queue.

    I really only clarified this myself recently. I’ve tried to update the post slightly to make it more accurate.

  10. gpfeng.cs says:

    Jay Janssen

    Good to see the reply!

    so wsrep_local_cert_failures and wsrep_local_bf_aborts only record the aborted/rollback information of the local trxs on the source node(before replicated)

    one more questions:
    Is there a status variable that counts the certification test failures of the replicated(enqueued and sitted) trxs? as high rate of certification test failures should be found and explained.

  11. >> so wsrep_local_cert_failures and wsrep_local_bf_aborts only record the aborted/rollback information of the local trxs on the source node(before replicated)

    local_cert_failure is after replication, but before final commit on the source node of the transaction

    local_bf_abort are transactions on the local node that are aborted because of some replicated transaction from some other node that was committed first. Note that there is a bug with bf_aborts not being incremented all the time: https://bugs.launchpad.net/codership-mysql/+bug/1126281, so the count there might be lower than it really is.

    >>one more questions:
    >>Is there a status variable that counts the certification test failures of the replicated(enqueued and sitted) trxs? as high rate of certification test failures should be found and explained.

    No, local_cert_failure is the only one, and that only counts on the node that was the source of the transaction. All other nodes *should* get the exact same failure (or else there is a bug), but there is nothing that counts that.

  12. gpfeng.cs says:

    >>local_cert_failure is after replication, but before final commit on the source node of the transaction

    from your comment:
    –>
    The flow is:

    source node commit: local certification, enqueue on all other nodes, commit locally

    then:

    –<

    my understanding: when a local trx tries to commit,it must first pass the local certification and then it will be broadcasted, otherwise it will be rollback and nothing replicated to other nodes.

    or is the ture logic that at COMMIT trx will be first bundled into writeset and replicated to all nodes, and then local certification will be tested on the source node?

    I'm not so sure.

  13. gpfeng.cs says:

    my understanding of the flow:

    1. source node commit:
    2. local certification on source node (local_cert_failure will be incremented if failed)
    3. enqueue on all other nodes if pass the local certification test
    4. certification test (all nodes produce same result)
    5. commit on all nodes if passed the certification test(will cause other local open trx to rollback) or rollback if failed

  14. No, flow is:

    1) source node commit (COMMIT from client)
    2) replicate to all nodes and determine GTID
    3) certification on all nodes (on failure: source node: local_cert_failure +1, client deadlock error, other nodes: drop trx)
    4) source node: actual innodb commit(). Other nodes: enter apply queue
    5) Other nodes: actual innodb commit()

    The trx is replicated before it is certified, that is important. I did not realize that myself until recently. It is necessary to do so to keep certification deterministic. If you certified locally before replication, you would have a race condition where certification might fail the second time on some other node, and now you have 2 phase commit. Replicating first allows Galera to certify deterministically because all certification is done in GTID order on all nodes.

  15. gpfeng.cs says:

    Good work!

    >>If you certified locally before replication, you would have a race condition where certification might fail the second time on some other node, and now you have 2 phase commit.

    why 2pc is required here?

  16. 2pc is a common synchronous replication strategy, but Galera does not use it. Because certification is done on all nodes in order of GTID, and certification is deterministic (i.e., will return the same result on all nodes) we don’t need it. This reduces the overhead of replication quite a bit as 2pc is quite chatty over the network and requires many roundtrips to commit, even under circumstances where there is no conflict.

  17. gpfeng.cs says:

    Hi Jay Janssen:

    Sorry, I should have misunderstood you, I guesed that you mean 2pc is required if the logic is certification before replication from the following:

    >>If you certified locally before replication, you would have a race condition where certification might fail the second time on some other node, and now you have 2 phase commit.

    and now I get everything clear, Thank you!

  18. Pablo says:

    Hi,

    What happens if the apply fails on just one node? For example, If the disk is full. Does the apply retries indefinitely ?
    In general, what are the guarantees that a WriteSet will succeed in all the nodes? Is there any?

    Thanks!

Speak Your Mind

*