Optimizing Percona XtraDB Cluster for write hotspotsStephane Combaudon
Some applications have a heavy write workload on a few records – for instance when incrementing a global counter: this is called a write hotspot. Because you cannot update the same row simultaneously from multiple threads, this can lead to performance degradation. When using Percona XtraDB Cluster (PXC), some users try to solve this specific issue by writing on multiple nodes at the same time. Good idea or bad idea? Read on!
Simultaneous writes on a standalone InnoDB server
Say you have these 3 transactions being run simultaneously (id is the primary key of the table):
UPDATE t SET ... WHERE id = 100
UPDATE t SET ... WHERE id = 100
UPDATE t SET ... WHERE id = 101
All transactions will require a row lock on the record they want to modify. So T3 can commit at the same time than T1 and/or T2, because it will not lock the same record as T1 and T2.
But T1 and T2 cannot execute simultaneously because they need to set a lock on the same record. Let’s assume T2 is executed by InnoDB first, how long does T1 need to wait? It is essentially the time needed for T2 to execute.
Simultaneous writes on multiple nodes (PXC)
Now is it any different if you have a 3-node PXC cluster and if you want to run T1 on Node1 on T2 on Node2? Let’s review step by step how the cluster will execute these queries:
1. T1 starts executing on Node1: a row lock is set on the record where id=100. T2 also starts executing on Node2 and also sets a row lock on the record where id=100. How is it possible that 2 transactions set the same lock on the same record? Remember that locking in Galera is optimistic, meaning that a transaction can only set locks on the node where it is executing, but never on the remote nodes: here, T1 sets a lock on Node1 and T2 sets a lock on Node2.
2. Let’s assume T1 reaches commit before T2. T1 is then synchronously replicated on all nodes and it gets a global sequence number. At this point, a certification test is run on Node1 to determine whether there is any write conflicts between T1 and other “in-flight” transactions in the cluster. Go to the next section if you want some clarification about “in-flight” transactions and the certification test.
Back to our transactions: T1 is the first transaction to try to commit, so no other transaction is “in-flight”: the certification test will succeed and InnoDB will be able to apply the transaction on Node1. On Node2, the same certification test will be run and T1 will be put in the apply queue (and it will be applied at some point in the near future).
Ok, wait a minute! No other transaction is “in-flight”, really? What about T2 on Node2? Actually T2 is simply a local transaction on Node2 and it is not known by the cluster yet. Therefore it is not what I called an “in-flight” transaction and it does not play any role in the certification test.
3. Now T2 reaches commit on node2. It is then synchronously replicated to all nodes and a certification test will run on node2. If T1 and T2 did commit simultaneously, there is a good chance that when T2 starts committing, T1 is still in the apply queue of Node2. In this case, the certification test for T2 will fail. Why? To make sure that T1 will commit on all nodes no matter what, any other transaction that wants to set a lock on the record where id=100 will be rejected.
Then if the certification test for T2 fails, T2 is rolled back. The only option to commit T2 is to retry executing it.
Let’s assume that this second try is successful, how long did T2 have to wait before being executing? Essentially we had to execute T2 twice so we had to replicate it twice, each replication taking 1 network RTT, we had to roll T2 back on Node2 (rollback is expensive in InnoDB) and the application had to decide that T2 had to be executed a second time. That is a lot more work and wait compared to the scenario on a single server.
So where is the fundamental problem when we tried to write on several nodes? Galera uses optimistic locking, and we had to go very far in the execution of T2 before realizing that the query will not succeed. Multi-node writing is therefore not a good solution at all when the system sees write hotspots.
“In-flight” transactions and certification test
“In-flight” transactions are transactions that have already been applied on at least one node of the cluster but not on all nodes. Remember that even if replicating transactions is synchronous, applying committed transactions on remote nodes is not. So a transaction Tx can be committed and applied on node1 but not on node2: on node2, Tx will simply sit in an apply queue, waiting to be executed. Tx is then an “in-flight” transaction.
The goal of the certification test is to make sure that no transaction can prevent Tx to execute on node2: as Tx is already on node1 and as we want data consistency, we must make sure that Tx will execute successfully no matter what can happen. And the magic of Galera is that the test is deterministic so group communication is not necessary when a node runs a certification test.
So what are your options with PXC when the workload has write hotspots? The most obvious one is to write on a single node: then you will have the same locking model as with a standalone InnoDB server. Performance will not be as good as with a standalone server as you will have to pay for synchronous replication, but you will avoid the very expensive write conflicts.
Another option would be to rearchitect your application to write less often. For instance, for a global counter, you could maintain it with Redis and only periodically flush the value to PXC.