We have seen TokuDB lock escalation stall the execution of SQL operations for tens of seconds. To address this problem, we changed the lock escalation algorithm used by TokuDB and TokuMX so that the cost of lock escalation only affects big transactions. We also eliminated a serialization point when running lock escalation.
Transactions in TokuDB and TokuMX accumulate locks on key ranges while they execute. These locks allow multiple transactions to run concurrently. The locks are released when the transaction commits or aborts.
The locks are stored in an in memory data structure that contains a set of key range and transaction identifier pairs. Since the locks are stored in memory and we want to support arbitrarily large transactions, an algorithm is needed to kick in when the amount of memory used to store locks exceeds a maximum limit. The limit is set when the server is started. The lock escalation algorithm shrinks the total lock memory footprint by merging key ranges. The merged key ranges must have the same transaction identifier. One result of lock escalation is that there are fewer lock ranges in the memory, so the memory footprint is smaller. One downside of lock escalation is that the merged key range may include keys that were not explicitly locked by the application. This may result in less transactional concurrency.
How does lock escalation work? The lock escalation algorithm examines all of the key ranges stored in the lock tree looking for merge opportunities. For a large set of locks, the time to run the algorithm can be very long. I have seen escalations that take tens of seconds to complete. Perhaps a less expensive lock escalation algorithm can be used. We will investigate this someday.
When does lock escalation run? Lock escalation is run when the memory footprint of all of the locks exceeds the lock memory limit. Whenever a transaction needs a lock, it checks the total memory footprint of the locks. If the lock memory footprint is over the limit, then lock escalation is run by the calling thread. In addition, a global mutex is held while lock escalation is running. This is the source of the serialization point. All threads that need a lock get blocked on the mutex until lock escalation finishes. This can take tens of seconds.
Usually, a big transaction owns most of the locks. It would be nice if a big transaction ran lock escalation while small transactions continue to run. What is a big transaction? A big transaction is a transaction with a rollback log that does not fit in memory. When a transaction creates a rollback log that no longer fits in memory by doing a lot of fractal tree operations, its rollback log gets spilled to a file.
We added a lock footprint limit for big transactions that is 1/2 of the total limit. If a big transaction needs a lock and the lock memory footprint is greater than 1/2 of the limit, then the big transaction thread runs lock escalation. Since the big transaction is probably the owner of most of the locks, it should pick up the cost of lock escalation. Small transactions are able to acquire and release locks while lock escalation is in progress since lock escalation is kicked off early.
This new algorithm shipped in TokuDB 7.1.5 and TokuMX 1.4.0.