One of our customers sometimes observed lots of simple insertions taking far longer than expected to complete. Usually these insertions completed in milliseconds, but the insertions sometimes were taking hundreds of seconds. These stalls indicated the existence of a serialization bug in the Fractal Tree index software, so the hunt was on. We found that these stalls occurred when a big transaction was committing and the Fractal Tree index software was taking a checkpoint. This problem was fixed in both TokuDB 7.0.3 and TokuMX 1.0.3. Please read on as we describe some details about this bug and how we fixed it. We describe some of the relevant Fractal Tree index algorithms first.
What is a Big Transaction?
Each transaction builds a rollback log as it performs Fractal Tree index operations. The rollback log is maintained in memory until it gets too big and is spilled to the TokuDB rollback file. We define a small transaction as one whose rollback log resides in memory. We define a big transaction as one whose rollback log is spilled to the rollback file.
When the transaction commits or aborts, the entire rollback log for the transaction is traversed and for each rollback entry the commit or abort action for the rollback entry is executed. For example, it the transaction is committing and the rollback entry is an insertion, then the Fractal Tree index software does nothing since we implicitly commit the insertion. If the transaction is aborting an insertion, then the Fractal Tree index software sends an abort message into the appropriate Fractal Tree index. The time to traverse the rollback log depends on the number of operations that the transaction logged. This time can be huge.
What is a Checkpoint?
Each transaction also appends recovery entries to the recovery log as it performs Fractal Tree index operations. When recovering from a crash, the recovery log is traversed and its log entries are executed. Since we don’t want to process the recovery log from the beginning of time, we periodically checkpoint the open Fractal Tree indexes and write checkpoint log entries to the recovery log. A checkpoint creates a new version of the Fractal Tree index and persists it to storage. The checkpoint log entries allow crash recovery to only have to replay operations that occurred since the last checkpoint, not since the beginning of time.
A checkpoint is split into two phases: the begin phase and the end phase. The begin checkpoint phase logs the state of the open Fractal Tree indexes and the live transactions. It also marks the blocks that need to be written by the end checkpoint phase. The details of checkpoints will not be covered here since they are way beyond the topic of this blog.
A reader writer lock is used to serialize transaction commits, aborts and the checkpoint begin phase. Transaction commits and aborts take a read lock on this lock. Checkpoints take a write lock on this lock. We use a fair reader writer lock so that checkpoints are not starved forever in the face of lots of transaction commits or aborts.
Here is how the stalls occur.
- A big transaction grabs a read lock on the checkpoint lock when it wants to commit or abort.
- Other transactions can still grab a read lock on the checkpoint lock and commit. So far, so good.
- A checkpoint attempts to grab a write lock in the checkpoint lock. Since there is a read lock active, the write lock attempt is blocked.
- When other transactions try to grab a read lock on the checkpoint lock, they are blocked because there is a write lock pending.
- Now, all transaction commits and aborts are serialized until the big transaction completes.
The Bug Fix
We split the checkpoint lock into a low priority reader writer lock and a high priority reader writer lock. A big transaction grabs a read lock on the low priority reader writer lock. A small transaction grabs a read lock on the high priority reader writer lock. The checkpoint first grabs a write lock on the low priority lock. It then grabs a write lock on the high priority lock.
Zardosht and Leif tell me that this design pattern is a special case of a graded lock that supports multiple priority levels. We use two levels here.
Here is how the stalls are avoided.
- A big transaction grabs a read on the low priority checkpoint lock.
- Other big transactions can still grab a read lock on the low priority checkpoint lock.
- Other small transactions can grab a read lock on the high priority checkpoint lock as well.
- A checkpoint attempts to grab a write lock on the low priority checkpoint. Since there is a read lock active, the write lock attempt is blocked.
- Other small transactions can continue to grab a read lock on the high priority checkpoint lock since there are no write locks.
- Eventually, the big transaction releases the low priority lock, the checkpoint thread grabs it, and then the checkpoint thread tries to grab the high priority lock.
- This will serialize the small transactions while the checkpoint begin runs, but we expect checkpoint begin to be relatively fast.
We give priority to small transactions over big transactions in this design. While big transactions are running through their rollback logs, small transactions continue to commit or abort.