Demystifying Sequence Numbers (seqno) in Percona XtraDB ClusterKrunal Bauskar
In this blog post, we’ll look at how sequence numbers work in Percona XtraDB Cluster.
Percona XtraDB Cluster uses multiple sequence numbers (seqno), each having a special role to play. Let’s try to understand the significance of each sequence number.
An active Percona XtraDB Cluster cluster has multiple write-sets (transactions) generated from one or more nodes. These transactions then get replicated to other nodes of the cluster. Each write-set/transaction should have a unique identity across the cluster. global_seqno helps serve this identity. Given write-set will have the same global_seqno on all the nodes of the cluster.
We designed the protocol so that each node processes all write-sets, including self-generated write-sets, in same order – thereby skipping the generation of a unique ID through a common entity (also avoiding single point failure). Each node increments its local counter once it receives the write-set from group channel. On restart, this counter is re-initialized based on the state of the cluster as seen by a respective node. In the attached diagram we can see replicated write-sets across the cluster have the same seqno (x, y, z).
Just like global_seqno, there is a local_seqno, but this seqno is local to the said node. It is, of course, incremented to register receipt of write-set. But in addition, it is also incremented for other activities including configuration change, pause activity, etc…
All internal actions like certification are ordered based on local_seqno through LocalMonitor. This is important in that the first transaction to replicate should be first to go through certification. If there is a conflict, then the followup transaction fails and not the first replicated transaction.
local_seqno resets to 0 on restart.
As we can see in the attached example, each write-set has same or different local seqno, as it is really local to that node and there is no cross-node decision based on this seqno.
When a given transaction is about to replicate, last_seen_seqno is set based on last_committed seqno on the said node. In other words, it is like setting a lower watermark for certification ranges. In simpler terms, this transaction has successfully accommodated changes done until this point. If any new conflicting (conflicting based on keys) transactions are added beyond this point, it can result in certification failure.
Here is an example:
- Say a trx is about to replicate (it doesn’t yet have global_seqno) assigned.
- trx last_seen_seqno is updated based on the last committed transaction. In this case “a”.
- trx is successfully replicated and gets a global_seqno (g). In the meantime, trx with global_seqno b,c,d,e,f are also added to the channel and the respective node certifies and adds them to the certification queue (as seen above).
- The certification algorithm is now trying to certify trx(g). The trx(g) last_seen_seqno is (a), so the range to certify is from b-f. Since b-f write-sets were added after trx(g) was replicated, these write-sets could potentially create certification conflicts with trx(g).
- trx(g) keys are then compared with the certification queue before it hits a conflicting key at trx(e). Since global_seqno(trx(e)) > last_seen_seqno(trx(g)), it indicates certification conflict. It is also important to note that trx(e) is already accepted, so trx(g) is one that gets rolled back.
So last_seen_seqno helps determine the lower bound of the certification range.
Each transaction has a parent transaction based on the data object it modifies. This parent transaction should be applied first. Then the said transaction can be allowed to get processed.
Again, let’s look at an example:
- “create db” is a base parent trx. ( depends_seqno=0)
- “create table” is dependent on it. ( depends_seqno=1 for this trx with global_seqno=2)
- Insert is an independent action and they are not linked to each other. They have a common parent that is trx with global_seqno=2.
- The update statement tends to modify a complete table, so closet transaction in the queue acts as the parent ( depends_seqno=5).
This depends_seqno is set during certification based on the key evaluation, and it dictates the apply order. In turn, it also sets the commit order for the said transaction. Even though apply can run in parallel, it will not allow an update to proceed until insert transaction (with global_seqno=5) is applied. The order will affect the end-result. (The certification queue continues to purge. Say an insert appears after a long interval, and in the meantime, the certification queue has purged. depends_seqno defaults to point to the starting entry of the queue. A purge happens only when the said entry has been committed on all the nodes and it is safe not to link the insert entry with that old historical entry, as the said goal has been achieved.)
Now, what if there is another transaction with global_seqno=7 (update mysql.t2). This transaction doesn’t conflict with any of the existing transactions so that it will have depends_seqno=1 (first transaction, known transaction in a queue). It can proceed without waiting for (update test.t1) or (insert test.t1) to complete the apply action.
NOTE: I hope this blog helped clarify the basic significance of each seqno. Leave a comment below if you need a more detailed explanation about any of the related aspect.