Recently Codership introduced (with Galera 3.19) a very important and long awaited feature. Now users can recover Galera cache on restart.
If you gracefully shutdown cluster nodes one after another, with some lag time between nodes, then the last node to shutdown holds the latest data. Next time you restart the cluster, the last node shutdown will be the first one to boot. Any followup nodes that join the cluster after the first node will demand an SST.
Why SST, when these nodes already have data and only few write-sets are missing? The DONOR node caches missing write-sets in Galera cache, but on restart this cache is wiped clean and restarted fresh. So the DONOR node doesn’t have a Galera cache to donate missing write-sets.
This painful set up made it necessary for users to think and plan before gracefully taking down the cluster. With the introduction of this new feature, the user can retain the Galera cache.
How does this help ?
On restart, the node will revive the galera-cache. This means the node can act as a DONOR and service missing write-sets (facilitating IST, instead of using SST). This option to retain the galera-cache is controlled by an option named gcache.recover=yes/no. The default is NO (Galera cache is not retained). The user can set this option for all nodes, or selective nodes, based on disk usage.
gcache.recover in action
The example below demonstrates how to use this option:
- Let’s say the user has a three node cluster (n1, n2, n3), with all in sync.
- The user gracefully shutdown n2 and n3.
- n1 is still up and running, and processes some workload, so now n1 has latest data.
- n1 is eventually shutdown.
- Now the user decides to restart the cluster. Obviously, the user needs to start n1 first, followed by n2/n3.
- n1 boots up, forming an new cluster.
- n2 boots up, joins the cluster, finds there are missing write-sets and demands IST but given that n1 doesn’t have a gcache, it falls back to SST.
n2 (JOINER node log):
2016-11-18 13:11:06 3277 [Note] WSREP: State transfer required:
Group state: 839028c7-ad61-11e6-9055-fe766a1886c3:4680
Local state: 839028c7-ad61-11e6-9055-fe766a1886c3:3893
n1 (DONOR node log), gcache.recover=no:
2016-11-18 13:11:06 3245 [Note] WSREP: IST request: 839028c7-ad61-11e6-9055-fe766a1886c3:3893-4680|tcp://192.168.1.3:5031
2016-11-18 13:11:06 3245 [Note] WSREP: IST first seqno 3894 not found from cache, falling back to SST
Now let’s re-execute this scenario with gcache.recover=yes.
n2 (JOINER node log):
2016-11-18 13:24:38 4603 [Note] WSREP: State transfer required:
Group state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495
Local state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769
2016-11-18 13:24:41 4603 [Note] WSREP: Receiving IST: 726 writesets, seqnos 769-1495
2016-11-18 13:24:49 4603 [Note] WSREP: IST received: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495
n1 (DONOR node log):
2016-11-18 13:24:38 4573 [Note] WSREP: IST request: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769-1495|tcp://192.168.1.3:5031
2016-11-18 13:24:38 4573 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
You can also validate this by checking the lowest write-set available in gcache on the DONOR node.
mysql> show status like 'wsrep_local_cached_downto';
| Variable_name | Value |
| wsrep_local_cached_downto | 1 |
1 row in set (0.00 sec)
So as you can see, gcache.recover could restore the cache on restart and help service IST over SST. This is a major resource saver for most of those graceful shutdowns.
gcache revive doesn’t work if . . .
If gcache pages are involved. Gcache pages are still removed on shutdown, and the gcache write-set until that point also gets cleared.
Again let’s see and example:
- Let’s assume the same configuration and workflow as mentioned above. We will just change the workload pattern.
- n1, n2, n3 are in sync and an average-size workload is executed, such that the write-set fits in the gcache. (seqno=1-x)
- n2 and n3 are shutdown.
- n1 continues to operate and executes some average size workload followed by a huge transaction that results in the creation of a gcache page. (1-x-a-b-c-h) [h represent transaction seqno]
- Now n1 is shutdown. During shutdown, gcache pages are purged (irrespective of the keep_page_sizes setting).
- The purge ensures that all the write-sets that has seqno smaller than gcache-page-residing write-set are purged, too. This effectively means (1-h) everything is removed, including (a,b,c).
- On restart, even though n1 can revive the gcache it can’t revive anything, as all the write-sets are purged.
- When n2 boots up, it requests IST, but n1 can’t service the missing write-set (a,b,c,h). This causes SST to take place.
Summing it up
Needless to say, gcache.recover is a much needed feature, given it saves SST pain. (Thanks Codership.) It would be good to see if the feature can be optimized to work with gcache pages.
And yes, Percona XtraDB Cluster inherits this feature in its upcoming release.