Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Finding a good IST donor in Percona XtraDB Cluster 5.6

January 8, 2014

Author

Jay Janssen

MySQL

Share this Post:

Gcache and IST

The Gcache is a memory-based cache of recent Galera transactions that is local to each node in a cluster. If a node leaves and rejoins the cluster, it can use the gcache from another node that stayed in the cluster (i.e., its donor node) to fetch the transactions it missed (IST) as opposed to doing a full state snapshot transfer (SST). However, there are a few nuances that are not obvious to the beginner:

- The Gcache is lost when a node restarts

- The Gcache is fixed size and implemented as a LRU. Once it is full, older transactions roll off.

- Donor selection is made irregardless of the gcache state

- If the given donor for a restarting node doesn’t have all transactions needed, a full SST (read: full backup) is done instead

- Until recent developments, there was no way to tell what, precisely, was in the Gcache.

So, with (somewhat) arbitrary donor selection, it was hard to be certain that a node restart would not trigger a SST. For example:

- A node crashed over night or was otherwise down for some length of time. How do you know if the gcache on any node is big enough to contain all the transactions necessary for IST?

- If you brought two nodes in your cluster simultaneously, the second one you restart might select the first one as its donor and be forced to SST.

Along comes PXC 5.6.15 RC1

Astute readers of the PXC 5.6.15 release notes will have noticed this little tidbit:

New wsrep_local_cached_downto status variable has been introduced. This variable shows the lowest sequence number in gcache. This information can be helpful with determining IST and/or SST.

Until this release there was no visibility into any node’s Gcache and what was likely to happen when restarting a node. You could make some assumptions, but now it its a bit easier to:

1. Tell if a given node would be a suitable donor

1. And hence select a donor manually using wsrep_sst_donor instead of leaving it to chance.

What it looks like

Suppose I have a 3 node cluster where load is hitting node1. I execute the following in sequence:

1. Shut down node2

1. Shut down node3

1. Restart node2

At step 3, node1 is the only viable donor for node2. Because our restart was quick, we can have some reasonable assurance that node2 will IST correctly (and it does).

However, before we restart node3, let’s check the oldest transaction in the gcache on nodes 1 and 2:

[root@node1 ~]# mysql -e "show global status like 'wsrep_local_cached_downto';"
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_cached_downto | 889703 |
+---------------------------+--------+

[root@node2 mysql]# mysql -e "show global status like 'wsrep_local_cached_downto';"
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| wsrep_local_cached_downto | 1050151 |
+---------------------------+---------+

[root@node1 ~]# mysql -e "show global status like 'wsrep_local_cached_downto';"

+---------------------------+--------+

| Variable_name | Value |

+---------------------------+--------+

| wsrep_local_cached_downto | 889703 |

+---------------------------+--------+

[root@node2 mysql]# mysql -e "show global status like 'wsrep_local_cached_downto';"

+---------------------------+---------+

| Variable_name | Value |

+---------------------------+---------+

| wsrep_local_cached_downto | 1050151 |

+---------------------------+---------+

So we can see that node1 has a much more “complete” gcache than node2 does (i.e., a much smaller seqno). Node2’s gcache was wiped when it restarted, so it only has transactions from after its restart.

To check node3’s GTID, we can either check the grastate.dat, or (if it has crashed and the grastate is zeroed) use –wsrep_recover:

[root@node3 ~]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    7206c8e4-7705-11e3-b175-922feecc92a0
seqno:   1039191
cert_index:

[root@node3 ~]# mysqld_safe --wsrep-recover
140107 16:18:37 mysqld_safe Logging to '/var/lib/mysql/error.log'.
140107 16:18:37 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140107 16:18:37 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pIVkT4' --pid-file='/var/lib/mysql/node3-recover.pid'
140107 16:18:39 mysqld_safe WSREP: Recovered position 7206c8e4-7705-11e3-b175-922feecc92a0:1039191
140107 16:18:41 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

[root@node3 ~]# cat /var/lib/mysql/grastate.dat

# GALERA saved state

version: 2.1

uuid: 7206c8e4-7705-11e3-b175-922feecc92a0

seqno: 1039191

cert_index:

[root@node3 ~]# mysqld_safe --wsrep-recover

140107 16:18:37 mysqld_safe Logging to '/var/lib/mysql/error.log'.

140107 16:18:37 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

140107 16:18:37 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pIVkT4' --pid-file='/var/lib/mysql/node3-recover.pid'

140107 16:18:39 mysqld_safe WSREP: Recovered position 7206c8e4-7705-11e3-b175-922feecc92a0:1039191

140107 16:18:41 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

So, armed with this information, we can tell what would happen to node3, depending on which donor was selected:

Donor selected	Donor’s gcache oldest seqno	Node3’s seqno	Result for node3
node2	1050151	1039191	SST
node1	889703	1039191	IST

So, we can instruct node3 to use node1 as its donor on restart with wsrep_sst_donor:

[root@node3 ~]# service mysql start --wsrep_sst_donor=node1

1	[root@node3 ~]# service mysql start --wsrep_sst_donor=node1

Note that passing mysqld options on the command line is only supported in RPM packages, Debian requires you put that setting in your my.cnf. We can see from node3’s log that it does properly IST:

2014-01-07 16:23:26 19834 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.70.4:4568
2014-01-07 16:23:26 19834 [Note] WSREP: Node 0.0 (node3) requested state transfer from 'node1'. Selected 2.0 (node1)(SYNCED) as donor.
...
2014-01-07 16:23:27 19834 [Note] WSREP: Receiving IST: 39359 writesets, seqnos 1039191-1078550
2014-01-07 16:23:27 19834 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.15-56'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release 25.2, Revision 645, wsrep_25.2.r4027
2014-01-07 16:23:41 19834 [Note] WSREP: IST received: 7206c8e4-7705-11e3-b175-922feecc92a0:1078550

2014-01-07 16:23:26 19834 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.70.4:4568

2014-01-07 16:23:26 19834 [Note] WSREP: Node 0.0 (node3) requested state transfer from 'node1'. Selected 2.0 (node1)(SYNCED) as donor.

...

2014-01-07 16:23:27 19834 [Note] WSREP: Receiving IST: 39359 writesets, seqnos 1039191-1078550

2014-01-07 16:23:27 19834 [Note] /usr/sbin/mysqld: ready for connections.

Version: '5.6.15-56' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.2, Revision 645, wsrep_25.2.r4027

2014-01-07 16:23:41 19834 [Note] WSREP: IST received: 7206c8e4-7705-11e3-b175-922feecc92a0:1078550

Sometime in the future, this may be handled automatically on donor selection, but for now it is very useful that we can at least see the status of the gcache.

0 0 votes

Article Rating

7 Comments

Oldest

Newest Most Voted

Admin

Peter Zaitsev

12 years ago

Jay,

Great information.

I wonder on the other topic – if we’re having multiple nodes starting SST at the same time are they going to use single node multiple nodes in the cluster or there is no guarantee ?

Author

Jay Janssen

12 years ago

Peter,
The rules as I understand them are thus:

– A given donor can only donate to one joiner at a time.
– Multiple joiner/donor pairs can exist in the same cluster as long as there are available donors (i.e., two nodes can join two existing nodes and SST simultaneously)
– If there are no available donors (i.e., they are all busy doing other donations), the joiner blocks until one becomes available. There maybe some timeout in play here.

Author

Jay Janssen

12 years ago

I believe the above rules hold true for both IST and SST.

Alex

12 years ago

Yes, for the moment it holds for both SST and IST. Although for IST we should be able to relax that.

Rick James

12 years ago

Perhaps Galera could (should) do the steps you suggest automagically?

For the messy case you mentioned, do something like this… For each possible donor, estimate how long it would take to finish the IST/SST.
* Donor available and IST possible: estimate amount to transfer.
* Donor available and SST required: estimate how long SST would take.
* Machine is busy: Estimate how long before it will finish with the current transfer, then add on what it would take (IST/SST) to do the desired transfer.
Then pick the one that would finish the task fastest. So, in your example… If the IST is deemed significantly faster than SST, it should decide to wait node1 to finish the first IST, then do a second IST.

Alex

12 years ago

He-he, “estimate”… 😉

However, in great majority of cases IST is way better than SST, if for nothing else, then for having least impact on donor. And several concurrent ISTs from a single donor is possible, SST can be only one. So we are working on it.

Anil

11 years ago

Hi Jay,

you said: “The Gcache is lost when a node restarts”

Does that mean what’s written into galera.cache won’t persist between node restarts?

I am having trouble booting up nodes after a graceful cluster shutdown. I have a 3 node cluster. I’ve shutdown all 3 nodes gracefully. Then I bootstrapped cluster using the most advanced node. And then, when I booted the second node, it tried to initiate an IST but the primary component replied it didn’t have enough write sets in its gcache, and started SST instead.

Now, gcache size is pretty big and node_2 was only behind by a handful of transactions. How come node_1 didn’t have enough cache stored for IST?

Then I ran into this blog post.

If the gcache gets reset between node restarts, nodes will not be able to do IST any more, after an entire cluster restart. Isn’t it?