Feature in details: Incremental state transfer after a node crash in Percona XtraDB Cluster
With our newest release of Percona XtraDB Cluster, I would like to highlight a very nice ability to recovery a node and bring it back to the cluster with an incremental transfer after a crash.
This feature was available even in previous release, but now I want to give some details.
So, MySQL crashes from time to time and this a fact of life. HA solution is exactly needed to deal with an one node failure and allowing whole cluster continuing to work.
The idea is, if a node crashed, after it recovered – we just transfer all changes that happened in the cluster, while the node was down. It sounds easy in words, but proven hard when it comes to implementation. It all comes to the question: if mysqld crashes, how do we know what is the last transaction was executed. For a single InnoDB instance it is easy, there is always LSN, which is used for recovery, but in a cluster all nodes have their individual LSNs. Instead Cluster uses Global Transaction ID (GTID), in form
50176f05-69b5-11e2-0800-930817fe924a:8549230
.
So, how can we store GTID so it is available after a crash? Of course there is always a good all way to store it in a separate file, which however will require an additional fsync
call for each transaction, and it is know performance killer.
Instead, we store GTID in InnoDB system area, which is updated for each transaction. So even the system crashes, we can access information about the last commited transaction.
In XtraDB Cluster you can access this information by calling mysqld with option:
mysqld --wsrep-recover
and having this information, we can force the node to start forcing to use initial GTID:, i.e
mysqld --wsrep_start_position=50176f05-69b5-11e2-0800-930817fe924a:8549230
In fact the same methodology can be used if to restore nodes from backup.
We can start all nodes from an identical starting position, so they all will assume to start on identical data. Well, you can do it even on not identical data, but you know that you do not have a consistent cluster in this case.
As it all may sound complicated, this logic
--wsrep-recover / --wsrep_start_position=
is implemented in mysqld_safe
script, so you have it out of box.
Here is how a start process looks like in an error.log:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
130129 23:01:22 mysqld_safe Starting mysqld daemon with databases from /mnt/data/mysql 130129 23:01:22 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.l1zf2BIDja 130129 23:01:45 mysqld_safe WSREP: Recovered position 50176f05-69b5-11e2-0800-930817fe924a:8549230 130129 23:01:45 mysqld_safe WSREP: Recovered position 50176f05-69b5-11e2-0800-930817fe924a:8549230 130129 23:01:45 [Note] WSREP: wsrep_start_position var submitted: '50176f05-69b5-11e2-0800-930817fe924a:8549230' 130129 23:01:45 [Note] WSREP: Read nil XID from storage engines, skipping position init 130129 23:01:45 [Note] WSREP: wsrep_load(): loading provider library '/usr/local/mysql/lib/libgalera_smm.so' 130129 23:01:45 [Note] WSREP: wsrep_load(): Galera 2.3(r143) by Codership Oy <info@codership.com> loaded succesfully. ... 130129 23:01:45 [Note] WSREP: Setting initial position to 50176f05-69b5-11e2-0800-930817fe924a:8549230 130129 23:01:56 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 8549242) 130129 23:01:56 [Note] WSREP: State transfer required: Group state: 50176f05-69b5-11e2-0800-930817fe924a:8549242 Local state: 50176f05-69b5-11e2-0800-930817fe924a:8549230 .... 130129 23:01:58 [Note] WSREP: Prepared IST receiver, listening at: tcp://208.88.225.242:4568 130129 23:01:58 [Note] WSREP: Node 5 (node3) requested state transfer from '*any*'. Selected 0 (node1)(SYNCED) as donor. 130129 23:01:58 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 8549306) 130129 23:01:58 [Note] WSREP: Requesting state transfer: success, donor: 0 130129 23:01:58 [Note] WSREP: SST complete, seqno: 8549230 ... 130129 23:02:01 [Note] WSREP: Receiving IST: 12 writesets, seqnos 8549230-8549242 130129 23:02:01 [Note] /usr/local/mysql/bin/mysqld: ready for connections. |
So what happens there? Basically at start node detects that the last transaction had GTID:
50176f05-69b5-11e2-0800-930817fe924a:8549230
, but joining cluster, it figures out
that the cluster already at position
50176f05-69b5-11e2-0800-930817fe924a:8549242
,
and to catch up, the node has to recieve 12 events, which succesfully happen there:
130129 23:02:01 [Note] WSREP: Receiving IST: 12 writesets, seqnos 8549230-8549242
After applying 12 events locally the node is ready and succesfully joins cluster.
You may try it with the latest release Percona XtraDB Cluster 5.5.29
Comments (4)
One thing that isn’t always clear is when you need to recover the wsrep position. Typically the answer is “if you know the node crashed”, but that may not always be clear and to avoid an SST, you want to be very sure your node has the best possible change of ISTing before you issue a restart.
Why? Because as soon as PXC decides to do an SST, it starts dutifully overwriting your datadir and once that starts, you have to continue with a full SST.
To do this, you can check your grastate.dat:
[root@node3 ~]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 8d211006-5bf5-11e2-0800-067f71542765
seqno: -1
cert_index:
On a running node, the seqno is always ‘-1’. If the node crashes, this remains -1, and this will trigger a full SST on restart unless you intervene. However, if you see: a positive integer there, then this would be from mysqld being shut down cleanly.
In addition to starting mysqld or mysqld_safe directly on the command line as Vadim suggests, you can modify the grastate.dat directly to match your recovered sequence number directly and just restart mysql via your init script:
[root@node3 ~]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 8d211006-5bf5-11e2-0800-067f71542765
seqno: -1
cert_index:
So we can see our seqno is -1, this is asking for an SST (this assumes mysqld is not still running). Let’s find out where it left off using –wsrep-recover:
[root@node3 ~]# mysqld_safe –wsrep-recover
130131 12:44:17 mysqld_safe Logging to ‘/var/lib/mysql/error.log’.
130131 12:44:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
130131 12:44:17 mysqld_safe WSREP: Running position recovery with –log_error=/tmp/tmp.YLnmVW4zAl
nohup: ignoring input and appending output to `nohup.out’
130131 12:44:23 mysqld_safe WSREP: Recovered position 8d211006-5bf5-11e2-0800-067f71542765:314773
130131 12:44:28 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended
So I know the position, now I can carefully edit grastate.dat (I use vi) and update the uuid and seqno fields):
[root@node3 ~]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 8d211006-5bf5-11e2-0800-067f71542765
seqno: 314773
cert_index:
[root@node3 ~]# service mysql start
Starting MySQL (Percona XtraDB Cluster).. SUCCESS!
[root@node3 ~]# grep IST /var/lib/mysql/error.log | tail -n 1
130131 12:49:10 [Note] WSREP: IST received: 8d211006-5bf5-11e2-0800-067f71542765:318688
The caveat to relying on –wsrep-recover is that it pulls it’s state from the Innodb redo logs. Any updates to any MyISAM tables, or any ongoing DDL (ALTER or similar) are not transactional. It’s probably the case that wsrep-recovery can be relied on most of the time since those types of changes are normally rare, but it is not a silver bullet.
FTR, I found what I think is a bug where wsrep_start_position will not work: https://bugs.launchpad.net/codership-mysql/+bug/1112724
thanks very much jay janssen. editing /var/lib/mysql/grastate.dat directly really solved my problem!
all methods mentioned in comments did not work with galera-3 and mariadb 10.0-17 on ubuntu 14.04. I had to restore db backup to a newly install mariadb cluster