September 2, 2014

Feature in details: Incremental state transfer after a node crash in Percona XtraDB Cluster

With our newest release of Percona XtraDB Cluster, I would like to highlight a very nice ability to recovery a node and bring it back to the cluster with an incremental transfer after a crash.
This feature was available even in previous release, but now I want to give some details.

So, MySQL crashes from time to time and this a fact of life. HA solution is exactly needed to deal with an one node failure and allowing whole cluster continuing to work.

The idea is, if a node crashed, after it recovered – we just transfer all changes that happened in the cluster, while the node was down. It sounds easy in words, but proven hard when it comes to implementation. It all comes to the question: if mysqld crashes, how do we know what is the last transaction was executed. For a single InnoDB instance it is easy, there is always LSN, which is used for recovery, but in a cluster all nodes have their individual LSNs. Instead Cluster uses Global Transaction ID (GTID), in form
50176f05-69b5-11e2-0800-930817fe924a:8549230.
So, how can we store GTID so it is available after a crash? Of course there is always a good all way to store it in a separate file, which however will require an additional fsync call for each transaction, and it is know performance killer.

Instead, we store GTID in InnoDB system area, which is updated for each transaction. So even the system crashes, we can access information about the last commited transaction.
In XtraDB Cluster you can access this information by calling mysqld with option:
mysqld --wsrep-recover
and having this information, we can force the node to start forcing to use initial GTID:, i.e
mysqld --wsrep_start_position=50176f05-69b5-11e2-0800-930817fe924a:8549230

In fact the same methodology can be used if to restore nodes from backup.
We can start all nodes from an identical starting position, so they all will assume to start on identical data. Well, you can do it even on not identical data, but you know that you do not have a consistent cluster in this case.

As it all may sound complicated, this logic
--wsrep-recover / --wsrep_start_position= is implemented in mysqld_safe script, so you have it out of box.

Here is how a start process looks like in an error.log:

So what happens there? Basically at start node detects that the last transaction had GTID:
50176f05-69b5-11e2-0800-930817fe924a:8549230, but joining cluster, it figures out
that the cluster already at position
50176f05-69b5-11e2-0800-930817fe924a:8549242,
and to catch up, the node has to recieve 12 events, which succesfully happen there:
130129 23:02:01 [Note] WSREP: Receiving IST: 12 writesets, seqnos 8549230-8549242

After applying 12 events locally the node is ready and succesfully joins cluster.

You may try it with the latest release Percona XtraDB Cluster 5.5.29

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. One thing that isn’t always clear is when you need to recover the wsrep position. Typically the answer is “if you know the node crashed”, but that may not always be clear and to avoid an SST, you want to be very sure your node has the best possible change of ISTing before you issue a restart.

    Why? Because as soon as PXC decides to do an SST, it starts dutifully overwriting your datadir and once that starts, you have to continue with a full SST.

    To do this, you can check your grastate.dat:

    [root@node3 ~]# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid: 8d211006-5bf5-11e2-0800-067f71542765
    seqno: -1
    cert_index:

    On a running node, the seqno is always ‘-1′. If the node crashes, this remains -1, and this will trigger a full SST on restart unless you intervene. However, if you see: a positive integer there, then this would be from mysqld being shut down cleanly.

    In addition to starting mysqld or mysqld_safe directly on the command line as Vadim suggests, you can modify the grastate.dat directly to match your recovered sequence number directly and just restart mysql via your init script:

    [root@node3 ~]# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid: 8d211006-5bf5-11e2-0800-067f71542765
    seqno: -1
    cert_index:

    So we can see our seqno is -1, this is asking for an SST (this assumes mysqld is not still running). Let’s find out where it left off using –wsrep-recover:

    [root@node3 ~]# mysqld_safe –wsrep-recover
    130131 12:44:17 mysqld_safe Logging to ‘/var/lib/mysql/error.log’.
    130131 12:44:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    130131 12:44:17 mysqld_safe WSREP: Running position recovery with –log_error=/tmp/tmp.YLnmVW4zAl
    nohup: ignoring input and appending output to `nohup.out’
    130131 12:44:23 mysqld_safe WSREP: Recovered position 8d211006-5bf5-11e2-0800-067f71542765:314773
    130131 12:44:28 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

    So I know the position, now I can carefully edit grastate.dat (I use vi) and update the uuid and seqno fields):

    [root@node3 ~]# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid: 8d211006-5bf5-11e2-0800-067f71542765
    seqno: 314773
    cert_index:

    [root@node3 ~]# service mysql start
    Starting MySQL (Percona XtraDB Cluster).. SUCCESS!
    [root@node3 ~]# grep IST /var/lib/mysql/error.log | tail -n 1
    130131 12:49:10 [Note] WSREP: IST received: 8d211006-5bf5-11e2-0800-067f71542765:318688

    The caveat to relying on –wsrep-recover is that it pulls it’s state from the Innodb redo logs. Any updates to any MyISAM tables, or any ongoing DDL (ALTER or similar) are not transactional. It’s probably the case that wsrep-recovery can be relied on most of the time since those types of changes are normally rare, but it is not a silver bullet.

  2. FTR, I found what I think is a bug where wsrep_start_position will not work: https://bugs.launchpad.net/codership-mysql/+bug/1112724

  3. guimaluf says:

    thanks very much jay janssen. editing /var/lib/mysql/grastate.dat directly really solved my problem!

Speak Your Mind

*