Troubleshooting Relay Log Corruption in MySQL

PREVIOUS POST
NEXT POST

Have you ever seen the replication stopped with message like this:

Last_Error: Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave.

This is relay relay log corruption and you can check details in the MySQL Error log file. The error message describes few reasons and indeed because there is little validation (ie no checksums) in the replication there are multiple reasons for bad event to show up in relay logs.

Really this is only one of various error messages you could see if relay log corrupted. You could also see malformed queries (with some junk), complaining about event to big etc if there is a garbage in relay logs.

If relay logs are corrupted it is surely worth to check what could cause it – it could be network (especially if replicating over unreliable long distance networks), MySQL bugs on master or slave, hardware problems and few others. In any case it is worth investigating.

Investigating is what you do later but how do you fix the problem first ? The important question you need to have answered – are logs corrupted on the master ? If logs on the master are OK you can just run SHOW SLAVE STATUS on slave experiencing error and use CHANGE MASTER TO to re-point replication to Relay_Master_Log_File:Exec_Master_Log_Pos:

This will purge existing relay logs re-fetch all events which have not been executed yet. Doing this command make sure your master is operational and it still has all the logs needed to re-fetch events.

How would you know if logs are OK on the master ? Well in this case there were probably 5 another slaves which did not have the problem – which means Master is most likely OK. In any case it is little harm to try restarting from the same position – if logs are bad on the master you would get the same error message again and can continue with investigation.

What if logs on the master are corrupted ? In this case you have couple of choices (and you also potentially have multiple slaves to deal with). You can use mysqlbinlog (or you favorite hex editor if mysqlbinlog does not work) to find the next event start and potentially recover “corrupted” event to be manually executed on the slaves.

Skipping around event makes master and slave potentially inconsistent and you should access the risks depending on applications (and on amount of events which were corrupted) you may want to let replication continue from the new position or resync the slaves to the master.

How can you recover the slave ? As all slaves are likely to be affected in this case you can’t clone another slave. You also can’t use classical method of recovery from backup – because you would need relay logs to roll forward, and they are corrupted. You can either re-clone the data from Master. (This is where LVM or similar techniques can help you a lot) or skip bad events as described and when use Maatkit mk-table-checksum to check what tables are out of sync and when use mk-table-sync to resync them.

Last method works in particularly well in case you can afford to run for a while with slaves which are a bit out of sync, which is quite often better than having just master available (also having extra load of data copied from it).

PREVIOUS POST
NEXT POST

Comments

  1. Josue says

    Hello Peter,

    This apply to situations where you have the following error:
    “Error reading packet from server: Client requested master to start replication from impossible position ( server_errno=1236)”

    Tnkx for this great post.

    JD

  2. says

    Dude, you rock. You saved my life. I was planning to spend another night trying to figure out what went wrong and with your clear and pragmatic explanations I solved the problem in no time at all.

  3. Justin says

    Let me also say thank you for writing this up so thoroughly. I was *mostly* sure it would work, but it’s always good to get confirmation from an expert! This site has saved me so much time and trouble, I really appreciate it.

  4. tony balleras says

    hi peter,

    thanks for this information. saved my life. by the way when i executed your script i was able to get a syntax error.
    it seems the Relay_Master_Log_File should be ‘Relay_Master_Log_File’ (notice the single qoute).

    anyway thanks peter.

    tony

  5. Pan Admin says

    Петя, ты мой герой! Избавил мой сайт от двухчасовой пересинхронизации слейва.
    PS. Гори в аду, мастерхост.

  6. Rob says

    Great intructions, but missing what would have been useful information in my scenerio.

    If you expire your binary logs after X amount of days (expire_logs_days=X) on your Master server and your Slave replication gets hung up on a Relay_Master_Log_File which no longer exists, as it has been expired. You can use:

    Change MASTER to RELAY_LOG_FILE = ‘Relay_Log_File’, RELAY_LOG_POS = ‘Relay_Log_Pos';

    to use the Relay logs which have been stacking up on the Slave replication.

    Also, note this bug applied to my case as well: http://bugs.mysql.com/bug.php?id=55460

  7. says

    Hi Saved my life too , and it worked so smoothly that i could not believe.

    In my case master is not corrupted as , i have more slaves which did not get the similar error.

    Is there any way to troubleshoot at slave level.

    Thanks

    Suyash Jain
    http://linuxhacks.in

Leave a Reply

Your email address will not be published. Required fields are marked *