September 15, 2014

Making replication a bit more reliable

Running MySQL slave is quite common and regular task which we do every day, taking backups from slave is often recommended solution. However the current state of MySQL replication makes restoring slave a bit tricky (if possible at all). The main problem is that InnoDB transaction state and replication state are not synchronized. If we speak about backup and you can execute SHOW SLAVE STATUS command you can get reliable information about current state, but some solutions does not allow that. Look for example Sun Storage 7410, which provides storage via NFS and where you can make ZFS snapshots without any info what kind of data you are storing there. What makes situation worse is that files with replication state (relay-log.info, master.info) are not synchronized on disk after each update, and even wrose – in case with NFS they are stored on client side OS/NFS cache for long time. As solution we can do patch to execute fsync() for these files after each write, but I can’t predict how much performance penalty we will see here, I expect it will be very significant.

Our idea is not new, it was taken from TransactionalReplication http://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication patch and http://bugs.mysql.com/bug.php?id=34058 bug report. Basically we want to store the state of replication in InnoDB transactional log file, in this case we will be able to see what position in replication the last executed transaction corresponds to. Of course it will work if you have writes exclusively to InnoDB storage engine, case with mix of storage engines is much more complex and I do not see easy way to solve it.

So we propose overwrite_relay_log_info extension for XtraDB storage engine, the name comes from fact that XtraDB will try to rewrite relay-log.info by replication position, or at least this info is available in error-log output, so you can repoint your slave to correct position by executing CHANGE MASTER command.

Currently patch is available in Launchpad lp:~percona-dev/percona-xtradb/overwrite-relay-log-info , and after testing it will go to main XtraDB tree. If there is interest the patch will be ported to 5.0 and 5.1 trees. Some info available at this documentation page.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Very nice!!

  2. it is my favorite feature from google patch ;-)

  3. Alex P. says:

    hi, i have a ‘Master’ DB replicating to 3 ‘Slaves’ DB (with InnoDB engine) and it’s a bit difficult to restore when some slave stops replicate… if im not wrong this “patch/feature” will help me to restore replication rigth?
    thanks in advance and sry for my ‘not very good’ english :)
    btw, great mysql blog.

  4. Vadim says:

    Alex,

    It may help, or may not. To be sure I need to know what exact problem you have with restoring replication.

  5. Alex P. says:

    hi. Sometimes i got (in slave) the “Duplicate key entry” error (i see this in the ‘show slave status’ Last_error) always with an insert query. Dont know exactly why but just happened in 1 of 3 slave, and the others was fine and running normally. With Myisam just need “load data from master” but with InnoDB we have to download and run a backup of the Master DB and it takes some minutes….. i have this problem around 1 time every 3 months but im looking for a better solution..
    thanks again.

  6. Gil says:

    Alex, I have run into similar duplicate key errors in the past. It turns out the slave I was using to take backups had slightly different data than the master. You may want to do a consistency check using Maatkit just to be sure.

  7. Vadim says:

    Alex P, Gil

    Actually this patch can fix that issue.

    The problem is when your slave crash or some another problem – relay-log.info contains old information, and when you do slave start – slave tries to repeat transactions which were already executed (and you getting Duplicate key error).

    Our patch tries to fix that and change relay-log.info to the real last transaction.

    However Dublicate key error may be related to different issues also, so I can’t say for sure our patch is solution.

  8. Alex P. says:

    yes i know the “duplicate key error” may be related to different issues (sometimes its isp problem, manual inserts, our slaves uses dyndns and the update of the service sometime make it crash, etc ), the way i resolve the crash is making a backup of the master, send this backup to the slave (via sftp) then i execute this backup, delete the ‘relay’ files and restart the mysql service… after that slave is runnning ok again. I know this “solution” is not very good (what do you think? ) so im trying to find another one, that’s why i think this patch/feature will help me.

  9. Ernesto Vargas says:

    Alex P,

    You could also do:

    STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

    So the replication will jump that Dup Key error and continue with the next statement of the replication.

  10. Alex P. says:

    thank you. i will try it and find if can solve the problem.

    :-)

  11. Aric says:

    Also, in my.cnf you can skip replication errors.

    slave-skip-errors = all

  12. These “solutions” make replication LESS reliable and MORE likely to have further problems! FIX the problem, don’t hide it.

  13. @Aric,

    Skipping errors by specifying “slave-skip-errors = all” in my.cnf is quite dangerous, since you will not even be aware of replication issues; with Ernesto’s solution, you are at least forced into manual intervention, which may supply you with enlightenment about the problem’s origins.

    Instead, your replicated data will keep on moving away from master’s data.

  14. Robin says:

    Hello,everybody.
    Does mysql have a feature to customize the slave port? I see my slave port is random.
    Thank you!

  15. James Parks says:

    Your slave port is random because you’re looking at it’s ephemeral port. The slave does not listen for incoming connections from the master, but is rather a client of the master.

    I imagine you were asking this question because you were trying to figure out which ports to allow through your firewall. If you’ll allow all established TCP connections, you’ll solve this problem. In IPtables, it would look something like this:

    -A INPUT -m state –state RELATED,ESTABLISHED -j ACCEPT

    I know it’s been a long time since you posted, but thought someone else might find this useful.

Speak Your Mind

*