MySQL Crash RecoveryPeter Zaitsev
MySQL is known for its stability but as any other application it has bugs so it may crash sometime. Also operation system may be flawed, hardware has problems or simply power can go down which all mean similar things – MySQL Shutdown is unexpected and there could be various inconsistences. And this is not only problem as we’ll see.
MySQL has angel process mysqld_safe which will restart MySQL Server in most cases. It is great, unless you have run into some bug which causes it to crash again – such crashes qucikly following one another are kind of worse because they explore many less tested code paths in MySQL and so problem potential is larger.
So lets look at the problem which happen during the crash which might need to take care of or which may seriously affect MySQL Performance.
MyISAM Corruption – If you’re writting to MyISAM tables there is very large chance of them becoming corrupted during the crash. Note corruption may be hidden and do not expose itself instantly – you may notice wrong query results days after crash. Sometimes corrupted tables may be reason for further crashes or hangs, and corruption may spread itself further in the table. You probably do not want any of these so it is very good idea to run MySQL with myisam_recover option which will make sure all improperly closed MyISAM tables are checked first time it is accessed. This option is however rather painful to use with web applications – users may issue different queries which may trigger check/repair running for many tables at onces, which typically make system extremely slow and also can use up all allowed connections or run out of memory ( myisam_sort_buffer_size is normally set pretty lage). If this becomes the problem I use tiny script which moves out all MyISAM tables out of MySQL database directory, checks them with MyISAMchk and moves them back to running server. This looks scary but it works great – until table is checked and ready application gets error rather than stalling forever which allows application to become partially functional as soon as possible. This hack is needed only in some cases – in most cases using Innodb for tables which you need to be recovered fast is better solution.
Innodb Recovery – Unless you have some hardware problems (99%) or found new Innodb bug (1%) Innodb recovery should be automatic and bring your database to consistent state. Depending on innodb_flush_lot_at_trx_commit setting you may lose few last committed transactions but it is it. It is Performance of this process which may cause the problems. As I already wrote innodb_log_file_size and innodb_buffer_pool_size affect recovery time significantly as well as your workload. I should also mention if you have innodb_file_per_table=1 your recovery speed will depend on number of Innodb tables you have, as well as many other operations, so beware.
Binary log corruption – Binary log may become corrupted and out of sync with database content. This will sometimes break replication but if you’re just planning on using binary log for point in time recovery it can go unnoticed. sync_binlog Is helping by syncing binary log, but at performance penalty. If using Innodb you also might with to use innodb-safe-binlog option in MySQL 4.1 so your Innodb log and binary log are synchronized. In MySQL 5.0 XA is taking care of this synchronization.
.frm Corruption – Few people know MySQL is not really ACID even with Innodb tables, at least not for DDL statements.
There is a chance of failing for example during CREATE statement with table created in Innodb dictionary but .frm not created or not completely written. Partially written .frm files or .frm being unsync with internal Innodb dictionary may cause MySQL to fail with wierd error messages. In MySQL 4.1 sync_frm option was added which reduces this problem as time window when it can happen is much less. Still if failure happens just during writting .frm file nasty things may happen, not to mention such potentially multiple operation DDL statements as RENAME TABLE – these are most vulnerable.
master.info corruption – If slave happens to crash you can also have relay logs corruption and master.info being corrupted. Not to mention MyISAM tables can contain partially completed statements as well as some of updates totally lost. The safe approach it to reclone the slaves if they crash or you can take the risks and try to continue. Sometimes you might be able to manually find appropriate position even if master.info file is out of sync but I would not be basing my failure handling scenarios.
Cold Start – If you restart MySQL server its caches (key_buffer, innodb_buffer_pool, query_cache,table_cache) are cleaned, so may be OS caches. This may reduce performance dramatically. So if you’re bringing server back after crash you might want to populate caches. For MyISAM key_cache this can be done by using LOAD INDEX INTO CACHE statement, for other storage engines it can be done by issuing large index scan queries. Full table scan queries allow to preload table data ether in storage engine caches or in OS cache. You can save these into .sql file and use –init-file to make sure it is run on startup. The other approach is to prime server with real servers (ie clone queries from other slave) before putting traffic to it.
In case application is not highly available so there is only one server you might with to start serving only some users initially (returning error to others) and gradually increase the load as server warms up. This may sound strange but makes a lot of sense as not only waiting for pages which never load is more frustrating for users than getting honest “try again later” message, but also – warmup takes longer time on extreme load.
Innodb statistics – Unlike MyISAM Innodb does not store index cardinality in tables, instead it computes them on first table access after startup. This may take significant time if you have very large number of tables (Some users have hundreds of thousands of tables per database host). This one is pretty much part of cold start problems but I wanted to point out it separately. To warmup this data you might run select 1 from _table_ limit 1 for each table or any other statement – it is table open which is important.
There are other problems which you may experience related to MySQL Crash Recovery – Restoring data from backup, corrupted Innodb tablespace recovery etc but I should write about them some other time.