September 16, 2014

White Paper: Preventing MySQL Emergencies

About a year ago, I started a study of emergency incidents that our customers filed with us. What I found was really surprising, and defied conventional wisdom. I learned a lot about preventing emergencies. I just published the outcome as a white paper, including checklists that you can use and modify for your own servers. (Analysis of the nature and causes of emergencies is due to be published in the next issue of IOUG’s quarterly SELECT journal.)

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Baron Schwartz says:

    MySQL doesn’t always do it reliably, or someone messes things up and causes it to fail. For example, someone changes server configuration, and as a result, the server doesn’t read the same information after restart that it wrote before shutdown. I’ve seen it fail a number of times in interesting ways — these are just examples.

  2. Vojtech Kurka says:

    “Before shutting down the server, stop replication
    and issue SHOW SLAVE STATUS. Save the result to
    a file and refer to it after restart to ensure that replication starts in the correct position”

    Can please anyone explain me, why to do this? AFAIK mysql saves this information into the master.info file, so you only need to sync it to disk before you power the machine off (which does the OS for you).
    I probably missed some other reasons to do this?

    Vojtech

  3. Sean C says:

    Nice paper. Most production issues/outages can avoided with proper controls and checklists. A book I highly recommend reading and that compliments this post: Visible Ops Handbook (http://www.amazon.com/Visible-Ops-Handbook-Starting-Practical/dp/0975568604)

  4. Thanks for the link. I believe that the _SELECT_ article, when it is published, will further validate your viewpoint. The TL;DR version is that most emergencies in databases are caused by uncontrolled changes.

  5. Wow! Thank for this effort!

    May I add some shameless plugs?

    With regard to shutdowns (3.5, 3.6): openark kit’s oak-prepare-shutdown will perform both actions: reducing the dirty pages percent until no improvement is seen, as well as verifying no temporary tables are open on slave. It will also gracefully stop replication beforehand.

    check configuration before shutdown (3.8): mycheckpoint (monitoring tool for MySQL) records all status variables, as well as all server variables. By querying a single view, you get the entire history of parameter changes, timestamps and all. Actually, it is also easy to cross reference with the “uptime” value per entry, so you may easily relate changes that were suspiciously close to a restart (indication someone may have changed a variable dynamically yet forgot to update cnf file).

    Security (10.4, 10.4, 10.5) all these and *many* others (e.g. different users sharing the same password) are handled by openark kit’s oak-security-audit. Basically, the tool provides you with a report and recommendations on your MySQL’s security flaws.

  6. Vojtech Kurka says:

    “Before shutting down the server, stop replication
    and issue SHOW SLAVE STATUS. Save the result to
    a file and refer to it after restart to ensure that replication starts in the correct position”

    Can please anyone explain me, why to do this? AFAIK mysql saves this information into the master.info file, so you only need to sync it to disk before you power the machine off (which does the OS for you).
    I probably missed some other reasons to do this?

    Vojtech

  7. MySQL doesn’t always do it reliably, or someone messes things up and causes it to fail. For example, someone changes server configuration, and as a result, the server doesn’t read the same information after restart that it wrote before shutdown. I’ve seen it fail a number of times in interesting ways — these are just examples.

Speak Your Mind

*