About a year ago, I started a study of emergency incidents that our customers filed with us. What I found was really surprising, and defied conventional wisdom. I learned a lot about preventing emergencies. I just published the outcome as a white paper, including checklists that you can use and modify for your own servers. (Analysis of the nature and causes of emergencies is due to be published in the next issue of IOUG’s quarterly SELECT journal.)

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Sean C

Nice paper. Most production issues/outages can avoided with proper controls and checklists. A book I highly recommend reading and that compliments this post: Visible Ops Handbook (http://www.amazon.com/Visible-Ops-Handbook-Starting-Practical/dp/0975568604)

Shlomi Noach

Wow! Thank for this effort!

May I add some shameless plugs?

With regard to shutdowns (3.5, 3.6): openark kit’s oak-prepare-shutdown will perform both actions: reducing the dirty pages percent until no improvement is seen, as well as verifying no temporary tables are open on slave. It will also gracefully stop replication beforehand.

check configuration before shutdown (3.8): mycheckpoint (monitoring tool for MySQL) records all status variables, as well as all server variables. By querying a single view, you get the entire history of parameter changes, timestamps and all. Actually, it is also easy to cross reference with the “uptime” value per entry, so you may easily relate changes that were suspiciously close to a restart (indication someone may have changed a variable dynamically yet forgot to update cnf file).

Security (10.4, 10.4, 10.5) all these and *many* others (e.g. different users sharing the same password) are handled by openark kit’s oak-security-audit. Basically, the tool provides you with a report and recommendations on your MySQL’s security flaws.

Vojtech Kurka

“Before shutting down the server, stop replication
and issue SHOW SLAVE STATUS. Save the result to
a file and refer to it after restart to ensure that replication starts in the correct position”

Can please anyone explain me, why to do this? AFAIK mysql saves this information into the master.info file, so you only need to sync it to disk before you power the machine off (which does the OS for you).
I probably missed some other reasons to do this?

Vojtech

Vojtech Kurka

“Before shutting down the server, stop replication
and issue SHOW SLAVE STATUS. Save the result to
a file and refer to it after restart to ensure that replication starts in the correct position”

Can please anyone explain me, why to do this? AFAIK mysql saves this information into the master.info file, so you only need to sync it to disk before you power the machine off (which does the OS for you).
I probably missed some other reasons to do this?

Vojtech