Percona Live 2017 Open Source Database Conference

April 24 - 27, 2017

Santa Clara, California

Autopsy of an automation disaster

Autopsy of an automation disaster

 25 April - 5:15 PM - 5:40 PM @ Room 210
Experience level: 
Intermediate
Duration: 
25 minutes conference
Tracks:
Business / Case Studies
Topics:
MySQL
Devops
High Availability

Description

You’ve deployed automation, enabled automatic master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such surprise. Once upon a time, a failure brought down a master. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, with an edge-case recovery, and a lack of oversight in automation, lead to a split-brain. This talk will go into details about the convoluted - but still real world - sequence of events that lead to this disaster. I will cover what could have avoided the split-brain and what could have make things easier to fix it.

Speakers

Simon Mudd's picture

Simon Mudd

Senior Database Administrator, booking.com

Biography:

Simon Mudd is a Senior Database Administrator and works at booking.com working from Madrid, Spain. He has been working with MySQL in large production environments for over 10 years. He has contributed heavily to orchestrator (https://github.com/github/orchestrator), and is the author of ps-top (https://github.com/sjmudd/ps-top). He previously managed replicated Sybase servers for a stock trading system. He has a degree in Computation from the University of Manchester Institute of Science and Technology (UMIST), now University of Manchester.

Share this talk