Autopsy of an automation disaster
You’ve deployed automation, enabled automatic master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such surprise. Once upon a time, a failure brought down a master. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, with an edge-case recovery, and a lack of oversight in automation, lead to a split-brain. This talk will go into details about the convoluted - but still real world - sequence of events that lead to this disaster. I will cover what could have avoided the split-brain and what could have make things easier to fix it.
Senior Database Administrator, booking.com
Simon Mudd is a Senior Database Administrator and works at booking.com working from Madrid, Spain. He has been working with MySQL in large production environments for over 10 years. He has contributed heavily to orchestrator (https://github.com/github/orchestrator), and is the author of ps-top (https://github.com/sjmudd/ps-top). He previously managed replicated Sybase servers for a stock trading system. He has a degree in Computation from the University of Manchester Institute of Science and Technology (UMIST), now University of Manchester.