How fast can you failover your databases? Do you trust it? Do you trust the process enough to let [almost] anyone do it, at any time? We do!
At Square, we manage thousands of MySQL and Redis database clusters. We recently rewrote all of our automation which fails over MySQL databases - making it even faster and more reliable. We brought the time from the user requesting the action, to database writes going to the new target - to generally under 2 seconds, with no real downtime or risk. This rewrite went so well for MySQL, that we decided to further abstract the process and apply the exact same set of tools to our Redis.
This talk describes the prerequisites, process, tooling, and lessons learned in safely cutting over database traffic and abstracting the process to apply to both MySQL and Redis.
Brian Ip is a software engineer on the Online Data Storage team at Square. He spends his time writing tools to help manage the MySQL and Redis fleet.
Emily is a software engineer on the Online Data Stores team at Square. She spends her time writing tools to help manage the MySQL and Redis fleet.