The Math of Automated FailoverPeter Zaitsev
Here is my take on it. When we look at systems providing high availability we can identify 2 cases of system breaking down. First is when the system itself has a bug or limitations which does not allow it to take the right decision. Second is the configuration issue – which can be hardware configuration such as redundant network paths, STONITH, as well as things like various timeouts to know the difference between just transient errors or performance problem and real problem.
To be truly prudent it is better not to trust anyone… including yourself and assume the software is buggy and limited as well as you have not configured it properly. What can give us assurance ? right testing! If you’re really serious implementing HA setup you might need to spend 3-10 times as much time testing as you spend setting it up. In my experience unfortunately very few people are really ready to invest as much time into it. You do not only need time for testing, you need to test right things. Quite frequently I do not only see people not spending enough time testing but also not testing for right things. Taking shortcuts is one of the problem, simply not understanding what things can go wrong is another. One example could be testing network failures with rejecting packets on firewall which gives you instant connection failure, while in real life you could see connection to take significant time before timing out. This can be a lot of difference. The complex situations which happen with databases such as performance problems, running out of connections and so transient successful and failed connections as well as database stalling forever on queries can often be omitted.
The fact there are so many things what can go wrong mean two things. First you want to be using very well tested software when it comes to high availability solution. It takes many man years to polish them both in terms of theory and clear bug-less implementation. Second you will unlikely be able to cover any real world multi-dimensional failure scenario which will happen in production in your testing. So there is still a chance for things to go wrong.
There is another complexity at play as well – in many environments such as Cloud or Rented Dedicated servers you can’t even test everything would like, moreover you might have limited information about how exactly things are set up in environment you do not control and so what kind of failure scenarios you can expect.
As result you really would not have the confidence of the system working correctly until you have it working in practice. This is where my like for Manual failover comes in play. I would like to make system to detect the problem report in the logs what decisions it would take but rather page me to make actual failover first few time. As I have seen system is taking a good scenario in real world scenarios I would consider enabling automated failover because I can trust it. I would like the failover process to be automated but a person to take a decision to initiate it and verified it completed successfully.
Now as promised lets do some math. Consider you’re having manual fail-over and when things go south it takes in average 15 minutes to get the person to check what is going on and hit the button to initiate failover. Lets say automated failover is immediate but if it screws up it takes us couple of hours to fix the problem of faliover which did not go right. Note you do not really have a good idea about of probability of failover going bad until there is some statistics about the system. With parameters stated from purely uptime prospective it would make sense to have manual failover until you have a confidence the automated failover has problems in less than 1/8 of cases…. which I would not trust until I have seen system handling some 10 production incidents.
Lets not look at the time as the variable because it is important as well. If you have single MySQL server hosted in a good place you should have it failing less than once in a year in average. Some people report numbers as much as 3-5 years as mean time between failures but lets be conservative. If it takes a year between system failures this means it may take many years for system to really prove it in production… and by that time you will likely have your hardware, MySQL server, HA solution server all changed and you might need to start gathering stats as well.
This means in most cases I would recommend automated failover for large systems only. If I’m looking not at one Master-Slave pair but at 1000 I will be dealing with failures essentially every day and assuming they are configured the same way I will be able to gain the trust in software to be able to handle failures very well.
Finally lets look at the cost of bad failover. You want to minimize it and this is indeed where “failover only once” and “failover only when it is safe” principles are very good. When we apply them to MySQL replication we can use these simplified principles – we only will failover from Master to Slave but never fail back without manual intervention. We also will failover only when replication is caught up (wait for slave to catch up before getting it online). If you follow some of these principles and if software implements them properly there might be less issue with taking a wrong term – doing failover when not needed, flapping and causing all over mess. Some people also say you should not failover when database experiences performance problem only when it is really down completely. This is a hard question for me. Indeed there are many cases when performance problems are self imposed. Developer added bad query to the application and the box is overloaded and “goes down”. As we fail over another server goes down with overload the same way. This is not only the case though – there are many hardware failure and misconfiguration scenarios which can result in performance problem on single node only such as disk failing in RAID, resync initiated, BBU starting “learning” are just some IO related examples. In these cases failover to working box can be very good solution. Yet the problem is it is very hard to automatically detect the difference between these two conditions.
Conclusion: Way to often I see people being obsessed with system which provides automated failover when they have system of way too small case, have little discipline and, implement solution not following best practices and do not test in properly. Lots of unfortunate downtime is result. Running manual failover solution can be better choice in a lot of entry level systems and “battle testing” solution before putting it on auto pilot is a good approach for larger scale systems.
Interested in MySQL and High Availability ?
Come to Percona Live NYC taking place Oct 1-2 and learn about many state of the art high availability solutions for MySQL such as Clustrix, Tungsten Replicator, PRM, Percona XtraDB Cluster, MySQL and DRBD as well as many great sessions on other topics.