The Math of Automated Failover

PREVIOUS POST
NEXT POST

There are number of people recently blogging about MySQL automated failover, based on production incident which GitHub disclosed.

Here is my take on it. When we look at systems providing high availability we can identify 2 cases of system breaking down. First is when the system itself has a bug or limitations which does not allow it to take the right decision. Second is the configuration issue – which can be hardware configuration such as redundant network paths, STONITH, as well as things like various timeouts to know the difference between just transient errors or performance problem and real problem.

To be truly prudent it is better not to trust anyone… including yourself and assume the software is buggy and limited as well as you have not configured it properly. What can give us assurance ? right testing! If you’re really serious implementing HA setup you might need to spend 3-10 times as much time testing as you spend setting it up. In my experience unfortunately very few people are really ready to invest as much time into it. You do not only need time for testing, you need to test right things. Quite frequently I do not only see people not spending enough time testing but also not testing for right things. Taking shortcuts is one of the problem, simply not understanding what things can go wrong is another. One example could be testing network failures with rejecting packets on firewall which gives you instant connection failure, while in real life you could see connection to take significant time before timing out. This can be a lot of difference. The complex situations which happen with databases such as performance problems, running out of connections and so transient successful and failed connections as well as database stalling forever on queries can often be omitted.

The fact there are so many things what can go wrong mean two things. First you want to be using very well tested software when it comes to high availability solution. It takes many man years to polish them both in terms of theory and clear bug-less implementation. Second you will unlikely be able to cover any real world multi-dimensional failure scenario which will happen in production in your testing. So there is still a chance for things to go wrong.

There is another complexity at play as well – in many environments such as Cloud or Rented Dedicated servers you can’t even test everything would like, moreover you might have limited information about how exactly things are set up in environment you do not control and so what kind of failure scenarios you can expect.

As result you really would not have the confidence of the system working correctly until you have it working in practice. This is where my like for Manual failover comes in play. I would like to make system to detect the problem report in the logs what decisions it would take but rather page me to make actual failover first few time. As I have seen system is taking a good scenario in real world scenarios I would consider enabling automated failover because I can trust it. I would like the failover process to be automated but a person to take a decision to initiate it and verified it completed successfully.

Now as promised lets do some math. Consider you’re having manual fail-over and when things go south it takes in average 15 minutes to get the person to check what is going on and hit the button to initiate failover. Lets say automated failover is immediate but if it screws up it takes us couple of hours to fix the problem of faliover which did not go right. Note you do not really have a good idea about of probability of failover going bad until there is some statistics about the system. With parameters stated from purely uptime prospective it would make sense to have manual failover until you have a confidence the automated failover has problems in less than 1/8 of cases…. which I would not trust until I have seen system handling some 10 production incidents.

Lets not look at the time as the variable because it is important as well. If you have single MySQL server hosted in a good place you should have it failing less than once in a year in average. Some people report numbers as much as 3-5 years as mean time between failures but lets be conservative. If it takes a year between system failures this means it may take many years for system to really prove it in production… and by that time you will likely have your hardware, MySQL server, HA solution server all changed and you might need to start gathering stats as well.

This means in most cases I would recommend automated failover for large systems only. If I’m looking not at one Master-Slave pair but at 1000 I will be dealing with failures essentially every day and assuming they are configured the same way I will be able to gain the trust in software to be able to handle failures very well.

Finally lets look at the cost of bad failover. You want to minimize it and this is indeed where “failover only once” and “failover only when it is safe” principles are very good. When we apply them to MySQL replication we can use these simplified principles – we only will failover from Master to Slave but never fail back without manual intervention. We also will failover only when replication is caught up (wait for slave to catch up before getting it online). If you follow some of these principles and if software implements them properly there might be less issue with taking a wrong term – doing failover when not needed, flapping and causing all over mess. Some people also say you should not failover when database experiences performance problem only when it is really down completely. This is a hard question for me. Indeed there are many cases when performance problems are self imposed. Developer added bad query to the application and the box is overloaded and “goes down”. As we fail over another server goes down with overload the same way. This is not only the case though – there are many hardware failure and misconfiguration scenarios which can result in performance problem on single node only such as disk failing in RAID, resync initiated, BBU starting “learning” are just some IO related examples. In these cases failover to working box can be very good solution. Yet the problem is it is very hard to automatically detect the difference between these two conditions.

Conclusion: Way to often I see people being obsessed with system which provides automated failover when they have system of way too small case, have little discipline and, implement solution not following best practices and do not test in properly. Lots of unfortunate downtime is result. Running manual failover solution can be better choice in a lot of entry level systems and “battle testing” solution before putting it on auto pilot is a good approach for larger scale systems.

Interested in MySQL and High Availability ?
Come to Percona Live NYC taking place Oct 1-2 and learn about many state of the art high availability solutions for MySQL such as Clustrix, Tungsten Replicator, PRM, Percona XtraDB Cluster, MySQL and DRBD as well as many great sessions on other topics.

PREVIOUS POST
NEXT POST

Comments

  1. says

    Hi Peter! One point about failure rates–while they are increasingly low for well-managed on-premise computing, in my experience the rates are a lot higher for cloud environments like Amazon. I’m curious if Percona has any scientific numbers on this.

  2. says

    Robert,

    I think they are higher. I still think they are close to 1 year MTBF for well maintained MySQL systems.
    There are also other factors in the cloud which make automated failover good solution on lower scale. Systems in the cloud generally have more servers (because you can’t get as good hardware in the cloud and because of prevailing cloud design patterns) as well as generally higher level of automation and testing (DevOps etc).

  3. says

    I agree and argued in a previous blog article that Amazon changes calculations on availability, including considerations about automation. The model here is companies like Netflix that assume everything is going to crash and even give it a shove now and then to prove they can recover. There are many other companies doing similar things but Adrian Cockcroft and the Netflix team are very good at explaining what they do. (Example: http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html)

    See you in NYC. :)

  4. says

    I believe the risk of auto failover is more compared to manual importantly because you come to know about auto failover after it actually happens and when the problem gets detected.
    We faced it with Amazon instance (Multi AZ) and it took an hour to detect the issue and failover scenario; and then RDS support conversation begin :) .

    As far as I’ve seen the world, I prefer manual failover as and when possible, as it provides
    – Knowledge of issue.
    – Confirmation of successful failover as checks are run immediately.

    Importantly, thanks for the Math and surely these all posts have a lot of take-away. Thanks GitHUb & all for the dissection.

    – Kedar

  5. says

    Robert,

    I completely agree it makes sense for companies like Netflix which have both large infrastructure, large enough business supported by infrastructure to allow them to invest into the team, processes, testing etc. If you look closer at my point the question is can you be sure automatics does not fail very frequently in your case ? Unfortunately most teams can’t ensure it.

  6. says

    @Peter, to your point about unnecessary failure, if you depend on automation it’s important to go with mature products. Netflix develops such products internally but smaller shops need to depend on vendors to get it mostly right and provide 24×7 support to cover the corner cases where it does not work. It’s possible to get solutions sufficient “right” that they make economic sense for a lot of people to use them. For instance, Continuent puts a lot of effort into avoiding up-front configuration errors that might cause failover later on. You can test for these of course and should but it’s even better to avoid them in the first place. Reducing the scope of required testing is necessary to make the problem tractable.

    Obviously not everyone makes the choice to go with full automation. It’s interesting that a lot of the voices that prefer manual failover are from sophisticated operations that are staffed 24×7 and have advanced monitoring. That effects the economics significantly.

  7. says

    A lot of the voices that prefer manual failover are also those that have helped the small companies without lots of staff and advanced monitoring, after their inadequately implemented/tested automatic failover causes them to experience business-threatening downtime ;-)

  8. says

    @Baron, that’s a fair point. I have also seen the flip side where users didn’t notice failures for many hours due to monitoring problems. There’s clearly no one-size fits all answer and the answer change over time. There’s definitely room to improve across the board.

  9. says

    Robert…. I think if you do not have monitoring working properly you’re screwed and working automated failover will be of little help. It will help with small variety of cases but you will still run of space, run out of connections and overload the system when developer pushes not indexed query live and goes back to sleep.

Leave a Reply

Your email address will not be published. Required fields are marked *