As my first contribution to the MySQL Performance Blog, I joined Percona at the beginning September, I chose to cover the various high-availability (HA) options available for MySQL.Â I have done dozen of MySQL HA related engagements while working for Sun/MySQL over the last couple of years using Heartbeat, DRBD and NDB cluster and I’ll probably be doing the same at Percona.Â I have built my first DRBD based HA solution nearly 10 years ago.
There is quite a lot of confusion surrounding HA solutions for MySQL, I will try to present them objectively, my goal here been not to sell any specific technology but to help people choose the right one for their needs.Â This post is first of a series,Â I don’t yet know how many I will write in the series.
Before we start, it must be stated that high-availability is not only a matter of technical solutions, good management practices covering monitoring, alerting, security and documentation are also needed to insure a successful solution. In other words, no solution is fool proof, if a high-availability solution is running in recovery mode for months without nobody caring about it, the risk of a complete failure is much higher.
In order to all be on the same page, I will first give some definitions of the key terms.Â I don’t pretend those definitions are perfect but let’s build on them.
Let’s first define what is meant by high-availability.Â The most general definition would be that a high-availability setup is specialÂ computer architecture designed to improve the availability of a computer service, like a MySQL database.Â High-availability, HA for short, introduces a wealth of peculiar concepts, we will first review the main ones.
Uptime means the service is available even if degraded as long as it is above some defined performance threshold. Downtime means the opposite, either the service is completely down or unresponsive according to the defined performance threshold.Â In many cases, people don’t define a performance threshold, it is basically the service monitoring frequency and timeout that fix it.
Level of Availability
The level of availability is basically the guaranteed percentage of uptime you will get over a year.Â It has always been a subject of debate and it is something hard to evaluate since, most of the time, the samples are small and all the conditions of the deployments are not easily controlled. See the level of availability as the availability you, as the operator of the service, can promise in case of a worse case scenario. For example, 98% availability means a downtime of a little more than 7 days per year.Â The cost is approximately an exponential function of the level of availability and has to be compared with the downtime cost. If an HA setup with a level of availability of 99% is fairly simple and affordable, moving to 99.9% and 99.99% can be much more expensive and complex. Also, you need to consider the environmental factors.Â If your ISP cannot guarantee you a level of availability of more then 99.9% for the Internet access, it is useless to go beyond that no matter the importance of the application.
Single point of failure (SPOF)
Single points of failure are the things you are looking to remove when you build an HA solution.Â Basically, they are the devices/things that if they are not available, the service is down.Â A data center can be considered a SPOF at a high enough level of availability.Â Usually the more SPOFs you consider, the higher the availability of your solution and the higher its cost.
Recovery (or failover) is the process by how a HA setup recovers from a failure. During the recovery time, the service is down.Â With the most simple solutions, it can be a manual process but most of the time, it is automatic.Â Also, there is a time associated with the recovery.Â If a failure happened during the night and the operator is only available from 8am to 5pm then, you might have a recovery time of more than 12 hours.Â The more complex solutions have automatic recovery and do not need human intervention.Â Once again, although they are some exceptions, faster and automatic recovery usually means higher costs.
A bunch of servers used for the same task.Â In our case, dedicated to high availability of the MySQL database service.
With theses common definitions, we will then be able to move to the second step, the questions.