So you need to design highly available MySQL powered system… how do you approach that ?
Too often I see the question is approached by focusing on expensive hardware which in theory should be reliable. And this really can work quite well for small systems. It is my experience – with quality commodity hardware (Dell,HP,IBM etc) you would see box failing once per couple of years of uptime which is enough to maintain level of availability needed by many small systems. In fact they typically would have order of magnitude more availability issues caused by their own software bugs, DOS attacks and other issues.
However as your system growths the reliability goes down. If you have 100 servers with each failing every 2 years this is about a server a week which is bad and if you’re into thousands and tens of thousands of servers server failures are becoming common place so it is important to make sure failing server does not affect your system and also what you can recover from server failure easily
So you should assume every component in the system can fail (if it is Server,Switch,Router,Cable, SAN) etc and you’re ready to deal with this. It does not mean you always have to ensure you stay fully operational after any failure but at least you should understand the risks. For example you may want to choose to keep single Cisco router because it has its own internal high availability on the component level which makes it extremely unlikely to fail, because you have 4 hour onsite repair agreement and because it is just freaking expensive. Though may be redundant less expensive systems could be better choice.
I would highlight again every component can fail it does not matter how redundant it is inside. The SAN is very good example – I’ve seen Firmware glitches causing failure in the SAN which was fully redundant on the component level. It is not every hardware component but also any code may fail as well. This is actually what makes your own code often the weakest link in availability.
Depending on failure rate you also should be thinking about automation – for frequent failures you want to recovery (like getting spare Web server and putting it online) to be automatic or done with simple manual command. For complex and rare failures you may have less automation – if certain type of failure happens once per couple of years for many evolving systems there is very high chance the old automation tools may not work well (this is of course unless you always test all automated failure scenarios regularly).
So if we’re designing the system so it can tolerate hardware failures should we bother about hardware quality at all ? The answer is yes in particular for classic database/storage systems. Few systems are design with so much error detection and automated handling in mind as Google File System.
In particular you want to make sure Error Detection is on the good level. For example if you’re running the system without ECC memory chances are your data will melt down and you will not notice it for long time (in particular if you’re using MyISAM tables) which can cause the error to propagate further in the system and make recovery much more complicated than simply swapping the server. This is exactly one of the reasons many high scale installations prefer Innodb – it is paranoid and this is how you want your data storage to be. This is also why Sun is so proud about checksums on the file system level in ZFS.
What is about RAID when ? As strange as it may sound but you should not relay on RAID for your data safety. There are many ways to loose data on RAID system even if you’re running RAID6 with couple of hot spare. The RAID is just dramatically reduces chance of data loss in case of hard drive failure and this is good because recovering database servers is not fully automated in most cases. Plus there may be system performance impact and (in particular if you use MySQL Replication for HA) the switch to the new server may not be 100% clean with few updates lost. RAID, especially with BBU also makes a good sense to get extra performance out of the box.
Some installations are using RAID0 for slaves – in these cases there are typically many slaves and recovery of the slave is very easy and causes no business impact. This is fine assuming you do the math and the performance gains or cost savings are worth it.
Another good RAID question is if Hot Spare should be used. I normally do not use it because it a large waste, especially as most of systems have even number of drives, so if you’re looking for RAID10 setting up hot spare costs you 2 drives. Having hot spare does not add a lot to high availability – if you have proper RAID monitoring in place and keep couple of spare hard drives on the shelf in the data center we’re speaking about couple of extra hours running in degraded mode. Even if you do not have spare hard drive you can often pool the one from the spare server and have the “warranty man” to replace it instead.
It is also a good question if you need redundant power supplies. In my experience they rarely fail so having redundant power supplies does not increase availability when it comes to hardware failures that much and so if you just look from this angle it may be justified only for the most critical servers. Do not forget redundant power supplies also increase server power usage a bit. Redundant power supplies however are helpful if you have multiple power feeds, so server can stay up if one of the phases has a power loss. Another benefit is – in redundant power supply will often allow to do some power work (like moving server to different circuit) without downtime which may be or may not be something important for you.
Finally I should mention about spare component. These are paramount if you’re designing highly available system. Having spare drives on the shelf, spare switches, spare servers (which are same as better as servers which are in production) is paramount. It is important promotion happens easily and there are no performance gotchas (ie 8 core server can be slower than 4 core with MySQL). It is best if you just put couple of spare servers in each purchase batch so they are absolutely same configuration but I know it is not always possible. Dealing with spares is yet another reasons to avoid the “server zoo” and have limited set of purchased configurations which are reviewed yearly (on other regular interval) rather than finding different best configuration each week.
Having spare servers also means you often do not need most expensive support agreements and Business Hours Monday-Friday is good enough for you – you’re not waiting for support for production anyway just fall back to another server and use it. Of course you can imagine cases when problem could affect all servers of the same type but it is not that frequently seen in practice.
To avoid multiple servers failing at the same time it is of course important to QA/Stress test servers before you put the load on them. I’ve seen multiple cases when something would go wrong and all servers of same configuration will experience the same problem. Proper QA/Stress test reduces the chance of this but you better to be testing with load similar to what you expect in production.
Requirement to have Spare hardware is also the reason why commodity inexpensive hardware is often better choice. If you have couple of $1M in production you need another $1M server as a spare and this is expensive. If instead you have 10 pairs $10K boxes having couple of spares would cost you only $20K plus I found it in many cases much easier to convince “finance” people to buy something cheap which is not used most of the time when to spend a lot of money on the server which will be where sitting doing idle.
How many spare servers you need – you would see it in practice. As I mentioned at least one for any hardware class you have. If you have many failures you need more of course. You may also decide to keep more spare systems when you can use them to help capacity management, especially if you have multiple applications which do not share hardware but share the data center. You may have “spares” to provide extra on demand capacity for web servers or memcached quite easily, or say increase number of slaves if you have unexpectedly high number of reports launched by users etc.