Well, it happened again… Another lengthy EBS outage in the US-East region impacted several sites across the net. While failures like this are rare, they can be quite costly and translate into headaches for the operations team when impact production systems for any length of time. At Percona, we routinely help clients architect and deploy highly available systems designed with disaster recovery in the cloud. Here are a few high level best practices that I’ve seen when helping clients with AWS deployments:
- Plan for failure
- Plan for failure
- Plan for … you get the idea
Plan for Failure
The single most critical piece is to plan for and expect failure. The ease of setting an infrastructure in the cloud combined with promises of HA can lead to a false sense of confidence. Assume that parts of the robust cloud are going to fail and work to eliminate any SPOF within the cloud architecture. Simulate random things going away at random times (cue the Chaos Monkey). While outages are going to happen, preparing for and planning on cloud failures can help you to mitigate the impact to your application.
AWS Infrastructure (High Level)
The Amazon Web Services infrastructure contains several individual components that can be combined to create a highly available architecture. While Amazon claims that issues in a single Availability Zone should have no impact on other zones in the same region, empirical evidence from past outages has shown otherwise. It is crucial to have components geographically isolated as well as isolated at the data center level.
The top level of isolation in AWS is the region. They are geographically isolated in different physical data centers around the world. Bandwidth across regions is similar to standard traffic across the internet, and is charged as such. In order to have a fully redundant solution, you need to have working instances in multiple regions that are able to operate independently.
Within a single region, there are multiple Availability Zones (AZs). They are designed to operate independently, but there have been examples where issues in a single AZ impacted resources in different AZs. Data transfer within a single AZ is free while data transfer across AZs (but within the same region) is charged at a discounted Regional transfer rate. Having instances in multiple AZs is a minimal level of availability, but can’t be trusted alone.
Looking at the history of AWS outages, they have been isolated to a single region, but have impacted multiple availability zones. Also, other regions may suffer a bit of a slowdown due to others failing into another region. In general, there were will be a larger load across the system and some API calls may not be fully responsive.
I would say that the number one strategy is making sure that you are geographically isolated. While that isn’t the end-all (multiple cloud providers, physical datacenter, etc), it should give your cloud app more resilience when faced with a cascading failure within a single region. This is a very similar principle to running real gear – you can’t rely on simply keeping servers in different racks when thinking of HA and DR. Rather, you need to have geographically isolated data centers in the event of a catastrophic failure.
I have seen EBS volume failover used as a viable HA option within a single AZ. Essentially, your instance starts having issues so you mount your EBS data directory on another instance and simply fire up MySQL. This works wonderfully unless EBS is the component experiencing the issue. In this case, having a hot slave in another region is really the only way to “spin up another instance”. In general, unless your data resides in another location, you can’t always assume you can simply mount your storage to another instance (thinking SAN failure in the real gear analogy).
Multi Region Replication
The easiest approach would be to use native replication from a master to a slave across regions. This will allow you to keep a relatively in-sync instance in another region ready to take over in the event of a full region outage where your primary server resides. Being asynchronous, there is always a potential for some slave lag, but 1-2 seconds of lost transactions (which may or may not be recoverable once the downed region recovers) compared with hours of downtime is probably a decent tradeoff.
You can also combine this approach with a tool like pt-query-digest or pt-playback to keep your standby server primed in the event of failover. I only mention this because in some cases, a cold start can often times result in degraded service for quite some time as well.
Percona XtraDB Cluster (PXC)
Another option would be to use PXC and keep a node (or more) in a separate region. While this will allow for synchronous writes to the remote node, it will also have some latency impact on write operations (the ping time to the most distant node) but that may be something your application can tolerate.
In conjunction with having a node outside of the region, you can also keep your other nodes in different AZs within the same region to at least give yourself some isolation at the data center level (think keeping your nodes in separate racks).
If the extra write latency is something that is a show-stopper for your application, you may consider replicating from a PXC cluster to another cluster or standalone server in another region via standard asynchronous replication. This would be similar to the above approach, but you gain some level of resiliency within the region as well. As a note, you will definitely want to load test this solution when evaluating a cross region cluster.
Overall, issues like this are inevitable and will likely cause at least some downtime or degraded service (unless you run active/active, but that is another discussion). However, given the issues that have occurred in the past, you can mitigate that by treating the cloud like you would a physical datacenter and planning accordingly. Expect failures and simulate/test your failover procedures so you can be confident the next time #ec2pacolypse hits.