September 30, 2014

Minimizing Downtime from Lengthy AWS Outages

Well, it happened again…  Another lengthy EBS outage in the US-East region impacted several sites across the net.  While failures like this are rare, they can be quite costly and translate into headaches for the operations team when impact production systems for any length of time.  At Percona, we routinely help clients architect and deploy highly available systems designed with disaster recovery in the cloud.  Here are a few high level best practices that I’ve seen when helping clients with AWS deployments:

  1. Plan for failure
  2. Plan for failure
  3. Plan for … you get the idea

Plan for Failure

The single most critical piece is to plan for and expect failure.  The ease of setting an infrastructure in the cloud combined with promises of HA can lead to a false sense of confidence.  Assume that parts of the robust cloud are going to fail and work to eliminate any SPOF within the cloud architecture.   Simulate random things going away at random times (cue the Chaos Monkey).  While outages are going to happen, preparing for and planning on cloud failures can help you to mitigate the impact to your application.

 AWS Infrastructure (High Level)

The Amazon Web Services infrastructure contains several individual components that can be combined to create a highly available architecture.  While Amazon claims that issues in a single Availability Zone should have no impact on other zones in the same region, empirical evidence from past outages has shown otherwise.  It is crucial to have components geographically isolated as well as isolated at the data center level.

Regions

The top level of isolation in AWS is the region.  They are geographically isolated in different physical data centers around the world.  Bandwidth across regions is similar to standard traffic across the internet, and is charged as such.  In order to have a fully redundant solution, you need to have working instances in multiple regions that are able to operate independently.

Availability Zones

Within a single region, there are multiple Availability Zones (AZs).  They are designed to operate independently, but there have been examples where issues in a single AZ impacted resources in different AZs.  Data transfer within a single AZ is free while data transfer across AZs (but within the same region) is charged at a discounted Regional transfer rate.  Having instances in multiple AZs is a minimal level of availability, but can’t be trusted alone.

Failure Scenarios

Looking at the history of AWS outages, they have been isolated to a single region, but have impacted multiple availability zones.  Also, other regions may suffer a bit of a slowdown due to others failing into another region.  In general, there were will be a larger load across the system and some API calls may not be fully responsive.

Possible Strategies

I would say that the number one strategy is making sure that you are geographically isolated.  While that isn’t the end-all (multiple cloud providers, physical datacenter, etc), it should give your cloud app more resilience when faced with a cascading failure within a single region.  This is a very similar principle to running real gear – you can’t rely on simply keeping servers in different racks when thinking of HA and DR.  Rather, you need to have geographically isolated data centers in the event of a catastrophic failure.

EBS Failover

I have seen EBS volume failover used as a viable HA option within a single AZ.  Essentially, your instance starts having issues so you mount your EBS data directory on another instance and simply fire up MySQL.  This works wonderfully unless EBS is the component experiencing the issue.  In this case, having a hot slave in another region is really the only way to “spin up another instance”.  In general, unless your data resides in another location, you can’t always assume you can simply mount your storage to another instance (thinking SAN failure in the real gear analogy).

Multi Region Replication

The easiest approach would be to use native replication from a master to a slave across regions.  This will allow you to keep a relatively in-sync instance in another region ready to take over in the event of a full region outage where your primary server resides.  Being asynchronous, there is always a potential for some slave lag, but 1-2 seconds of lost transactions (which may or may not be recoverable once the downed region recovers) compared with hours of downtime is probably a decent tradeoff.

You can also combine this approach with a tool like pt-query-digest or pt-playback to keep your standby server primed in the event of failover.  I only mention this because in some cases, a cold start can often times result in degraded service for quite some time as well.

Percona XtraDB Cluster (PXC)

Another option would be to use PXC and keep a node (or more) in a separate region.  While this will allow for synchronous writes to the remote node, it will also have some latency impact on write operations (the ping time to the most distant node) but that may be something your application can tolerate.

In conjunction with having a node outside of the region, you can also keep your other nodes in different AZs within the same region to at least give yourself some isolation at the data center level (think keeping your nodes in separate racks).

If the extra write latency is something that is a show-stopper for your application, you may consider replicating from a PXC cluster to another cluster or standalone server in another region via standard asynchronous replication.  This would be similar to the above approach, but you gain some level of resiliency within the region as well.  As a note, you will definitely want to load test this solution when evaluating a cross region cluster.

 

Overall, issues like this are inevitable and will likely cause at least some downtime or degraded service (unless you run active/active, but that is another discussion).  However, given the issues that have occurred in the past, you can mitigate that by treating the cloud like you would a physical datacenter and planning accordingly.  Expect failures and simulate/test your failover procedures so you can be confident the next time #ec2pacolypse hits.

About Mike Benshoof

Michael joined Percona in 2012 as a US based consultant. Prior to joining Percona, Michael spent several years in a DevOps role maintaining a SaaS application specializing in social networking. His experiences include application development and scaling, systems administration, along with database administration and design. He enjoys designing extensible and flexible solutions to problems.

When not working, he enjoys golfing, grilling, watching sports, and spending time with the family.

Comments

  1. Karl says:

    And avoid North Virginia at all cost. Most of the issues are coming from this region, since it’s their first one.

  2. I like the idea of using two PXC clusters in different regions. Use at least three nodes in each region and place each node into a different AZ. The extra latency on commit won’t be too bad as the latency between AZs is fairly low.

    You can use normal asynch replication between regions. I suggest this for two reasons. First, you don’t want an Internet transit problem (suddenly very slow connections, or lots of packet loss etc) to slow down writing into your primary region.

    Second, while PXC performs much better when geographically distributed compared to semi-sync replication, the extra latency on commit may not be acceptable for your application if you extend your cluster across regions, particularly if you want your primary region to be in the US and the secondary region in Europe.

    Some other things to keep in mind:
    Make sure you encrypt your traffic between zones and regions. Use SSL, stunnel, openvpn, etc between your nodes.

    Make sure you run backups in each of the regions.

    Make sure you test failures and failure modes in each region.

  3. Jacky says:

    Just curious, have anyone done performance test for PXC in single availability zone, multi availability zone, and multi region? This will be a good insight on how well PXC do in HA setup of EC2

  4. Jeremiah says:

    Not using “The Cloud” seems to be the obvious way to avoid the impact of outages like this.

  5. William says:

    @Jeremiah Avoiding the AWS will help you avoid AWS outages, but then you’ll have your own infrastructure problems. Mixed in with a few AWS specifics, there are some geographic distribution tips as well.

  6. Aldo says:

    Thanks for the post! the benefits of using AWS are enough reasons to have a good contingency plan I’ll suggest this topic to be included in the course on AWS.

  7. Anil says:

    Please post your experience on Percona XtraDB Cluster (PXC) on AWS. I would help us since we want to explore this option.

Speak Your Mind

*