
Gene Kranz was 37 years old when Apollo 13, the space mission he served on as lead Flight Director, launched. A few days after takeoff, the Apollo 13 Service Module exploded, jeopardizing the lives of the three astronauts on board. While he never used the phrase “Failure Is Not An Option” (other than in the film version of the story), it was the attitude of mission control.
In light of the AWS EC2 crisis, which has left some companies with their services crippled or completely disabled for over 24 hours, it’s a good time to reflect on whether or not failure is an option to you. I know that most of us aren’t writing software controlling the launch codes, but we are writing software that businesses, and therefore people’s livelihoods, are dependent upon.
Evan Cooke, Twilio’s CTO, wrote a great post yesterday about how they managed to avoid any downtime. Pay attention to his update at the end of the post regarding how they’ve resisted EBS adoption.
By the way…
At Mashery, we did extensive I/O testing on EBS versus RAID0 striped ephemeral stores, and there’s no comparison. RAID0 blows EBS away on performance. Sure, you don’t get the magical features off-the-shelf that Amazon provides for EBS, but then again … how magical are those features feeling today?
So ask yourself: is failure an option?

(Link)
When you’re designing a system, you make tradeoffs. Choices. You must choose among variables like:
… just to name a few. Most people are familiar with the Project Triangle: “Good, Fast and Cheap: Choose any two” … but a simple triangle doesn’t fit today’s projects.
Most of the above items are self-explanatory, but I specifically separated resiliency and durability. A project/site/service may be resilient, which to me means it can go down and bounce back quickly. A durable site, on the other hand, won’t go down. There’s also a variance between the two on whether or not a site can rebound without data loss. A resilient service may bounce back, but suffer data loss … or, perhaps a resilient site doesn’t have any data to worry about. There aren’t many services that have value without some data, but many services provide value with minimal data dependencies.
As a handy reminder, I give you … The Heptmogrifier. Adjust as you like, but remember that your head will explode if you try to set all sliders to the max.

(I know “performant” isn’t really a word, but it’s used often enough that I’ll deal with those complaints.)
This is what you’re really working with when you deal with “internet-scale” applications.
The traditional good/fast/cheap tradeoffs will dictate the project’s code quality. But those three factors are influenced heavily by the choices you make about the other four.
Sustainable: How much is this thing going to cost to keep running, given the choices made?
Durable: How susceptible is the service to failure? More importantly, how likely is data loss as a result of failure?
Resilient: How quickly can the service recover from a failure? It’s extremely difficult for a service to have zero downtime. So, even if you’re building something durable, factor in what it takes to failover. Example: a DNS TTL of 60 seconds, with an every-30-second health check and a two-strikes policy, might take as much as 120 seconds to detect a failure and switch traffic elsewhere.
Performant: How do the choices you make in other areas affect performance overall? Or, perhaps some aspects of the system will need performance expectation adjustments based on the choices you make around durability?
Not all of these choices are as painful as they sound. A comment on my previous post suggested that latency was too high across regions to consider. This simply isn’t true for most applications. A half-second lag for master-slave replication across the United States is nothing when you consider that you’ll be able to cut over to that other region in 2 minutes, promote that slave to a master, and be back up and running before most people finish reading the TechCrunch article about how the sky has fallen.
They probably don’t care about the technical details, but they will certainly care about their investment going down the tubes because the service they invested in goes toes up due to incomplete failure planning. It’s not a popular conversation, and “how will I scale?” is often how questions about uptime and availability are framed.
It’s time to think about it differently. I agree that scaling to millions of users is a nice problem to have — a “high class problem,” as a friend is fond of saying. However, surviving an outage like what’s happening with Amazon EC2 East is an entirely different story. Your service may find it difficult to get to 10,000 users if it faceplants every time an upstream provider has a hiccup. (Investors, please weigh in below — feel free to call bullshit if I’m off the mark.) Make sure you and your other stakeholders are on the same page.
Take control of your destiny.
Reject failure as an option.
In doing so, you’ll not only live to fight another day, you’ll be around to take customers your competitors alienated by choosing … poorly.
Image credits: NASA, Indiana Jones and The Last Crusade

When we started using Amazon EC2 at Mashery in the fall of 2006, it had been in private beta for about three days. Meaning they’d announced it was available, but you had to get an invite to use it.
We managed to score an invite, and I immediately began building Mashery’s entire infrastructure around EC2. There was some angst over this decision internally and from our investors, but in retrospect it was one of the most important (and best!) decisions we made as a young company.
One of the things I remember vividly about the month in which I got my feet wet with EC2 was my take-away from reading the sparse documentation. It didn’t come right out and say it, but the vibe between the lines was:
WE MAY RIP THE RUG OUT FROM UNDER YOUR FEET AT ANY MOMENT.
I had this sensation that ANY instance, at ANY time, would just blink out of existence without any warning (other than the ominous undertone of the documentation).
So we built for fault tolerance from the beginning.
In the years since, I’ve seen services flail due to some EC2 hiccup or another. The buzz around Infrastructre as a Service has shifted from 2006 “IF YOU USE IT YOU ARE CRAZY” tones to “YOU ARE CRAZY IF YOU DON’T USE IT”, and with that shift has come this pleasant sensation that cloud infrastructure never goes down.
Snap out of it.
ANYTHING TECHNICAL has the potential to just up and crap the bed at ANY TIME. A service built without anticipating failure deserves the downtime it experiences.
I’ve heard all the ways of saying that planning for failover is premature optimization, but let’s face it: if the service isn’t built to fail well from the beginning, it’s unlikely ever going to get around to adding graceful fault tolerance.
What most companies do when faced with Technical Failure Awakening is build a “we’re down! Sorry!” page that lives on some other server, and points DNS to that page when their primary infrastructure fails. That’s not a solution, folks, it’s just a slightly better bedwetting than an completely unresponsive server. These guys have gone from failing their customers 100% to failing them 99.9%.
There are plenty of ways to get around this — it takes planning, and prioritization from the top.

Here’s the catch: failing well is hard. In the current “OMG Launch Immediately and Iterate!” and “Customer Service is the Only Defensible Strategy” culture of internet companies, no one has paused to recognize that these are mutually near-exclusive. Planning to fail gracefully in ways that don’t negatively impact customers is HARD WORK. It takes TIME, and simulated failures. (See: Chaos Monkey)
This is by no means a complete list of failure avoidance resources, but here are a couple of things to start with:
(No, I don’t work for DynDNS — I just think they have a great product offering at a reasonable price. Similar services can be had from UltraDNS or Akamai if you’re interested in paying 1999 prices.)
Amazon RDS doesn’t currently support auto-failover replication between regions, which is the only thing that would have saved people from this week’s outage. Those that are particularly concerned about having serious uptime can’t be fully bound to what their cloud service offers them. Cloud services are super-awesome, but it may make sense to use them in conjunction with other techniques that aren’t (yet) packaged up for mass consumption. Replication is often one of those hard problems that require custom configurations that off-the-shelf product features won’t cover. Remember: that reality doesn’t minimize the importance of replication to fault tolerance.
Go forth and stay dry under the clouds.
Photo credits: iStockphoto, CaptainD