On the 21st of April, Amazon Web Services experienced a major outage an outage which took many large websites offline. Websites that were affected included Reddit.com, Hootsuite for example. The outage was chalked up to a change that was made to upgrade network capacity that also shifted traffic off of one of the redundant routers on the Amazon Elastic Block Store . Amazon states that the shift was done incorrectly which caused the outage.
Netflix also resides on AWS but was not affected by the outage and they want to tell you why. Netflix was designed to be ready for exactly the type of failure that happened at Amazon on the 21st. The Netflix servers did not use EBS as their main data storage and when the traffic was routed off the EBS at Amazon Netflix was still up.
What Went Well…
The Netflix service ran without intervention but with a higher than usual error rate and higher latency than normal through the morning, which is the low traffic time of day for Netflix streaming. We didn’t see a noticeable increase in customer service calls, or our customers’ ability to find and start movies.
In some ways this Amazon issue was the first major test to see if our new cloud architecture put us in a better place than we were in 2008. It wasn’t perfect, but we didn’t suffer any big external outages and only had to do a small amount of scrambling internally
What Didn’t Go So Well…
While we did weather the storm, it wasn’t totally painless. Things appeared pretty calm from a customer perspective, but we did have to do some scrambling internally. As it became clear that AWS was unlikely to resolve the issues before Netflix reached peak traffic in the early evening, we decided to manually re-assign our traffic to avoid the problematic zone. Thankfully, Netflix engineering teams were able to quickly coordinate to get this done. At our current level of scale this is a bit painful, and it is clear that we need more work on automation and tooling. As we grow from a single Amazon region and 3 availability zones servicing the USA and Canada to being a worldwide service with dozens of availability zones, even with top engineers we simply won’t be able to scale our responses manually.
Although Netflix was still up and running during the AWS outage they were still running at higher than normal latencies and error rates. But the traffic was still low as it was early in the morning. As the traffic started to pick up as the day progressed the Netflix Engineering teams found themselves scrambling to redirect traffic to handle the AWS outage. The outage was an eye opener for Netflix as they had to make many of these network changes manually and will need to make more automated tools in the future to make these changes faster. You can learn more about what happened with Netflix here.