Twitter’s bad day, an outage explained

Twitter has had a good run for the last year. There have been nearly no outages in the last year on Twitter and the fail whale has been on hiatus until today.

Twitter was down for almost 2 hours total in a series of outages this morning. At first, Twitter executives did not have much information and kept silent which is understandable in an outage. Once they were able to get their grounds they offered a little more information.

This is what Twitter VP of Engineering Mazen Rawashdeh had to say:

Not how we wanted today to go. At approximately 9:00am PDT, we discovered that Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets. We immediately began to investigate the issue and found that there was a cascading bug in one of our infrastructure components. This wasn’t due to a hack or our new office or Euro 2012 or GIF avatars, as some have speculated today. A “cascading bug” is a bug with an effect that isn’t confined to a particular software element, but rather its effect “cascades” into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter.

We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future.

For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though.

We know how critical Twitter has become for you — for many of us. Every day, we bring people closer to their heroes, causes, political movements, and much more. One user, Arghya Roychowdhury, put it this way:
It’s imperative that we remain available around the world, and today we stumbled. For that we offer our most sincere apologies and hope you’ll be able to breathe easier now.

The problem, Rawashdeh explained, had to do with what is called a “cascading bug” — a term that quickly spawned its own parody Twitter account — in one of the company’s infrastructure components. That bug wasn’t confined to an individual element of the company software, so it created a cascading effect, spreading to other parts of the software and affecting Twitter’s 150 million-plus users.

These things happen, Twitter will get back up and brush of their shoulders. I applaud how quickly the Twitter engineers were able to bring such a large service back online again in such short time.   Other blogs are focusing on the fact that advertisers may lose confidence in the service, I do not see that happening unless this is a continued occurrence. Tweet on.

Learn more the author of this post:

I was a Computer and Information Technology student at Purdue University. I have always wanted my own website and have been fascinated with technology my entire life. So here I am, what's next?