Take advantage of our commitment to Fanatical Service, enhance your Peace of Mind and get Fully Managed™.
Well, Thursday April 21st marks an historic day in the history of cloud computing. Amazon Elastic Compute Cloud (EC2), one of the titans, the giants, the behemoths in cloud computing, sustained a massive outage that as of the time of this post has been down for 33 hours with no sign of full recovery in sight.
A quick review of the Amazon Service Health Dashboard shows that the Amazon Elastic Compute Cloud (N. Virginia), one of the largest and most highly populated regions in North America, shows a series of 19 fairly morbid updates starting at 1:41 AM PDT on April 21st, 2011 and ending (as of the time of this post), at 8:49 AM PDT on April 21st.
We at Fully Managed™ were negatively affected because several of our key information systems run within this availability center. This includes our public website and several ancillary systems we use to support our clients. Luckily, we have a fairly diversified set of tools and robust disaster recovery plan that have allowed us to sustain this 33 hour outage without impacting our clients negatively. What have I learned from this outage?
The Amazon Health Service Dashboard is a great idea but in a time of crisis the execution of communication via this feed is completely and utterly useless. While I appreciate the thought of Amazon providing us with semi-regular updates every hour (or sometimes 5 hours, depending on how they're feeling), the content provided provides absolutely zero peace of mind for us as a client. Take this example:
"Our current estimate is that the majority of volumes will be recovered in the next 5 to 6 hours.." and "..a smaller number of volumes will require a more time consuming process... will take longer to recover..."
Clearly, this is written either by a lawyer or a weather forecaster. Since I believe that the large majority of sites that are offline still reside inside the availability zone that is offline, what help does this statement above provide? Should we start rebuilding the applications from older backups? Will the restore be successful when completed? Have you really just lost all of the data and are trying to buy time to make a press release? We need quantifiable information here. I have to believe that by now, 33 hours later, Amazon has an idea of the damage and a more precise timeline that they could provide on a client-by-client basis. Not sharing that information is a massive failure on their part. Sort of reminds me of the Tiger Woods debacle.
We've been led to believe that the massive cloud providers such as Amazon have an incredibly robust, virtually bullet proof back-end infrastructure. They provide money-back SLAs boasting 99.95%. Two issues come to mind here: 1) Are they actually achieving this SLA? Certainly, after this outage they will no longer be hitting 99.95%. 2) Who cares about a bit of money back when your operation has been ground to a halt for two days?
This incident should raise some concerns for anyone currently considering or leveraging cloud computing of this nature. My recommendation is that you work with your IT leadership team to add some strategy around disaster recovery and business continuity with cloud computing.
The funny thing is, Amazon will probably see some upside from this outage as their clients light up new server instances in different availability zones. I can tell you that we are not looking at this option, and would rather implement failover systems either inside our own private cloud, or other public cloud hosting provider.