The Riverbed Blog (testing)

A blog in search of a tagline

OMG! The Sky is Falling!

Posted by riverbedtest on April 26, 2011

Images Last Thursday morning I woke up to the kind of headline in the New York Times that I hadn't seen in years. "Amazon Cloud Failure Takes Down Web Sites" it read.  Wow.  An article about downed computers.  That takes me back.  I haven't seen those since the late '90s and early '00s (we still don't have a name for that decade, do we?), when security glitches and outages affected major web sites and became front page news.

Probably the least surprising thing about complex computers and computing networks is that from time to time they go down.  In fact, I'm more surprised when they stay up.  After all, what is the Public Cloud but a collection of computers.  Adding virtual computing doesn't change the fact Blueprints Coverthat there are physical computers underlying the virtual machines.  In fact, the addition of VMs to the mix just makes for more complexity and more things that can fail. 

Back in 2003, I wrote a book on Highly Available Systems Design, and while we didn't write about virtual  machines or public cloud computing, the principles really haven't changed all that much.  Since every component of a Exploding-earthcomputer system can fail, if you want truly reliable computing you must foresee these outages and  build protections into the system, and eliminate as much of the complexity as possible.  Of course you have to balance that with cost; the larger and rarer the prospective outage, the harder and more expensive it is to protect against it.  And if you take that to its ultimate extreme, you soon realize that the Earth is a single point of failure, and there's not much we can do about it.

So newspapers are reporting about computing outages again.  I look at it in a couple of ways.

  1. The publicity is ultimately GOOD for the Public Cloud Computing industry.  Why?  Because ALL publicity is good, and because in the end more people will learn about Cloud Computing.  
  2. Because companies will learn from their mistakes and build in better protections so that the same failures won't cause the same kind of outage next time.
  3. The first time around, one of the companies who had some of the most widely-publicized failures was amazon.com.  They seem to have gotten past those early problems, and I hear they're doing pretty well these days.

 

Leave a comment