If you haven't heard by now, Google GMail had a 2.5 hour outage on Tuesday. Acacio Cruz, Gmail’s Site Reliability Manager, explained in a little more detail regarding the outage. My interpretation of his explanation is that they had a bug in how data is shifted around in the cloud, and during a shift one of data centers couldn't handle the new burden, failed, and eventually the effect cascaded.

I have a few comments on this. First, GMail still carries the 'beta' moniker...and aren't service disruptions like this kind of expected/tolerated with beta software? But beta or not, real people use the service--including Google themselves (according to Acacio Cruz's post). So maybe it's time to promote GMail out of beta and designate it as production, since that's what it is to most people anyways.

Second, running a resilient cloud is hard, particularly when it comes to outages and the resulting automatic rebalancing to cover the gap and maintain availability. It's like a tire on a car: everything rotates routinely and drives smoothly when the tire is balanced. But if the center of balance of the tire shifts dramatically at a random moment, things quickly deteriorate and the high-speed rotation can cause significant instability and damage to the rest of the vehicle. Computing clouds are conceptually similar. What happens when an outage takes out a piece of the cloud and the finely orchestrated balance is randomly skewed...is the capacity of the rest of the cloud sufficient to absorb the load? Can the cloud rebalance itself gracefully? These are the same types of questions one will hear when investigating fault tolerance, failover, and redundancy capabilities of any normal technology component (servers, firewalls, WAN links, etc.). But the questions take on a whole new level of scale when you apply them to a cloud.

Plus let's not overlook the fact that we have different fault tolerance expectations for services in the cloud. The typical perception of clouds seems to imply internal failover capabilities. After all, that's part of the appeal of clouds: all that fault tolerance, redundancy, etc. is magically taken care of within the cloud. The cloud is infallible. The cloud cannot fail. Those are the dominant marketing messages. No one talks about having a contingency plan for when the cloud is unavailable (what would that be exactly? A second hot-spare cloud?).

So when choosing a cloud-based vendor, you should try to investigate their cloud resiliency. Vendors who take non-cloud specific software, run copies of it in two or three datacenters, and call it a cloud service are likely not delivering the true expected experience to customers. Software needs to be purpose-built to run in a cloud, including multi-tenancy design and strong internal failover capability. That's because most people wind up putting all of their eggs into the cloud basket. If you are going to be one of those people, you should try to make sure the basket offers the implied safety that you expect.

Until next time,
- Jeff