The fragility of public cloud IT

I have been reading AntiFragile again (by Nassim Taleb). And although he would probably disagree with my use of his concepts, it appears to me that IT is becoming more fragile, not less.

For example, recent outages at major public cloud providers display increased fragility for IT. Yet these problems, although almost national in scope, seldom deter individual organizations from their migration to the cloud.

Tragedy of the cloud commons

The issues are somewhat similar to the tragedy of the commons. When more and more entities use a common pool of resources, occasionally that common pool can become degraded. But because no-one really owns the common resources no one has any incentive to improve the situation.

Now the public cloud, although certainly a common pool of resources, is also most assuredly owned by corporations. So it’s not a true tragedy of the commons problem. Public cloud corporations have a real incentive to improve their services.

However, the fragility of IT in general, the web, and other electronic/data services all increases as they become more and more reliant on public cloud, common infrastructure. And I would propose this general IT fragility is really not owned by any one person, corporation or organization, let alone the public cloud providers.

Pre-cloud was less fragile, post-cloud more so

In the old days of last century, pre-cloud, if a human screwed up a CLI command the worst they could happen was to take out a corporation’s data services. Nowadays, post-cloud, if a similar human screws up a CLI command, the worst that can happen is that major portions of the internet services of a nation go down.

Strange Clouds by michaelroper (cc) (from Flickr)

Yes, over time, public cloud services have become better at not causing outages, but they aren’t going away. And if anything, better public cloud services just encourages more corporations to use them for more data services, causing any subsequent cloud outage to be more impactful, not less

The Internet was originally designed by DARPA to be more resilient to failures, outages and nuclear attack. But by centralizing IT infrastructure onto public cloud common infrastructure, we are reversing the web’s inherent fault tolerance and causing IT to be more susceptible to failures.

What can be done?

There are certainly things that can be done to improve the situation and make IT less fragile in the short and long run:

  1. Use the cloud for non-essential or temporary data services, that don’t hurt a corporation, organization or nation when outages occur.
  2. Build in fault-tolerance, automatic switchover for public cloud data services to other regions/clouds.
  3. Physically partition public cloud infrastructure into more regions and physically separate infrastructure segments within regions, such that any one admin has limited control over an amount of public cloud infrastructure.
  4. Divide an organizations or nations data services across public cloud infrastructures, across as many regions and segments as possible.
  5. Create a National Public IT Safety Board, not unlike the one for transportation, that does a formal post-mortem of every public cloud outage, proposes fixes, and enforces fix compliance.

The National Public IT Safety Board

The National Transportation Safety Board (NTSB) has worked well for air transportation. It relies on the cooperation of multiple equipment vendors, airlines, countries and other parties. It performs formal post mortems on any air transportation failure. It also enforces any fixes in processes, procedures, training and any other activities on equipment vendors, maintenance services, pilots, airlines and other entities that can impact public air transport safety. At the moment, air transport is probably the safest form of transportation available, and much of this is due to the NTSB

We need something similar for public (cloud) IT services. Yes most public cloud companies are doing this sort of work themselves in isolation, but we have a pressing need to accelerate this process across cloud vendors to improve public IT reliability even faster.

The public cloud is here to stay and if anything will become more encompassing, running more and more of the worlds IT. And as IoT, AI and automation becomes more pervasive, data processes that support these services, which will, no doubt run in the cloud, can impact public safety. Just think of what would happen in the future if an outage occurred in a major cloud provider running the backend for self-guided car algorithms during rush hour.

If the public cloud is to remain (at this point almost inevitable) then the safety and continuous functioning of this infrastructure becomes a public concern. As such, having a National Public IT Safety Board seems like the only way to have some entity own IT’s increased fragility due to  public cloud infrastructure consolidation.

~~~~

In the meantime, as corporations, government and other entities contemplate migrating data services to the cloud, they should consider the broader impact they are having on the reliability of public IT. When public cloud outages occur, all organizations suffer from the reduced public perception of IT service reliability.

Photo Credits: Fragile by Bart Everson; Fragile Planet by Dave Ginsberg; Strange Clouds by Michael Roper

2 Replies to “The fragility of public cloud IT”

  1. Well, every significant public cloud service outage is an opportunity to think about how you have deployed your cloud services to avoid being cut-off from your apps and data during an outage. An important thing to remember is AWS is not your network architect, storage administrator or systems admin. Organizations do need to learn how to use and deploy AWS in ways that won’t leave them exposed to service interruptions when something or someone at AWS causes a problem.

    There may also be reasons why the AWS US-East-1 region has been more prone to outages. It was the first region that Amazon brought up in 2006. Amazon acquired existing data centers instead of building new ones, so they are getting old. This AWS region is also one of the largest. Because AWS gets bigger over time, size and complexity may create problems not anticipated by the designers. AWS should be able to survive multiple failures without going off-line, but when some anomalous behavior occurs it can produce unexpected results.

    1. Tim,

      Thanks for your comments. Yes it’s interesting that US-East-1 has been more error prone than the other regions, and I agree that concentration brings its own problems. It’s less clear how customers can easily design there systems for more fault tolerance in the face of AWS regional outages. But I am not a full stack developer.

      My main concern is not that AWS or the others have outages or problems but what’s the best way forward to reduce their frequency, duration and extent over time. I do believe that all the cloud providers are learning from their outages and are making changes to improve all these factors. But we as a society, can do better if we pool these improvement activities rather than continue them in isolation. And as the IoT, AI and automation become increasingly prominent in everyday society, the need for safer IT public cloud services will become even more important.
      Ray

Comments are closed.