The fragility of public cloud IT

I have been reading AntiFragile again (by Nassim Taleb). And although he would probably disagree with my use of his concepts, it appears to me that IT is becoming more fragile, not less.

For example, recent outages at major public cloud providers display increased fragility for IT. Yet these problems, although almost national in scope, seldom deter individual organizations from their migration to the cloud.

Tragedy of the cloud commons

The issues are somewhat similar to the tragedy of the commons. When more and more entities use a common pool of resources, occasionally that common pool can become degraded. But because no-one really owns the common resources no one has any incentive to improve the situation.

Now the public cloud, although certainly a common pool of resources, is also most assuredly owned by corporations. So it’s not a true tragedy of the commons problem. Public cloud corporations have a real incentive to improve their services.

However, the fragility of IT in general, the web, and other electronic/data services all increases as they become more and more reliant on public cloud, common infrastructure. And I would propose this general IT fragility is really not owned by any one person, corporation or organization, let alone the public cloud providers.

Pre-cloud was less fragile, post-cloud more so

In the old days of last century, pre-cloud, if a human screwed up a CLI command the worst they could happen was to take out a corporation’s data services. Nowadays, post-cloud, if a similar human screws up a CLI command, the worst that can happen is that major portions of the internet services of a nation go down.

Strange Clouds by michaelroper (cc) (from Flickr)

Yes, over time, public cloud services have become better at not causing outages, but they aren’t going away. And if anything, better public cloud services just encourages more corporations to use them for more data services, causing any subsequent cloud outage to be more impactful, not less

The Internet was originally designed by DARPA to be more resilient to failures, outages and nuclear attack. But by centralizing IT infrastructure onto public cloud common infrastructure, we are reversing the web’s inherent fault tolerance and causing IT to be more susceptible to failures.

What can be done?

There are certainly things that can be done to improve the situation and make IT less fragile in the short and long run:

  1. Use the cloud for non-essential or temporary data services, that don’t hurt a corporation, organization or nation when outages occur.
  2. Build in fault-tolerance, automatic switchover for public cloud data services to other regions/clouds.
  3. Physically partition public cloud infrastructure into more regions and physically separate infrastructure segments within regions, such that any one admin has limited control over an amount of public cloud infrastructure.
  4. Divide an organizations or nations data services across public cloud infrastructures, across as many regions and segments as possible.
  5. Create a National Public IT Safety Board, not unlike the one for transportation, that does a formal post-mortem of every public cloud outage, proposes fixes, and enforces fix compliance.

The National Public IT Safety Board

The National Transportation Safety Board (NTSB) has worked well for air transportation. It relies on the cooperation of multiple equipment vendors, airlines, countries and other parties. It performs formal post mortems on any air transportation failure. It also enforces any fixes in processes, procedures, training and any other activities on equipment vendors, maintenance services, pilots, airlines and other entities that can impact public air transport safety. At the moment, air transport is probably the safest form of transportation available, and much of this is due to the NTSB

We need something similar for public (cloud) IT services. Yes most public cloud companies are doing this sort of work themselves in isolation, but we have a pressing need to accelerate this process across cloud vendors to improve public IT reliability even faster.

The public cloud is here to stay and if anything will become more encompassing, running more and more of the worlds IT. And as IoT, AI and automation becomes more pervasive, data processes that support these services, which will, no doubt run in the cloud, can impact public safety. Just think of what would happen in the future if an outage occurred in a major cloud provider running the backend for self-guided car algorithms during rush hour.

If the public cloud is to remain (at this point almost inevitable) then the safety and continuous functioning of this infrastructure becomes a public concern. As such, having a National Public IT Safety Board seems like the only way to have some entity own IT’s increased fragility due to  public cloud infrastructure consolidation.

~~~~

In the meantime, as corporations, government and other entities contemplate migrating data services to the cloud, they should consider the broader impact they are having on the reliability of public IT. When public cloud outages occur, all organizations suffer from the reduced public perception of IT service reliability.

Photo Credits: Fragile by Bart Everson; Fragile Planet by Dave Ginsberg; Strange Clouds by Michael Roper