Google releases new Cloud TPU & Machine Learning supercomputer in the cloud

Last year about this time Google released their 1st generation TPU chip to the world (see my TPU and HW vs. SW … post for more info).

This year they are releasing a new version of their hardware called the Cloud TPU chip and making it available in a cluster on their Google Cloud.  Cloud TPU is in Alpha testing now. As I understand it, access to the Cloud TPU will eventually be free to researchers who promise to freely publish their research and at a price for everyone else.

What’s different between TPU v1 and Cloud TPU v2

The differences between version 1 and 2 mostly seem to be tied to training Machine Learning Models.

TPU v1 didn’t have any real ability to train machine learning (ML) models. It was a relatively dumb (8 bit ALU) chip but if you had say a ML model already created to do something like understand speech, you could load that model into the TPU v1 board and have it be executed very fast. The TPU v1 chip board was also placed on a separate PCIe board (I think), connected to normal x86 CPUs  as sort of a CPU accelerator. The advantage of TPU v1 over GPUs or normal X86 CPUs was mostly in power consumption and speed of ML model execution.

Cloud TPU v2 looks to be a standalone multi-processor device, that’s connected to others via what looks like Ethernet connections. One thing that Google seems to be highlighting is the Cloud TPU’s floating point performance. A Cloud TPU device (board) is capable of 180 TeraFlops (trillion or 10^12 floating point operations per second). A 64 Cloud TPU device pod can theoretically execute 11.5 PetaFlops (10^15 FLops).

TPU v1 had no floating point capabilities whatsoever. So Cloud TPU is intended to speed up the training part of ML models which requires extensive floating point calculations. Presumably, they have also improved the ML model execution processing in Cloud TPU vs. TPU V1 as well. More information on their Cloud TPU chips is available here.

So how do you code a TPU?

Both TPU v1 and Cloud TPU are programmed by Google’s open source TensorFlow. TensorFlow is a set of software libraries to facilitate numerical computation via data flow graph programming.

Apparently with data flow programming you have many nodes and many more connections between them. When a connection is fired between nodes it transfers a multi-dimensional matrix (tensor) to the node. I guess the node takes this multidimensional array does some (floating point) calculations on this data and then determines which of its outgoing connections to fire and how to alter the tensor to send to across those connections.

Apparently, TensorFlow works with X86 servers, GPU chips, TPU v1 or Cloud TPU. Google TensorFlow 1.2.0 is now available. Google says that TensorFlow is in use in over 6000 open source projects. TensorFlow uses Python and 1.2.0 runs on Linux, Mac, & Windows. More information on TensorFlow can be found here.

So where can I get some Cloud TPUs

Google is releasing their new Cloud TPU in the TensorFlow Research Cloud (TFRC). The TFRC has 1000 Cloud TPU devices connected together which can be used by any organization to train machine learning algorithms and execute machine learning algorithms.

I signed up (here) to be an alpha tester. During the signup process the site asked me: what hardware (GPUs, CPUs) and platforms I was currently using to training my ML models; how long does my ML model take to train; how large a training (data) set do I use (ranging from 10GB to >1PB) as well as other ML model oriented questions. I guess there trying to understand what the market requirements are outside of Google’s own use.

Google’s been using more ML and other AI technologies in many of their products and this will no doubt accelerate with the introduction of the Cloud TPU. Making it available to others is an interesting play but this would be one way to amortize the cost of creating the chip. Another way would be to sell the Cloud TPU directly to businesses, government agencies, non government agencies, etc.

I have no real idea what I am going to do with alpha access to the TFRC but I was thinking maybe I could feed it all my blog posts and train a ML model to start writing blog post for me. If anyone has any other ideas, please let me know.

Comments?

Photo credit(s): From Google’s website on the new Cloud TPU

 

Disaster recovery from VMware to AWS using Dell EMC Avamar & Data Domain

avI was at Dell EMC World2017 last week and although most of the news was on Dell’s new 14th generation server and Dell-EMC integration progress, Wednesday’s keynote was devoted to storage and non-server infrastructure news.

There was plenty of non-server news but one item that caught my attention was new functionality from Dell EMC Data Protection Division that used Avamar and Data Domain to provide disaster recovery for VMware VMs directly to AWS.

Data Domain (AWS) Cloud DR

Dell EMC Data Domain Cloud DR (DDCDR) is  a new capability that enables DD to backup to AWS S3 object storage and when needed restart the virtual machines within AWS.

DDCDR requires that a customer with Avamar backup and Data Domain (DD) storage install an OVA which deploys an “add-on” to their on-prem Avamar/DD system and install a lightweight VM (Cloud DR server) utility in their AWS domain.

Once the OVA is installed, it will read the changed data and will segment, encrypt, and compress the backup data and then send this and the backup metadata to AWS S3 objects. Avamar/DD policies can be established to control how many daily backup copies are to be saved to S3 object storage. There’s no need for Data Domain or Avamar to run in AWS.

When there’s a problem at the primary data center, an admin can click on a Avamar GUI button and have the Cloud DR server, uncompress, decrypt, rehydrate and restore the backup data into EBS volumes, translate the VMware VM image to an AMI image and then restarts the AMI on an AWS virtual server (EC2) with its data on EBS volume storage. The Cloud DR server will use the backup metadata to select the AWS EC2 instance with the proper CPU and RAM needed to run the application. Once this completes, the VM is running standalone, in an AWS EC2 instance. Presumably, you have to have EC2 and EBS storage volumes resources available under your AWS domain to be able to install the application and restore its data.

For simplicity purposes, the user can control almost all of the required functionality for DDCDR from the Avamar GUI alone. But in case of a site outage, the user can initiate the application DR from a portal supplied by the Cloud DR server utility.

There you have it, simplified, easy to use (AWS) Cloud DR for your VM applications all through Dell EMC Avamar, Data Domain storage and DDCDR. At the moment, it only works with AWS cloud but it’s likely to be available for other public clouds in the near future.

~~~~

There was much more infrastructure news at Dell EMC World2017. I’ll discuss more details on their new storage offerings in my upcoming Storage Intelligence newsletter, due out the end of this month. If your interested in receiving your own copy of my newsletter, checkout the signup button in the upper right of this page.

Comments?

[Edits were made for readability and technical accuracy after this post was published. Ed]

Know Fortran, optimize NASA code, make money

Read a number of articles this past week about NASA offering a Fortran optimization contest, the High Performance Fast Computing Contest (HPFCC) for their computational fluid dynamics (CFD) program. They want to speed up CFD by 10X to 1000X and are willing to pay for it.

The contest is being run through HeroX and TopCoder and they are offering $55K, across the various levels of the contests to the winners.

The FUN3D CVD code (manual) runs on NASA’s Pliedes Linux supercomputer complex which sports over 245K cores. Even when running on the supercomputer complex, a typical CVD FUN3D run takes thousands to millions of core hours!

The program(s)

FUN3D does a hypersonic fluid analysis over a (fixed) surface which includes a “simulation of mixtures of thermally perfect gases in thermo-chemical equilibrium and non-equilibrium. The routines in PHYSICS_DEPS enable coupling of the new gas modules to the existing FUN3D infrastructure. These algorithms also address challenges in simulation of shocks and boundary layers on tetrahedral grids in hypersonic flows.”

Not sure what all that means but I am certain there’s a number of iterations on multiple Fortran modules, and it does this over a 3D grid of points, which corresponds to both the surface being modeled and the gas mixture, it’s running through at hypersonic speeds. Sounds easy enough.

The contest(s)

There are two levels to the contest: an Ideation phase (at HeroX) and an architectural phase (at TopCoder). The $55000 is split up between the HeroX ideation phase which rewards a total of $20K: $10K for winner and 2-$5K runner up prizes and the TopCoder architectural phase which rewards a total of $35K: $15K for winner and $10K for 2nd place and another $10K for “Qualified improvement candidate”.

The (HeroX) Ideation phase looks for specific new or faster algorithms that could replace current ones in FUN3D which include “exploiting algorithmic developments in such areas as grid adaptation, higher-order methods and efficient solution techniques for high performance computing hardware.”

The (TopCoder) Architecture phase looks at specifically speeding up actual FUN3D code execution. “Ideal submission(s) may include algorithm optimization of the existing code base,  Inter-node dispatch optimization or a combination of the two.  Unlike the Ideation challenge, which is highly strategic, this challenge focuses on measurable improvements of the existing FUN3d suite and is highly tactical.”

Sounds to me that the ideation phase is selecting algorithm designs and the architecture phase is running the new algorithms or just in general speeding up the FUN3D code execution.

The equation(s)

There’s a Navier-Stokes equation algorithm that get’s called maybe a trillion times until the flow settles down, during a run and any minor improvement here would be obviously significant. Perhaps there are algorithmic changes that can be used, if your an aeronautical engineer or perhaps there are compiler speedups that can be found, if your a fortran expert. Both approaches can be validated/debugged/proved out on a desktop computer.

You have to be a US citizen to access the code and you can apply here. You will receive an email to verify your email address and then once your validated and back on the website, you need to approve the software use agreement. NASA will verify your physical address by sending a letter to you with a passcode to use to finally access the code. The process may take weeks to complete, so if your interested in the contest, best to start now.

The Fortran(s)

I learned Fortran 66 a long time ago and may have dabbled with Fortran 77 but that’s the last touched fortran. But it’s like riding a bike, once you do it, it’s easy to do it again.

As I understand it the FUN3D uses Fortran 2003 and NASA suggests you use the Gnu Fortran GFortran compiler as the Intel one has some bugs in it. There appears to be a Fortran 2015 but it’s not in main use just yet.

A million core hours, just amazing. If you could save a millisecond out of the routine called a trillion times, you’d save 1 billion seconds, or ~280K core hours.

Coders start your engines…

 

Quantum computing at our doorsteps

Read an article the other day in MIT’s Technical Review, Google’s new chip is a stepping stone to quantum computing… about Google’s latest endeavor to create quantum computers. Although, digital logic or classical electronic computation has been around since mid last century, quantum logic does things differently and there are many problems that are easier to compute with quantum computing that take much longer to solve with digital computing.

Qubits are weird

Classical or digital electronic computation follows the more physical mechanistic view of the world (for the most part) and quantum computing follows the quantum mechanical view of the world. Quantum computing uses quantum bits or Qubits and the device that Google demonstrated has a 2X3 matrix of qubits, 6 in total.

Unlike a bit, which (theoretically)is a two state system that can only take on the values of 0 and 1, a qubit is a two level system but it can take on an infinitely many number of different states in reality. In practice, with a qubit, there are always two states that are distinguishable from one another but they can be any two states of the infinitely many states they can take on.

Also, reading out the state value of a qubit can be a probabilistic endeavor and can impact the “value” of the qubit that is read out afterwards.

There’s more to quantum computing and I am certainly no expert. So if your interested, I suggest starting with this Arxiv article.

Faster quantum algorithms

In any case some difficult and time consuming arenas of classical computation seem to be easier and faster with quantum computation. For example,

  • Factoring large numbers – in classical computation this process takes an amount of time that is exponential to the number of bits in the “large number”, where “B” is number of bits and “E” epsilon is a constant >0, the best current algorithms take O([1+E]**B) time. But Shor’s quantum factorization algorithm takes only O(B**3) time, which is considerably faster for large numbers. This is important because RSA cryptography and most key exchange algorithms in use today, base their security on the difficulty of factoring large numbers. (See Wikipedia article on Integer Factorization for more information.
  • Searching an unstructured list – in classical computation for a list of N items, it takes on the O(N). But Grover’s quantum search algorithm only takes O(sort[N]) which is considerably faster for large lists. (See Arxiv paper for more information.)

Using the Shor factorization algorithm, they were able to factor the number 15 with 7 qubits.

There are many quantum algorithms available today (see the Quantum Algorithm Zoo at NIST) with more showing up all the time.  Suffice it to say that quantum computing will be a more time efficient and thus, more effective approach to certain problems than classical computing.

Quantum computers starting to scale

Now back to the chip. According to the article the new Googl chip implements a 2X3 matrix of qubits.

For those old enough to remember, this was called an Octal or 3-bit number, ranging from 0 to 7, and two octals can range from 0..64. Octals were used for a long time to represent digital information for some (mostly mini-computers) computers. This is in contrast to most computing nowadays ,which uses Hexadecimal numbers or 4-bit numbers ranging from 0..15, and with two hexadecimal numbers ranging from 0..255.

Why are octals important? Well if quantum computing can scale up multiple octal numbers, then they can start representing really large numbers. According to the article Google chose 2X3 qubit structure because it’s more easy to scale.

I assume all the piping surrounding the chip package in the above photo are cooling ports. It seems that quantum computing only works at very cold temperatures. And if this is a two octals computer, scaling these up to multiple octals is going to take lots of space.

How quickly will it scale?

For some history, Intel introduced their 4004 (4-bit) computing chip in 1971 (Wikipedia), their 8-bit Intel 8008 in 1972 (Wikipedia), their 16-bit Intel 8086 between 1976-78. So in 7 years we went from a 4-bit computer to a 16 bit computer whose (x86) architecture continues on today and rules the world.

Now the Intel 4004 had 16 4-bit registers, had a data/instruction bus that could address 4096 4-bit words, 3-level subroutine stack and was a full fledged 4 bit computer. It’s unclear what’s in Google’s chip. But if we consider that this 2×3-qubit computer, which has multiple 2×3 qubit registers, a qubit storage bus, multi-level qubit subroutine (register) stack, etc. Then we are well on our way to quantum computing being added to the worlds computational capabilities in less than 10 years.

And of course, Googles not the only large organization working on quantum computing.

~~~~

So there you have it, Google and others are in the process of making your cryptography obsolete, rapidly speeding up unstructured searching and doing multiple other computations lots faster than today.

Photo Credit(s): from the MIT Technical Review article.

 

Crowdsourced vision for visually impaired

Read an article the other day in Christian Science Monitor (CSM) on the Be My Eyes App. The app is from BeMyEyes.com and is available for the iPhone and Android smart phones.

Essentially there are two groups of people that use the app:

  • Visually helpful volunteers – these people signup for the app and when a visually impaired person needs help they provide visual aid by speaking to the person on the other end.
  • Visually impaired individuals – these people signup for the app and when they are having problems understanding what they are (or are not) looking at they can turn on their camera take video with their phone and it will be sent to a volunteer, they can then ask the volunteer for help in deciding what they are looking at.

So, the visually impaired ask questions about the scenes they are shooting with their phone camera and volunteers will provide an answer.

It’s easy to register as Sighted and I assume Blind. I downloaded the app, registered and tried a test call in minutes. You have to enable notifications, microphone access and camera access on your phone to use the app. The camera access is required to display the scene/video on your phone.

According to the app there are 492K sighted individuals, 34.1K blind individuals and they have been helped 214K times.

Sounds like an easy way to help the world.

There was no requests to identify a language to use, so it may only work for English speakers. And there was no way to disable/enable it for a period of time when you don’t want to be disturbed. But maybe you would just close the app.

But other than that it was simple to use and seemed effective.

Now if there were only an app that would provide the same service for the hearing impaired to supply captions or a “filtered” audio feed to ear buds.

The world need more apps like this…

Comments

There’s a new cluster filesystem on the block, Elastifile

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

  • File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each  data object holds the 4KB of compressed file data and journal log entries.
  • Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

  • A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
  • A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
  • A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information

~~~~

New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

AI’s Image recognition success feeds sound recognition improvements

I must do reCAPTCHA at least a dozen times a week for various websites I use. It’s become a real pain. And the fact that I know that what I am doing is helping some AI image recognition program do a better job of identifying street signs, mountains, or shop fronts doesn’t reduce my angst.

But that’s the thing with deep learning, machine learning, re-inforcement learning, etc. they all need massive amounts of annotated data that’s a correct interpretation of a scene in order to train properly.

Computers to the rescue

So, when I read a recent article in MIT News that Computers learn to recognize sounds by watching video, I was intrigued. What the researchers at MIT have done is use advanced image recognition to annotate film clips with the names of things that are making sounds on the film. They then fed this automatically annotated data into a sound identifying algorithm to improve its recognition capability.

They used this approach to train their sound recognition system to be  able to identify natural and artificial sounds like bird song, speaking in crowds, traffic sounds, etc.

They tested their newly automatically trained sound recognition against standard labeled sound sets and was able to categorize sound with a 92% accuracy for a 10 category data set and with a 74% accuracy with a 50 category dataset. Humans are able categorize these sounds with a 96% and 81% accuracy, respectively.

AI’s need for annotation

The problem with machine learning is that it needs a massive, properly annotated data set in order to learn properly. But getting annotated data takes too long or is too expensive to do for many things that we want AI for.

Using one AI tool to annotate data to train another AI tool is sort of bootstrapping AI technology. It’s acute trick but may have only limited application. I could only think of only a few more applications of similar technology:

  • Use chest strap or EKG technology to annotate audio clips of heart beat sounds at a wrist or other appendage to train a system to accurately determine pulse rates through sound alone.
  • Use wave monitoring technology to annotate pictures and audio clips of sea waves to train a system to accurately determine wave levels for better tsunami detection.
  • Use image recognition to annotate pictures of food and then use this train a system to recognize food smells (if they ever find a way to record smells).

But there may be many others. Just further refinement of what they have used could lead to finer grained people detection. For example, as (facial) image recognition gets better, it’s possible to annotate speaking film clips to train a sound recognition system to identify people from just hearing their speech. Intelligence applications for such technology are significant.

Nonetheless, I for one am happy that the next reCAPTCHA won’t be having me identify river sounds in a matrix of 9 sound clips.

But I fear there’s enough GreyBeards on Storage podcast recordings and Storage Field Day video clips already available to train a system to identify Ray’s and for sure, Howard’s voice anywhere on the planet…

Comments?

Photo Credit(s): Wave by Matthew Potter; Waves crashing on Puget Sound by mikeskatieDay 16: Podcasting by Laura Blankenship

The fragility of public cloud IT

I have been reading AntiFragile again (by Nassim Taleb). And although he would probably disagree with my use of his concepts, it appears to me that IT is becoming more fragile, not less.

For example, recent outages at major public cloud providers display increased fragility for IT. Yet these problems, although almost national in scope, seldom deter individual organizations from their migration to the cloud.

Tragedy of the cloud commons

The issues are somewhat similar to the tragedy of the commons. When more and more entities use a common pool of resources, occasionally that common pool can become degraded. But because no-one really owns the common resources no one has any incentive to improve the situation.

Now the public cloud, although certainly a common pool of resources, is also most assuredly owned by corporations. So it’s not a true tragedy of the commons problem. Public cloud corporations have a real incentive to improve their services.

However, the fragility of IT in general, the web, and other electronic/data services all increases as they become more and more reliant on public cloud, common infrastructure. And I would propose this general IT fragility is really not owned by any one person, corporation or organization, let alone the public cloud providers.

Pre-cloud was less fragile, post-cloud more so

In the old days of last century, pre-cloud, if a human screwed up a CLI command the worst they could happen was to take out a corporation’s data services. Nowadays, post-cloud, if a similar human screws up a CLI command, the worst that can happen is that major portions of the internet services of a nation go down.

Strange Clouds by michaelroper (cc) (from Flickr)

Yes, over time, public cloud services have become better at not causing outages, but they aren’t going away. And if anything, better public cloud services just encourages more corporations to use them for more data services, causing any subsequent cloud outage to be more impactful, not less

The Internet was originally designed by DARPA to be more resilient to failures, outages and nuclear attack. But by centralizing IT infrastructure onto public cloud common infrastructure, we are reversing the web’s inherent fault tolerance and causing IT to be more susceptible to failures.

What can be done?

There are certainly things that can be done to improve the situation and make IT less fragile in the short and long run:

  1. Use the cloud for non-essential or temporary data services, that don’t hurt a corporation, organization or nation when outages occur.
  2. Build in fault-tolerance, automatic switchover for public cloud data services to other regions/clouds.
  3. Physically partition public cloud infrastructure into more regions and physically separate infrastructure segments within regions, such that any one admin has limited control over an amount of public cloud infrastructure.
  4. Divide an organizations or nations data services across public cloud infrastructures, across as many regions and segments as possible.
  5. Create a National Public IT Safety Board, not unlike the one for transportation, that does a formal post-mortem of every public cloud outage, proposes fixes, and enforces fix compliance.

The National Public IT Safety Board

The National Transportation Safety Board (NTSB) has worked well for air transportation. It relies on the cooperation of multiple equipment vendors, airlines, countries and other parties. It performs formal post mortems on any air transportation failure. It also enforces any fixes in processes, procedures, training and any other activities on equipment vendors, maintenance services, pilots, airlines and other entities that can impact public air transport safety. At the moment, air transport is probably the safest form of transportation available, and much of this is due to the NTSB

We need something similar for public (cloud) IT services. Yes most public cloud companies are doing this sort of work themselves in isolation, but we have a pressing need to accelerate this process across cloud vendors to improve public IT reliability even faster.

The public cloud is here to stay and if anything will become more encompassing, running more and more of the worlds IT. And as IoT, AI and automation becomes more pervasive, data processes that support these services, which will, no doubt run in the cloud, can impact public safety. Just think of what would happen in the future if an outage occurred in a major cloud provider running the backend for self-guided car algorithms during rush hour.

If the public cloud is to remain (at this point almost inevitable) then the safety and continuous functioning of this infrastructure becomes a public concern. As such, having a National Public IT Safety Board seems like the only way to have some entity own IT’s increased fragility due to  public cloud infrastructure consolidation.

~~~~

In the meantime, as corporations, government and other entities contemplate migrating data services to the cloud, they should consider the broader impact they are having on the reliability of public IT. When public cloud outages occur, all organizations suffer from the reduced public perception of IT service reliability.

Photo Credits: Fragile by Bart Everson; Fragile Planet by Dave Ginsberg; Strange Clouds by Michael Roper