The fragility of public cloud IT

I have been reading AntiFragile again (by Nassim Taleb). And although he would probably disagree with my use of his concepts, it appears to me that IT is becoming more fragile, not less.

For example, recent outages at major public cloud providers display increased fragility for IT. Yet these problems, although almost national in scope, seldom deter individual organizations from their migration to the cloud.

Tragedy of the cloud commons

The issues are somewhat similar to the tragedy of the commons. When more and more entities use a common pool of resources, occasionally that common pool can become degraded. But because no-one really owns the common resources no one has any incentive to improve the situation.

Now the public cloud, although certainly a common pool of resources, is also most assuredly owned by corporations. So it’s not a true tragedy of the commons problem. Public cloud corporations have a real incentive to improve their services.

However, the fragility of IT in general, the web, and other electronic/data services all increases as they become more and more reliant on public cloud, common infrastructure. And I would propose this general IT fragility is really not owned by any one person, corporation or organization, let alone the public cloud providers.

Pre-cloud was less fragile, post-cloud more so

In the old days of last century, pre-cloud, if a human screwed up a CLI command the worst they could happen was to take out a corporation’s data services. Nowadays, post-cloud, if a similar human screws up a CLI command, the worst that can happen is that major portions of the internet services of a nation go down.

Strange Clouds by michaelroper (cc) (from Flickr)

Yes, over time, public cloud services have become better at not causing outages, but they aren’t going away. And if anything, better public cloud services just encourages more corporations to use them for more data services, causing any subsequent cloud outage to be more impactful, not less

The Internet was originally designed by DARPA to be more resilient to failures, outages and nuclear attack. But by centralizing IT infrastructure onto public cloud common infrastructure, we are reversing the web’s inherent fault tolerance and causing IT to be more susceptible to failures.

What can be done?

There are certainly things that can be done to improve the situation and make IT less fragile in the short and long run:

  1. Use the cloud for non-essential or temporary data services, that don’t hurt a corporation, organization or nation when outages occur.
  2. Build in fault-tolerance, automatic switchover for public cloud data services to other regions/clouds.
  3. Physically partition public cloud infrastructure into more regions and physically separate infrastructure segments within regions, such that any one admin has limited control over an amount of public cloud infrastructure.
  4. Divide an organizations or nations data services across public cloud infrastructures, across as many regions and segments as possible.
  5. Create a National Public IT Safety Board, not unlike the one for transportation, that does a formal post-mortem of every public cloud outage, proposes fixes, and enforces fix compliance.

The National Public IT Safety Board

The National Transportation Safety Board (NTSB) has worked well for air transportation. It relies on the cooperation of multiple equipment vendors, airlines, countries and other parties. It performs formal post mortems on any air transportation failure. It also enforces any fixes in processes, procedures, training and any other activities on equipment vendors, maintenance services, pilots, airlines and other entities that can impact public air transport safety. At the moment, air transport is probably the safest form of transportation available, and much of this is due to the NTSB

We need something similar for public (cloud) IT services. Yes most public cloud companies are doing this sort of work themselves in isolation, but we have a pressing need to accelerate this process across cloud vendors to improve public IT reliability even faster.

The public cloud is here to stay and if anything will become more encompassing, running more and more of the worlds IT. And as IoT, AI and automation becomes more pervasive, data processes that support these services, which will, no doubt run in the cloud, can impact public safety. Just think of what would happen in the future if an outage occurred in a major cloud provider running the backend for self-guided car algorithms during rush hour.

If the public cloud is to remain (at this point almost inevitable) then the safety and continuous functioning of this infrastructure becomes a public concern. As such, having a National Public IT Safety Board seems like the only way to have some entity own IT’s increased fragility due to  public cloud infrastructure consolidation.

~~~~

In the meantime, as corporations, government and other entities contemplate migrating data services to the cloud, they should consider the broader impact they are having on the reliability of public IT. When public cloud outages occur, all organizations suffer from the reduced public perception of IT service reliability.

Photo Credits: Fragile by Bart Everson; Fragile Planet by Dave Ginsberg; Strange Clouds by Michael Roper

Flash’s only at 5% of data storage

7707062406_6508dba2a4_oWe have been hearing for years that NAND flash is at price parity with disk. But at this week’s Flash Memory Summit, Darren Thomas, VP Storage BU, Micron said at his keynote that NAND only store 5% of the bits in a data center.

Darren’s session was all about how to get flash to become more than 5% of data storage and called this “crossing the chasm”. I assume the 5% is against yearly data storage shipped.

Flash’s adoption rate

Darren, said last year flash climbed from 4% to 5% of data center storage, but he made no mention on whether flash’s adoption was accelerating. According to another of Darren’s charts, flash is expected to ship ~77B Gb of storage in 2015 and should grow to about 240B Gb by 2019.

If the ratio of flash bits shipped to data centers (vs. all flash bits shipped) holds constant then Flash should be ~15% of data storage by 2019. But this assumes data storage doesn’t grow. If we assume a 10% Y/Y CAGR for data storage, then flash would represent about ~9% of overall data storage.

Data growth at 10% could be conservative. A 2012 EE Times article said2010-2015 data growth CAGR would be 32%  and IDC’s 2012 digital universe report said that between 2012 and 2020, data will double every two years, a ~44% CAGR. But both numbers could be talking about the world’s data growth, not just data center.

How to cross this chasm?

Geoffrey Moore, author of Crossing the Chasm, came up on stage as Darren discussed what he thought it would take to go beyond early adopters (visionaries) to early majority (pragmatists) and reach wider flash adoption in data center storage. (See Wikipedia article for a summary on Crossing the Chasm.)

As one example of crossing the chasm, Darren talked about the electric light bulb. At introduction it competed against candles, oil lamps, gas lamps, etc. But it was the most expensive lighting system at the time.

But when people realized that electric lights could allow you to do stuff at night and not just go to sleep, adoption took off. At that time competitors to electric bulb did provide lighting it just wasn’t that good and in fact, most people went to bed to sleep at night because the light then available was so poor.

However, the electric bulb  higher performing lighting solution opened up the night to other activities.

What needs to change in NAND flash marketing?

From Darren’s perspective the problem with flash today is that marketing and sales of flash storage are all about speed, feeds and relative pricing against disk storage. But what’s needed is to discuss the disruptive benefits of flash/NAND storage that are impossible to achieve with disk today.

What are the disruptive benefits of NAND/flash storage,  unrealizable with disk today.

  1. Real time analytics and other RT applications;
  2. More responsive mobile and data center applications;
  3. Greener, quieter, and potentially denser data center;
  4. Storage for mobile, IoT and other ruggedized application environments.

Only the first three above apply  to data centers. And none seem as significant  as opening up the night, but maybe I am missing a few.

Also the Wikipedia article cited above states that a Crossing the Chasm approach works best for disruptive or discontinuous innovations and that more continuous innovations (doesn’t cause significant behavioral change) does better with Everett Roger’s standard diffusion of innovation approaches (see Wikepedia article for more).

So is NAND flash a disruptive or continuous innovation?  Darren seems firmly in the disruptive camp today.

Comments?

Photo Credit(s): 20-nanometer NAND flash chip, IntelFreePress’ photostream

Another Y2K-like problem, this time Internet routers are the problem

Read an article today in Wired about The Internet has grown too big for its aging infrastructure showing up as a serious problem that’s soon to be more widespread.

This Y2K-like problem is associated with the Border Gateway Protocol  (BGP) routing tables entries which represent IP address prefixes.  Internet routers keep BGP tables in Tertiary Content Addressable Memory (TCAM, sort of like a virtual memory page table only for router addresses) and there are physical limits as to how many BGP entries will fit into any specific Internet router.  Some routers crash when they exceed their TCAM limit and others just ignore the BGP entries that exceed their limits – neither approach seems workable long term.

Apparently we are approaching one of those hard and fast limits, at least for older routers, as the BGP routing tables reach over 512K entries.  As of May 2014, there were in excess of 500,000 BGP prefixes (table entries).

Smoking gun points to …

It appears that this time Verizon was the perpetrator. Yesterday they added 15K BGP entries to the Internet BGP table, kicking some routers over their 512K limit. This was no doubt in anticipation of some growth in Internet addresses on their networks.

The result was that LiquidWeb’s network went down. Supposedly they have an older Cisco 7600 router and the latest addition to BGP entries exceeded its TCAM capacity, crashing their router. Oops!

Verizon quickly withdrew the offending 15K BGP entry addition and things seem back to normal for the moment. But we are once again close to some arbitrary computerized limit. Only this problem won’t happen at midnight December 31st. It won’t take that long to exceed the current BGP entry limits again and next time it might not be that easy to back out.

But it’s almost like there’s no stopping it…

Just guessing here but these types of routers probably have similar limits for BGP entries exceeding 1024K entries, 2048K, 4096K, etc. With the number of internet connected devices growing exponentially, especially with the Internet of Things, I predict similar problems over the coming years. Indeed, we went from ~400K to ~500K BGP entries in just under two years and the rate of growth seems to be accelerating.

It’s really just a matter of time before even todays routers run out of TCAM slots. Y2K-like, only this time there’s no way to stop it from happening again and again in the future.  I suppose it would be better if the routers just ignored the new BGP entries rather than crashing but that would seem to put some segment of Internet routers out of their reach?  There’s got to be a way to intelligently ignore some updates or summarize prefix updates when a router runs out of TCAM entries.

Welcome to the new 512K problem.

~~~~

Comments?

Photo Credit(s): Cisco 7609 @ itb for INHERENT by Affan Basalamah

 

 

 

Extremely low power transistors open up new IoT applications

We have written before about the computational power efficiency law know as Koomley’s Law which states that the computations one can do with the same amount of energy has been doubling every 1.57 years (for more info, please see my No power sensors surface … post).

The dawn of sub-threshold electronics

But just this week there was another article this time about electronics that use much less power than normal transistors. Achieving this in Internet of Thing (IoT) type sensors would take the computations/joule up by a orders of magnitude, not just ~1.6X as in Koomley’s law, although how long it will take to come out commercially is another issue

This new technology is called sub-threshold transistors and they use much less power than normal transistors. The article in MIT Technical Review, A batteryless sensor chip for the IoT, discusses the phenomenon used by sub-threshold transistors that normal transistors, even when they are technically in the “off” state, leak some amount of current.  This CMOS transistor parasitic leakage had been considered a current drain that couldn’t be eliminated and as such, wasted energy up until recently.

Not so any longer, with the new sub-threshold transistor design paradigm, electronics  could now take advantage of this “leakage” current to perform actual computations. And that opens up a whole new level of IoT sensors that could be deployed.

Prototype sub-threshold circuits coming out

One company PsiKick is using this phenomenon to design ASIC/chips that, depending on the application, using sub-threshold transistors plus extensive power reduction design techniques, only use 0.1 to 1% of the energy of similar functioning chips. Their first prototype was a portable EKG that uses body heat to power itself with a thermo-electric generator rather than a battery.  The prototype was just a proof of concept but they seem to be at work trying to open the technology to broader applications.

One serious consideration limiting the types of sensors that could be deployed in IoT applications was how to get power to these sensors. The other thing was how to get information out of the sensor and out to the real world.  There are a few ways to attack the power issue for IoT sensors, creating more efficient electronics, more effective/long lasting batteries, and smaller electronic generators. Sub-threshold transistor electronics takes a major leap forward to more efficient electronics.

In my previous post we discussed ways to construct smaller electronic generators used by low-power systems/chips. One approach highlighted in that paper used small antennas to extract power from ambient radio waves. But that’s not the only way to generate small amounts of power. I have also heard of piezoelectric generators that use force and movement (such as foot falls) to generate energy. And of course, small solar panels could do the same trick.

Any of these micro energy generators could be made to work, and together with the ability to design circuits that use 0.1 to 1% of the electricity used by normal circuits, this  should just about eliminate any computational/power limits to the sorts of IoT sensors that could be deployed.

What about non-sensor/non-IoT electronics?

Not sure if this works for IoT sensors why it couldn’t be used for something more substantial like mobile/smart phones, desktop computers, enterprise servers, etc. To that end, it seems that ARM Holdings and IMEC are also looking at the technology.

Only a couple of years ago, everybody was up in arms about all the energy consumption of server farms, especially on the west coast of the USA. But with this sort of sub-threshold transistor electronics coming online, maybe servers could run on ambient radio wave energy, data centers could run desktop computers and led lighting off of thermo-electric generators inside their heat exchangers, and iPhones could run off of accelerometer piezoelectric generators using the motion a phone undergoes while sitting in a pocket of a moving person.

Almost gives the impression of perpetual motion machines but rather than motion we are talking electronics, sort of like perpetual electronics…

So can a no-battery iPhone be in our future, I wouldn’t bet against it. Remember, the compute engine inside all iPhones is based on ARM technology.

Comments?

Photo credit(s): Intel Free Press: Joshua R. Smith holding a sensor