Why EMC is doing Project Lightening and Thunder

Picture of atmospheric lightening striking ground near a building at night
rayo 3 by El Garza (cc) (from Flickr)

Although technically Project Lightening and Thunder represent some interesting offshoots of EMC software, hardware and system prowess,  I wonder why they would decide to go after this particular market space.

There are plenty of alternative offerings in the PCIe NAND memory card space.  Moreover, the PCIe card caching functionality, while interesting is not that hard to replicate and such software capability is not a serious barrier of entry for HP, IBM, NetApp and many, many others.  And the margins cannot be that great.

So why get into this low margin business?

I can see a couple of reasons why EMC might want to do this.

  • Believing in the commoditization of storage performance.  I have had this debate with a number of analysts over the years but there remain many out there that firmly believe that storage performance will become a commodity sooner, rather than later.  By entering the PCIe NAND card IO buffer space, EMC can create a beachhead in this movement that helps them build market awareness, higher manufacturing volumes, and support expertise.  As such, when the inevitable happens and high margins for enterprise storage start to deteriorate, EMC will be able to capitalize on this hard won, operational effectiveness.
  • Moving up the IO stack.  From an applications IO request to the disk device that actually services it is a long journey with multiple places to make money.  Currently, EMC has a significant share of everything that happens after the fabric switch whether it is FC,  iSCSI, NFS or CIFS.  What they don’t have is a significant share in the switch infrastructure or anywhere on the other (host side) of that interface stack.  Yes they have Avamar, Networker, Documentum, and other software that help manage, secure and protect IO activity together with other significant investments in RSA and VMware.   But these represent adjacent market spaces rather than primary IO stack endeavors.  Lightening represents a hybrid software/hardware solution that moves EMC up the IO stack to inside the server.  As such, it represents yet another opportunity to profit from all the IO going on in the data center.
  • Making big data more effective.  The fact that Hadoop doesn’t really need or use high end storage has not been lost to most storage vendors.  With Lightening, EMC has a storage enhancement offering that can readily improve  Hadoop cluster processing.  Something like Lightening’s caching software could easily be tailored to enhance HDFS file access mode and thus, speed up cluster processing.  If Hadoop and big data are to be the next big consumer of storage, then speeding cluster processing will certainly help and profiting by doing this only makes sense.
  • Believing that SSDs will transform storage. To many of us the age of disks is waning.  SSDs, in some form or another, will be the underlying technology for the next age of storage.  The densities, performance and energy efficiency of current NAND based SSD technology are commendable but they will only get better over time.  The capabilities brought about by such technology will certainly transform the storage industry as we know it, if they haven’t already.  But where SSD technology actually emerges is still being played out in the market place.  Many believe that when industry transitions like this happen it’s best to be engaged everywhere change is likely to happen, hoping that at least some of them will succeed. Perhaps PCIe SSD cards may not take over all server IO activity but if it does, not being there or being late will certainly hurt a company’s chances to profit from it.

There may be more reasons I missed here but these seem to be the main ones.  Of the above, I think the last one, SSD rules the next transition is most important to EMC.

They have been successful in the past during other industry transitions.  If anything they have shown similar indications with their acquisitions by buying into transitions if they don’t own them, witness Data Domain, RSA, and VMware.  So I suspect the view in EMC is that doubling down on SSDs will enable them to ride out the next storm and be in a profitable place for the next change, whatever that might be.

And following lightening, Project Thunder

Similarly, Project Thunder seems to represent EMC doubling their bet yet again on the SSDs.  Just about every month I talk to another storage startup coming out in the market providing another new take on storage using every form of SSD imaginable.

However, Project Thunder as envisioned today is not storage, but rather some form of external shared memory.  I have heard this before, in the IBM mainframe space about 15-20 years ago.  At that time shared external memory was going to handle all mainframe IO processing and the only storage left was going to be bulk archive or migration storage – a big threat to the non-IBM mainframe storage vendors at the time.

One problem then was that the shared DRAM memory of the time was way more expensive than sophisticated disk storage and the price wasn’t coming down fast enough to counteract increased demand.  The other problem was making shared memory work with all the existing mainframe applications was not easy.  IBM at least had control over the OS, HW and most of the larger applications at the time.  Yet they still struggled to make it usable and effective, probably some lesson here for EMC.

Fast forward 20 years and NAND based SSDs are the right hardware technology to make  inexpensive shared memory happen.  In addition, the road map for NAND and other SSD technologies looks poised to continue the capacity increase and price reductions necessary to compete effectively with disk in the long run.

However, the challenges then and now seem as much to do with software that makes shared external memory universally effective as with the hardware technology to implement it.  Providing a new storage tier in Linux, Windows and/or VMware is easier said than done. Most recent successes have usually been offshoots of SCSI (iSCSI, FCoE, etc).  Nevertheless, if it was good for mainframes then, it certainly good for Linux, Windows and VMware today.

And that seems to be where Thunder is heading, I think.

Comments?

 

Comments?

Correlated risk

Aerial view of damage to Wakuya, Japan following earthquake. by Official U.S. Navy... (cc) (from Flickr)
Aerial view of damage to Wakuya, Japan following earthquake. by Official U.S. Navy... (cc) (from Flickr)

What’s the chance that

  • an earthquake  at sea could knock out primary power and generate a tsunami which would also knock out backup generators for nuclear power plant emergency cooling equipment (1 in 40 yrs),
  • an overextended speculative market segment would collapse and cause widespread ruin that would take down both equity and bond markets and force 100s of financial institutions to go under (1 in 77 yrs),
  • a hurricane occurs that destroys flood barriers which then flood your home, office and the place you store your backups (?)

All these represent correlated risks that prior to the actual event, were deemed very improbable.  But high improbability, doesn’t mean it will never happen.

Correlated risk defined

A correlated risk is the risk of any subsequent disaster or event occurring after a primary event or catastrophe has occurred. In the case of natural disasters, any event that is generated as a consequence or because of an originating event occurrance is a correlated event and as such, has a correlated risk.

I once worked for a major company, that kept their disaster recovery backups in a basement, underground in the same campus as their headquarters.  This seemed risky, as any event which took out the campus could potentially damage this basement as well and all the associated backup tapes.

How to understand your correlated risk

It seems to me to be pretty straightforward to understand correlated risk within the framework of a business continuity or disaster recovery plan (BC or DR plan).  One lists in one column all possible primary accidents, calamities, disasters, etc., man made or natural, in another column other possible accidents, calamities, disasters, etc. that are generated because of the prime event.

One then recurses on this process to generate all possible correlated events associated with the primary or previous correlated event until you exhaust all possible chains of catastrophes associated with the primary disaster. Then in a third column, list the potential scope (distance or area impacted) and outcomes (what damage could be expected) of all those activities in the first two columns.  In a fourth column, one lists the best guess probability of the events and/or the correlated event(s) occurring.

In the end, you should have an exhaustive list of things you should be preparing for.  Now one ranks the events in probability order and tackles them from highest to lowest probability.  There is some cutoff point that everyone reaches depending on their risk tolerance, at some point dealing with the multiple disasters that could potentially occur becomes too costly to deal with. But it all depends on risk tolerance.  For instance, a nuclear plant probably needs a much higher risk tolerance than your average corporate environment.

With that in place you have a start on a BC and/or DR plan.  Now all you need to determine is your risk tolerance level and how to handle primary and correlated risks that fall within that level.

A correlated risk analysis

Take Silverton Consulting as an example .  I take daily incremental backups stored on a local hard disk, take weekly “partial backups” (critical business files only) to removable media but also stored locally in the office, and take monthly full backups stored in a safety deposit box located in a vault in the basement of a bank within five miles of the office.

If I just look at natural events:

  • My first and most likely natural event is building fire – in this case the scope of the event would be limited to the building, which would take out both the local hard disk incrementals and weekly partial backups but the safety deposit box of monthly fulls would still be accessible.
  • A possible correlated event as well as another primary event could be wild fire – in this case, potentially both the office and the bank could be consumed and all backups would be lost.  The fact that the bank is 5 miles away, has it’s own fire suppression system, and has my backups located in their basement, just reduces the probability of a wild fire impacting both locations but doesn’t eliminate it.
  • Another possible correlated event to any wild fire would be loss of power, transport, and communication services – the fact that the bank is only 5 miles away, indicates that if the primary office loses these services, it’s highly probably that the bank would lose them as well.  Access to the bank vault backups, under these circumstances would be delayed at best, until at least such services could be restored.  Had I been using a cloud provider backup service (which I am considering), I couldn’t access my data until communication services were restored or until I had moved far enough away to regain access to these services.  Wth the roads/other transport being out this would take some time.
  • Next most likely natural event is flood. Our location is within a 100 year flood plane, so a serious flood is possible that would take out the office once every 100 yrs.  I would like to say that our bank is outside our flood plane, but I just don’t know yet.  But I promise to find out.
  • A correlated event to a flood is a loss of power, transport and communication services. The scope and consequences of this catastrophe are similar to that discussed above.
  • Next most likely natural event is tornado, …
  • Next most likely natural event is earthquake, …
  • Next most likely natural event is volcano eruption,

… and the list goes on.  Of course these are just natural disasters, one would need to consider man-made catastrophes as well.

In any event, all these have a distinct, non-zero probability.  One can come up with some calculation of the probability of such primary and correlated events through research and/or other means.

For instance, I get a fortnightly email from Colorado University’s Natural Hazards Center which occasionally provides some insight into these probabilities. Potentially, your corporatations insurance companies can also provide some guidance into these probabilities as well.

What is risk tolerance?

But at some point, only the company can determine it’s risk tolerance.  I believe risk tolerance to be some combination of money one is willing to invest and your ability to invest it in mitigating risks.   For example, let’s say my company makes $10M a year in revenues.  Given the importance of IT to my corporation’s activities a reasonable risk tolerance in $ terms might be somewhere between 0.1% to 1.0% of revenues or $10K to $100K.   I must say I am probably spending more than that percentage of SCI revenues in my current DR activities, such as they are, but I include weekly and monthly backups with these costs (most would not include these activities in pure DR spending).

—-

So as the disaster in Japan continues, let us pray that it works out well in the end for all parties.  But also let’s use this time to re-examine our risk tolerance and disaster recovery plans with respect to correlated risks.  Hopefully, we will all do better next time.

Comments?

Strategy, as we know it, is dead

Or at least that’s how the WSJ reported it yesterday.

Years back when I was working in corporate strategy we used to have this yearly dance called strategic planning.  Every year we would fan out to all the business units, look at what they were doing and try to figure out what they needed to be doing three to five years down the road.

This process typically lasted the better part of a quarter or so and culminated in a presentation to upper management on a direction to pursue for the business unit.  What would happen next was often the best part.  Some business groups would shelve the work and not look at it again.  Other business units would invest time and effort to incorporate the strategic plan recommendations into what they were doing that year to try to make it happen in 3 to 5 years time.  At the end of this process, annual budgets would be declared “done” and the world would go back to work.

But that was the old, dead strategy.

The “New Strategy”

The new strategy is defined by adaptability and flexibility to take advantage of any opportunity that presents itself.  This results in strategic plans and operating budgets that are updated monthly, just-in-time decision making, and wider ranging planning scenarios.  For example:

  • Strategic plans and budgets updated  monthly – as the economy tanked over the last couple of years, baseline assumptions were rendered useless in no time at all.  Budgets updated yearly were no help.  Even budgets that were updated quarterly were subject to significant tracking error.  The only way to survive was to look at your budgets every month and adjust for cost of capital, inventory, and revenue mix.  This way a company could adjust their product mix immediately to best match what was selling and thus, maximize return.
  • Just-in-time decision making – the WSJ used a factory closing example in their article but I prefer to look at the SSD vs HDD product mix.  When to get on the SSD bandwagon is a strategic decision.  One can examine this decision yearly quarterly or monthly to see if it makes sense today or  take the time to identify the trigger points that would make the decision for you.  For SSDs, one could decide what price SLC-NAND memory has to drop to,  say $X/GB,  when SSDs would make sense.  To make this decision, one must determine how long it would take to create and launch SSD product offerings, what SLC-NAND pricing trends look like today and back up the trigger point to take this all in account.  But, after that all one need do is monitor SSD pricing daily and when it hits your trigger point start the product changeover.
  • Wider ranging scenarios – all old strategic planning used economic variables such as cost of capital, revenue growth, and cost of goods sold, many would use a range of +/- 5% on each of these factors to generate operating scenarios that were then fed into the strategic planning process.  The problem with such scenarios is that they didn’t take into account the extreme circumstances of the last couple of years.  By widening the scenarios to something like +/- 15%, they became much more useful and would have reflected actual experience.
F-15 F-16 F86 Sabre Jet Heritage Flight by TMWolf (cc) (from flickr)
F-15 F-16 F86 Sabre Jet Heritage Flight by TMWolf (cc) (from flickr)

But in the end most of this speaks to speed and taking advantage of opportunities that are present.

OODA

All this reminds me of Colonel John R. Boyd (USAF deceased) who came up with a new military and competitive strategic paradigm called OODA or Observation, Orientation, Decision, and Action.  Observe the competition (or market place), orient to (or appreciate what) the market is doing,  decide what the most appropriate action will be, and then do it.  John believed that the fastest OODA cycle always wins in the end.  Any OODA cycle takes time to perform, one that is fastest will change the marketplace such that by the time your (slower) adversary sees what’s happening and reacts, you have already changed the world out from under them.

There was a good book on Col. Boyd’s life by Robert Coram, Boyd: The Fighter Pilot Who Changed the Art of War. Also there was a bio, Genghis John, written by a close friend, Chuck Spinney.  If you are interested in understanding more on his views of conflict and strategy, I suggest starting at the bio but the book was an easy read.

How this all applies to the world with 6-18 month product development cycles, and 3 month marketing campaigns needs to be the subject of a future post…