Top 10 storage technologies over the last decade

Aurora's Perception or I Schrive When I See Technology by Wonderlane (cc) (from Flickr)
Aurora's Perception or I Schrive When I See Technology by Wonderlane (cc) (from Flickr)

Some of these technologies were in development prior to 2000, some were available in other domains but not in storage, and some were in a few subsystems but had yet to become popular as they are today.  In no particular order here are my top 10 storage technologies for the decade:

  1. NAND based SSDs – DRAM and other technology solid state drives (SSDs) were available last century but over the last decade NAND Flash based devices have dominated SSD technology and have altered the storage industry forever more.  Today, it’s nigh impossible to find enterprise class storage that doesn’t support NAND SSDs.
  2. GMR head– Giant Magneto Resistance disk heads have become common place over the last decade and have allowed disk drive manufacturers to double data density every 18-24 months.  Now GMR heads are starting to transition over to tape storage and will enable that technology to increase data density dramatically
  3. Data DeduplicationDeduplication technologies emerged over the last decade as a complement to higher density disk drives as a means to more efficiently backup data.  Deduplication technology can be found in many different forms today, ranging from file and block storage systems, backup storage systems, to backup software only solutions.
  4. Thin provisioning – No one would argue that thin provisioning emerged last century but it took the last decade to really find its place in the storage pantheon.  One almost cannot find a data center class storage device that does not support thin provisioning today.
  5. Scale-out storage – Last century if you wanted to get higher IOPS from a storage subsystem you could add cache or disk drives but at some point you hit a subsystem performance wall.  With scale-out storage, one can now add more processing elements to a storage system cluster without having to replace the controller to obtain more IO processing power.  The link reference talks about the use of commodity hardware to provide added performance but scale-out storage can also be done with non-commodity hardware (see Hitachi’s VSP vs. VMAX).
  6. Storage virtualizationserver virtualization has taken off as the dominant data center paradigm over the last decade but a counterpart to this in storage has also become more viable as well.  Storage virtualization was originally used to migrate data from old subsystems to new storage but today can be used to manage and migrate data over PBs of physical storage dynamically optimizing data placement for cost and/or performance.
  7. LTO tape When IBM dominated IT in the mid to late last century, the tape format dejour always matched IBM’s tape technology.  As the decade dawned, IBM was no longer the dominant player and tape technology was starting to diverge into a babble of differing formats.  As a result, IBM, Quantum, and HP put their technology together and created a standard tape format, called LTO, which has become the new dominant tape format for the data center.
  8. Cloud storage Unclear just when over the last decade cloud storage emerged but it seemed to be a supplement to cloud computing that also appeared this past decade.  Storage service providers had existed earlier but due to bandwidth limitations and storage costs didn’t survive the dotcom bubble. But over this past decade both bandwidth and storage costs have come down considerably and cloud storage has now become a viable technological solution to many data center issues.
  9. iSCSI SCSI has taken on many forms over the last couple of decades but iSCSI has the altered the dominant block storage paradigm from a single, pure FC based SAN to a plurality of technologies.  Nowadays, SMB shops can have block storage without the cost and complexity of FC SANs over the LAN networking technology they already use.
  10. FCoEOne could argue that this technology is still maturing today but once again SCSI has taken opened up another way to access storage. FCoE has the potential to offer all the robustness and performance of FC SANs over data center Ethernet hardware simplifying and unifying data center networking onto one technology.

No doubt others would differ on their top 10 storage technologies over the last decade but I strived to find technologies that significantly changed data storage that existed in 2000 vs. today.  These 10 seemed to me to fit the bill better than most.

Comments?

VPLEX surfaces at EMCWorld

Pat Gelsinger introducting VPLEXes on stage at EMCWorld
Pat Gelsinger introducting VPLEXes on stage at EMCWorld

At EMCWorld today Pat Gelsinger  had a pair of VPLEXes flanking him on stage and actively moving VMs from “Boston” to “Hopkinton” data centers.  They showed a demo of moving a bunch of VMs from one to the other with all of them actively performing transaction processing.  I have written about EMC’s vision in a prior blog post called Caching DaaD for Federated Data Centers.

I talked to an vSpecialist at the Blogging lounge afterwards and asked him where the data actually resided for the VMs that were moved.  He said the data was synchronously replicated and actively being updated  at both locations. They proceeded to long-distance teleport (Vmotion) 500 VMs from Boston to Hopkinton.  After that completed, Chad Sakac powered down the ‘Boston’ VPLEX and everything in ‘Hopkinton’ continued to operate.  All this was done on stage so Boston and Hopkinton data centers were possibly both located in the  convention center but interesting nonetheless.

I asked the vSpecialist how they moved the IP address between the sites and he said they shared the same IP domain.  I am no networking expert but I felt that moving the network addresses seemed the last problem to solve for long distance Vmotion.  But, he said Cisco had solved this with their OTV (Open Transport Virtualization) for  Nexus 7000 which could move IP addresses from one data center to another.

1 Engine VPLEX back view
1 Engine VPLEX back view

Later at the Expo, I talked with a Cisco rep who said they do this by encapsulating Layer 2 protocol messages into a Layer 3 packet. Once encapsulated it can be routed over anyone’s gear to the other site and as long as there was another Nexus 7K switch at the other site within the proper IP domain shared with the server targets for Vmotion then all works fine.  Didn’t ask what happens if the primary Nexus 7K switch/site goes down but my guess is that the IP address movement would cease to work. But for active VM migration between two operational data centers it all seems to hang together.  I asked Cisco if OTV was a formal standard TCP/IP protocol extension and he said he didn’t know.  Which probably means that other switch vendors won’t support OTV.

4 Engine VPLEX back view
4 Engine VPLEX back view

There was a lot of other stuff at EMCWorld today and at the Expo.

  • EMC’s Content Management & Archiving group was renamed Information Intelligence.
  • EMC’s Backup Recovery Systems group was in force on the Expo floor with a big pavilion with Avamar, Networker and Data Domain present.
  • EMC keynotes were mostly about the journey to the private cloud.  VPLEX seemed to be crucial to this journey as EMC sees it.
  • EMCWorld’s show floor was impressive. Lots of  major partners were there RSA, VMware, IOmega, Atmos, VCE, Cisco, Microsoft, Brocade, Dell, CSC, STEC, Forsythe, Qlogic, Emulex and many others.  Talked at length with Microsoft about SharePoint 2010. Still trying to figure that one out.
One table at bloggers lounge StorageNerve & BasRaayman hard at work
One table at bloggers lounge StorageNerve & BasRaayman in the foreground hard at work

I would say the bloggers lounge was pretty busy for most of the day.  Met a lot of bloggers there including StorageNerve (Devang Panchigar), BasRaaymon (Bas Raaymon), Kiwi_Si (Simon Seagrave), DeepStorage (Howard Marks), Wikibon (Dave Valente), and a whole bunch of others.

Well not sure what EMC has in store for day 2, but from my perspective it will be hard to beat day 1.

Full disclosure, I have written a white paper discussing VPLEX for EMC and work with EMC on a number of other projects as well.

7 grand challenges for the next storage century

Clock tower (4) by TJ Morris (cc) (from flickr)
Clock tower (4) by TJ Morris (cc) (from flickr)

I saw a recent IEEE Spectrum article on engineering’s grand challenges for the next century and thought something similar should be done for data storage. So this is a start:

  • Replace magnetic storage – most predictions show that magnetic disk storage has another 25 years and magnetic tape another decade after that before they run out of steam. Such end-dates have been wrong before but it is unlikely that we will be using disk or tape 50 years from now. Some sort of solid state device seems most probable as the next evolution of storage. I doubt this will be NAND considering its write endurance and other long-term reliability issues but if such issues could be re-solved maybe it could replace magnetic storage.
  • 1000 year storage – paper can be printed today with non-acidic based ink and retain its image for over a 1000 years. Nothing in data storage today can claim much more than a 100 year longevity. The world needs data storage that lasts much longer than 100 years.
  • Zero energy storage – today SSD/NAND and rotating magnetic media consume energy constantly in order to be accessible. Ultimately, the world needs some sort of storage that only consumes energy when read or written or such storage would provide “online access with offline power consumption”.
  • Convergent fabrics running divergent protocols – whether it’s ethernet, infiniband, FC, or something new, all fabrics should be able to handle any and all storage (and datacenter) protocols. The internet has become so ubiquitous becauset it handles just about any protocol we throw at it. We need the same or something similar for datacenter fabrics.
  • Securing data – securing books or paper is relatively straightforward today, just throw them in a vault/safety deposit box. Securing data seems simple but yet is not widely used today. It doesn’t have to be that way. We need better, more long lasting tools and methodology to secure our data.
  • Public data repositories – libraries exist to provide access to the output of society in the form of books, magazines, papers and other printed artifacts. No such repository exists today for data. Society would be better served if we could store and retrieve data if there were library like institutions could store data. Most of these issues are legal due to data ownership but technological issues exist here as well.
  • Associative accessed storage – Sequential and random access have been around for over half a century now. Associative storage could complement these and be another approach allowing storage to be retrieved by its content. We can kind of do this today by keywording and indexing data. Biological memory is accessed associations or linkages to other concepts, once accessed memory seem almost sequentially accessed from there. Something comparable to biological memory may be required to build more intelligent machines.

Some of these are already being pursued and yet others receive no interest today. Nonetheless, I believe they all deserve investigation, if storage is to continue to serve its primary role to society, as a long term storehouse for society’s culture, thoughts and deeds.

Comments?

Protecting the Yottabyte archive

blinkenlights by habi (cc) (from flickr)
blinkenlights by habi (cc) (from flickr)

In a previous post I discussed what it would take to store 1YB of data in 2015 for the National Security Agency (NSA). Due to length, that post did not discuss many other aspects of the 1YB archive such as ingest, index, data protection, etc. Thus, I will attempt to cover each of these in turn and as such, this post will cover some of the data protection aspects of the 1YB archive and its catalog/index.

RAID protecting 1YB of data

Protecting the 1YB archive will require some sort of parity protection. RAID data protection could certainly be used and may need to be extended to removable media (RAID for tape), but that would require somewhere in the neighborhood of 10-20% additional storage (RAID5 across 10 to 5 tape drives). It’s possible with Reed-Solomon encoding and using RAID6 that we could take this down to 5-10% of additional storage (RAID 6 for a 40 to a 20 wide tape drive stripe). Possibly other forms of ECC (such as turbo codes) might be usable in a RAID like configuration which would give even better reliability with less additional storage.

But RAID like protection also applies to the data catalog and indexes required to access the 1YB archive of data. Ditto for the online data itself while it’s being ingested, indexed, or readback. For the remainder of this post I ignore the RAID overhead but suffice it to say with today’s an additional 10% storage for parity will not change this discussion much.

Also in the original post I envisioned a multi-tier storage hierarchy but the lowest tier always held a copy of any files residing in the upper tiers. This would provide some RAID1 like redundancy for any online data. This might be pretty usefull, i.e., if a file is of high interest, it could have been accessed recently and therefore resides in upper storage tiers. As such, multiple copies of interesting files could exist.

Catalog and indexes backups for 1YB archive

IMHO, RAID or other parity protection is different than data backup. Data backup is generally used as a last line of defense for hardware failure, software failure or user error (deleting the wrong data). It’s certainly possible that the lowest tier data is stored on some sort of WORM (write once read many times) media meaning it cannot be overwritten, eliminating one class of user error.

But this presumes the catalog is available and the media is locatable. Which means the catalog has to be preserved/protected from user error, HW and SW failures. I wrote about whether cloud storage needs backup in a prior post and feel strongly that the 1YB archive would also require backups as well.

In general, backup today is done by copying the data to some other storage and keeping that storage offsite from the original data center. At this amount of data, most likely the 2.1×10**21 of catalog (see original post) and index data would be copied to some form of removable media. The catalog is most important as the other two indexes could potentially be rebuilt from the catalog and original data. Assuming we are unwilling to reindex the data, with LTO-6 tape cartridges, the catalog and index backups would take 1.3×10**9 LTO-6 cartridges (at 1.6×10**12 bytes/cartridge).

To back up this amount of data once per month would take a gaggle of tape drives. There are ~2.6×10**6 seconds/month and each LTO-6 drive can transfer 5.4×10**8 bytes/sec or 1.4X10**15 bytes/drive-month but we need to backup 2.1×10**21 bytes of data so we need ~1.5×10**6 tape transports. Now tapes do not operate 100% of the time because when a cartridge becomes full it has to be changed out with an empty one, but this amounts to a rounding error at these numbers.

To figure out the tape robotics needed to service 1.5×10**6 transports we could use the latest T-finity tape library just announced by Spectra Logic . The T-Finity supports 500 tape drives and 122,000 tape cartridges, so we would need 3.0×10**3 libraries to handle the drive workload and about 1.1×10**4 libraries to store the cartridge set required, so 11,000 T-finity libraries would suffice. Presumably, using LTO-7 these numbers could be cut in half ~5,500 libraries, ~7.5×10**5 transports, and 6.6×10**8 cartridges.

Other removable media exist, most notably the Prostor RDX. However RDX roadmap info out to the next generation are not readily available and high-end robotics are do not currently support RDX. So for the moment tape seems the only viable removable backup for the catalog and index for the 1YB archive.

Mirroring the data

Another approach to protecting the data is to mirror the catalog and index data. This involves taking the data and copying it to another online storage repository. This doubles the storage required (to 4.2×10**21 bytes of storage). Replication doesn’t easily protect from user error but is an option worthy of consideration.

Networking infrastructure needed

Whether mirroring or backing up to tape, moving this amount of data will require substantial networking infrastructure. If we assume that in 2105 we have 32GFC (32 gb/sec fibre channel interfaces). Each interface could potentially transfer 3.2GB/s or 3.2×10**9 bytes/sec. Mirroring or backing up 2.1×10**21 bytes over one month will take ~2.5×10**6 32GFC interfaces. Probably should have twice this amount of networking just to not have any one be a bottleneck so 5×10**6 32GFC interfaces should work.

As for switches, the current Brocade DCX supports 768 8GFC ports and presumably similar port counts will be available in 2015 to support 32GFC. In addition if we assume at least 2 ports per link, we will need ~6,500 fully populated DCX switches. This doesn’t account for multi-layer switches and other sophisticated switch topologies but could be accommodated with another factor of 2 or ~13,000 switches.

Hot backups require journals

This all assumes we can do catalog and index backups once per month and take the whole month to do them. Now storage today normally has to be taken offline (via snapshot or some other mechanism) to be backed up in a consistent state. While it’s not impossible to backup data that is concurrently being updated it is more difficult. In this case, one needs to maintain a journal file of the updates going on while the data is being backed up and be able to apply the journaled changes to the data backed up.

For the moment I am not going to determine the storage requirements for the journal file required to cover the catalog transactions for a month, but this is dependent on the change rate of the catalog data. So it will necessarily be a function of the index or ingest rate of the 1YB archive to be covered in a future post.

Stay tuned, I am just having too much fun to stop.

IO Virtualization comes out

Snakes in a plane by richardmasoner [from flickr (cc)]
Snakes in a plane by richardmasoner (from flickr (cc))
Prior to last week’s VMworld, I had never heard of IO virtualization products before – storage virtualization yes but never IO virtualization. Then at last week’s VMworld I met with two vendors of IO virtualization products Aprius and Virtensys.

IO virtualization shares the HBAs/CNAs/NICs that a server tower would normally have plugged into each server and creates a top-of-rack box that shares these IO cards. The top-of-rack IO is connected to each of the tower servers by extending each server’s PCI-express bus.

Each individual server believes it has a local HBA/CNA/NIC card and acts accordingly. The top-of-rack box handles the mapping of each server to a portion of the HBA/CNA/NIC cards being shared. This all seems to remind me of server virtualization, using software to share the server processor, memory and IO resources across multiple applications. But with one significant difference.

How IO virtualization works

Aprius depends on the new SRIOV (Single Root I/O virtualization [requires login]) standards. I am no PCI-express expert but what this seems to do is allow a HBA/CNA/NIC PCI-express card to be a shared resource among a number of virtual servers executing within a physical server. What Aprius has done is sort of a “P2V in reverse” and allows a number of physical servers to share the same PCI-express HBA/CNA/NIC card in the top-of-rack solution.

Virtensys says it’s solution does not depend on SRIOV standards to provide IO virtualization. As such, it’s not clear what’s different but the top-of-box solution could conceivably share the hardware via software magic.

From a FC and HBA perspective there seems to be a number of questions as to how all this works.

  • Does the top-of-box solution need to be powered and booted up first?
  • How is FC zoning and LUN masking supported in a shared environment?

Similar networking questions should arise especially when one considers iSCSI boot capabilities.

Economics of IO virtualization

But the real question is one of economics. My lab owner friends tell me that a CNA costs about $800/port these days. Now when you consider that one could have 4-8 servers sharing each of these ports with IO virtualization the economics become clearer. With a typical configuration of 6 servers

  • For a non-IO virtualized solution, each server would have 2 CNA ports at a minimum so this would cost you $1600/server or $9600.
  • For an IO virtualized solution, each server requires PCI-extenders, costing about $50/server or $300, at least one CNA (for the top-of-rack) costing $1600 and the cost of their top-of-rack box.

If the IO virtualization box cost less than $7.7K it would be economical. But, IO virtualization providers also claim another savings, i.e, less switch ports need to be purchased because there are less physical network links. Unclear to me what a 10Gbe port with FCOE support costs these days but my guess may be 2X what a CNA port costs or another $1600/port or for the 6 server dual ported configuration ~$19.2K. Thus, the top-of-rack solution could cost almost $27K and still be more economical. When using IO virtualization to reduce HBAs and NICs then the top-of-rack solution could be even more economical.

Although the economics may be in favor of IO virtualization – at the moment – time is running out. CNA, HBA and NIC ports are coming down in price as vendors ramp up production. These same factors will reduce switch port cost as well. Thus, the savings gained from sharing CNAs, HBAs and NICs across multiple servers will diminish over time. Also the move to FCOE will eliminate HBAs and NICs and replace them with just CNAs so there are even less ports to amortize.

Moreover, PCI-express extender cards will probably never achieve volumes similar to HBAs, NICs, or CNAs so extender card pricing should remain flat. In contrast, any top-of-rack solution will share in overall technology trends reducing server pricing so relative advantages of IO virtualization over top-of-rack switches should be a wash.

The critical question for the IO virtualization vendors is can they support a high enough fan-in (physical server to top-of-rack) to justify the additional costs in both capital and operational expense for their solution. And will they be able to keep ahead of the pricing trends of their competition (top-of-rack switch ports and server CNA ports).

On one side as CNAs, HBAs, and NICs become faster and more powerful, no single application can consume all the throughput being made available. But on the other hand, server virtualization are now running more applications on each physical server and as such, amortizing port hardware over more and more applications.

Does IO virtualization make sense today at HBAs@8GFC, NICs and CNAs@10Gbe, would it make sense in the future with converged networks? It all depends on port costs. As port costs go down eventually these products will be squeezed.

The significant difference between server and IO virtualization is the fact that IO virtualization doesn’t reduce hardware footprint – one top-of-box IO virtualization appliance replaces a top-of-box switch and server PCI-express slots used by CNAs/HBAs/NICs are now used by PCI-extender cards. In contrast, server virtualization reduced hardware footprint and costs from the start. The fact that IO virtualization doesn’t reduce hardware footprint may doom this product.

VMworld and long distance Vmotion

Moving a VM from one data center to another

In all the blog posts/tweets about VMworld this week I didn’t see much about long distance Vmotion. At Cisco’s booth there was a presentation on how they partnered with VMware and to perform Vmotion over 200 (simulated) miles away.

I can’t recall when I first heard about this capability but for many of us this we heard about this before. However, what was new was that Cisco wasn’t the only one talking about it. I met with a company called NetEx whose product HyperIP was being used to performe long distance Vmotion at over 2000 miles apart . And had at least three sites actually running their systems doing this. Now I am sure you won’t find NetEx on VMware’s long HCL list but what they have managed to do is impressive.

As I understand it, they have an optimized appliance (also available as a virtual [VM] appliance) that terminates the TCP session (used by Vmotion) at the primary site and then transfers the data payload using their own UDP protocol over to the target appliance which re-constitutes (?) the TCP session and sends it back up the stack as if everything is local. According to the NetEx CEO Craig Gust, their product typically offers a data payload of around ~90% compared to standard TCP/IP of around 30%, which automatically gives them a 3X advantage (although he claimed a 6X speed or distance advantage, I can’t seem to follow the logic).

How all this works with vCenter, DRS and HA I can only fathom but my guess is that everything this long distance Vmotion is actually does appears to VMware as a local Vmotion. This way DRS and/or HA can control it all. How the networking is set up to support this is beyond me.

Nevertheless, all of this proves that it’s not just one highend networking company coming away with a proof of concept anymore, at least two companies exist, one of which have customers doing it today.

The Storage problem

In any event, accessing the storage at the remote site is another problem. It’s one thing to transfer server memory and state information over 10-1000 miles, it’s quite another to transfer TBs of data storage over the same distance. The Cisco team suggested some alternatives to handle the storage side of long distance Vmotion:

  • Let the storage stay in the original location. This would be supported by having the VM in the remote site access the storage across a network
  • Move the storage via long distance Storage Vmotion. The problem with this is that transferring TB of data takes (even at 90% data payload for 800 Mb/s) would take hours. And 800Mb/s networking isn’t cheap.
  • Replicate the storage via active-passive replication. Here the storage subsystem(s) concurrently replicate the data from the primary site to the secondary site
  • Replicate the storage via active-active replication where both the primary and secondary site replicate data to one another and any write to either location is replicated to the other

Now I have to admit the active-active replication where the same LUN or file system can be be being replicated in both directions and updated at both locations simultaneously seems to me unobtainium, I can be convinced otherwise. Nevertheless, the other approaches exist today and effectively deal with the issue, albeit with commensurate increases in expense.

The Networking problem

So now that we have the storage problem solved, what about the networking problem. When a VM is Vmotioned to another ESX server it retains its IP addressing so as to retain all it’s current network connections. Cisco has some techniques here where they can seem to extend the VLAN (or subnet) from the primary site to the secondary site and leave the VM with the same network IP address as at the primary site. Cisco has a couple of different ways to extend the VLAN optimized for HA, load ballancing, scaleability or protocol isolation and broadcast avoidance. (all of which is described further in their white paper on the subject). Cisco did mention that their Extending VLAN technology currently would not support distances greater than 500 miles apart.

Presumably NetEx’s product solves all this by leaving the IP addresses/TCP port at the primary site and just transferring the data to the secondary site. In any event multiple solutions to the networking problem exist as well.

Now, that long distance Vmotion can be accomplished is it a DR tool, a mobility tool, a load ballancing tool, or all of the above. That will need to wait for another post.