Server virtualization vs. storage virtualization

Functional Fusion? by Cain Novocaine (cc) (from Flickr)
Functional Fusion? by Cain Novocaine (cc) (from Flickr)

One can only be perplexed by the seemingly overwelming adoption of server virtualization and contrast that with the ho-hum, almost underwelming adoption of storage virtualization.  Why is there this significant a difference?

I think the problem is partly due to the lack of an common understanding of storage performance utilization.

Why server virtualization succeeded

One significant driver of server virtualization is the precipitous drop in server utilization that occurred over the last decade when running single applications on a physical server.  It was nothing to see real processor utilization of less than 10% and consequently it was easy to envision that executing 5-10 applications on the single server. And what’s more each new generation of server kept getting more powerful, handling double the MIPs every 18 months or so driven by Moore’s law.

The other factor was that application workloads weren’t increasing that much. Yes new applications would come online but they seldom consumed an inordinate amount of MIPs and were often similar to what was already present. So application processing growth while not flatlining, was expanding at a relatively slow speed.

Why storage virtualization has failed

Data on the other hand continues its never ending exponential growth. Doubling every 3-5 years or less. And the fact that you have more data, almost always requires more storage hardware to support the IOPs being required to support it.

In the past the storage IOP rates was intrinsically tied to the number of disk heads available to service the load.  Although disk performance grew it wasn’t doubling every 18 months, and real per disk performance was actually going down over time, measured as the amount of IOPS per GB.

This drove proliferation of disk spindles and as such, storage subsystems in the data center. Storage virtualization couldn’t reduce the number of spindles required to support the workload.

Thus, if you look at storage performance from the perspective of % IOPS one could support per disk, most  sophisticated systems were running anywhere from 75% to 150% (based on DRAM caching).

Paradigm shift ahead

But SSDs can change this dynamic considerably.  A typical SSD can sustain 10-100K IOPs and there is some liklihood that this will increase with each generation that comes out but the application requirements will not increase as fast.  Hence, , there is a high liklihood that normal data center utilisation of SSD storage perfomance will start to drop below 50% or more, when that happens. -torage virtualization may start to make a lot more sense.

Maybe when (SSD) data storage starts moving more in line with Moore’s law, storage virtualization will become a more dominant paradigm for data center storage use.

Any bets on who the VMware of storage virtualization will be?

Comments?

Graphene Flash Memory

Model of graphene structure by CORE-Materials (cc) (from Flickr)
Model of graphene structure by CORE-Materials (cc) (from Flickr)

I have been thinking about writing a post on “Is Flash Dead?” for a while now.  Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.

But then this new Technology Review article came out  discussing recent research on Graphene Flash Memory.

Problems with NAND Flash

As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.

These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm.  At that point or soon thereafter, current NAND flash technology will no longer be viable.

Other non-NAND based non-volatile memories

That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others.  All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.

IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.

Graphene as the solution

Then along comes Graphene based Flash Memory.  Graphene can apparently be used as a substitute for the storage layer in a flash memory cell.  According to the report, the graphene stores data using less power and with better stability over time.  Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries.  The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.

Current demonstration chips are much larger than would be useful.  However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems.  The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.

The other problem is getting graphene, a new material, into current chip production.  Current materials used in chip manufacturing lines are very tightly controlled and  building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.

So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so.  Then again, I am no materials expert, so don’t hold me to this timeline.

 

—-

Comments?

Pure Storage surfaces

1 controller X 1 storage shelf (c) 2011 Pure Storage (from their website)
1 controller X 1 storage shelf (c) 2011 Pure Storage (from their website)

We were talking with Pure Storage last week, another SSD startup which just emerged out of stealth mode today.  Somewhat like SolidFire which we discussed a month or so ago, Pure Storage uses only SSDs to provide primary storage.  In this case, they are supporting a FC front end, with an all SSDs backend, and implementing internal data deduplication and compression, to try to address the needs of enterprise tier 1 storage.

Pure Storage is in final beta testing with their product and plan to GA sometime around the end of the year.

Pure Storage hardware

Their system is built around MLC SSDs which are available from many vendors but with a strategic investment from Samsung, currently use that vendor’s storage.  As we know, MLC has write endurance limitations but Pure Storage was built from the ground up knowing they were going to use this technology and have built their IP to counteract these issues.

The system is available in one or two controller configurations, with an Infiniband interconnect between the controllers, 6Gbps SAS backend, 48GB of DRAM per controller for caching purposes, and NV-RAM for power outages.  Each controller has 12-cores supplied by 2-Intel Xeon processor chips.

With the first release they are limiting the controllers to one or two (HA option) but their storage system is capable of clustering together many more, maybe even up to 8-controllers using the Infiniband back end.

Each storage shelf provides 5.5TB of raw storage using 2.5″ 256GB MLC SSDs.  It looks like each controller can handle up to 2-storage shelfs with the HA (dual controller option) supporting 4 drive shelfs for up to 22TB of raw storage.

Pure Storage Performance

Although these numbers are not independently verified, the company says a single controller (with 1-storage shelf) they can do 200K sustained 4K random read IOPS, 2GB/sec bandwidth, 140K sustained write IOPS, or 500MB/s of write bandwidth.  A dual controller system (with 2-storage shelfs) can achieve 300K random read IOPS, 3GB/sec bandwidth, 180K write IOPS or 1GB/sec of write bandwidth.  They also claim that they can do all this IO with an under 1 msec. latency.

One of the things they pride themselves on is consistent performance.  They have built their storage such that they can deliver this consistent performance even under load conditions.

Given the amount of SSDs in their system this isn’t screaming performance but is certainly up there with many enterprise class systems sporting over 1000 disks.  The random write performance is not bad considering this is MLC.  On the other hand the sequential write bandwidth is probably their weakest spec and reflects their use of MLC flash.

Purity software

One key to Pure Storage (and SolidFire for that matter) is their use of inline data compression and deduplication. By using these techniques and basing their system storage on MLC, Pure Storage believes they can close the price gap between disk and SSD storage systems.

The problems with data reduction technologies is that not all environments can benefit from them and they both require lots of CPU power to perform well.  Pure Storage believes they have the horsepower (with 12 cores per controller) to support these services and are focusing their sales activities on those (VMware, Oracle, and SQL server) environments which have historically proven to be good candidates for data reduction.

In addition, they perform a lot of optimizations in their backend data layout to prolong the life of MLC storage. Specifically, they use a write chunk size that matches the underlying MLC SSDs page width so as not to waste endurance with partial data writes.  Also they migrate old data to new locations occasionally to maintain “data freshness” which can be a problem with MLC storage if the data is not touched often enough.  Probably other stuff as well, but essentially they are tuning their backend use to optimize endurance and performance of their SSD storage.

Furthermore, they have created a new RAID 3D scheme which provides an adaptive parity scheme based on the number of available drives that protects against any dual SSD failure.  They provide triple parity, dual parity for drive failures and another parity for unrecoverable bit errors within a data payload.  In most cases, a failed drive will not induce an immediate rebuild but rather a reconfiguration of data and parity to accommodate the failing drive and rebuild it onto new drives over time.

At the moment, they don’t have snapshots or data replication but they said these capabilities are on their roadmap for future delivery.

—-

In the mean time, all SSD storage systems seem to be coming out of the wood work. We mentioned SolidFire, but WhipTail is another one and I am sure there are plenty more in stealth waiting for the right moment to emerge.

I was at a conference about two months ago where I predicted that all SSD systems would be coming out with little of the engineering development of storage systems of yore. Based on the performance available from a single SSD, one wouldn’t need 100s of SSDs to generate 100K IOPS or more.  Pure Storage is doing this level of IO with only 22 MLC SSDs and a high-end, but essentially off-the-shelf controller.

Just imagine what one could do if you threw some custom hardware at it…

Comments?

SATA Express combines PCIe and SATA

SATA Express plug configuration (c) SATA-IO (from SATA-IO.org website)SATA-IO recently announced a new specification for an PCIe and SATA-IO specification (better described in the presentation) that will provide a SATA device interface directly connected to a server’s PCIe bus.

The new working specification offers either 8Gbps or 16Gbps depending on the number of PCIe lanes being used and provides a new PCIe/SATA-IO plug configuration.

While this may be a boon to normal SATA-IO disk drives I see the real advantage lies with an easier interface for PCIe based NAND storage cards or Hybrid disk drives.

New generation of PCIe SSDs based on SATA Express

For example, previously if you wanted to produce a PCIe NAND storage card, you either had to surround this with IO drivers to provide storage/cache interfaces (such as FusionIO) or provide enough smarts on the card to emulate an IO controller along with the backend storage device (see my post on OCZ’s new bootable PCIe Z-drive).  With the new SATA Express interface, one no longer needs to provide any additional smarts with the PCIe card as long as you can support SATA Express.

It would seem that SATA Express would be the best of all worlds.

  • If you wanted a directly accessed SATA SSD you could plug it in to your SATA-IO controller
  • If you wanted networked SATA SSDs you could plug it into your storage array.
  • If you wanted even better performance than either of those two alternatives you could plug the SATA SSD directly into the PCIe bus with the PCIe/SATA-IO interface.

Of course supporting SATA Express will take additional smarts on the part of any SATA-IO device but with all new SATA devices supporting the new interface, additional cost differentials should shrink substantially.

SATA-IO 3.2

The PCIe/SATA-IO plug design is just a concept now but SATA expects to have the specification finalized by year end with product availability near the end of 2012.  The SATA-IO organization have designated the SATA Express standard to be part of SATA 3.2.

One other new capability is being introduced with SATA 3.2, specifically a µSATA specification designed to provide storage for embedded system applications.

The prior generation SATA 3.1, coming out in products soon, includes the mSATA interface specification for mobile device storage and the USM SATA interface specification for consumer electronics storage.   And as most should recall, SATA 3.0 provided 6Gbps data transfer rates for SATA storage devices.

—-

Can “SAS Express” be far behind?

Comments?

OCZ’s latest Z-Drive R4 series PCIe SSD

OCZ_Z-Drive_RSeries (from http://www.ocztechnology.com/ocz-z-drive-r4-r-series-pci-express-ssd.html)
OCZ_Z-Drive_RSeries (from http://www.ocztechnology.com/ocz-z-drive-r4-r-series-pci-express-ssd.html)

OCZ just released a new version of their enterprise class Z-drive SSD storage with pretty impressive performance numbers (up to 500K IOPS [probably read] with 2.8GB/sec read data transfer).

Bootability

These new drives are bootable SCSI devices and connect directly to a server’s PCIe bus. They come in half height and full height card form factors and support 800GB to 3.2TB (full height) or 300GB to 1.2TB (half height) raw storage capacities.

OCZ also offers their Velo PCIe SSD series which are not bootable and as such, require an IO driver for each operating system. However, the Z-drive has more intelligence which provides a SCSI device and as such, can be used anywhere.

Naturally this comes at the price of additional hardware and overhead.   All of which could impact performance but given their specified IO rates, it doesn’t seem to be a problem.

Unclear how many other PCIe SSDs exist today that offer bootability but it certainly puts these drives in a different class than previous generation PCIe SSD such as available from FusionIO and other vendors that require IO drivers.

MLC NAND

One concern with new Z-drives might be their use of MLC NAND technology.  Although OCZ’s press release said the new drives would be available in either SLC or MLC configurations, current Z-drive spec sheets only indicate MLC availability.

As  discussed previously (see eMLC & eSLC and STEC’s MLC posts), MLC supports less write endurance (program-erase and write cycles) than SLC NAND cells.  Normally the difference is on the order of 10X less before NAND cell erase/write failure.

I also noticed there was no write endurance specification on their spec sheet for the new Z-drives.  Possibly,  at these capacities it may not matter but, in our view, a write endurance specification should be supplied for any SSD drive, and especially for enterprise class ones.

Z-drive series

OCZ offers two versions of their Z-drive the R and C series, both of which offer the same capacities and high performance but as far as I could tell the R series appears to be have more enterprise class availability and functionality. Specifically, this drive has power fail protection for the writes (capacitance power backup) as well as better SMART support (with “enterprise attributes”). These both seem to be missing from their C Series drives.

We hope the enterprise attribute SMART provides write endurance monitoring and reporting.  But there is no apparent definition of these attributes that were easily findable.

Also the R series power backup, called DataWrite Assurance Technology would be a necessary component for any enterprise disk device.  This essentially saves data written to the device but not to the NAND just yet from disappearing during a power outage/failure.

Given the above, we would certainly opt for the R series drive in any enterprise configuration.

Storage system using Z-drives

Just consider what one can do with a gaggle of Z-drives in a standard storage system.  For example, with 5 Z-drives in a server, it could potentially support 2.5M IOPs/sec and 14GB/sec of data transfer with some resulting loss of performance due to front-end emulation.  Moreover, at 3.2TB per drive, even in a RAID5 4+1 configuration the storage system would support 12.8TB of user capacity. One could conceivably do away with any DRAM cache in such a system and still provide excellent performance.

What the cost for such a system would be is another question. But with MLC NAND it shouldn’t be too obscene.

On the other hand serviceability might be a concern as it would be difficult to swap out a failed drive (bad SSD/PCIe card) while continuing IO operations. This could be done with some special hardware but it’s typically not present in standard, off the shelf servers.

—-

All in all a very interesting announcement from OCZ.  The likelihood that a single server will need this sort of IO performance from a lone drive is not that high (except maybe for massive website front ends) but putting a bunch of these in a storage box is another matter.  Such a configuration would make one screaming storage system with minimal hardware changes and only a modest amount of software development.

Comments?

Intel’s 320 SSD “8MB problem”

Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)
Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)

Read a recent news item on Intel being able to replicate their 320 SSD 8MB problem that some customers have been complaining about.

Apparently the problem occurs when power is suddenly removed from the device.  The end result is that the SSD’s capacity is restricted to 8MB from 40GB or more.

I have seen these sorts of problems before.  It probably has something to do with table updating activity associated with SSD wear leveling.

Wear leveling

NAND wear leveling looks very similar to virtual memory addressing and maps storage block addresses to physical NAND pages. Essentially something similar to a dynamic memory page table is maintained that shows where the current block is located in the physical NAND space, if present.  Typically, there are multiple tables involved, one for spare pages, another for mapping current block addresses to NAND page location and offset, one for used pages, etc.  All these tables have to be in some sort of non-volatile storage so they persist after power is removed.

Updating such tables and maintaining their integrity is a difficult endeovor.  More than likely some sort of table update is not occurring in an ACID fashion.

Intel’s fix

Intel has replicated the problem and promises a firmware fix. In my experience this is entirely possible.  Most probably customer data has not been lost (although this is not a certainty), it’s just not accessible at the moment. And Intel has reminded everyone that as with any storage device everyone should be taking periodic backups to other devices, SSDs are no exception.

I am certain that Intel and others are enhancing their verification and validation (V&V) activities to better probe and test the logic behind wear leveling fault tolerance, at least with respect to power loss. Of course, redesigning the table update algorithm to be more robust, reliable, and fault tolerant is a long range solution to these sorts of problems but may take longer than a just a bug fix.

The curse of complexity

But all this begs a critical question, as one puts more and more complexity outboard into the drive are we inducing more risk?

It’s a perennial problem in the IT industry. Software bugs are highly correlated to complexity and thereby, are ubiquitous, difficult to eliminate entirely, and often escape any and all efforts to eradicate them before customer shipments.  However, we can all get better at reducing bugs, i.e., we can make them less frequent, less impactful, and less visible.

What about disks?

All that being said, rotating media is not immune to the complexity problem. Disk drives have different sorts of complexity, e.g., here block addressing is mostly static and mapping updates occur much less frequently (for defect skipping) rather than constantly as with NAND, whenever data is written.  As such, problems with power loss impacting table updates are less frequent and less severe with disks.  On the other hand, stiction, vibration, and HDI are all very serious problems with rotating media but SSDs have a natural immunity to these issues.

—-
Any new technology brings both advantages and disadvantages with it.  NAND based SSD advantages include high speed, low power, and increased ruggedness but the disadvantages involve cost and complexity.  We can sometimes tradeoff cost against complexity but we cannot eliminate it entirely.

Moreover, while we cannot eliminate the complexity of NAND wear leveling today, we can always test it better.  That’s probably the most significant message coming out of today’s issue.  Any product SSD testing has to take into account the device’s intrinsic complexity and exercise that well, under adverse conditions. Power failure is just one example, I can think of dozens more.

Comments?

SolidFire supplies scale-out SSD storage for cloud service providers

SolidFire SF3010 node (c) 2011 SolidFire (from their website)
SolidFire SF3010 node (c) 2011 SolidFire (from their website)

I was talking with a local start up called SolidFire the other day with an interesting twist on SSD storage.  They were targeting cloud service providers with a scale-out, cluster based SSD iSCSI storage system.  Apparently a portion of their team had come from Lefthand (now owned by HP) another local storage company and the rest came from Rackspace, a national cloud service provider.

The hardware

Their storage system is a scale-out cluster of storage nodes that can range from 3 to a theoretical maximum of 100 nodes (validated node count ?). Each node comes equipped with 2-2.4GHz, 6-core Intel processors and 10-300GB SSDs for a total of 3TB raw storage per node.  Also they have 8GB of non-volatile DRAM for write buffering and 72GB read cache resident on each node.

The system also uses 2-10GbE links for host to storage IO and inter-cluster communications and support iSCSI LUNs.  There are another 2-1GigE links used for management communications.

SolidFire states that they can sustain 50K IO/sec per node. (This looks conservative from my viewpoint but didn’t state any specific R:W ratio or block size for this performance number.)

The software

They are targeting cloud service providers and as such the management interface was designed from the start as a RESTful API but they also have a web GUI built out of their API.  Cloud service providers will automate whatever they can and having a RESTful API seems like the right choice.

QoS and data reliability

The cluster supports 100K iSCSI LUNs and each LUN can have a QoS SLA associated with it.  According to SolidFire one can specify a minimum/maximum/burst level for IOPS and a maximum or burst level for throughput at a LUN granularity.

With LUN based QoS, one can divide cluster performance into many levels of support for multiple customers of a cloud provider.  Given these unique QoS capabilities it should be relatively easy for cloud providers to support multiple customers on the same storage providing very fine grained multi-tennancy capabilities.

This could potentially lead to system over commitment, but presumably they have some way to ascertain over commitment is near and not allowing this to occur.

Data reliability is supplied through replication across nodes which they call Helix(tm) data protection.  In this way if an SSD or node fails, it’s relatively easy to reconstruct the lost data onto another node’s SSD storage.  Which is probably why the minimum number of nodes per cluster is set at 3.

Storage efficiency

Aside from the QoS capabilities, the other interesting twist from a customer perspective is that they are trying to price an all-SSD storage solution at the $/GB of normal enterprise disk storage. They believe their node with 3TB raw SSD storage supports 12TB of “effective” data storage.

They are able to do this by offering storage efficiency features of enterprise storage using an all SSD configuration. Specifically they provide,

  • Thin provisioned storage – which allows physical storage to be over subscribed and used to support multiple LUNs when space hasn’t completely been written over.
  • Data compression – which searches for underlying redundancy in a chunk of data and compresses it out of the storage.
  • Data deduplication – which searches multiple blocks and multiple LUNs to see what data is duplicated and eliminates duplicative data across blocks and LUNs.
  • Space efficient snapshot and cloning – which allows users to take point-in-time copies which consume little space useful for backups and test-dev requirements.

Tape data compression gets anywhere from 2:1 to 3:1 reduction in storage space for typical data loads. Whether SolidFire’s system can reach these numbers is another question.  However, tape uses hardware compression and the traditional problem with software data compression is that it takes lots of processing power and/or time to perform it well.  As such, SolidFire has configured their node hardware to dedicate a CPU core to each physical disk drive (2-6 core processors for 10 SSDs in a node).

Deduplication savings are somewhat trickier to predict but ultimately depends on the data being stored in a system and the algorithm used to deduplicate it.  For user home directories, typical deduplication levels of 25-40% are readily attainable.  SolidFire stated that their deduplication algorithm is their own patented design and uses a small fixed block approach.

The savings from thin provisioning ultimately depends on how much physical data is actually consumed on a storage LUN but in typical environments can save 10-30% of physical storage by pooling non-written or free storage across all the LUNs configured on a storage system.

Space savings from point-in-time copies like snapshots and clones depends on data change rates and how long it’s been since a copy was made.  But, with space efficient copies and a short period of existence, (used for backups or temporary copies in test-development environments) such copies should take little physical storage.

Whether all of this can create a 4:1 multiplier for raw to effective data storage is another question but they also have a eScanner tool which can estimate savings one can achieve in their data center. Apparently the eScanner can be used by anyone to scan real customer LUNs and it will compute how much SolidFire storage will be required to support the scanned volumes.

—–

There are a few items left on their current road map to be delivered later, namely remote replication or mirroring. But for now this looks to be a pretty complete package of iSCSI storage functionality.

SolidFire is currently signing up customers for Early Access but plan to go GA sometime around the end of the year. No pricing was disclosed at this time.

I was at SNIA’s BoD meeting the other week and stated my belief that SSDs will ultimately lead to the commoditization of storage.  By that I meant that it would be relatively easy to configure enough SSD hardware to create a 100K IO/sec  or 1GB/sec system without having to manage 1000 disk drives.  Lo and behold, SolidFire comes out the next week.  Of course, I said this would happen over the next decade – so I am off by a 9.99 years…

Comments?