SSD shipments start to take off

3 rack V-Max storage subsystem from EMC
3 rack V-Max storage subsystem from EMC

Was on an analyst call today where Bob Wambach of EMC was discussing their recent success with V-Max their newest version of their highly successful Symmetrix storage subsystem. But what was more interesting was their announcement of having sold 1PB of enterprise flash storage on Symmetrix and almost 2PB total across all EMC product lines in 1H09. Symmetrix SSD shipments includes both DMX and V-Max installs. During 1H09, EMC shipped both 146GB and 400GB SSDs, so it’s hard to put a number of drives on this capacity but at 146GBs 1PB of Symmetrix SSD this would be around 6.9K SSDs and for all SSDs a maximum of ~14K drives.

SSD drive shipments vs. hard drives

To put this in comparison, ~540M hard drive were shipped in 2008 and with a ~7% decline in 2009 this should equate to around 502M drive shipments in 2009. But this includes all drives and as such, if 15-20% of these were data center storage, then ~75 to ~100M data center hard drives will be shipped in ’09. Looking at just the first half, probably close to 40% of the whole year, then ~30-40M data center hard drives were shipped across the industry in 1H09. In Q2’09 EMC had a 22.4% revenue storage market share, using this market share for all of 1H09, this means they probably shipped ~7.8M data center hard drives during 1H09 (assuming revenue correlates with drive shipments). Hence, 14K SSDs represents a very small but growing proportion (<0.2%) of all drives sold by EMC.

Of course this is just the start

On the analyst call today EMC provided a couple of examples of recent SSD installations. In one example a customer was looking at a US$3M mainframe upgrade but instead went with a $500K SSD upgrade. EMC was able to examine their current storage, identify their hottest, most active LUNs and convert these to SSDs. Once this was done, EMC was able to solve the customers performance problems which allowed them to defer the mainframe upgrade.

Data center access patterns

Some statistics from an EMC year long, data center study analyzing detail IO traces from around 600 data centers, show that over 60% of data center data is infrequently accessed. EMC believes such data can best be left on high capacity SATA drives. As for the rest, it wouldn’t surprise me if 15-20% is accessed frequently enough to reside on SSD/flash drives for improved performance and the remaining 25-20% probably best be served today left on FC drives.

Nowadays, EMC goes through a manual analysis to identify which data to place on SSDs but in the near future their FAST (Fully Automated Storage Tiering) software will be able to migrate data to the right performing storage tier automatically. With FAST in place, supposedly one only needs to upgrade to SSDs and turn FAST on, after that it will analyze your data reference patterns and automatically move your performance critical data to SSDs.

The coming SSD world

So, SSDs are starting to be adopted, by organizations both large and small. Perhaps current SSD drive shipments are insignificant compared to hard drives, but given today’s realities of data use there seems no reason that SSD adoption can’t accelerate and someday claim 10% or more of all data center drive shipments. Hence, at todays numbers, this means almost 10 million SSDs being shipped each year.

Why SSD performance is a mystery?

SSDs! :) by gimpbully (cc) (from flickr)
SSDs! ūüôā by gimpbully (cc) (from flickr)

SSD and/or SSS (solid state storage) performance is a mystery to most end-users. The technology is inherently asymmetrical, i.e., it reads much faster than it writes. I have written on some of these topics before (STEC’s new MLC drive, Toshiba’s MLC flash, Tape V Disk V SSD V RAM) but the issue is much more complex when you put these devices behind storage subsystems or in client servers.

Some items that need to be considered when measuring SSD/SSS performance include:

  • Is this a new or used SSD?
  • What R:W ratio will we use?
  • What blocksize should be used?
  • Do we use sequential or random I/O?
  • What block inter-reference interval should be used?

This list is necessarily incomplete but it’s representative of the sort of things that should be considered to measure SSD/SSS performance.

New device or pre-conditioned

Hard drives show little performance difference whether new or pre-owned, defect skips notwithstanding. In contrast, SSDs/SSSs can perform very differently when they are new versus when they have been used for a short period depending on their internal architecture. A new SSD can write without erasure throughout it’s entire memory address space but sooner or later wear leveling must kick in to equalize the use of the device’s NAND memory blocks. Wear leveling causes both reads and rewrites of data during it’s processing. Such activity takes bandwidth and controller processing away from normal IO. If you have a new device it may take days or weeks of activity (depending on how fast you write) to attain the device’s steady state where each write causes some sort of wear leveling activity.

R:W Ratio

Historically, hard drives have had slightly slower write seeks than reads, due to the need to be more accurately positioned to write data than to read it. As such, it might take .5msec longer to write than to read 4K bytes. But for SSDs the problem is much more acute, e.g. read times can be in microseconds while write times can almost be in milliseconds for some SSDs/SSSs. This is due to the nature of NAND flash, having to erase a block before it can be programmed (written) and the programming process taking a lot’s longer than a read.

So the question for measuring SSD performance is what read to write (R:W) ratio to use. Historically a R:W of 2:1 was used to simulate enterprise environments but most devices are starting to see more like 1:1 for enterprise applications due to the caching and buffering provided by controllers and host memory. I can’t speak as well for desktop environments but it wouldn’t surprise me to see 2:1 used to simulate desktop workloads as well.

SSDs operate a lot faster if their workload is 1000:1 than for 1:1 workloads. Most SSD data sheets tout a significant read I/O rate but only for 100% read workloads. This is like a subsystem vendor quoting a 100% read cache hit rate (which some do) but is unusual in the real world of storage.

Blocksize to use

Hard drives are not insensitive to blocksizes, as blocks can potentially span tracks which will require track-to-track seeks to be read or written. However, SSDs can also have some adverse interaction with varying blocksizes. This is dependent on the internal SSD architecture and is due to over optimizing write performance.

With an SSD, you erase a block of NAND and write a page or sector of NAND at a time. As writes takes much longer than reads, many SSD vendors add parallelism to improve write throughput. Parallelism writes or programs multiple sectors at the same time. Thus, if your blocksize is an integral multiple of the multi-sector size written performance is great, if not, performance can suffer.

In all honesty, similar issues exist with hard drive sector sizes. If your blocksize is an integral multiple of the drive sector size then performance is great, if not too bad. In contrast to SSDs, drive sector size is often configurable at the device level.

Sequential vs. random IO

Hard drives perform sequential IO much better than random IO. For SSDs this is not much of a problem, as once wear leveling kicks in, it’s all random to the NAND flash. So when comparing hard drives to SSDs the level of sequentiality is a critical parameter to control.

Cache hit rate

The block inter-reference interval is simply measures how often the same block is re-referenced. This is important for caching devices and systems because it ultimately determines the cache hit rate (reading data directly from cache instead of the device storage). Hard drives have onboard cache of 8 to 32MB today. SSD drives also have a DRAM cache for data buffering and other uses. SSDs typically publicize their cache size so in order to insure 0 cache hits one needs an block inter-reference interval close to the device’s capacity. Not a problem today with 146GB devices but as they move to 300GB and larger it becomes more of a problem to completely characterize device performance.

The future

So how do we get a handle on SSD performance? SNIA and others are working on a specification on how to measure SSD performance that will one day become a standard. When the standard is available we will have benchmarks and service groups that can run these benchmarks to validate SSD vendor performance claims. Until then – caveat emptor.

Of course most end users would claim that device performance is not as important as (sub)system performance which is another matter entirely…

What’s happening with MRAM?

16Mb MRAM chips from Everspin
16Mb MRAM chips from Everspin

At the recent Flash Memory Summit there were a few announcements that show continued development of MRAM technology which can substitute for NAND or DRAM, has unlimited write cycles and is magnetism based. My interest in MRAM stems from its potential use as a substitute storage technology for today’s SSDs that use SLC and MLC NAND flash memory with much more limited write cycles.

MRAM has the potential to replace NAND SSD technology because of the speed of write (current prototypes write at 400Mhz or a few nanoseconds) and with the potential to go up to 1Ghz. At 400Mhz, MRAM is already much much faster than today’s NAND. And with no write limits, MRAM technology should be very appealing to most SSD vendors.

The problem with MRAM

The only problem is that current MRAM chips use 150nm chip design technology whereas today’s NAND ICs use 32nm chip design technology. All this means that current MRAM chips hold about 1/1000th the memory capacity of today’s NAND chips (16Mb MRAM from Everspin vs 16Gb NAND from multiple vendors). MRAM has to get on the same (chip) design node as NAND to make a significant play for storage intensive applications.

It’s encouraging that somebody at least is starting to manufacture MRAM chips rather than just being lab prototypes with this technology. From my perspective, it can only get better from here…

SSD vs Drive energy use

Hard Disk by Jeff Kubina
Hard Disk by Jeff Kubina

Recently, the Storage Performance Council (SPC) has introduced a new benchmark series, the SPC-1C/E, which provides detailed energy usage for storage subsystems. So far there have been only two published submissions in this category but we look forward to seeing more in the future. The two submissions are for an IBM SSD and a Seagate Savvio (10Krpm) SAS attached storage subsystems.

My only issue with the SPC-1C/E reports is that they focus on a value of nominal energy consumption rather than reporting peak and idle energy usage. I understand that this is probably closer to what an actual data center would see as energy cost but it buries some intrinsic energy use profile differences.

SSD vs Drive power profile differences

The deltas for reported energy consumption for the two current SPC-1C/E submissions show a ~9.6% difference in peak versus nominal energy use for rotating media storage. Similar results for the SSD storage show a difference of ~1.7%. Taking these results for peak versus idle periods, shows the difference for rotating media being 28.5% and for SSD, ~2.8%.

So, the upside for SSD is drive them as hard as you want and it will cost you only a little bit more energy. In contrast, the downside is leave them idle and it will cost almost as much as if you were driving them at peak IO rates.

Rotating media storage seems to have a much more responsive power profile. Drive them hard and it will consume more power, leave them idle and it consumes less power.

Data center view of storage power

Now these differences might not seem significant but given the amount of storage in most shops they could represent significant cost differentials. Although SSD storage consumes less power, it’s energy use profile is significantly flatter than rotating media and will always consume that level of power (when powered on). On the other hand, rotating media consumes more power on average but it’s power profile is more slanted than SSDs and at peak workload consumes much more power than when idle.

Usualy, it’s unwise to generalize from two results. However, everything I know says that these differences in their respective power profiles should persist across other storage subsystem results. As more results are submitted it should be easy to verify whether I am right.

STEC’s MLC enterprise SSD

So many choices by Robert S. Donovan
So Many Choices by Robert S. Donovan

I haven’t seen much of a specification on STEC’s new enterprise MLC SSD but it should be interesting. ¬†So far everything I have seen seems to indicate that it’s a pure MLC drive with no SLC ¬†NAND. ¬†This is difficult for me to believe but could easily be cleared up by STEC or their specifications. ¬†Most likely it’s a hybrid SLC-MLC drive similar, at least from the NAND technology perspective, to FusionIO’s SSD drive.

MLC write endurance issue

My difficulty with a pure MLC enterprise drive is the write endurance factor. ¬†MLC NAND can only endure around 10,000 erase/program passes before it starts losing data. ¬†With a hybrid SLC-MLC design one could have the heavy write data go to SLC NAND which has a 100,000 erase/program pass lifecycle and have the less heavy write data go to MLC. ¬†Sort of like a storage subsystem “fast write” which writes to cache first and then destages to disk but in this case the destage may never happen if the data is written often enough.

The only flaw in this argument is that as the SSD drives get bigger (STEC’s drive is available supporting up to 800GB) this becomes less of an issue. Because with more raw storage the fact that a small portion of the data is very actively written gets swamped by the fact that there is plenty of storage to hold this data. ¬†As such, when one NAND cell gets close to its lifetime another, younger cell can be used. ¬†This process is called wear leveling. STEC’s current SLC Zeus drive already has sophisticated wear leveling to deal with this sort of problem with SLC SSDs and doing this for MLCs just means having larger tables to work with.

I guess at some point, with multi-TB per drives, the fact that MLC cannot sustain more than 10,000 erase/write passes becomes moot. ¬†Because there just isn’t that much actively written data out there in an enterprise shop. When you amortize the portion of highly written data as a percentage of a drive, the more drive capacity, the smaller the active data percentages become. As such, as SSD drive capacities gets larger this becomes less of an issue. ¬†I figure with 800GB drives, active data proportion might still be high enough to cause a problem but it might not be an issue at all.

Of course with MLC it’s also cheaper to over provision NAND storage to also help with write endurance. For an 800GB MLC SSD, you could easily add another 160GB (20% over provisioning) fairly cheaply. As such, over provisioning will also allow you to sustain an overall drive write endurance that is much higher than the individual NAND write endurance.

Another solution to the write endurance problem is to increase the power of ECC to handle write failures. This would probably take some additional engineering and may or may not be in the latest STEC MLC drive but it would make sense.

MLC performance

The other issue about MLC NAND is that it has slower read and erase/program cycle times. ¬†Now these are still order’s of magnitude faster than standard disk but slower than SLC NAND. ¬†For enterprise applications SLC SSDs are blistering fast and are often performance limited by the subsystem they are attached to. So, the fact that MLC SSDs are somewhat slower than SLC SSDs may not even be percieved by enterprise shops.

MLC performance is slower because it takes longer to read a cell with multiple bits in it than it takes with just one. MLC, in one technology I am aware of, encodes 2-bits in the voltage that is programmed in or read out from a cell, e.g., VoltageA = “00”, VoltageB=”01″, VoltageC=”10″, and VoltageD=”11″. This gets more complex with 3 or more bits per cell but the logic holds. ¬†With multiple voltages, determining which voltage level is present is more complex for MLC and hence, takes longer to perform.

In the end I would expect STEC’s latest drive to be some sort of SLC-MLC hybrid but I could be wrong. It’s certainly possible that STEC have gone with just an MLC drive and beefed up the capacity, over provisioning, ECC, and wear leveling algorithms to handle its lack of write endurance

MLC takes over the world

But the major issue with using MLC in SSDs is that MLC technology is driving the NAND market. All those items in the photo above are most probably using MLC NAND, if not today then certainly tomorrow. As such, the consumer market will be driving MLC NAND manufacturing volumes way above anything the SLC market requires. Such volumes will ultimately make it unaffordable to manufacture/use any other type of NAND, namely SLC in most applications, including SSDs.

So sooner or later all SSDs will be using only MLC NAND technology. I guess the sooner we all learn to live with that the better for all of us.

Toshiba’s New MLC NAND Flash SSDs

Toshiba has recently announced a new series of SSD’s based on MLC NAND (Yahoo Biz story). This is only the latest in a series of MLC SSDs which Toshiba has released.

Historically, MLC (multi-level cell) NAND has supported higher capacity but has been slower and less reliable than SLC (single-level cell) NAND. The capacity points supplied for the new drive (64, 128, 256, & 512GB) reflect the higher density NAND. Toshiba’s performance numbers for new drives also look appealing but are probably overkill for most desktop/notebook/netbook users

Toshiba’s reliability specifications were not listed in the Yahoo story and probably would be hard to find elsewhere (I looked on the Toshiba America website and couldn’t locate any). However the duty cycle for a desktop/notebook data drive are not that severe. So the fact that MLC can only endure ~1/10th the writes that SLC can endure is probably not much of an issue.

SNIA is working on SSD (or SSS as SNIA calls it, see SNIA SSSI forum website) reliability but have yet to publish anything externally. Unsure whether they will break out MLC vs SLC drives but it’s certainly worthy of discussion.

But the advantage of MLC NAND SSDs is that they should be 2 to 4X cheaper than SLC SSDs, depending on the number (2, 3 or 4) of bits/cell, and as such, more affordable. This advantage can be reduced by the need to over-provision the device and add more parallelism in order to improve MLC reliability and performance. But both of these facilities are becoming more commonplace and so should be relatively straight forward to support in an SSD.

The question remains, given the reliability differences, when and if MLC NAND will ever become reliable enough for enterprise class SSDs. Although many vendors make MLC NAND SSDs for the notebook/desktop market (Intel, SanDISK, Samsung, etc.), FusionIO is probably one of the few using a combination of SLC and MLC NAND for enterprise class storage (see FusionIO press release). Although calling the FusionIO device an SSD is probably a misnomer. And what FusionIO does to moderate MLC endurance issues is not clear but buffering write data to SLC NAND must certainly play some part.

Testing storage systems – Intel’s SSD fix

Intel’s latest (35nm NAND) SSD shipments were halted today because a problem was identified when modifying BIOS passwords (see IT PRO story). At least they gave a timeframe for a fix – a couple of weeks.

The real question is can products be tested sufficiently these days to insure they work in the field. Many companies today will ship product to end-user beta testers to work out the bugs before the product reaches the field. But beta-testing has got to be complemented with active product testing and validation. As such, unless you plan to get 100s or perhaps 1000s of beta testers you could have a serious problem with field deployment.

And therein lies the problem, software products are relatively cheap and easy to beta test, just set up a download site and have at it. But with hardware products beta testing actually involves sending product to end-users which costs quite a bit more $’s to support. So I understand why Intel might be having problems with field deployment.

So if you can’t beta test hardware products as easily as software – then you have to have a much more effective test process. Functional testing and validation is more of an art than a science and can cost significant $’s and more importantly, time. All of which brings us back to some form of beta testing.

Perhaps Intel could use their own employees as beta testers rotating new hardware products from one organization to another, over time to get some variability in the use of a new product. Many companies use their new product hardware extensively in their own data centers to validate functionality prior to shipment. In the case of Intel’s SSD drives these could be placed in the in-numberable servers/desktops that Intel no-doubt has throughout it’s corporation.

One can argue whether beta testing takes longer than extensive functional testing. However given today’s diverse deployments, I believe beta testing can be a more cost effective process when done well.

Intel is probably trying to figure out just what went wrong in their overall testing process today. I am sure, given their history, they will do better next time.

Tape v Disk v SSD v RAM

There was a time not long ago when the title of this post wouldn’t have included SSD. But, with the history of the last couple of years, SSD has earned its right to be included.

A couple of years back I was at a Rocky Mountain Magnetics Seminar (see IEEE magnetics societies) and a disk drive technologist stated that Disks have about another 25 years of technology roadmap ahead of them. During this time they will continue to increase density, throughput and other performance metrics. After 25 years of this they will run up against some theoretical limits which will halt further density progress.

At the same seminar, the presenter said that Tape was lagging Disk technology by about 5-10 years or so. As such, tape should continue to advance for another 5-10 years after disk stops improving at which time tape would also stop increasing density.

Does all this mean the end of tape and disk? I think not. Paper stopped advancing in density theoretically about 2 to 3000 years ago (the papyrus scroll was the ultimate in paper “rotating media”). If we move up to the codex or book form- which in my view is a form factor advance – this took place around 400AD (see history of scroll and codex). Paperback, another form factor advance, took place in the early 20th century (see paperback history).

Turning now to write performance, moveable type was a significant paper (write) performance improvement and started in the mid 15th century. The printing press would go on to improve (paper write) performance for the next six centuries (see printing press history) and continues today.

All this indicates that some data technology, whose density was capped over 2000 years ago, can continue to advance and support valuable activity in today’s world and for the foreseeable future. “Will disk and tape go away” is the wrong question, the right question is “can disk or tape, after SSDs reach price equivalence on a $/GB basis, still be useful to the world”?

I think yes, but that depends on a number of factors as to how the relative SSD-Disk-Tape technologies advance. Assuming someday all these technologies support equivalent Tb/SqIn or spatial density and

  • SSD’s retain their relative advantage in random access speed,
  • Tape it’s advantage in sequential throughput, volumetric density, and long media life, and
  • Disk it’s all around, combined sequential and random access advantage

It seems likely that each can sustain some niche in the data center/home office of tomorrow, although probably not where they are today.

One can see trends being enacted in the enterprise data centers today that are altering the relative positioning of SSDs, disks and tape. Tape is now being relegated to long term, archive storage, Disk is moving to medium-term, secondary storage and SSDs is replacing top tier, primary storage.

More thoughts on this in future posts.