SCI’s latest SPC-1&-1/E LRT results – chart of the month

(c) 2010 Silverton Consulting, Inc., All Rights Reserved
(c) 2010 Silverton Consulting, Inc., All Rights Reserved

It’s been a while since we reported on Storage Performance Council (SPC) Least Response Time (LRT) results (see Chart of the month: SPC LRT[TM]).  This is one of the charts we produce for our monthly dispatch on storage performance (quarterly report on SPC results).

Since our last blog post on this subject there have been 6 new entries in LRT Top 10 (#3-6 &, 9-10).  As can be seen here which combines SPC-1 and 1/E results, response times vary considerably.  7 of these top 10 LRT results come from subsystems which either have all SSDs (#1-4, 7 & 9) or have a large NAND cache (#5).    The newest members on this chart were the NetApp 3270A and the Xiotech Emprise 5000-300GB disk drives which were published recently.

The NetApp FAS3270A, a mid-range subsystem with 1TB of NAND cache (512MB in each controller) seemed to do pretty well here with all SSD systems doing better than it and a pair of all SSD systems doing worse than it.  Coming in under 1msec LRT is no small feat.  We are certain the NAND cache helped NetApp achieve their superior responsiveness.

What the Xiotech Emprise 5000-300GB storage subsystem is doing here is another question.  They have always done well on an IOPs/drive basis (see SPC-1&-1/E results IOPs/Drive – chart of the month) but being top ten in LRT had not been their forte, previously.  How one coaxes a 1.47 msec LRT out of a 20 drive system that costs only ~$41K, 12X lower than the median price(~$509K) of the other subsystems here is a mystery.  Of course, they were using RAID 1 but so were half of the subsystems on this chart.

It’s nice that some turnover in this top 10 LRT.  I still contend that response time is an important performance metric for many storage workloads (see my IO throughput vs. response time and why it matters post) and improvement over time validates my thesis.  Also I received many comments discussing the merits of database latencies for ESRP v3 (Exchange 2010) results, (see my Microsoft Exchange Perfomance ESRP v3.0 results – chart of the month post).  You can judge the results of that lengthy discussion for yourselves.

The full performance dispatch will be up on our website in a couple of weeks but if you are interested in seeing it sooner just sign up for our free monthly newsletter (see upper right) or subscribe by email and we will send you the current issue with download instructions for this and other reports.

As always, we welcome any constructive suggestions on how to improve our storage performance analysis.

Comments?

The future of data storage is MRAM

Core Memory by teclasorg
Core Memory by teclasorg

We have been discussing NAND technology for quite awhile now but this month I ran across an article in IEEE Spectrum titled “a SPIN to REMEMBER – Spintronic memories to revolutionize data storage“. The article discussed a form of magneto-resistive random access memory or MRAM that uses quantum mechanical spin effects or spintronics to record data. We have talked about MRAM technology before and progress has been made since then.

Many in the industry will recall that current GMR (Giant Magneto-resistance) heads and TMR (Tunnel magneto-resistance) next generation disk read heads already make use of spintronics to detect magnetized bit values in disk media. GMR heads detect bit values on media by changing its electrical resistance.

Spintronics however can also be used to record data as well as read it. These capabilities are being exploited in MRAM technology which uses a ferro-magnetic material to record data in magnetic spin alignment – spin UP, means 0; spin down, means 1 (or vice versa).

The technologists claim that when MRAM reaches its full potential it could conceivably replace DRAM, SRAM, NAND, and hard disk drives or all current electrical and magnetic data storage. Some of MRAM’s advantages include unlimited write passes, fast reads and writes and data non-volatilility.

MRAM reminds me of old fashioned magnetic core memory (in photo above) which used magnetic polarity to record non-volatile data bits. Core was a memory mainstay in the early years of computing before the advent of semi-conductor devices like DRAM.

Back to future – MRAM

However, the problems with MRAM today are that it is low-density, takes lots of power and is very expensive. But technologists are working on all these problems with the view that the future of data storage will be MRAM. In fact, researchers at the North Carolina State University (NCSU) Electrical Engineering department have been having some success with reducing power requirements and increasing density.

As for data density NCSU researchers now believe they can record data in cells approximating 20 nm across, better than current bit patterned media which is the next generation disk recording media. However reading data out of such a small cell will prove to be difficult and may require a separate read head on top of each cell. The fact that all of this is created with normal silicon fabrication methods make doing so at least feasible but the added chip costs may be hard to justify.

Regarding high power, their most recent design records data by electronically controlling the magnetism of a cell. They are using dilute magnetic semiconductor material doped with gallium maganese which can hold spin value alignment (see the article for more information). They are also using a semiconductor p-n junction on top of the MRAM cell. Apparently at the p-n junction they can control the magnetization of the MRAM cells by applying -5 volts or removing this. Today the magnetization is temporary but they are also working on solutions for this as well.

NCSU researchers would be the first to admit that none of this is ready for prime time and they have yet to demonstrate in the lab a MRAM memory device with 20nm cells, but the feeling is it’s all just a matter of time and lot’s of research.

Fortunately, NCSU has lots of help. It seems Freescale, Honeywell, IBM, Toshiba and Micron are also looking into MRAM technology and its applications.

—–

Let’s see, using electron spin alignment in a magnetic medium to record data bits, needs a read head to read out the spin values – couldn’t something like this be used in some sort of next generation disk drive that uses the ferromagnetic material as a recording medium. Hey, aren’t disks already using a ferromagnetic material for recording media? Could MRAM be fabricated/layed down as a form of magnetic disk media?? Maybe there’s life in disks yet….

What do you think?

What eMLC and eSLC do for SSD longevity

Enterprise NAND from Micron.com (c) 2010 Micron Technology, Inc.
Enterprise NAND from Micron.com (c) 2010 Micron Technology, Inc.

I talked last week with some folks from Nimbus Data who were discussing their new storage subsystem.  Apparently it uses eMLC (enterprise Multi-Level Cell) NAND SSDs for its storage and has no SLC (Single Level Cell) NAND at all.

Nimbus believes with eMLC they can keep the price/GB down and still supply the reliability required for data center storage applications.  I had never heard of eMLC before but later that week I was scheduled to meet with Texas Memory Systems and Micron Technologies that helped get me up to speed on this new technology.

eMLC/eSLC defined

eMLC and its cousin, eSLC are high durability NAND parts which supply more erase/program cycles than generally available from MLC and SLC respectively.  If today’s NAND technology can supply 10K erase/program cycles for MLC and similarly, 100K erase/program cycles for SLC then, eMLC can supply 30K.  Never heard a quote for eSLC but 300K erase/program cycles before failure might be a good working assumption.

The problem is that NAND wears out, and can only sustain so many erase/program cycles before it fails.  By having more durable parts, one can either take the same technology parts (from MLC to eMLC) to use them longer or move to cheaper parts (from SLC to eMLC) to use them in new applications.

This is what Nimbus Data has done with eMLC.  Most data center class SSD or cache NAND storage these days are based on SLC. But SLC, with only on bit per cell, is very expensive storage.  MLC has two (or three) bits per cell and can easily halve the cost of SLC NAND storage.

Moreover, the consumer market which currently drives NAND manufacturing depends on MLC technology for cameras, video recorders, USB sticks, etc.  As such, MLC volumes are significantly higher than SLC and hence, the cost of manufacturing MLC parts is considerably cheaper.

But the historic problem with MLC NAND is the reduction in durability.  eMLC addresses that problem by lengthening the page programming (tProg) cycle which creates a better, more lasting data write, but slows write performance.

The fact that NAND technology already has ~5X faster random write performance than rotating media (hard disk drives) makes this slightly slower write rate less of an issue. If eMLC took this to only ~2.5X disk writes it still would be significantly faster.  Also, there are a number of architectural techniques that can be used to speed up drive write speeds easily incorporated into any eMLC SSD.

How long will SLC be around?

The industry view is that SLC will go away eventually and be replaced with some form of MLC technology because the consumer market uses MLC and drives NAND manufacturing.  The volumes for SLC technology will just be too low to entice manufacturers to support it, driving the price up and volumes even lower – creating a vicious cycle which kills off SLC technology.  Not sure how much I believe this, but that’s conventional wisdom.

The problem with this prognosis is that by all accounts the next generation MLC will be even less durable than today’s generation (not sure I understand why but as feature geometry shrinks, they don’t hold charge as well).  So if today’s generation (25nm) MLC supports 10K erase/program cycles, most assume the next generation (~18nm) will only support 3K erase/program cycles. If eMLC then can still support 30K or even 10K erase/program cycles that will be a significant differentiator.

—-

Technology marches on.  Something will replace hard disk drives over the next quarter century or so and that something is bound to be based on transistorized logic of some kind, not the magnetized media used in disks today. Given todays technology trends, it’s unlikely that this will continue to be NAND but something else will most certainly crop up – stay tuned.

Anything I missed in this analysis?

SPC-1/E IOPS per watt – chart of the month

SPC*-1/E IOPs per Watt as of 27Aug2010
SPC*-1/E IOPs per Watt as of 27Aug2010

Not a lot of Storage Performance Council (SPC) benchmark submissions this past quarter just a new SPC-1/E from HP StorageWorks on their 6400 EVA with SSDs and a new SPC-1 run for Oracle Sun StorageTek 6780.  Recall that SPC-1/E executes all the same tests as SPC-1 but adds more testing with power monitoring equipment attached to measure power consumption at a number of performance levels.

With this chart we take another look at the storage energy consumption (see my previous discussion on SSD vs. drive energy use). As shown above we graph the IOPS/watt for three different performance environments: Nominal, Medium, and High as defined by SPC.  These are contrived storage usage workloads to measure the varibility in power consumed by a subsystem.  SPC defines the workloads as follows:

  • Nominal usage is 16 hours of idle time and 8 hours of moderate activity
  • Medium usage is 6 hours of idle time, 14 hours of moderate activity, and 4 hours of heavy activity
  • High usage is 0 hours of idle time, 6 hours of moderate activity and 18 hours of heavy activity

As for activity, SPC defines moderate activity at 50% of the subsystem’s maximum SPC-1 reported performance and heavy activity is at 80% of its maximum performance.

With that behind us, now on to the chart.  The HP 6400 EVA had 8-73GB SSD drives for storage while the two Xiotech submissions had 146GB/15Krpm and 600GB/15Krpm drives with no flash.  As expected the HP SSD subsystem delivered considerably more IOPS/watt at the high usage workload – ~2X the Xiotech with 600GB drives and ~2.3X the Xiotech with 146GB drives.  The multipliers were slightly less for moderate usage but still substantial nonetheless.

SSD nominal usage power consumption

However, the nominal usage bears some explanation.  Here both Xiotech subsystems beat out the HP EVA SSD subsystem at nominal usage with the 600GB drive Xiotech box supporting ~1.3X the IOPS/watt of the HP SSD system. How can this be?  SSD idle power consumption is the culprit.

The HP EVA SSD subsystem consumed ~463.1W at idle while the Xiotech 600GB only consumed ~23.5W and the Xiotech 146GM drive subsystem consumed ~23.4w.  I would guess that the drives and perhaps the Xiotech subsystem have considerable power savings algorithms that shed power when idle.  For whatever reason the SSDs and HP EVA don’t seem to have anything like this.  So nominal usage with 16Hrs of idle time penalizes the HP EVA SSD system resulting in the poors IOPS/watt for nominal usage shown above..

Rays reading: SSDs are not meant to be idled alot and disk drives, especially the ones that Xiotech are using have very sophisticated power management that maybe SSDs and/or HP should take a look at adopting.

The full SPC performance report will go up on SCI’s website next month in our dispatches directory.  However, if you are interested in receiving this sooner, just subscribe by email to our free newsletter and we will send you the current issue with download instructions for this and other reports.

As always, we welcome any suggestions on how to improve our analysis of SPC performance information so please comment here or drop us a line.

Micron’s new P300 SSD and SSD longevity

Micron P300 (c) 2010 Micron Technology
Micron P300 (c) 2010 Micron Technology

Micron just announced a new SSD drive based on their 34nm SLC NAND technology with some pretty impressive performance numbers.  They used an independent organization, Calypso SSD testing, to supply the performance numbers:

  • Random Read 44,000 IO/sec
  • Random Writes 16,000 IO/sec
  • Sequential Read 360MB/sec
  • Sequential Write 255MB/sec

Even more impressive considering this performance was generated using SATA 6Gb/s and measuring after reaching “SNIA test specification – steady state” (see my post on SNIA’s new SSD performance test specification).

The new SATA 6Gb/s interface is a bit of a gamble but one can always use an interposer to support FC or SAS interfaces.  In addition,today many storage subsystems already support SATA drives so its interface may not even be an issue.  The P300 can easily support 3Gb/s SATA if that’s whats available and sequential performance suffers but random IOPs won’t be too impacted by interface speed.

The advantages of SATA 6Gb/sec is that it’s a simple interface and it costs less to implement than SAS or FC.  The downside is the loss of performance until 6Gb/sec SATA takes over enterprise storage.

P300’s SSD longevity

I have done many posts discussing SSDs and their longevity or write endurance but this is the first time I have heard any vendor describe drive longevity using “total bytes written” to a drive. Presumably this is a new SSD write endurance standard coming out of JEDEC but I was unable to find any reference to the standard definition.

In any case, the P300 comes in 50GB, 100GB and 200GB capacities and the 200GB drive has a “total bytes written” to the drive capability of 3.5PB with the smaller versions having proportionally lower longevity specs. For the 200GB drive, it’s almost 5 years of 10 complete full drive writes a day, every day of the year.  This seems enough from my perspective to put any SSD longevity considerations to rest.  Although at 255MB/sec sequential writes, the P300 can actually sustain ~10X that rate per day – assuming you never read any data back??

I am sure over provisioning, wear leveling and other techniques were used to attain this longevity. Nonetheless, whatever they did, the SSD market could use more of it.  At this level of SSD longevity the P300 could almost be used in a backup dedupe appliance, if there was need for the performance.

You may recall that Micron and Intel have a joint venture to produce NAND chips.  But the joint venture doesn’t include applications of their NAND technology.  This is why Intel has their own SSD products and why Micron has started to introduce their own products as well.

—–

So which would you rather see for an SSD longevity specification:

  • Drive MTBF
  • Total bytes written to the drive,
  • Total number of Programl/Erase cycles, or
  • Total drive lifetime, based on some (undefined) predicted write rate per day?

Personally I like total bytes written because it defines the drive reliability in terms everyone can readily understand but what do you think?

SNIA’s new SSD performance test specification

Western Digital's Silicon Edge Blue SSD SATA drive (from their website)
Western Digital's Silicon Edge Blue SSD SATA drive (from their website)

A couple of weeks ago SNIA just released a new version of their SSSI (SSD) performance test specification for public comment. Not sure if this is the first version out for public comment or not but I discussed a prior version in a presentation I did for SNW last October and I have blogged before about some of the mystery of measuring SSD performance.  The current version looks a lot more polished than what I had to deal with last year but the essence of the performance testing remains the same:

  • Purge test – using vendor approved process, purge (erase) all the data on the drive.
  • Preconditioning test  – Write 2X the capacity of the drive using 128KiB blocksizes and sequentially writing through the whole device’s usable address space.
  • Steady state testing – varying blocksizes, varying read-write ratios, varying block number ranges, looped until steady state is achieved in device performance.

The steady state testing runs a random I/O mix for a minutes duration at whatever the current specified blocksize, RW ratio and block number range.  Also, according to the specification the measurements for steady state are done once 4KiB block sizes and 100% Read Write settles down.  This steady state determinant testing must execute over a number of rounds (4?) then the other performance test runs are considered at “steady state”.

SNIA’s SSSI performance test benefits

Lets start by saying no performance test is perfect.  I can always find fault in any performance test, even my own.  Nevertheless, the SSSI new performance test goes a long way towards fixing some intrinsic problems with SSD performance measurement.  Specifically,

  • The need to discriminate between fresh out of the box (FOB) performance and ongoing drive performance.  The preconditioning test is obviously a compromise in attempting to do this but writing double the full capacity of a drive will take a long time and should cause every NAND cell in the user space to be overwritten.  Once is not enough to overwrite all the devices write buffers.   However three times the device’s capacity may still show some variance in performance but it will take correspondingly longer.
  • The need to show steady state SSD performance versus some peak value.  SSDs are notorious for showing differing performance over time. Partially this is due to FOB performance (see above) but mostly this is due to the complexity of managing NAND erasure and programming overhead.

The steady state performance problem is not nearly as much an issue with hard disk drives but even here, with defect skipping, drive performance will degrade over time (but a much longer time than for SSDs).  My main quibble with the test specification is how they elect to determine steady state – 4KiB with 100% read write seems a bit over simplified.

Is write some proportion of read IO needed to define SSD “steady state” performance?

[Most of the original version of this post centered on the need for some write component in steady state determination.  This was all due to my misreading the SNIA spec.  I now realize that the current spec calls for a 100% WRITE workload with 4KiB blocksizes to settle down to determine steady state.   While this may be overkill, it certainly is consistent with my original feelings that some proportion of write activity needs to be a prime determinant of SSD steady state.]

Most of my concern with how the test determines SSD steady state performance is that lack of write activity. One concern is the lack of read activity in determining steady state. My other worry with this approach is the blocksize seems a bit too small, however this is minor in comparison.

Let’s start with the fact that SSDs are by nature assymetrical devices.  By that I mean their write performance differs substantially from their read performance due to the underlying nature of the NAND technology.  But much of what distinguishes an enterprise SSD from a commercial drive is the sophistication of its write processing.  By using a 100% read rate we are undervaluing this sophistication.

But using 100% writes to test for steady state may be too much.

In addition, it’s It is hard for me to imagine any commercial or enterprise class device in service not having some high portion of ongoing write read IO activity.  I can easily be convinced that a normal R:W activity for an SSD device is somewhere between 90:10 and 50:50.  But I have a difficult time seeing an SSD R:W ratio of 100:0 0:100 as realistic.  And I feel any viable interpretation of device steady state performance needs to be based on realistic workloads.

In SNIA’s defense they had to pick some reproducible way to measure steady state.  Some devices may have had difficulty reaching steady state with any 100% write activity.  However, most other benchmarks have some sort of cut off that can be used to invalidate results.  Reaching steady state is one current criteria for SNIA’s SSSI performance test.  I just think adding some portion of write mix of read and write activity would be a better measure of SSD stability.

As for the 4KiB block size, it’s purely a question of what’s the most probable blocksize in the use of SSDs and  may vary for  enterprise or consumer applications.  But 4KiB seems a bit behind the times especially with todays 128GB and higher drives…

What do you think should SSD steady state need some portion of write mix of read and write activity or not?

[Thanks to Eden Kim and his team at SSSI for pointing out my spec reading error.]

SPECsfs2008 CIFS vs. NFS results – chart of the month

SPECsfs(R) 2008 CIFS vs. NFS 2010Mar17
SPECsfs(R) 2008 CIFS vs. NFS 2010Mar17

We return now to our ongoing quest to understand the difference between CIFS and NFS performance in the typical data center.  As you may recall from past posts and our newsletters on this subject, we had been convinced that in SPECsfs 2008 CIFS had almost 2X the throughput of NFS in SPECsfs 2008 benchmarks.  Well as you can see from this updated chart this is no longer true.

Thanks to EMC for proving me wrong (again).  Their latest NFS and CIFS result utilized a NS-G8 Celerra gateway server in front of V-Max backend using SSDs and FC disks. The NS-G8 was the first enterprise class storage subsystem to release both a CIFS and NFS SPECsfs 2008 benchmark.

As you can see from the lower left quadrant all of the relatively SMB level systems (under 25K NFS throughput ops/sec) showed a consistent pattern of CIFS throughput being ~2X NFS throughput.  But when we added the Celerra V-Max combination to the analysis it brought the regression line down considerably and now the equation is:

CIFS throughput = 0.9952 X NFS throughput + 10565, with a R**2 of 0.96,

what this means is that CIFS and NFS throughput are roughly the same now.

When I first reported the relative advantage of CIFS over NFS throughput in my newsletter I was told that you cannot compare the two results mainly because NFS was “state-less” and CIFS was “state-full” and a number of other reasons (documented in the earlier post and in the newsletter).  Nonetheless, I felt that it was worthwhile to show the comparison because at the end of the day whether some file happens to be serviced by NFS or CIFS may not matter to the application/user, it should matter significantly to the storage administrator/IT staff.  By showing the relative performance of each we were hoping to help IT personnel to decide between using CIFS or NFS storage.

Given the most recent results, it seems that the difference in throughput is not that substantial irregardless of their respective differences.  Of course more data will help. There seems to be a wide gulf between the highest SMB submission and the EMC enterprise class storage that should be filled out.  As Celerra V-Max is the only enterprise NAS to submit both CIFS and NFS benchmarks there could still be many surprises in store. As always, I would encourage storage vendors to submit both NFS and CIFS benchmarks for the same system so that we can see how this pattern evolves over time.

The full SPECsfs 2008 report should have went out to our newsletter subscribers last month but I had a mistake with the link.  The full report will be delivered with this months newsletter along with a new performance report on Exchange Solution Review Program and storage announcement summaries.  In addation, a copy of the SPECsfs report will be up on the dispatches page of our website later next month. However, you can get this information now and subscribe to future newsletters to receive future full reports even earlier, just email us at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

As always, we welcome any suggestions on how to improve our analysis of SPECsfs or any of our other storage system performance results.

Save the planet – buy fatter disks and flash

Hard drive capacity overt time (from commons.wikimedia.org) (cc)
PC hard drive capacity over time (from commons.wikimedia.org) (cc)

Well maybe that overstates the case but there is no denying that both fatter (higher capacity) drives and flash memory (used as cache or in SSDs) saves energy in today’s data center.  The interesting thing is that the trend to higher capacity drives has been going on for decades now (see chart) but only within the last few years has been given any credit for energy reduction.  In contrast, flash in SSDs and cache is a relative newcomer but saves energy nonetheless.

I almost can’t recall when disk drives weren’t doubling in capacity every 18 to 24 months.  The above chart only shows PC drives capacities over time but enterprise drives have followed a similar curve.  The coming hard drive capacity wall may slow things down in the future but just last week IBM announced they were moving from a 300GB to a 600GB 15Krpm enterprise class disk drive in their DS8700 subsystem.  While doubling capacity may not quite halve energy use, it’s still significant.   Such energy reductions are even more dramatic with slower, higher density disks. These SATA disks are moving from 1TB to 2TB later this year and should cut energy use considerably.

Similarly, NAND flash density used in SSDs is increasing capacity at almost a faster rate than disk storage.  ASIC feature size continues to shrink and as such, more and more flash storage is packed onto the same die size.  Improvements like these are doubling the capacity of SSDs and flash memory.  While SSD power reduction due to density improvements may not be as significant as disk, we hope to see a flattening out of power use per NAND cell over time.  This flattening out of power use is now happening with processing chips and we see little reason why similar techniques couldn’t apply to NAND.

But the story with flash/SSDs is a bit more complicated:

  • SSDs don’t consume as much energy as a standard disk drive at the same capacity, so a 146GB enterprise class SSD should consume much less energy than a 146GB enterprise class disk drive.
  • SSDs don’t exhibit the significant energy spike that hard disk drives encounter when driven at higher IOPs and was discussed in SSDs vs. Drives energy use.
  • SSDs can often replace many more disk spindles than pure capacity equivalence would dictate.  Some data centers use more disks than necessary to spread workload performance over more spindles wasting storage, power and cooling.  Moving this data to SSDs or adding flash cache to a subsystem, spindle counts can be reduced dramatically and as such, slash energy use for storage.

All this says that using SSDs or flash in place of disk drives reduces data center power requirements.  So if you’re interested in saving energy and thus, helping to save the planet, buy fat(ter) disks and flash for your data storage needs.

Brought to you on behalf of Planet Earth in honor of Earth Day.