QoM1610: Will NVMe over Fabric GA in enterprise AFA by Oct’2017

NVMeNVMe over fabric (NVMeoF) was a hot topic at Flash Memory Summit last August. Facebook and others were showing off their JBOF (see my Facebook moving to JBOF post) but there were plenty of other NVMeoF offerings at the show.

NVMeoF hardware availability

When Brocade announced their Gen6 Switches they made a point of saying that both their Gen5 and Gen6 switches currently support NVMeoF protocols. In addition to Brocade’s support, in Dec 2015 Qlogic announced support for NVMeoF for select HBAs. Also, as of  July 2016, Emulex announced support for NVMeoF in their HBAs.

From an Ethernet perspective, Qlogic has a NVMe Direct NIC which supports NVMe protocol offload for iSCSI. But even without NVMe Direct, Ethernet 40GbE & 100GbE with  iWARP, RoCEv1-v2, iSCSI SER, or iSCSI RDMA all could readily support NVMeoF on Ethernet. The nice thing about NVMeoF for Ethernet is not only do you get support for iSCSI & FCoE, but CIFS/SMB and NFS as well.

InfiniBand and Omni-Path Architecture already support native RDMA, so they should already support NVMeoF.

So hardware/firmware is already available for any enterprise AFA customer to want NVMeoF for their data center storage.

NVMeoF Software

Intel claims that ~90% of the software driver functionality of NVMe is the same for NVMeoF. The primary differences between the two seem to be the NVMeoY discovery and queueing mechanisms.

There are two fabric methods that can be used to implement NVMeoF data and command transfers: capsule mode where NVMe commands and data are encapsulated in normal fabric packets or fabric dependent mode where drivers make use of native fabric memory transfer mechanisms (RDMA, …) to transfer commands and data.

12679485_245179519150700_14553389_nA (Linux) host driver for NVMeoF is currently available from Seagate. And as a result, support for NVMeoF for Linux is currently under development, and  not far from release in the next Kernel (I think). (Mellanox has a tutorial on how to compile a Linux kernel with NVMeoF driver support).

With Linux coming out, Microsoft Windows and VMware can’t be far behind. However, I could find nothing online, aside from base NVMe support, for either platform.

NVMeoF target support is another matter but with NICs/HBAs & switch hardware/firmware and drivers presently available, proprietary storage system target drivers are just a matter of time.

Boot support is a major concern. I could find no information on BIOS support for booting off of a NVMeoF AFA. Arguably, one may not need boot support for NVMeoF AFAs as they are probably not a viable target for storing App code or OS software.

From what I could tell, normal fabric multi-pathing support should work fine with NVMeoF. This should allow for HA NVMeoF storage, a critical requirement for enterprise AFA storage systems these days.

NVMeoF advantages/disadvantages

Chelsio and others have shown that NVMeoF adds ~8μsec of additional overhead beyond native NVMe SSDs, which if true would warrant implementation on all NVMe AFAs. This may or may not impact max IOPS depending on scale-ability of NVMeoF.

For instance, servers (PCIe bus hardware) typically limit the number of private NVMe SSDs to 255 or less. With an NVMeoF, one could potentially have 1000s of shared NVMe SSDs accessible to a single server. With this scale, one could have a single server attached to a scale-out NVMeoF AFA (cluster) that could supply ~4X the IOPS that a single server could perform using private NVMe storage.

Base level NVMe SSD support and protocol stacks are starting to be available for most flash vendors and operating systems such as, Linux, FreeBSD, VMware, Windows, and Solaris. If Intel’s claim of 90% common software between NVMe and NVMeoF drivers is true, then it should be a relatively easy development project to provide host NVMeoF drivers.

The need for special Ethernet hardware that supports RDMA may delay some storage vendors from implementing NVMeoF AFAs quickly. The lack of BIOS boot support may be a minor irritant in comparison.

NVMeoF forecast

AFA storage systems, as far as I can tell, are all about selling high IOPS and very-low latency IOs. It would seem that NVMeoF would offer early adopter AFA storage vendors a significant performance advantage over slower paced competition.

In previous QoM/QoW posts we have established that there are about 13 new enterprise storage systems that come out each year. Probably 80% of these will be AFA, given the current market environment.

Of the 10.4 AFA systems coming out over the next year, ~20% of these systems pride themselves on being the lowest latency solutions in the market, and thus command high margins. One would think these systems would be the first to adopt NVMeoF. But, most of these systems have their own, proprietary flash modules and do not use standard (NVMe) SSDs and can use their own proprietary interface to their proprietary flash storage. This will delay any implementation for them until they can convert their flash storage to NVMe which may take some time.

On the other hand, most (70%) of the other AFA systems, that currently use SAS/SATA SSDs, could boost their IOP counts and drastically reduce their IO  response times, by implementing NVMe SSDs and NVMeoF. But converting SAS/SATA backends to NVMe will take time and effort.

But, there are a select few (~10%) of AFA systems, that already use NVMe SSDs in their AFAs, and for these few, they would seem to have a fast track towards implementing NVMeoF. The fact that NVMeoF is supported over all fabrics and all storage interface protocols make it even easier.

Moreover, NVMeoF has been under discussion since the summer of 2015, which tells me that astute AFA vendors have already had 18+ months to develop it. With NVMeoF host drivers & hardware available since Dec. 2015, means hardware and software exist to test and validate against.

I believe that NVMeoF will be GA’d within the next 12 months by at least one enterprise AFA system. So my QoM1610 forecast for NVMeoF is YES, with a 0.83 probability.





Has latency become the key metric? SPC-1 LRT results – chart of the month

I was at EMCworld a couple of months back and they were showing off a preview of the next version VNX storage, which was trying to achieve a million IOPS with under a millisecond latency.  Then I attended NetApp’s analyst summit and the discussion at their Flash seminar was how latency was changing the landscape of data storage and how flash latencies were going to enable totally new applications.

One executive at NetApp mentioned that IOPS was never the real problem. As an example, he mentioned one large oil & gas firm that had a peak IOPS of 35K.

Also, there was some discussion at NetApp of trying to come up with a way of segmenting customer applications by latency requirements.  Aside from high frequency trading applications, online payment processing and a few other high-performance database activities, there wasn’t a lot that could easily be identified/quantified today.

IO latencies have been coming down for years now. Sophisticated disk only storage systems have been lowering latencies for over a decade or more.   But since the introduction of SSDs it’s been a whole new ballgame.  For proof all one has to do is examine the top 10 SPC-1 LRT (least response time, measured with workloads@10% of peak activity) results.

Top 10 SPC-1 LRT results, SSD system response times


In looking over the top 10 SPC-1 LRT benchmarks (see Figure above) one can see a general pattern.  These systems mostly use SSD or flash storage except for TMS-400, TMS 320 (IBM FlashSystems) and Kaminario’s K2-D which primarily use DRAM storage and backup storage.

Hybrid disk-flash systems seem to start with an LRT of around 0.9 msec (not on the chart above).  These can be found with DotHill, NetApp, and IBM.

Similarly, you almost have to get to as “slow” as 0.93 msec. before you can find any disk only storage systems. But most disk only storage comes with a latency at 1msec or more. Between 1 and 2msec. LRT we see storage from EMC, HDS, HP, Fujitsu, IBM NetApp and others.

There was a time when the storage world was convinced that to get really good response times you had to have a purpose built storage system like TMS or Kaminario or stripped down functionality like IBM’s Power 595.  But it seems that the general purpose HDS HUS, IBM Storwize, and even Huawei OceanStore are all capable of providing excellent latencies with all SSD storage behind them. And all seem to do at least in the same ballpark as the purpose built, TMS RAMSAN-620 SSD storage system.  These general purpose storage systems have just about every advanced feature imaginable with the exception of mainframe attach.

It seems nowadays that there is a trifurcation of latency results going on, based on underlying storage:

  • DRAM only systems at 0.4 msec to ~0.1 msec.
  • SSD/flash only storage at 0.7 down to 0.2msec
  • Disk only storage at 0.93msec and above.

The hybrid storage systems are attempting to mix the economics of disk with the speed of flash storage and seem to be contending with all these single technology, storage solutions. 

It’s a new IO latency world today.  SSD only storage systems are now available from every major storage vendor and many of them are showing pretty impressive latencies.  Now with fully functional storage latency below 0.5msec., what’s the next hurdle for IT.


Image: EAB 2006 by TMWolf


Enhanced by Zemanta

SCI SPC-1 results analysis: Top 10 $/IOPS – chart-of-the-month

Column chart showing the top 10 economically performing systems for SPC-1
(SCISPC120226-003) (c) 2012 Silverton Consulting, Inc. All Rights Reserved

Lower is better on this chart.  I can’t remember the last time we showed this Top 10 $/IOPS™ chart from the Storage Performance Council SPC-1 benchmark.  Recall that we prefer our IOPS/$/GB which factors in subsystem size but this past quarter two new submissions ranked well on this metric.  The two new systems were the all SSD Huawei Symantec Oceanspace™ Dorado2100 (#2) and the latest Fujitsu ETERNUS DX80 S2 storage (#7) subsystems.

Most of the winners on $/IOPS are SSD systems (#1-5 and 10) and most of these were all SSD storage system.  These systems normally have better $/IOPS by hitting high IOPS™ rates for the cost of their storage. But they often submit relatively small systems to SPC-1 reducing system cost and helping them place better on $/IOPS.

On the other hand, some disk only storage do well by abandoning any form of protection as with the two Sun J4400 (#6) and J4200 (#8) storage systems which used RAID 0 but also had smaller capacities, coming in at 2.2TB and 1.2TB, respectively.

The other two disk only storage systems here, the Fujitsu ETERNUS DX80 S2 (#7) and the Huawei Symantec Oceanspace S2600 (#9) systems also had relatively small capacities at 9.7TB and 2.9TB respectively.

The ETERNUS DX80 S2 achieved ~35K IOPS and at a cost of under $80K generated a $2.25 $/IOPS.  Of course, the all SSD systems blow that away, for example the Oceanspace Dorado2100 (#2), all SSD system hit ~100K IOPS but cost nearly $90K for a $0.90 $/IOPS.

Moreover, the largest capacity system here with 23.7TB of storage was the Oracle Sun ZFS (#10) hybrid SSD and disk system which generated ~137K IOPS at a cost of ~$410K hitting just under $3.00 $/IOPS.

Still prefer our own metric on economical performance but each has their flaws.  The SPC-1 $/IOPS metric is dominated by SSD systems and our IOPS/$/GB metric is dominated by disk only systems.   Probably some way to do better on the cost of performance but I have yet to see it.


The full SPC performance report went out in SCI’s February newsletter.  But a copy of the full report will be posted on our dispatches page sometime next month (if all goes well). However, you can get the full SPC performance analysis now and subscribe to future free newsletters by just sending us an email or using the signup form above right.

For a more extensive discussion of current SAN or block storage performance covering SPC-1 (top 30), SPC-2 (top 30) and ESRP (top 20) results please see SCI’s SAN Storage Buying Guide available on our website.

As always, we welcome any suggestions or comments on how to improve our analysis of SPC results or any of our other storage performance analyses.


Latest SPC-1 results – IOPS vs drive counts – chart-of-the-month

Scatter plot of SPC-1  IOPS against Spindle count, with linear regression line showing Y=186.18X + 10227 with R**2=0.96064
(SCISPC111122-004) (c) 2011 Silverton Consulting, All Rights Reserved

[As promised, I am trying to get up-to-date on my performance charts from our monthly newsletters. This one brings us current up through November.]

The above chart plots Storage Performance Council SPC-1 IOPS against spindle count.  On this chart, we have eliminated any SSD systems, systems with drives smaller than 140 GB and any systems with multiple drive sizes.

Alas, the regression coefficient (R**2) of 0.96 tells us that SPC-1 IOPS performance is mainly driven by drive count.  But what’s more interesting here is that as drive counts get higher than say 1000, the variance surrounding the linear regression line widens – implying that system sophistication starts to matter more.

Processing power matters

For instance, if you look at the three systems centered around 2000 drives, they are (from lowest to highest IOPS) 4-node IBM SVC 5.1, 6-node IBM SVC 5.1 and an 8-node HP 3PAR V800 storage system.  This tells us that the more processing (nodes) you throw at an IOPS workload given similar spindle counts, the more efficient it can be.

System sophistication can matter too

The other interesting facet on this chart comes from examining the three systems centered around 250K IOPS that span from ~1150 to ~1500 drives.

  • The 1156 drive system is the latest HDS VSP 8-VSD (virtual storage directors, or processing nodes) running with dynamically (thinly) provisioned volumes – which is the first and only SPC-1 submission using thin provisioning.
  • The 1280 drive system is a (now HP) 3PAR T800 8-node system.
  • The 1536 drive system is an IBM SVC 4.3 8-node storage system.

One would think that thin provisioning would degrade storage performance and maybe it did but without a non-dynamically provisioned HDS VSP benchmark to compare against, it’s hard to tell.  However, the fact that the HDS-VSP performed as well as the other systems did with much lower drive counts seems to tell us that thin provisioning potentially uses hard drives more efficiently than fat provisioning, the 8-VSD HDS VSP is more effective than an 8-node IBM SVC 4.3 and an 8-node (HP) 3PAR T800 systems, or perhaps some combination of these.


The full SPC performance report went out to our newsletter subscriber’s last November.  [The one change to this chart from the full report is the date in the chart’s title was wrong and is fixed here].  A copy of the full report will be up on the dispatches page of our website sometime this month (if all goes well). However, you can get performance information now and subscribe to future newsletters to receive these reports even earlier by just sending us an email or using the signup form above right.

For a more extensive discussion of block or SAN storage performance covering SPC-1&-2 (top 30) and ESRP (top 20) results please consider purchasing our recently updated SAN Storage Buying Guide available on our website.

As always, we welcome any suggestions on how to improve our analysis of SPC results or any of our other storage system performance discussions.


SCI May 2011, Latest SPC-1 results IOPS vs. drive count – chart of the month

SCISPC110527-004 (c) 2011 Silverton Consulting, Inc., All Rights Reserved
SCISPC110527-004 (c) 2011 Silverton Consulting, Inc., All Rights Reserved

The above chart is from our May Storage Intelligence newsletter dispatch on system performance and shows the latest Storage Performance Council SPC-1 benchmark results in a scatter plot with IO/sec [or IOPS(tm)] on the vertical axis and number of disk drives on the horizontal axis.  We have tried to remove all results that used NAND flash as a cache or SSDs. Also this displays only results below a $100/GB.

One negative view of benchmarks such as SPC-1 is that published results are almost entirely due to the hardware thrown at it or in this case, the number of disk drives (or SSDs) in the system configuration.  An R**2 of 0.93 shows a pretty good correlation of IOPS performance against disk drive count and would seem to bear this view out, but is an incorrect interpretation of the results.

Just look at the wide variation beyond the 500 disk drive count versus below that where there are only a few outliers with a much narrower variance. As such, we would have to say that at some point (below 500 drives), most storage systems can seem to attain a reasonable rate of IOPS as a function of the number of spindles present, but after that point the relationship starts to break down.  There are certainly storage systems at the over 500 drive level that perform much better than average for their drive configuration and some that perform worse.

For example, consider the triangle formed by the three best performing (IOPS) results on this chart.  The one at 300K IOPS with ~1150 disk drives is from Huawei Symantec and is their 8-node Oceanspace S8100 storage system whereas the other system with similar IOPS performance at ~315K IOPS used ~2050 disk drives and is a 4-node, IBM SVC (5.1) system with DS8700 backend storage.   In contrast, the highest performer on this chart at ~380K IOPS, also had ~2050 disk drives and is a 6-node IBM SVC (5.1) with DS8700 backend storage.

Given the above analysis there seems to be much more to system performance than merely disk drive count, at least at the over 500 disk count level.


The full performance dispatch will be up on our website after the middle of next month but if you are interested in viewing this today, please sign up for our free monthly newsletter (see subscription request, above right) or subscribe by email and we’ll send you the current issue.  If you need a more analysis of SAN storage performance please consider purchasing SCI’s SAN Storage Briefing.

As always, we welcome all constructive suggestions on how to improve any of our storage performance analyses.



Latest SPC-1 IOPS results – chart of the month

SCISPC110221-001 (c) Silverton Consulting, Inc., All Rights Reserved
SCISPC110221-001 (c) Silverton Consulting, Inc., All Rights Reserved

As one can see from this chart clusters of off the shelf components are starting to dominate maximum IOPS performance in SPC-1.

The newest member to this chart is the Huawei Symantec Oceanspace S8100 which sported 8 storage server nodes and came in at just a touch above 300K IO/second in SPC-1 benchmark results.  It used 16-8GFC links and over 1150 disk drives to attain this performance.

Other cluster oriented systems here include all the IBM SVC submissions (#1,2,6, & 7) as well as the now HP 3Par system coming in at number 9.  One could probably argue that the IBM Power 595 w/SSDs also should also be considered a clustered system but it really only had one server (with 96 cores on it though) with SAS connected SSDs behind it.

It’s somewhat surprising not to see better performance from using SSDs on this chart.  The only SSD systems being IBM Power 594 and the two TMS systems.  It’s apparent from this data that one can obtain superior performance just by using lots of disk drives, at least for SPC-1 IOPS.


The full performance dispatch will be up on our website after month end but if one is interested in seeing it sooner sign up for our free monthly newsletter (see subscription request, above right) or subscribe by email and we will send the current issue along with download instructions for this and other reports.  If you need an even more in-depth analysis of SAN storage performance please consider purchasing SCI’s SAN Storage Briefing also available from our website.

As always, we welcome any constructive suggestions on how to improve any of our storage performance analysis.



SCI’s latest SPC-1&-1/E LRT results – chart of the month

(c) 2010 Silverton Consulting, Inc., All Rights Reserved
(c) 2010 Silverton Consulting, Inc., All Rights Reserved

It’s been a while since we reported on Storage Performance Council (SPC) Least Response Time (LRT) results (see Chart of the month: SPC LRT[TM]).  This is one of the charts we produce for our monthly dispatch on storage performance (quarterly report on SPC results).

Since our last blog post on this subject there have been 6 new entries in LRT Top 10 (#3-6 &, 9-10).  As can be seen here which combines SPC-1 and 1/E results, response times vary considerably.  7 of these top 10 LRT results come from subsystems which either have all SSDs (#1-4, 7 & 9) or have a large NAND cache (#5).    The newest members on this chart were the NetApp 3270A and the Xiotech Emprise 5000-300GB disk drives which were published recently.

The NetApp FAS3270A, a mid-range subsystem with 1TB of NAND cache (512MB in each controller) seemed to do pretty well here with all SSD systems doing better than it and a pair of all SSD systems doing worse than it.  Coming in under 1msec LRT is no small feat.  We are certain the NAND cache helped NetApp achieve their superior responsiveness.

What the Xiotech Emprise 5000-300GB storage subsystem is doing here is another question.  They have always done well on an IOPs/drive basis (see SPC-1&-1/E results IOPs/Drive – chart of the month) but being top ten in LRT had not been their forte, previously.  How one coaxes a 1.47 msec LRT out of a 20 drive system that costs only ~$41K, 12X lower than the median price(~$509K) of the other subsystems here is a mystery.  Of course, they were using RAID 1 but so were half of the subsystems on this chart.

It’s nice that some turnover in this top 10 LRT.  I still contend that response time is an important performance metric for many storage workloads (see my IO throughput vs. response time and why it matters post) and improvement over time validates my thesis.  Also I received many comments discussing the merits of database latencies for ESRP v3 (Exchange 2010) results, (see my Microsoft Exchange Perfomance ESRP v3.0 results – chart of the month post).  You can judge the results of that lengthy discussion for yourselves.

The full performance dispatch will be up on our website in a couple of weeks but if you are interested in seeing it sooner just sign up for our free monthly newsletter (see upper right) or subscribe by email and we will send you the current issue with download instructions for this and other reports.

As always, we welcome any constructive suggestions on how to improve our storage performance analysis.


Storage throughput vs. IO response time and why it matters

Fighter Jets at CNE by lifecreation (cc) (from Flickr)
Fighter Jets at CNE by lifecreation (cc) (from Flickr)

Lost in much of the discussions on storage system performance is the need for both throughput and response time measurements.

  • By IO throughput I generally mean data transfer speed in megabytes per second (MB/s or MBPS), however another definition of throughput is IO operations per second (IO/s or IOPS).  I prefer the MB/s designation for storage system throughput because it’s very complementary with respect to response time whereas IO/s can often be confounded with response time.  Nevertheless, both metrics qualify as storage system throughput.
  • By IO response time I mean the time it takes a storage system to perform an IO operation from start to finish, usually measured in milleseconds although lately some subsystems have dropped below the 1msec. threshold.  (See my last year’s post on SPC LRT results for information on some top response time results).

Benchmark measurements of response time and throughput

Both Standard Performance Evaluation Corporation’s SPECsfs2008 and Storage Performance Council’s SPC-1 provide response time measurements although they measure substantially different quantities.  The problem with SPECsfs2008’s measurement of ORT (overall response time) is that it’s calculated as a mean across the whole benchmark run rather than a strict measurement of least response time at low file request rates.  I believe any response time metric should measure the minimum response time achievable from a storage system although I can understand SPECsfs2008’s point of view.

On the other hand SPC-1 measurement of LRT (least response time) is just what I would like to see in a response time measurement.  SPC-1 provides the time it takes to complete an IO operation at very low request rates.

In regards to throughput, once again SPECsfs2008’s measurement of throughput leaves something to be desired as it’s strictly a measurement of NFS or CIFS operations per second.  Of course this includes a number (>40%) of non-data transfer requests as well as data transfers, so confounds any measurement of how much data can be transferred per second.  But, from their perspective a file system needs to do more than just read and write data which is why they mix these other requests in with their measurement of NAS throughput.

Storage Performance Council’s SPC-1 reports throughput results as IOPS and provide no direct measure of MB/s unless one looks to their SPC-2 benchmark results.  SPC-2 reports on a direct measure of MBPS which is an average of three different data intensive workloads including large file access, video-on-demand and a large database query workload.

Why response time and throughput matter

Historically, we used to say that OLTP (online transaction processing) activity performance was entirely dependent on response time – the better storage system response time, the better your OLTP systems performed.  Nowadays it’s a bit more complex, as some of todays database queries can depend as much on sequential database transfers (or throughput) as on individual IO response time.  Nonetheless, I feel that there is still a large component of response time critical workloads out there that perform much better with shorter response times.

On the other hand, high throughput has its growing gaggle of adherents as well.  When it comes to high sequential data transfer workloads such as data warehouse queries, video or audio editing/download or large file data transfers, throughput as measured by MB/s reigns supreme – higher MB/s can lead to much faster workloads.

The only question that remains is who needs higher throughput as measured by IO/s rather than MB/s.  I would contend that mixed workloads which contain components of random as well as sequential IOs and typically smaller data transfers can benefit from high IO/s storage systems.  The only confounding matter is that these workloads obviously benefit from better response times as well.   That’s why throughput as measured by IO/s is a much more difficult number to understand than any pure MB/s numbers.


Now there is a contingent of performance gurus today that believe that IO response times no longer matter.  In fact if one looks at SPC-1 results, it takes some effort to find its LRT measurement.  It’s not included in the summary report.

Also, in the post mentioned above there appears to be a definite bifurcation of storage subsystems with respect to response time, i.e., some subsystems are focused on response time while others are not.  I would have liked to see some more of the top enterprise storage subsystems represented in the top LRT subsystems but alas, they are missing.

1954 French Grand Prix - Those Were The Days by Nigel Smuckatelli (cc) (from Flickr)
1954 French Grand Prix - Those Were The Days by Nigel Smuckatelli (cc) (from Flickr)

Call me old fashioned but I feel that response time represents a very important and orthogonal performance measure with respect to throughput of any storage subsystem and as such, should be much more widely disseminated than it is today.

For example, there is a substantive difference a fighter jet’s or race car’s top speed vs. their maneuverability.  I would compare top speed to storage throughput and its maneuverability to IO response time.  Perhaps this doesn’t matter as much for a jet liner or family car but it can matter a lot in the right domain.

Now do you want your storage subsystem to be a jet fighter or a jet liner – you decide.