SPC-1 IOPS performance per GB-NAND – chart of the month

Bar chart depicting IOPS/GB-NAND, #1 is Datacore Parallel Server with ~266 IOPS/GB-NAND,
(c) 2016 Silverton Consulting, All Rights Reserved

The above is an updated chart from last months SCI newsletter StorInt™ SPC Performance Report depicting the top 10 SPC-1 submissions IOPS™ per GB-NAND. We have been searching for a while now how to depict storage system effectiveness when using SSD or other flash storage. We have used IOPS/SSD in the past but IOPS/GB-NAND looks better.

Calculating IOPS/GB-NAND

SPC-1 does not report this metric but it can be calculated by dividing IOPS by NAND storage capacity. One can find out NAND storage capacity by looking over SPC-1 full disclosure reports (FDR), totaling up the NAND storage in the configuration in all the SSDs and flash devices. This is total NAND capacity, not Total ASU (used storage) Capacity. GB-NAND reflects just what’s indicated for SSD/flash device capacity in the configuration section. This is not necessarily the device’s physical NAND capacity when over provisioned, but at least it’s available in the FDR.

DataCore Parallel Server IOPS/GB-NAND explained

The DataCore Parallel Server generated over 5M IOPS (IO’s/second) under an SPC-1 (OLTP-like) workload. And with their 54-480GB SSDs, totaling ~25.9TB of NAND capacity, it gives them just under 200 IOPS/GB-NAND. The chart in the original report was incorrect.  There we used 36-480GB SSDs or ~17.3TB of NAND to compute IOPS/GB-NAND, which gave them just under 300 IOPS/GB-NAND in the report, which was incorrect. (The full report has been since corrected and is available for re-download for subscribers to our newsletter).

The 480GB (Samsung SM863 MZ-7KM480E)SSDs were all SATA attached. Samsung lists these SSDs as V-NAND, MLC drives, rated at 97K random Reads and 26K random writes. At over 5M IOPS, it should be running close to 100% of the SSDs rated performance. However, DataCore’s Parallel Server included 2 controllers with a total of 3TB of DRAM cache,  which was then SAS connected to 4 DELL MD1220 storage arrays, each with 512GB of DRAM cache, so their total configuration had about 5TB of DRAM in it, most of which would have been used as a IO cache.

The SPC-1 submission only used 11.8TB (Total ASU capacity) of storage. All the DRAM cache help to explain how they attained 5M IOPS. Having a multi-tiered cache like DataCore-MD1220 configuration, doesn’t insure that all the cache is effectively used but even without cache tiering logic, there might not be much of an overlap between the MD1220 and Parallel Server caches. It would be more interesting to see how busy the SSDs were during this SPC-1 run.

How random the SPC-1 workload is, is subject to much speculation in the industry. Suffice it to say it’s not 100% random, but what is. Non-random OLTP workloads would tend to favor larger caches.

SPC is coming out with a new version of their benchmark with supplementary information which may shed more light on device busyness.

All SPC-1 benchmark submissions are available at storageperformance.org.

Want more?

The August 2016 and our other SPC Performance reports have much more information on SPC-1 and SPC-2 performance. Moreover, there’s a lot more performance information, covering email and other (OLTP and throughput intensive) block storage workloads, in our SAN Storage Buying Guide, available for purchase on our website. More information on file and block protocol/interface performance is included in SCI’s SAN-NAS Buying Guidealso available from our website .

~~~~

The complete SPC performance report went out in SCI’s August 2016 Storage Intelligence e-newsletter.  A copy of the report will be posted on our SCI dispatches (posts) page over the next quarter or so (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free SCI Storage Intelligence e-newsletters, by just using the signup form in the sidebar or you can subscribe here.

 

DDN unchains Wolfcreek, unleashes IME and updates WOS

16371098088_3b264f5844_zIt’s not every day that we get a vendor claiming 2.5X the top SPC-1 IOPS (currently held by Hitachi G1000 VSP all flash array at ~2M IOPS) as DataDirect Networks (DDN) has claimed for an all-flash version of their new Wolfcreek hyper converged appliance. DDN says their new 4U appliance is capable of 60GB/sec of throughput and over 5M IOPS. (See their press release for more information.) Unclear if these are SPC-1 IOPS or not, but I haven’t seen any SPC-1 report on it yet.

In addition to the new Wolfcreek appliance, DDN announced their new Infinite Memory Engine™ (IME) flash caching software and WOS® 360 V2.0, an enhanced version of their object storage.

DDN if you haven’t heard of them has done well in the Web 2.0 environments and is a leading supplier to high performance computing (HPC) sites. They have object storage system (WOS), all flash block storage (SFA12KXi), hybrid (disk-SSD) block storage (SFA7700X™ & SFA12KX™), Lustre file appliance (EXAScaler), IBM GPFS™ NAS appliance (GRIDScaler), media server appliance (MEDIAScaler™) and  software defined storage (Storage Fusion Accelerator [SFX™] flash caching software).

Wolfcreek hyper converged appliance

The converged solution comes in a 4U appliance using dual Haswell Intel microprocessors (with up to 18 cores each), includes a PCIe fabric which supports 48-NVMe flash cards or 72-SFF SSDs. With the NVMe or SSDs, Wolfcreek will be using their new IME software to accelerate IO activity.

Wolfcreek IME software supports either burst mode IO caching cluster or a storage cluster of nodes. I assume burst mode is a storage caching layer for backend file system stoorage. As a storage cluster I assume this would include some of their scale-out file system software on the nodes. Wolfcreek cluster interconnect is 40Gb Infiniband or 10/40Gb Ethernet and also will support Intel’s Omni-Path. The Wolfcreek appliance is compatible with HPC Lustre and IBM GPFS scale out file systems.

Wolfcreek appliance can be a great platform for OpenStack and Hadoop environments. But it also supports virtual machine hypervisors from VMware, Citrix and Microsoft. DDN says that the Wolfcreek appliance can scale up to support 100K VMs. I’ve been told that IME will not be targeted to work with Hypervisors in the first release.

Recall that with a hyper converged appliance, some portion of the system resources (memory and CPU cores) must be devoted to server and VM application activities and the remainder to storage activity. How this is divided up and whether this split is dynamic (changes over time) or static (fixed over time) in the Wolfcreek appliance is not indicated.

The hyper converged field is getting crowded of late what with VMware EVO:RAIL, Nutanix, ScaleComputing, Simplivity and others coming out with solutions. But there aren’t many that support all-flash storage and it seems unusual that hyper converged customers would have need for that much IO performance. But I could be wrong, especially for HPC customers.

There’s much more to hyper convergence than just having storage and compute in the same node. The software that links it all together, manages, monitors and deploys these combined hypervisor, storage and server systems is almost as important as any of the  hardware. There wasn’t much talk about the software that DDN is putting together for Wolfcreek but it’s still early yet. With their roots in HPC, it’s likely that any DDN hyper converged solution will target this market first and broaden out from there.

Infinite Memory Engine (IME)

IME is an outgrowth of DDN’s SFX software and seem to act as a caching layer for parallel file system IO. It makes use of NVMe or SSDs for its IO caching. And according to DDN can offer up to 1000X IO acceleration to storage or 100X file system acceleration.

It does this primarily by providing an application aware IO caching layer and supplying more effective IO to the file system layer using PCIe NVMe or SSD flash storage for hardware IO acceleration. According to the information provided by DDN, IME can provide 50 GB/sec bandwidth to a host compute cluster while only doing 4GB/sec of throughput to a backend file system, presumably by better caching of file IO.

WOS 360 V2.0

The new WOS 360 V2.0 object storage system features include

  • Higher density storage package with 98-8TB SATA drives or 768TB raw capacity in 4U) supporting 8B objects each and over 100B objects in a cluster.
  • Native SWIFT API support for OpenStack environments  which includes gateway or embedded deployments, up to 5000 concurrent users and 5B objects/namespace.
  • Global ObjectAssure data encoding with lower storage overhead (1.5x or a 20% reduction from their previous encoding option) for highly durable and available object storage usiing a two level hierarchical erasure code for object storage.
  • Enhanced network security with SSL  which provides end-to-end SSL network data transport between clients and WOS and betweenWOS storage nodes.
  • Simplified cluster installation, deployment and maintenance with can now deploy a WOS cluster in minutes, with a simple point and click GUI for installation and cluster deployment with automated non-disruptive software upgrade.
  • Performance improvements for better video streaming, content distribution and large file transfers with improved QoS for latency sensitive applications.

~~~~

Probably more going on with DDN than covered here but this hits the highlights. I wish there was more on their Wolfcreek appliance and its various configurations and performance benchmarks but there’s not.

Comments?

 Photo Credits: wolf-63503+1920 by _Liquid

 

All flash storage performance testing

There are some serious problems with measuring IO performance of all flash arrays with what we use on disk storage systems. Mostly, these are due to the inherent differences between current flash- and disk-based storage.

NAND garbage collection

First off, garbage collection is required by any SSD or NAND storage to be able to write data. Garbage collection coalesces free space by moving non-modified data to new pages/blocks and freeing up the space held by old, no-longer current data.

The problem is NAND garbage collection takes place only after a suitable amount of write activity and measuring all-flash array storage system performance without taking into account garbage collection is misleading at best and dishonest at worse.

The only way to control for garbage collection is to write lots of data to a all-flash storage system and measure its performance over a protracted period of time. How long this takes is dependent on the amount of storage in an all flash array but filling it up to 75% of its capacity and then measuring IO performance as you fill up another 10-15% of its capacity with new data should suffice. Of course this would all have to be done consecutively, without any time off between runs (which would allow garbage collection to sneak in).

Flash data reduction

Second, many all flash arrays offer data reduction like data compression or deduplication. Standard IO benchmarks today don’t control for data reduction.

What we need is a standard corpus of reducible data for an IO workload. Such data would need to be able to be data compressed and data deduplicated. Unclear where such a data corpus could be found but one is needed to properly measure all flash system performance. What would help is some real world data reduction statistics, from a large number of customer installations that could help identify what real-world dedup and compression ratios look like. Then we could use these statistics to construct a suitable data load that can then be scaled and tailored to required performance needs.

Perhaps SNIA or maybe a big (government) customer could support the creation of this data corpus that can be used for “standard” performance testing. With real world statistics and a suitable data corpus, standard IO benchmarks could control for data reduction on flash arrays and better measure system performance.

Block IO differences

Third, block heat maps (access patterns) need to become much more realistic. For disk based systems it was important to randomize IO stream to minimize the advantage of DRAM caching. But with all flash storage arrays, cache is less useful and because flash can’t be rewritten in place, having IO occur to the same block (especially overwrites) causes NAND page fragmentation and more NAND write overhead.

~~~~

Only by controlling for garbage collection, using a standard, data reducible data load and returning to a cache friendly (or at least write cache friendly) workload we will truly understand all flash storage performance.

Comments?

Thanks to Larry Freeman (@Larry_Freeman) for the idea for today’s post.

Photo Credit(s): Race Faces by Jerome Rauckman

Latest SPC-2 performance results – chart of the month

Spider chart top 10 SPC-1 MB/second broken out by workload LFP, LDQ and VODIn the figure above you can see one of the charts from our latest performance dispatch on SPC-1 and SPC-2  benchmark results. The above chart shows SPC-2 throughput results sorted by aggregate MB/sec order, with all three workloads broken out for more information.

Just last quarter I was saying it didn’t appear as if any all-flash system could do well on SPC-2, throughput intensive workloads.  Well I was wrong (again) and with an aggregate MBPS™ of ~33.5GB/sec. Kaminario’s all-flash K2 took the SPC-2 MBPS results to a whole different level, almost doubling the nearest competitor in this category (Oracle ZFS ZS3-4).

Ok, Howard Marks (deepstorage.net), my GreyBeardsOnStorage podcast co-host and long-time friend, had warned me that SSDs had the throughput to be winners at SPC-2, but they would probably cost to much to be viable.  I didn’t believe him at the time — how wrong could I be.

As for cost, both Howard and I misjudged this one. The K2 came in at just under a $1M USD, whereas the #2, Oracle system was under $400K. But there were five other top 10 SPC-2 MPBS systems over $1M so the K2, all-flash system price was about average for the top 10.

Ok, if cost and high throughput aren’t the problem why haven’t we seen more all-flash systems SPC-2 benchmarks.  I tend to think that most flash systems are optimized for OLTP like update activity and not sequential throughput. The K2 is obviously one exception. But I think we need to go a little deeper into the numbers to understand just what it was doing so well.

The details

The LFP (large file processing) reported MBPS metric is the average of 1MB and 256KB data transfer sizes, streaming activity with 100% write, 100% read and 50%:50% read-write. In K2’s detailed SPC-2 report, one can see that for 100% write workload the K2 was averaging ~26GB/sec. while for the 100% read workload the K2 was averaging ~38GB/sec. and for the 50:50 read:write workload ~32GB/sec.

On the other hand the LDQ workload appears to be entirely sequential read-only but the report shows that this is made up of two workloads one using 1MB data transfers and the other using 64KB data transfers, with various numbers of streams fired up to generate  stress. The surprising item for K2’s LDQ run is that it did much better on the 64KB data streams than the 1MB data streams, an average of 41GB/sec vs. 32GB/sec.. This probably says something about an internal flash data transfer bottleneck at large data transfers someplace in the architecture.

The VOD workload also appears to be sequential, read-only and the report doesn’t indicate a data transfer size but given K2’s actual results, averaging ~31GB/sec it would seem to indicate it was on the order of 1MB.

So what we can tell is that K2’s SSD write throughput is worse than reads (~1/3rd worse) and relatively smaller sequential reads are better than relatively larger sequential reads (~1/4 better).  But I must add that even at the relatively “slower write throughput”, the K2 would still have beaten the next best disk-only storage system by ~10GB/sec.

Where’s the other all-flash SPC-2 benchmarks?

Prior to K2 there was only one other all-flash system (TMS RamSan-630) submission for SPC-2. I suspect that writing 26 GB/sec. to an all-flash system would be hazardous to its health and maybe other all-flash storage system vendors don’t want to encourage this type of activity.

Just for the record the K2 SPC-2 result has been submitted for “review” (as of 18Mar2014) and may be modified before finally “accepted”. However, the review process typically doesn’t impact performance results as much as other report items. So, officially, we will need to await for final acceptance before we can truly believe these numbers.

Comments?

~~~~

The complete SPC  performance report went out in SCI’s February 2014 newsletter.  But a copy of the report will be posted on our dispatches page sometime next quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

Even more performance information and OLTP, Email and Throuphput ChampionCharts for Enterprise, Mid-range and SMB class storage systems are also available in SCI’s SAN Buying Guide, available for purchase from  website.

As always, we welcome any suggestions or comments on how to improve our SPC  performance reports or any of our other storage performance analyses.

Has latency become the key metric? SPC-1 LRT results – chart of the month

I was at EMCworld a couple of months back and they were showing off a preview of the next version VNX storage, which was trying to achieve a million IOPS with under a millisecond latency.  Then I attended NetApp’s analyst summit and the discussion at their Flash seminar was how latency was changing the landscape of data storage and how flash latencies were going to enable totally new applications.

One executive at NetApp mentioned that IOPS was never the real problem. As an example, he mentioned one large oil & gas firm that had a peak IOPS of 35K.

Also, there was some discussion at NetApp of trying to come up with a way of segmenting customer applications by latency requirements.  Aside from high frequency trading applications, online payment processing and a few other high-performance database activities, there wasn’t a lot that could easily be identified/quantified today.

IO latencies have been coming down for years now. Sophisticated disk only storage systems have been lowering latencies for over a decade or more.   But since the introduction of SSDs it’s been a whole new ballgame.  For proof all one has to do is examine the top 10 SPC-1 LRT (least response time, measured with workloads@10% of peak activity) results.

Top 10 SPC-1 LRT results, SSD system response times

 

In looking over the top 10 SPC-1 LRT benchmarks (see Figure above) one can see a general pattern.  These systems mostly use SSD or flash storage except for TMS-400, TMS 320 (IBM FlashSystems) and Kaminario’s K2-D which primarily use DRAM storage and backup storage.

Hybrid disk-flash systems seem to start with an LRT of around 0.9 msec (not on the chart above).  These can be found with DotHill, NetApp, and IBM.

Similarly, you almost have to get to as “slow” as 0.93 msec. before you can find any disk only storage systems. But most disk only storage comes with a latency at 1msec or more. Between 1 and 2msec. LRT we see storage from EMC, HDS, HP, Fujitsu, IBM NetApp and others.

There was a time when the storage world was convinced that to get really good response times you had to have a purpose built storage system like TMS or Kaminario or stripped down functionality like IBM’s Power 595.  But it seems that the general purpose HDS HUS, IBM Storwize, and even Huawei OceanStore are all capable of providing excellent latencies with all SSD storage behind them. And all seem to do at least in the same ballpark as the purpose built, TMS RAMSAN-620 SSD storage system.  These general purpose storage systems have just about every advanced feature imaginable with the exception of mainframe attach.

It seems nowadays that there is a trifurcation of latency results going on, based on underlying storage:

  • DRAM only systems at 0.4 msec to ~0.1 msec.
  • SSD/flash only storage at 0.7 down to 0.2msec
  • Disk only storage at 0.93msec and above.

The hybrid storage systems are attempting to mix the economics of disk with the speed of flash storage and seem to be contending with all these single technology, storage solutions. 

It’s a new IO latency world today.  SSD only storage systems are now available from every major storage vendor and many of them are showing pretty impressive latencies.  Now with fully functional storage latency below 0.5msec., what’s the next hurdle for IT.

Comments?

Image: EAB 2006 by TMWolf

 

Enhanced by Zemanta

SPC-2 performance results MBPS/drive – chart of the month

(SCISPC121029-005B) (c) 2013 Silverton Consulting, Inc. All Rights Reserved
(SCISPC121029-005B) (c) 2013 Silverton Consulting, Inc. All Rights Reserved

The above chart is from our October newsletter and is one of 5 charts we discussed in the Storage Performance Council benchmarks analysis.  There’s something intriguing about the above chart. Specifically, the band of results in numbers 2 through 10 range from a high of 45.7 to a low of 41.5 MBPS/drive.  The lone outlier is the SGI InfiniteStorage system which managed to achieve 67.7 MBPS/drive.

It turns out that the SGI system is actually a NetApp E5460 (from their LSI acquisition) with 60-146GB disk drives in a RAID 6 configuration.  Considering that the configuration ASU (storage capacity used during the test) was 7TB and the full capacity was 8TB, it seemed to use all the drives to the fullest extent possible.  The only other interesting tidbit about the SGI/NetApp system was the 16GB of system memory (which I assume was mostly used for caching).  Other than that it just seemed to be a screamer of a system from a throughput perspective.

Earlier this year I was at an analyst session with NetApp where they were discussing there thoughts on where E-series was going to focus on. One of the items was going to be high throughput intensive applications. From what we see here, they seem to have the right machine to go after this market.

The only storage to come close was an older Oracle J4200 series system which had no RAID protection, which we would not recommend for any data application.   Not sure what the IBM DS5300 series storage is OEMed from but it might be another older E-Series system.

A couple of caveats are in order for our MBPS/drive charts:

  • These are disk-only systems, any system using SSDs or FlashCache are excluded from this analysis
  • These systems all use 140GB disks or larger. (Some earlier SPC benchmarks used 36GB drives).

Also, please note the MBPS SPC-2 metric is a composite (average) of Video-on-demand, Large database query and Large file processing workload.

More information on SPC-2 performance as well as our SPC-1, SPC-2 and ESRP ChampionsCharts for block storage systems can be found in our SAN Storage Buying Guide available for purchase on our web site).

~~~~

The complete SPC-1 and SPC-2 performance report went out in SCI’s October newsletter.  But a copy of the report will be posted on our dispatches page sometime this month (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

As always, we welcome any suggestions or comments on how to improve our SPC  performance reports or any of our other storage performance analyses.


 

Top Ten RayOnStorage Posts for 2012

Here are the top 10 blog posts for 2012 from RayOnStorage.com

1. Snow Leopard to Mountain Lion

We discuss our Mac OSX transition from Snow Leopard to Mountain Lion with the good, bad and ugly of Mountain Lion from a novice user’s perspective.

2. Vsphere 5.1 storage enhancements and future vision

We detail some of the storage enhancements and directions for the latest revision of VMware Vsphere 5.1

3.  Object Storage Summit wrap up

We discuss last months ExecEvent Object Storage Summit and some of the use cases driving customers to adopt object storage for their data centers.

4. EMCWorld2012 part 1 – VNX/VNXe

We analyze the first day of EMCWorld2012 focused on EMC’s VNX/VNXe product enhancements.

5. Dell Storage Forum 2012 – day 2

We discuss the new Compellent and FluidFS systems coming out of Dell Storage Forum and their latest RNA Networks acquisition with a coherent Flash Cache network.

6. EMC buys ExtremeIO

Right before EMCWorld2012, EMC announced their purchase of ExtremeIO which was rumored for sometime but signaled a new path to flash only SAN storage systems.

7. HDS Influencer Summit wrap up

HDS held their Influencer Summit last month and rolled out their executive team to talk about their storage and service directions and successes.

8. Oracle finally releases StorageTek VSM6

Well after much delay we finally get to see the latest generation Virtual Storage Manager 6 (VSM6) for the mainframe System z market place.

9. Coraid, first thoughts

We got to meet with Coraid as part of a Storage TechField Day event and we came away impressed but still wanting to learn more.

10. Latest SPC-1 results IOPS vs. drive counts – chart of the month

Every month (or so) we do a more detailed analysis of a chart that appears in our free monthly newsletter, this was done earlier in the year and documented the correlation between IOPS and drive counts in SPC-1 results.

Happy New Year.

Latest SPC-2 MBPS vs drive count results – chart-of-the-month

SCISPC120529-001 (c) 2012 Silverton Consulting, All Rights Reserved
SCISPC120529-001 (c) 2012 Silverton Consulting, All Rights Reserved

The above chart comes from our August performance analysis [yes, I am a bit behind] and is a scatter plot of Storage Performance Council SPC-2 submissions. In the above we plot MBPS™ on the vertical axis and the number of disk drives in the submission on the horizontal.  We have also added a linear regression line using the data with the regression formula listed.

Unlike SPC-1 performance results and IOPS™ vs. drives documented in an earlier post, SPC-2 MPBS results have a much wider variance and the regression coefficient (R**2) at ~0.42, shows it.  In the earlier SPC-1 post the IOPS-drive count linear regression had a R**2 of ~0.96.

Why would SPC-1 IOPS be more driven by drive counts than MBPS?  We can only speculate of course,  but it seems to me that SPC-2 MBPS is more a function of system caching effectiveness rather than pure IO transaction speed.

All the SPC-2 workloads (VOD, LFP, and LDQ) are sequential in nature and as such, caching sequential lookahead sophistication can make more effective use of fewer spindles. In contrast, SPC-1 IOPS workloads are almost inherently random in nature and as such, are poor cache candidates which by natur depend on high counts of spindles to perform well.

In additon, SPC-2 has never been as popular as SPC-1 and as a result, doesn’t have as many submissions.  It’s never been clear to me why this is the case as not all enterprise class workloads are random and as such, sequential activity is a necessary requirement for many enterprise storage systems.

Comments?

~~~~

The complete SPC-1 & SPC-2 performance report with more top 10 charts went out in SCI’s August newsletter.  But a copy of the report will be posted on our dispatches page sometime this month (if all goes well).  However, you can get the latest SPC performance analysis now and subscribe to future free newsletters by just using the signup form above right.

For a more extensive discussion of current SAN block system storage performance covering SPC (Top 30) results as well as ESRP results with our new ChampionsChart™ for SAN storage systems, please see SCI’s SAN Storage Buying Guide available from our website.

As always, we welcome any suggestions or comments on how to improve our analysis of SPC results or any of our other storage performance analyses.