Microsoft ESRP database access latencies – chart of the month

sciesrp1030-001
The above chart was included in last month’s SCI e-Newsletter and depicts recent Microsoft Exchange (2013) Solution Reviewed Program (ESRP) results for database access latencies. Storage systems new to this 5000 mailbox and over category include, Oracle, Pure and Tegile. As all of these systems are all flash arrays we are starting to see significant reductions in database access latencies and the #4 system (Nimble Storage) was a hybrid (disk and flash) array.

As you recall, ESRP reports on three database access latencies: read database, write database and log write. All three are shown above but we sort and rank this list based on database read activity alone.

Hard to see above, but reading the ESRP reports, one finds that the top 3 systems had 1.04, 1.06 and 1.07 milliseconds. average read database latencies. So the separation between the top 3 is less than 40 microseconds.

The top 3 database write access latencies were 1.75, 1.62 and 3.07 milliseconds, respectively. So if we were ranking the above on write response times Pure would have come in #1.

The top 3 log write access latencies were 0.67, 0.41 and 0.82 milliseconds and once again if we were ranking based on log response times Pure would be #1.

It’s unclear whether Exchange customers would want to deploy AFAs for their database and log files but these three ESRP reports and Nimble’s show that there should be no problem with the performance of AFAs in these environments.

What about data reduction?

Unclear to me is how much data reduction technologies played in the AFA and hybrid solution ESRP performance. Data reduction advantages would most likely show up in database IOPS counts more so than response times but if present, may still reduce access latencies, as there would be potentially less data to be transferred to-from the backend of the storage system into-out of storage system cache.

ESRP reports do not officially report on a vendor’s data reduction effectiveness, so we are left with whatever the vendor decides to say.

In that respect, Pure FlashArray//m20 indicated in their ESRP report that their “data reduction is significantly higher” than what they see normally (4:1) because Jetstress (ESRP benchmark program) generates lots of duplicated data.

I couldn’t find anything that Tegile T3800 or Nimble Storage said similar to the above, indicating how well their data reduction technologies worked in Jetstress as compared to normal. However, they did make a reference to their compression effectiveness in database size but I have found this number to be somewhat less effective as it historically showed the amount of over provisioning used by disk-only systems and for AFA’s and hybrid – storage, it’s unclear how much is data reduction effectiveness vs. over provisioning.

For example, Pure, Tegile and Nimble also reported a “database capacity utilization” of 4.2%60% and 74.8% respectively. And Nimble did report that over their entire customer base, Exchange data has on average, a 61.2% capacity savings.

So you tell me what was the effective data reduction for their Pure’s, Tegile’s and Nimble’s respective Jetstress runs? From my perspective Pure’s report of 4.2% looks about right (that says that actual database data fit in 4.2% of SSD storage for a ~23.8:1 reduction effectiveness for Jetstress/ESRP data. I find it harder to believe what Tegile and Nimble have indicated as it doesn’t seem to be as believable as they would imply a 1.7:1 and 1.33:1 reduction effectiveness for Jetstress/ESRP data.

Oracle FS1-2 doesn’t seem to have any data reduction capabilities and reported a 100% storage capacity used by Exchange database.

So that’s it, Jetstress uses “significantly reducible” data for some AFAs systems. But in the field the advantage of data reduction techniques are much less so.

I think it’s time that ESRP stopped using significantly reducible data in their Jetstress program and tried to more closely mimic real world data.

Want more?

The October 2016 and our other ESRP reports have much more information on Microsoft Exchange performance. Moreover, there’s a lot more performance information, covering email and other (OLTP and throughput intensive) block storage workloads, in our SAN Storage Buying Guide, available for purchase on our website. More information on file and block protocol/interface performance is also included in SCI’s SAN-NAS Buying Guidealso available from our website. And if your interested in file system performance please consider purchasing our NAS Buying Guide also available on our website.

~~~~

The complete ESRP performance report went out in SCI’s October 2016 Storage Intelligence e-newsletter.  A copy of the report will be posted on our SCI dispatches (posts) page over the next quarter or so (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free SCI Storage Intelligence e-newsletters, by just using the signup form in the sidebar or you can subscribe here.

As always, we welcome any suggestions or comments on how to improve our ESRP  performance reports or any of our other storage performance analyses.

 

QoM1610: Will NVMe over Fabric GA in enterprise AFA by Oct’2017

NVMeNVMe over fabric (NVMeoF) was a hot topic at Flash Memory Summit last August. Facebook and others were showing off their JBOF (see my Facebook moving to JBOF post) but there were plenty of other NVMeoF offerings at the show.

NVMeoF hardware availability

When Brocade announced their Gen6 Switches they made a point of saying that both their Gen5 and Gen6 switches currently support NVMeoF protocols. In addition to Brocade’s support, in Dec 2015 Qlogic announced support for NVMeoF for select HBAs. Also, as of  July 2016, Emulex announced support for NVMeoF in their HBAs.

From an Ethernet perspective, Qlogic has a NVMe Direct NIC which supports NVMe protocol offload for iSCSI. But even without NVMe Direct, Ethernet 40GbE & 100GbE with  iWARP, RoCEv1-v2, iSCSI SER, or iSCSI RDMA all could readily support NVMeoF on Ethernet. The nice thing about NVMeoF for Ethernet is not only do you get support for iSCSI & FCoE, but CIFS/SMB and NFS as well.

InfiniBand and Omni-Path Architecture already support native RDMA, so they should already support NVMeoF.

So hardware/firmware is already available for any enterprise AFA customer to want NVMeoF for their data center storage.

NVMeoF Software

Intel claims that ~90% of the software driver functionality of NVMe is the same for NVMeoF. The primary differences between the two seem to be the NVMeoY discovery and queueing mechanisms.

There are two fabric methods that can be used to implement NVMeoF data and command transfers: capsule mode where NVMe commands and data are encapsulated in normal fabric packets or fabric dependent mode where drivers make use of native fabric memory transfer mechanisms (RDMA, …) to transfer commands and data.

12679485_245179519150700_14553389_nA (Linux) host driver for NVMeoF is currently available from Seagate. And as a result, support for NVMeoF for Linux is currently under development, and  not far from release in the next Kernel (I think). (Mellanox has a tutorial on how to compile a Linux kernel with NVMeoF driver support).

With Linux coming out, Microsoft Windows and VMware can’t be far behind. However, I could find nothing online, aside from base NVMe support, for either platform.

NVMeoF target support is another matter but with NICs/HBAs & switch hardware/firmware and drivers presently available, proprietary storage system target drivers are just a matter of time.

Boot support is a major concern. I could find no information on BIOS support for booting off of a NVMeoF AFA. Arguably, one may not need boot support for NVMeoF AFAs as they are probably not a viable target for storing App code or OS software.

From what I could tell, normal fabric multi-pathing support should work fine with NVMeoF. This should allow for HA NVMeoF storage, a critical requirement for enterprise AFA storage systems these days.

NVMeoF advantages/disadvantages

Chelsio and others have shown that NVMeoF adds ~8μsec of additional overhead beyond native NVMe SSDs, which if true would warrant implementation on all NVMe AFAs. This may or may not impact max IOPS depending on scale-ability of NVMeoF.

For instance, servers (PCIe bus hardware) typically limit the number of private NVMe SSDs to 255 or less. With an NVMeoF, one could potentially have 1000s of shared NVMe SSDs accessible to a single server. With this scale, one could have a single server attached to a scale-out NVMeoF AFA (cluster) that could supply ~4X the IOPS that a single server could perform using private NVMe storage.

Base level NVMe SSD support and protocol stacks are starting to be available for most flash vendors and operating systems such as, Linux, FreeBSD, VMware, Windows, and Solaris. If Intel’s claim of 90% common software between NVMe and NVMeoF drivers is true, then it should be a relatively easy development project to provide host NVMeoF drivers.

The need for special Ethernet hardware that supports RDMA may delay some storage vendors from implementing NVMeoF AFAs quickly. The lack of BIOS boot support may be a minor irritant in comparison.

NVMeoF forecast

AFA storage systems, as far as I can tell, are all about selling high IOPS and very-low latency IOs. It would seem that NVMeoF would offer early adopter AFA storage vendors a significant performance advantage over slower paced competition.

In previous QoM/QoW posts we have established that there are about 13 new enterprise storage systems that come out each year. Probably 80% of these will be AFA, given the current market environment.

Of the 10.4 AFA systems coming out over the next year, ~20% of these systems pride themselves on being the lowest latency solutions in the market, and thus command high margins. One would think these systems would be the first to adopt NVMeoF. But, most of these systems have their own, proprietary flash modules and do not use standard (NVMe) SSDs and can use their own proprietary interface to their proprietary flash storage. This will delay any implementation for them until they can convert their flash storage to NVMe which may take some time.

On the other hand, most (70%) of the other AFA systems, that currently use SAS/SATA SSDs, could boost their IOP counts and drastically reduce their IO  response times, by implementing NVMe SSDs and NVMeoF. But converting SAS/SATA backends to NVMe will take time and effort.

But, there are a select few (~10%) of AFA systems, that already use NVMe SSDs in their AFAs, and for these few, they would seem to have a fast track towards implementing NVMeoF. The fact that NVMeoF is supported over all fabrics and all storage interface protocols make it even easier.

Moreover, NVMeoF has been under discussion since the summer of 2015, which tells me that astute AFA vendors have already had 18+ months to develop it. With NVMeoF host drivers & hardware available since Dec. 2015, means hardware and software exist to test and validate against.

I believe that NVMeoF will be GA’d within the next 12 months by at least one enterprise AFA system. So my QoM1610 forecast for NVMeoF is YES, with a 0.83 probability.

Comments?

 

 

 

Bar chart depicting IOPS/GB-NAND, #1 is Datacore Parallel Server with ~266 IOPS/GB-NAND,

SPC-1 IOPS performance per GB-NAND – chart of the month

Bar chart depicting IOPS/GB-NAND, #1 is Datacore Parallel Server with ~266 IOPS/GB-NAND,
(c) 2016 Silverton Consulting, All Rights Reserved

The above is an updated chart from last months SCI newsletter StorInt™ SPC Performance Report depicting the top 10 SPC-1 submissions IOPS™ per GB-NAND. We have been searching for a while now how to depict storage system effectiveness when using SSD or other flash storage. We have used IOPS/SSD in the past but IOPS/GB-NAND looks better.

Calculating IOPS/GB-NAND

SPC-1 does not report this metric but it can be calculated by dividing IOPS by NAND storage capacity. One can find out NAND storage capacity by looking over SPC-1 full disclosure reports (FDR), totaling up the NAND storage in the configuration in all the SSDs and flash devices. This is total NAND capacity, not Total ASU (used storage) Capacity. GB-NAND reflects just what’s indicated for SSD/flash device capacity in the configuration section. This is not necessarily the device’s physical NAND capacity when over provisioned, but at least it’s available in the FDR.

DataCore Parallel Server IOPS/GB-NAND explained

The DataCore Parallel Server generated over 5M IOPS (IO’s/second) under an SPC-1 (OLTP-like) workload. And with their 54-480GB SSDs, totaling ~25.9TB of NAND capacity, it gives them just under 200 IOPS/GB-NAND. The chart in the original report was incorrect.  There we used 36-480GB SSDs or ~17.3TB of NAND to compute IOPS/GB-NAND, which gave them just under 300 IOPS/GB-NAND in the report, which was incorrect. (The full report has been since corrected and is available for re-download for subscribers to our newsletter).

The 480GB (Samsung SM863 MZ-7KM480E)SSDs were all SATA attached. Samsung lists these SSDs as V-NAND, MLC drives, rated at 97K random Reads and 26K random writes. At over 5M IOPS, it should be running close to 100% of the SSDs rated performance. However, DataCore’s Parallel Server included 2 controllers with a total of 3TB of DRAM cache,  which was then SAS connected to 4 DELL MD1220 storage arrays, each with 512GB of DRAM cache, so their total configuration had about 5TB of DRAM in it, most of which would have been used as a IO cache.

The SPC-1 submission only used 11.8TB (Total ASU capacity) of storage. All the DRAM cache help to explain how they attained 5M IOPS. Having a multi-tiered cache like DataCore-MD1220 configuration, doesn’t insure that all the cache is effectively used but even without cache tiering logic, there might not be much of an overlap between the MD1220 and Parallel Server caches. It would be more interesting to see how busy the SSDs were during this SPC-1 run.

How random the SPC-1 workload is, is subject to much speculation in the industry. Suffice it to say it’s not 100% random, but what is. Non-random OLTP workloads would tend to favor larger caches.

SPC is coming out with a new version of their benchmark with supplementary information which may shed more light on device busyness.

All SPC-1 benchmark submissions are available at storageperformance.org.

Want more?

The August 2016 and our other SPC Performance reports have much more information on SPC-1 and SPC-2 performance. Moreover, there’s a lot more performance information, covering email and other (OLTP and throughput intensive) block storage workloads, in our SAN Storage Buying Guide, available for purchase on our website. More information on file and block protocol/interface performance is included in SCI’s SAN-NAS Buying Guidealso available from our website .

~~~~

The complete SPC performance report went out in SCI’s August 2016 Storage Intelligence e-newsletter.  A copy of the report will be posted on our SCI dispatches (posts) page over the next quarter or so (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free SCI Storage Intelligence e-newsletters, by just using the signup form in the sidebar or you can subscribe here.

 

Facebook moving to JBOF (just a bunch of flash)

At Flash Memory Summit (FMS 2016) this past week, Vijay Rao, Director of Technology Strategy at Facebook gave a keynote session on some of the areas that Facebook is focused on for flash storage. One thing that stood out as a significant change of direction was a move to JBOFs in their datacenters.

As you may recall, Facebook was an early adopter of (FusionIO’s) server flash cards to accelerate their applications. But they are moving away from that technology now.

Insane growth at Facebook

Why? Vijay started his talk about some of the growth they have seen over the years in photos, videos, messages, comments, likes, etc. Each was depicted as a animated bubble chart, with a timeline on the horizontal axis and a growth measurement in % on the vertical axis, with the size of the bubble being the actual quantity of each element.

Although the user activity growth rates all started out small at different times and grew at different rates during their individual timelines, by the end of each video, they were all almost at 90-100% growth, in 4Q15 (assume this is yearly growth rate but could be wrong).

Vijay had similar slides showing the growth of their infrastructure, i.e.,  compute, storage and networking. But although infrastructure grew less quickly than user activity (messages/videos/photos/etc.), they all showed similar trends and ended up (as far as I could tell) at ~70% growth.
Continue reading “Facebook moving to JBOF (just a bunch of flash)”

Pure Storage FlashBlade well positioned for next generation storage

IMG_6344Sometimes, long after I listen to a vendor’s discussion, I come away wondering why they do what they do. Oftentimes, it passes but after a recent session with Pure Storage at SFD10, it lingered.

Why engineer storage hardware?

In the last week or so, executives at Hitachi mentioned that they plan to reduce  hardware R&D activities for their high end storage. There was much confusion what it all meant but from what I hear, they are ahead now, and maybe it makes more sense to do less hardware and more software for their next generation high end storage. We have talked about hardware vs. software innovation a lot (see recent post: TPU and hardware vs. software innovation [round 3]).
Continue reading “Pure Storage FlashBlade well positioned for next generation storage”

Has triple parity Raid time come?

Data center with hard drives
Data center with hard drives

Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that their latest all flash storage array was going to support triple-parity RAID.

And last week at a NetApp-SolidFire analyst event, someone mentioned that the new ONTAP 9 triple parity RAID-TEC™ for larger SSDs. Also heard at the meeting was that a 15.3TB SSD would take on the order of 12 hours to rebuild.

Need for better protection

When Nimble discussed the need for triple parity RAID they mentioned the report from Google I talked about recently (see my Surprises from 4 years of SSD experience at Google post). In that post, the main surprise was the amount of read errors they had seen from the SSDs they deployed throughout their data center.

I think the need for triple-parity RAID and larger (+15TB SSDs) will become more common over time. There’s no reason to think that the SSD vendors will stop at 15TB. And if it takes 12 hours to rebuild a 15TB one, I think it’s probably something like  ~30 hours to rebuild a 30TB one, which is just a generation or two away.

A read error on one SSD in a RAID group during an SSD rebuild can be masked by having dual parity. A read error on two SSDs can only be masked by having triple parity RAID.
Continue reading “Has triple parity Raid time come?”

Surprises from 4 years of SSD experience at Google

Flash field experience at Google 

Overview SSDsIn a FAST’16 article I recently read (Flash reliability in production: the expected and unexpected, see p. 67), researchers at Google reported on field experience with flash drives in their data centers, totaling many millions of drive days covering MLC, eMLC and SLC drives with a minimum of 4 years of production use (3 years for eMLC). In some cases, they had 2 generations of the same drive in their field population. SSD reliability in the field is not what I would have expected and was a surprise to Google as well.

The SSDs seem to be used in a number of different application areas but mainly as SSDs with a custom designed PCIe interface (FusionIO drives maybe?). Aside from the technology changes, there were some lithographic changes as well from 50 to 34nm for SLC and 50 to 43nm for MLC drives and from 32 to 25nm for eMLC NAND technology.
Continue reading “Surprises from 4 years of SSD experience at Google”

Intel Cloud Day 2016 news and views

 A couple of weeks back I was at Intel Cloud Day 2016 with the rest of the TFD team. We listened to a number of presentations from Intel Management team mostly about how the IT world was changing and how they planned to help lead the transition to the new cloud world.

The view from Intel is that any organization with 1200 to 1500 servers has enough scale to do a private cloud deployment that would be more economical than using public cloud services. Intel’s new goal is to facilitate (private) 10,000 clouds, being deployed across the world.

In order to facilitate the next 10,000, Intel is working hard to introduce a number of new technologies and programs that they feel can make it happen. One that was discussed at the show was the new OpenStack scheduler based on Google’s open sourced, Kubernetes technologies which provides container management for Google’s own infrastructure but now supports the OpenStack framework.

Another way Intel is helping is by building a new 1000 (500 now) server cloud test lab in San Antonio, TX. Of course the servers will be use the latest Xeon chips from Intel (see below for more info on the latest chips). The other enabling technology discussed a lot at the show was software defined infrastructure (SDI) which applies across the data center, networking and storage.

According to Intel, security isn’t the number 1 concern holding back cloud deployments anymore. Nowadays it’s more the lack of skills that’s governing how quickly the enterprise moves to the cloud.

At the event, Intel talked about a couple of verticals that seemed to be ahead of the pack in adopting cloud services, namely, education and healthcare.  They also spent a lot of time talking about the new technologies they were introducing today.
Continue reading “Intel Cloud Day 2016 news and views”