Block Storage – Silverton Consulting

StorPool, fast storage for fast times

Posted on March 11, 2019March 12, 2019 by Ray in Block Storage, Distributed computing, Storage, Storage Features, Storage performance

At Storage Field Day 18 (SFD18) a couple of weeks ago, we heard from a new company, StorPool, that provides ultra fast software defined storage for MSPs and other cloud providers. You can watch the videos of their sessions here.

Didn’t know what to make of them at first, but when they started demoing their performance, we all woke up. They ran an all read and mixed read-write IO workload, that almost blew away any other proprietary/non-proprietary storage I’ve seen before.

[Updated 12Mar2019] What they were trying to achieve was to match the performance of an Windows Server 2019 Hyper V benchmark which hit 13.8 M IOPS using 12 nodes of 384GB DRAM, 1.5TB Optane DC persistent memory, 32TB (4X8TB) NVMe SSDs and Mellanox 25Gbps RDMA ethernet, with each VM running on the server that had the VHDX file stored.

Their demo showed 70:30 R:W random 4KB mixed workload and achieved 1M IOPS with a read latency of 140 microsec. and write latency of 100 microsec. (end to end at the VM level). [Updated 12Mar2019] They were able to match the performance of a published benchmark without the 1.5TB Optane memory, without the 25Gbps RDMA Ethernet and without having the VMs and its storage running on same nodes. They were able to show this performance running StorPool, KVM and CentOS 7 across 12 nodes running both VMs and storage services.

They also showed information on a pgbench benchmark, which I was not familiar with. The chart had response times on the horizontal axis and TPS performance on the vertical axis.

What’s even more amazing is that even with the great performance they still offer reasonable data services such as CoW snapshots, asynchronous replication (with changed block tracking), thin provisioning, end to end data integrity, and iSCSI support.

Their target market is mostly MSPs and large customers moving to the private cloud configurations. They mentioned deep support for OpenNebula, [updated 12Mar2019] OpenStack, OnApp and Kubernetes which means each virtual disk is a volume/LUN. They support VMware and Windows Server/Hyper-V through iSCSI.

~~~~

The fact that they have a proprietary protocol is not that great but if they can generate the IOPS and response times they showed here with snapshot, thin provisioning and asynch replication, I’m ok with it. [Updated 12Mar2019]The fact that they were able to match the performance of the more expensive system with standard Ethernet, no Optane memory and all VMs running remote made a significant impression on me.

Want to learn more, check out these other discussions on StorPool (and other SFD18 vendors):

SFD18 – as intense as it gets by Max Mortillaro (@DarkAvenger), and

Podcast #3 review the SFD18 presenters by Chris M. Evans (@ChrisMEvans) and Matt Leib (@MBLeib).

[Updated 12Mar2019 Boyan Krosnov sent me an email indicating that the post had made some mistakes in the post which were corrected via updates above. Editors ]

Screaming IOP performance with StarWind’s new NVMeoF software & Optane SSDs

Posted on September 25, 2018September 25, 2018 by Ray in Block Storage, Distributed computing, NVMe, NVMe storage, Processing performance, SSD storage, Storage performance, Strategic Inflection Points, Uncategorized

Was at SFD17 last week in San Jose and we heard from StarWind SAN (@starwindsan) and their latest NVMeoF storage system that they have been working on. Videos of their presentation are available here. Starwind is this amazing company from the Ukraine that have been developing software defined storage.

They have developed their own NVMe SPDK for Windows Server. Intel doesn’t currently offer SPDK for Windows today, so they developed their own. They also developed their own initiator (CentOS Linux) for NVMeoF. The target system was a multicore server running Windows Server with a single Optane SSD that they used to test their software.

Extreme IOP performance consumes cores

During their development activity they tested various configurations. At the start of their development they used a Windows Server with their NVMeoF target device driver. With this configuration and on a bare metal server, they found that they could max out the Optane SSD at 550K 4K random write IOPs at 0.6msec to a single Optane drive.

When they moved this code directly to run under a Hyper-V environment, they were able to come close to this performance at 518K 4K write IOPS at 0.6msec. However, this level of IO activity pegged 100% of 8 cores on their 40 core server.

More IOPs/core performance in user mode

Next they decided to optimize their driver code and move as much as possible into user space and out of kernel space, They continued to use Hyper-V. With this level off code, they were able to achieve the same performance as bare metal or ~551K 4K random write IOP performance at the 0.6msec RT and 2.26 GB/sec level. However, they were now able to perform only pegging 2 cores. They expect to release this initiator and target software in mid October 2018!

They converted this functionality to run under ESX/VMware and were able to see much the same 2 cores pegged, ~551K 4K random write IOPS at 0.6msec RT and 2.26 GB/sec. They will have the ESXi version of their target driver code available sometime later this year.

Their initiator was running CentOS on another server. When they decided to test how far they could push their initiator, they were able to drive 4 Optane SSDs at up to ~1.9M 4K random write IOP performance.

At SFD17, I asked what they could have done at 100 usec RT and Max said about 450K IOPs. This is still surprisingly good performance. With 4 Optane SSDs and consuming ~8 cores, you could achieve 1.8M IOPS and ~7.4GB/sec. Doubling the Optane SSDs one could achieve ~3.6M IOPS, with sufficient initiators and target cores with ~14.8GB/sec.

Optane based super computer?

ORNL Summit super computer, the current number one supercomputer in the world, has a sustained throughput of 2.5 TB/sec over 18.7K server nodes. You could do much the same with 337 CentOS initiator nodes, 337 Windows server nodes and ~1350 Optane SSDs.

This would assumes that Starwind’s initiator and target NVMeoF systems can scale but they’ve already shown they can do 1.8M IOPS across 4 Optane SSDs on a single initiator server. Aand I assume a single target server with 4 Optane SSDs and at least 8 cores to service the IO. Multiplying this by 4 or 400 shouldn’t be much of a concern except for the increasing networking bandwidth.

Of course, with Starwind’s Virtual SAN, there’s no data management, no data protection and probably very little in the way of logical volume management. And the ORNL Summit supercomputer is accessing data as files in a massive file system. The StarWind Virtual SAN is a block device.

But if I wanted to rule the supercomputing world, in a somewhat smallish data center, I might be tempted to put together 400 of StarWind NVMeoF target storage nodes with 4 Optane SSDs each. And convert their initiator code to work on IBM Spectrum Scale nodes and let her rip.

Comments?

Infinidat’s ~90% (average read hit) solution – #SFD16

Posted on July 15, 2018August 7, 2018 by Ray in Block Storage, Disk storage, Market dynamics, SSD storage, Storage performance, System effectiveness, Visionary leadershp

We attended a SFD16 session at Infinidat‘s offices in Waltham, MA a couple of weeks ago.

The session was led off by Doc D’Errico (@docderrico) but CTO, Brian Carmody (@initzero) did most of the talking and Erik Kaulberg (@ekaulberg) discussed some of their new products. I have known Doc, Erik and Brian for years and all of them are industry heavyweights, having worked at major and startup storage companies for decades.

Infinidat’s challenge

The challenge that Infinidat has is how to perform as well as (or better than) an all flash array when you have hybrid flash – disk storage.

The advantage of spinning disk is that it’s relatively cheap ($/GB) storage with good throughput, great reliability and reasonable volumetric density (GB/mm**3). However, its random read access time is much, two orders of magnitude worse than flash (~10 msec vs. ~100 µsec).

Fortunately, DRAM has a random access time of ~100 nsec which is 3 orders of magnitude better than flash. A manager I had once said everyone wants their data stored on tape but accessed out of memory.

So the problem comes down to insuring that application data is sitting in DRAM (cache) when requested. This is called a read (cache) hit. [Write hits are easier because they are typically written directly into memory and later destaged. So essentially all writes are cache hits.]

Mainframes vs. open systems

In the old days with MVS on mainframes, read hit rates of ~70%+ were considered good and doable. Over time MVS and z/OS, its followon, started providing hints about what IO’s were coming next. With mainframe hints, storage systems started hitting ~90% read hit rates.

Open systems never seemed to come close to those hit rates. A typical open system hit rate was ~50% for well behaved applications. For VMware and it’s IO mixer, a 50% read hit rate was aspirational and hardly ever achieved in reality. This has improved over time, but nothing comes close to 90% read hit rates over extended periods.

So, for a storage systems in open system environments to average an 90% DRAM read hit cache hit rate is unheard of, and only seen for brief intervals at best, with especially well behaved applications and not under virtualization. For a customer to see an average DRAM hit rate exceeding 90%, over the course of multiple days was inconceivable,.

And yet that’s exactly what Brian’s showing in the photo above, an average over 1 week of over 95% DRAM read hit rate over a whole week, for an major retail/e-commerce/ERP customer during one of the heaviest, if not the heaviest activity weeks of the year, The data set size was 1.5PB.

How is this possible, what does Infinidat do differently to predict which data applications need moment to moment, over the course of the heaviest retail week in the year.

Infinidat’s Neural caching algorithm

It all starts with writes. When data is written to Infinidat, it records a terrain map of all the other data that has been written recently. I suppose one could think of this as a 2 dimensional map, with spots on the map being the equivalent of data in the storage system that have recently been written. This map changes over time so it’s more like a movie stream of frames showing, from frame to frame all the recently written data in the system at any point in time. Of course the frame rate for this stream is the IO rate.

When a IO request comes in for a specific record, Infinidat uses an index to locate the last time (frame) the record was written, and using this snapshot of all data written at the same time, it reads into cache all the other data written at that time.

This additional data could be read from disk or SSD. Read throughput is slightly better for flash over disk but not orders of magnitude.

In any case, any system could do this sort of caching algorithm, iff they had the processing power needed, had the metadata layout which made recording the IO stream frame by frame space efficient, had the metadata indexing which would enable them to locate the last frame a record was written in AND had the IO parallelism required to do a whole lot of IO all the time to keep that DRAM cache filled with hit candidates. Did I mention that Infinidat uses a three controller storage system, unlike the rest of the industry that uses a two controller system. This gives them 50% more horse power and data paths to get data into cache.

Brian goes into some depth on Neural caching implementation but there’s plenty of secret sauce behind it.

~~~~

Somewhere in his presentation, Brian stated that across their entire customer base (~3.4EB @ end of last quarter), they average a 90% read (DRAM) cache hit rate, inconceivable in the old days, nigh impossible today. Of course it only gets better from here.

If a hybrid system can continuously average a 95% read cache hit rate for a customer over weeks of IO, there’s no reason that system couldn’t outperform an AFA, even an NVMeoF AFA storage system.

I suppose cache hit rates like these could be application dependent but they didn’t seem to say anything about specific verticals they were targeting. And at 3+ EB it doesn’t appear to be application specific.

Comments?

For more information you may want to see these other SFD16 participants write ups on Infinidat:

Infinidat, and #SFD16, Data is the new currency by Matt Leib (@MBLeib)

Huawei presents OceanStor architecture at SFD15

Posted on May 21, 2018May 21, 2018 by Ray in Block Storage, Clustered storage, Data compression, Data reduction, IOPS, LRT, NVMe storage, SPC-1, SSD storage, Storage architecture, Storage Features, Storage performance

At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).

Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.

The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.

System architecture

Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.

They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.

IO without serialization

They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.

To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.

IO flow

Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.

The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.

Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.

Other optimizations

They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.

For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.

Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.

~~~~

Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.

It never occurred to me we didn’t have to use locks at all.

For more information, please read these other SFD15 blogger posts on Huawei:

Dorado – All about Speed – Storage Gaga, Chin-Fah Heoh (@StorageGaga)
Huawei – Probably Not What You Expected, Dan Firth (@PenguinPunk)

NetApp’s new NVMeoF/FC AFF & Cloud Data Volumes for every cloud

Posted on May 14, 2018May 14, 2018 by Ray in Block Storage, Cloud services, Cloud storage, Clustered storage, FC, File Storage, NVMe, SSD storage, Storage, Storage performance, Strategic Inflection Points

We attended a NetApp analyst event in their CA HQ last week and they had some interesting announcements as well other information to share. 1st up new faster ONTAP storage.

NVMeoF AFF

NetApp announced this week that their latest generation AFF (All Flash FAS) systems will support FC NVMeoF. We asked if this was just for NVMe SSDs or did it apply to all AFF media. The answer was it’s just another host interface which the customer can license for NVMe SSDs (available only on AFF F800) or SAS SSDs (A700S, A700, and A300). The only AFF not supporting the new host interface is their lowend AFF A220.

As for which NVMeoF, they only support FC at the moment, and it’s our belief that the FC NVMeoF spec is most well defined these days and the FC switch hardware (Brocade-Broadcom since Gen 5, now shipping Gen 6, Cisco not sure) already has NVMeoF support.

NetApp also mentioned support for 100GbE (A800 & A700S only) and 32Gbs FC hardware (all AFF systems but A220). So, presumably they offer NVMeoF for both 32Gbps and 16Gbps FC.

No word on when this will be available for Ethernet FCoE or iSCSI (iNVMe?) but with all the major storage vendors bar one, moving to NVMe SSDs it’s only a matter of time before they also support Ethernet NVMeoF.

As for AFF NVMeoF performance, the answer wasn’t entirely satisfactory. The indication was that the interface reduced response time by 10 usecs or so for NVMe SSDs over SAS SSDs. But I didn’t see any other performance information to substantiate that.

We did see on their AFF d atasheet that with NVMe SSDs and NVMeoF FC, the AFF A800 response time was sub 200usec with throughput of 300GB/s (in a 24 node cluster, 12 HA pairs). This means they add only about 100usec for ONTAP data services, a decent trade off from our perspective. Later in their datasheet they say the A800 is capable of 1.3M IOPS and sub-500usec latencies. Unsure why they quoted both numbers.

Cloud Data Volumes

NetApp is taking storage to the cloud. They just announced that NetApp Cloud Data Volumes will be available as a native service under Google Cloud Platform (GCP). NetApp Cloud Data Volume is a storage-as-a-service offering that provides on demand ONTAP file services in the cloud.

For GCP, both Google and NetApp will be offering the service. Dianne Green, GCP VP said Cloud Data Volumes are a bit like Kubernetes, disruption without disrupting. Customers can easily migrate their onprem file based applications to the cloud without having to worry about the performance of their data or data protection for that matter.

Getting the data there is another matter, but NetApp has other services like CloudSync and someday (maybe for Cloud Data Volumes), SnapMirror, which can help customers move data to and from the cloud.

Currently Cloud Data Volumes are in public preview as an Microsoft Azure Enterprise NFS (and SMB) service. It’s also in beta (I think) in AWS marketplace. And availability on GCP is still restricted. There’s a lot of emphasis at NetApp events on Cloud Data Volumes given its current status on public cloud providers but we think they are trying to gain some experience before they roll it out to the rest of the world.

However, Jean English, NetApp CMO mentioned that NetApp’s Cloud Data Service business unit has over 1800 customers and currently supports a multi-PB storage footprint in various clouds. Note, this is not just Cloud Data Volumes but comprises all NetApp Cloud Data Services, which includes ONTAP Cloud, NPS, CloudSync, AltaVault, etc. Nonetheless, it’s an impressive indicator of just how far they have come in applying their storage magic to the public cloud in a short time. The hyperscalers (read public cloud providers) say NetApp is 2 or more years ahead of all the other competition and from what we can see, it’s true.

One of the key differentiators between NetApp Cloud Data Volumes and ONTAP Cloud is performance SLAs. Cloud Data Volume customers can select and purchase a specified performance SLA. We believe it comes at three levels and is normally purchased on a pay as you go, consumption based, service offering. However, it’s also available to be billed periodically, other purchase options may be available as well.

When asked what storage was behind the service, the only thing NetApp would confirm was that it was ONTAP storage, present in public cloud data centers in various regions. So Cloud Data Volumes is available in only specific regions but I would expect that to expand over time.

Data Visualization Center

They also christened their new Data Visualization Center (DVC) and we had a multi-course meal at the Bistro at the center. The DVC had a wrap around, 1.5 floor tall screen which showed some of NetApp customer success stories. Inside the screen was a more immersive setting and there was plenty of VR equipment in work spaces alongside customer conference rooms.

Full Disclosure: NetApp paid for all our travel, hotel and food during the analyst event and gave us all Google Home Minis as going away presents and NetApp is a long time customer of my firm.

Axellio, next gen, IO intensive server for RT analytics by X-IO Technologies

Posted on June 22, 2017 by Ray in Block Storage, Data reduction, Distributed computing, Machine Learning, NVMe, NVMe storage, PCIe, Storage performance, Strategic Inflection Points, Strategic planning

We were at X-IO Technologies last week for SFD13 in Colorado Springs talking with the team and they showed us their new IO and storage intensive server, the Axellio. They want to sell Axellio to customers that need extreme IOPS, very high bandwidth, and large storage requirements. Videos of X-IO’s sessions at SFD13 are available here.

The hardware

Axellio comes in 2U appliance with two server nodes. Each server supports 2 sockets of Intel E5-26xx v4 CPUs (4 sockets total) supporting from 16 to 88 cores. Each server node can be configured with up to 1TB of DRAM or it also supports NVDIMMs.

There are two key differentiators to Axellio:

The FabricExpress™, a PCIe based interconnect which allows both server nodes to access dual-ported, 2.5″ NVMe SSDs; and
Dense drive trays, the Axellio supports up to 72 (6 trays with 12 drives each) 2.5″ NVMe SSDs offering up to 460TB of raw NVMe flash using 6.4TB NVMe SSDs. Higher capacity NVMe SSDS available soon will increase Axellio capacity to 1PB of raw NVMe flash.

They also probably spent a lot of time on packaging, cooling and power in order to make Axellio a reliable solution for edge computing. We asked if it was NEBs compliant and they told us not yet but they are working on it.

Axellio can also be configured to replace 2 drive trays with 2 processor offload modules such as 2x Intel Phi CPU extensions for parallel compute, 2X Nvidia K2 GPU modules for high end video or VDI processing or 2X Nvidia P100 Tesla modules for machine learning processing. Probably anything that fits into Axellio’s power, cooling and PCIe bus lane limitations would also probably work here.

At the frontend of the appliance there are 1x16PCIe lanes of server retained for networking that can support off the shelf NICs/HCAs/HBAs with HHHL or FHHL cards for Ethernet, Infiniband or FC access to the Axellio. This provides up to 2x100GbE per server node of network access.

Performance of Axellio

With Axellio using all NVMe SSDs, we expect high IO performance. Further, they are measuring IO performance from internal to the CPUs on the Axellio server nodes. X-IO says the Axellio can hit >12Million IO/sec with at 35µsec latencies with 72 NVMe SSDs.

Lab testing detailed in the chart above shows IO rates for an Axellio appliance with 48 NVMe SSDs. With that configuration the Axellio can do 7.8M 4KB random write IOPS at 90µsec average response times and 8.6M 4KB random read IOPS at 164µsec latencies. Don’t know why reads would take longer than writes in Axellio, but they are doing 10% more of them.

Furthermore, the difference between read and write IOP rates aren’t close to what we have seen with other AFAs. Typically, maximum write IOPs are much less than read IOPs. Why Axellio’s read and write IOP rates are so close to one another (~10%) is a significant mystery.

As for IO bandwitdh, Axellio it supports up to 60GB/sec sustained and in the 48 drive lax testing it generated 30.5GB/sec for random 4KB writes and 33.7GB/sec for random 4KB reads. Again much closer together than what we have seen for other AFAs.

Also noteworthy, given PCIe’s bi-directional capabilities, X-IO said that there’s no reason that the system couldn’t be doing a mixed IO workload of both random reads and writes at similar rates. Although, they didn’t present any test data to substantiate that claim.

Markets for Axellio

They really didn’t talk about the software for Axellio. We would guess this is up to the customer/vertical that uses it.

Aside from the obvious use case as a X-IO’s next generation ISE storage appliance, Axellio could easily be used as an edge processor for a massive fabric of IoT devices, analytics processor for large RT streaming data, and deep packet capture and analysis processing for cyber security/intelligence gathering, etc. X-IO seems to be focusing their current efforts on attacking these verticals and others with similar processing requirements.

X-IO Technologies’ sessions at SFD13

Other sessions at X-IO include: Richard Lary, CTO X-IO Technologies gave a very interesting presentation on an mathematically optimized way to do data dedupe (caution some math involved); Bill Miller, CEO X-IO Technologies presented on edge computing’s new requirements and Gavin McLaughlin, Strategy & Communications talked about X-IO’s history and new approach to take the company into more profitable business.

Again all the videos are available online (see link above). We were very impressed with Richard’s dedupe session and haven’t heard as much about bloom filters, since Andy Warfield, CTO and Co-founder Coho Data, talked at SFD8.

For more information, other SFD13 blogger posts on X-IO’s sessions:

SFD13 primer – X-IO Axellio Edge Computing Platform by Max Mortillaro (@Darkkavenger)
X-IO Technology – a #SFD13 preview by Mike Preston (@mwpreston)

Full Disclosure

X-IO paid for our presence at their sessions and they provided each blogger a shirt, lunch and a USB stick with their presentations on it.

Dreaming of SCM but living with NVDIMMs…

Posted on December 8, 2016December 16, 2016 by Ray in Block Storage, Data compression, File Storage, storage class memory

Last months GreyBeards on Storage podcast was with Rob Peglar, CTO and Sr. VP of Symbolic IO. Most of the discussion was on their new storage product but what also got my interest is that they are developing their storage system using NVDIMM technologies.

In the past I would have called NVDIMMs NonVolatile RAM but with the latest incarnation it’s all packaged up in a single DIMM and has both NAND and DRAM on board. It looks a lot like 3D XPoint but without the wait.

The first time I saw similar technology was at SFD5 with Diablo Technologies and SANdisk, a Western Digital company (videos here and here). At that time they were calling them UltraDIMM and memory class storage. ULTRADIMMs had an onboard SSD and DRAM and they provided a sort of virtual memory (paged) access to the substantial (SSD) storage behind the DRAM page(s). I wrote two blog posts about UltraDIMM and MCS (called MCS, UltraDIMM and memory IO, the new path ahead part1 and part2).

NVDIMM defined

NVDIMMs are currently available today from Micron, Crucial, NetList, Viking, and probably others. With today’s NVDIMM there is no large SSD (like ULTRADIMMs, just backing flash) and the complete storage capacity is available from the DRAM in the NVDIMM. At power reset, the NVDIMM sort of acts like virtual memory paging in data from the flash until all the data is in DRAM.

NVDIMM hardware includes control logic, DRAM, NAND and SuperCAPs/Batteries together in one DIMM. DRAM is used for normal memory traffic but in the case of a power outage, the data from DRAM is offloaded onto the NAND in the NVDIMM using the SuperCAP/Battery to hold up the DRAM memory just long enough to transfer it to flash..

Th problem with good, old DRAM is that it is volatile, which means when power is gone so is your data. With NVDIMMs (3D XPoint and other new non-volatile storage class memories also share this characteristic), when power goes away your data is still available and persists across power outages.

For example, Micron offers an 8GB, JEDEC DDR4 compliant, 288-pin NVDIMM that has 8GB of DRAM and 16GB of SLC flash in a single DIMM. Depending on part, it has 14.9-16.2GB/s of bandwidth and 1866-2400 MT/s (million memory transfers/second). Roughly translating MT/s to IOPS, says with ~17GB/sec and at an 8KB block size, the device should be able to do ~2.1 MIO/s (million IO operations per second [never thought I would need an acronym for that]).

Another thing that makes NVDIMMs unique in the storage world is that they are byte addressable.

Hardware – check, Software?

SNIA has a NVM Programming (NVMP) Technical Working Group (TWG), which has been working to help adoption of the new technology. In addition to the NVMP TWG, there’s pmem.io, SANdisk’s NVMFS (2013 FMS paper, formerly known as DirectFS) and Intel’s pmfs (persistent memory file system) GitHub repository. Couldn’t find any GitHub for NVMFS but both pmem.io and pmfs are well along the development path for Linux.

The TWG identified a three prong approach to NVDIMM adoption: crawl, walk, run (see pmem.io blog post for more info).

The Crawl approach uses standard block and file system drivers on Linux to talk to a NVDIMM driver. This way has the benefit of being well tested, well known and widely available (except for the NVDIMM driver). The downside is that you have a full block IO or file IO stack in front of a device that can potentially do 2.1 MIO/s and it is likely to cause a lot of overhead reducing this potential significantly.
The Walk approach uses a persistent memory file system (pmfs?) to directly access the NVDIMM storage using memory mapped IO. The advantage here is that there’s absolutely no kernel code active during a NVDIMM data access. But building a file system or block store up around this may require some application level code.
The Run approach wasn’t described well in the blog post but it seems like SANdisk’s NVMFS approach which uses both standard NVMe SSDs and non-volatile memory to build a hybrid (NVDIMM-SSD) file system.

Symbolic IO as another run approach?

Symbolic IO computationally defined storage is intended to make use of NVDIMM technology and in the Store [update 12/16/16] appliance version has SSD storage as well in a hybrid NVDIMM-SSD run-like solution. The appliance has a full version of Linux SymCE which doesn’t use a file system or the PMEM library to access the data, it’s just byte addressable storage ~~with a PMEM file system embedded within~~ [update 12/16/16]. This means that applications can use standard Linux file APIs to (directly) reference NVDIMM and the backend SSD storage.

It’s computationally defined because they use compute power to symbolically transform the data reducing data footprint in NVDIMM and subsequently in the SSD backing tier. Checkout the podcast to learn more

I came away from the podcast thinking that NVDIMMs are more prevalent than I thought. So, that’s what prompted this post.

Comments?

Photo Credit(s): UltraDIMM photo taken by Ray at SFD5, Architecture picture from pmem.io blog post,

Microsoft ESRP database access latencies – chart of the month

Posted on November 14, 2016 by Ray in Block Storage, Disk storage, ESRP, ESRPv4/Exchange 2013, SSD storage, Storage performance

The above chart was included in last month’s SCI e-Newsletter and depicts recent Microsoft Exchange (2013) Solution Reviewed Program (ESRP) results for database access latencies. Storage systems new to this 5000 mailbox and over category include, Oracle, Pure and Tegile. As all of these systems are all flash arrays we are starting to see significant reductions in database access latencies and the #4 system (Nimble Storage) was a hybrid (disk and flash) array.

As you recall, ESRP reports on three database access latencies: read database, write database and log write. All three are shown above but we sort and rank this list based on database read activity alone.

Hard to see above, but reading the ESRP reports, one finds that the top 3 systems had 1.04, 1.06 and 1.07 milliseconds. average read database latencies. So the separation between the top 3 is less than 40 microseconds.

The top 3 database write access latencies were 1.75, 1.62 and 3.07 milliseconds, respectively. So if we were ranking the above on write response times Pure would have come in #1.

The top 3 log write access latencies were 0.67, 0.41 and 0.82 milliseconds and once again if we were ranking based on log response times Pure would be #1.

It’s unclear whether Exchange customers would want to deploy AFAs for their database and log files but these three ESRP reports and Nimble’s show that there should be no problem with the performance of AFAs in these environments.

What about data reduction?

Unclear to me is how much data reduction technologies played in the AFA and hybrid solution ESRP performance. Data reduction advantages would most likely show up in database IOPS counts more so than response times but if present, may still reduce access latencies, as there would be potentially less data to be transferred to-from the backend of the storage system into-out of storage system cache.

ESRP reports do not officially report on a vendor’s data reduction effectiveness, so we are left with whatever the vendor decides to say.

In that respect, Pure FlashArray//m20 indicated in their ESRP report that their “data reduction is significantly higher” than what they see normally (4:1) because Jetstress (ESRP benchmark program) generates lots of duplicated data.

I couldn’t find anything that Tegile T3800 or Nimble Storage said similar to the above, indicating how well their data reduction technologies worked in Jetstress as compared to normal. However, they did make a reference to their compression effectiveness in database size but I have found this number to be somewhat less effective as it historically showed the amount of over provisioning used by disk-only systems and for AFA’s and hybrid – storage, it’s unclear how much is data reduction effectiveness vs. over provisioning.

For example, Pure, Tegile and Nimble also reported a “database capacity utilization” of 4.2%, 60% and 74.8% respectively. And Nimble did report that over their entire customer base, Exchange data has on average, a 61.2% capacity savings.

So you tell me what was the effective data reduction for their Pure’s, Tegile’s and Nimble’s respective Jetstress runs? From my perspective Pure’s report of 4.2% looks about right (that says that actual database data fit in 4.2% of SSD storage for a ~23.8:1 reduction effectiveness for Jetstress/ESRP data. I find it harder to believe what Tegile and Nimble have indicated as it doesn’t seem to be as believable as they would imply a 1.7:1 and 1.33:1 reduction effectiveness for Jetstress/ESRP data.

Oracle FS1-2 doesn’t seem to have any data reduction capabilities and reported a 100% storage capacity used by Exchange database.

So that’s it, Jetstress uses “significantly reducible” data for some AFAs systems. But in the field the advantage of data reduction techniques are much less so.

I think it’s time that ESRP stopped using significantly reducible data in their Jetstress program and tried to more closely mimic real world data.

Want more?

The October 2016 and our other ESRP reports have much more information on Microsoft Exchange performance. Moreover, there’s a lot more performance information, covering email and other (OLTP and throughput intensive) block storage workloads, in our SAN Storage Buying Guide, available for purchase on our website. More information on file and block protocol/interface performance is also included in SCI’s SAN-NAS Buying Guide, also available from our website. And if your interested in file system performance please consider purchasing our NAS Buying Guide also available on our website.

~~~~

The complete ESRP performance report went out in SCI’s October 2016 Storage Intelligence e-newsletter. A copy of the report will be posted on our SCI dispatches (posts) page over the next quarter or so (if all goes well). However, you can get the latest storage performance analysis now and subscribe to future free SCI Storage Intelligence e-newsletters, by just using the signup form in the sidebar or you can subscribe here.

As always, we welcome any suggestions or comments on how to improve our ESRP performance reports or any of our other storage performance analyses.