Screaming IOP performance with StarWind’s new NVMeoF software & Optane SSDs

Was at SFD17 last week in San Jose and we heard from StarWind SAN (@starwindsan) and their latest NVMeoF storage system that they have been working on. Videos of their presentation are available here. Starwind is this amazing company from the Ukraine that have been developing software defined storage.

They have developed their own NVMe SPDK for Windows Server. Intel doesn’t currently offer SPDK for Windows today, so they developed their own. They also developed their own initiator (CentOS Linux) for NVMeoF. The target system was a multicore server running Windows Server with a single Optane SSD that they used to test their software.

Extreme IOP performance consumes cores

During their development activity they tested various configurations. At the start of their development they used a Windows Server with their NVMeoF target device driver. With this configuration and on a bare metal server, they found that they could max out the Optane SSD at 550K 4K random write IOPs at 0.6msec to a single Optane drive.

When they moved this code directly to run under a Hyper-V environment, they were able to come close to this performance at 518K 4K write IOPS at 0.6msec. However, this level of IO activity pegged 100% of 8 cores on their 40 core server.

More IOPs/core performance in user mode

Next they decided to optimize their driver code and move as much as possible into user space and out of kernel space, They continued to use Hyper-V. With this level off code, they were able to achieve the same performance as bare metal or ~551K 4K random write IOP performance at the 0.6msec RT and 2.26 GB/sec level. However, they were now able to perform only pegging 2 cores. They expect to release this initiator and target software in mid October 2018!

They converted this functionality to run under ESX/VMware and were able to see much the same 2 cores pegged, ~551K 4K random write IOPS at 0.6msec RT and 2.26 GB/sec. They will have the ESXi version of their target driver code available sometime later this year.

Their initiator was running CentOS on another server. When they decided to test how far they could push their initiator, they were able to drive 4 Optane SSDs at up to ~1.9M 4K random write IOP performance.

At SFD17, I asked what they could have done at 100 usec RT and Max said about 450K IOPs. This is still surprisingly good performance. With 4 Optane SSDs and consuming ~8 cores, you could achieve 1.8M IOPS and ~7.4GB/sec. Doubling the Optane SSDs one could achieve ~3.6M IOPS, with sufficient initiators and target cores with ~14.8GB/sec.

Optane based super computer?

ORNL Summit super computer, the current number one supercomputer in the world, has a sustained throughput of 2.5 TB/sec over 18.7K server nodes. You could do much the same with 337 CentOS initiator nodes, 337 Windows server nodes and ~1350 Optane SSDs.

This would assumes that Starwind’s initiator and target NVMeoF systems can scale but they’ve already shown they can do 1.8M IOPS across 4 Optane SSDs on a single initiator server. Aand I assume a single target server with 4 Optane SSDs and at least 8 cores to service the IO. Multiplying this by 4 or 400 shouldn’t be much of a concern except for the increasing networking bandwidth.

Of course, with Starwind’s Virtual SAN, there’s no data management, no data protection and probably very little in the way of logical volume management. And the ORNL Summit supercomputer is accessing data as files in a massive file system. The StarWind Virtual SAN is a block device.

But if I wanted to rule the supercomputing world, in a somewhat smallish data center, I might be tempted to put together 400 of StarWind NVMeoF target storage nodes with 4 Optane SSDs each. And convert their initiator code to work on IBM Spectrum Scale nodes and let her rip.


Skyrmion and chiral bobber solitons for racetrack storage

Read an article this week in Science Daily (Magnetic skyrmions: Not the only one of their class; …) about new magnetic structures that could lend themselves to creating a new type of moving, non-volatile storage.  (There’s more information in the press release and the Nature paper [DOI: 10.1038/s41565-018-0093-3], behind a paywall).

Skyrmions and chiral bobbers are both considered magnetic solitons, types of magnetic structures only 10’s of nm wide, that can move around, in sort of a race track configuration.

Delay line memories

Early in computing history, there was a type of memory called a delay line memory which used various mechanisms (mercury, magneto-resistence, capacitors, etc.) arranged along a circular line such as a wire, and had moving pulses of memory that raced around it. .

One problem with delay line memory was that it was accessed sequentially rather than core which could be accessed randomly. When using delay lines to change a bit, one had to wait until the bit came under the read/write head . It usually took microseconds for a bit to rotate around the memory line and delay line memories had a capacity of a few thousand bits 256-512 bytes per line,  in today’s vernacular.

Delay lines predate computers and had been used for decades to delay any electronic or acoustic signal before retransmission.

A new racetrack

Solitons are being investigated to be used in a new form of delay line memory, called racetrack memory. Skyrmions had been discovered a while ago but the existence of chiral bobbers was only theoretical until researchers discovered them in their lab.

Previously, the thought was that one would encode digital data with only skyrmions and spaces. But the discovery of chiral bobbers and the fact that they can co-exist with skyrmions, means that chiral bobbers and skyrmions can be used together in a racetrack fashion to record digital data.  And the fact that both can move or migrate through a material makes them ideal for racetrack storage.

Unclear whether chiral bobbers and skyrmions only have two states or more but the more the merrier for storage. I am assuming that bit density or reliability is increased by having chiral bobbers in the chain rather than spaces.

Unlike disk devices with both rotating media and moving read-write heads, the motion of skyrmion-chiral bobber racetrack storage is controlled by a very weak pulse of current and requires no moving/mechanical parts prone to wear/tear. Moreover, as a solid state devices, racetrack memory is not sensitive to induced/organic vibration or shock,  So, theoretically these devices should have higher reliability than disk devices.

There was no information comparing the new racetrack memory reliability to NAND or 3D Crosspoint/PCM SSDs, but there may be some advantage here as well. I suppose one would need to understand how to miniaturize the read-erase-write head to the right form factor for nm racetracks to understand how it compares.

And I didn’t see anything describing how long it takes to rotate through bits on a skyrmion-chiral bobber racetrack. Of course, this would depend on the number of bits on a racetrack, but some indication of how long it takes one bit to move, one postition on the racetrack would be helpful to see what its rotational latency might be.


At the moment, reading and writing skyrmions and the newly discovered chiral bobbers takes a lot of advanced equipment and is only done in major labs. As such, I don’t see a skyrmion-chiral bobber racetrack storage device arriving on my desktop anytime soon. But the fact that there’s a long way to go before, we run out of magnetic storage options, even if it is on a chip rather than magnetic media,  is comforting to know. Even if we don’t ever come up with an economical way to produce it.

I wonder if you could synchronize rotational timing across a number of racetrack devices, at least that way you could be reading/erasing/writing a whole byte, word, double word etc, at a time, rather than a single bit.


Photo Credit(s): From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Timeline of computer history Magnetoresistive delay lines

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

NetApp’s new NVMeoF/FC AFF & Cloud Data Volumes for every cloud

We attended a NetApp analyst event in their CA HQ last week and they had some interesting announcements as well other information to share. 1st up new faster ONTAP storage.


NetApp announced this week that their latest generation AFF (All Flash FAS) systems will support FC NVMeoF. We asked if this was just for NVMe SSDs or did it apply to all AFF media. The answer was it’s just another host interface which the customer can license for NVMe SSDs (available only on AFF F800) or SAS SSDs (A700S, A700, and A300). The only AFF not supporting the new host interface is their lowend AFF A220.

As for which NVMeoF, they only support FC at the moment, and it’s our belief that the FC NVMeoF spec is most well defined these days and the FC switch hardware (Brocade-Broadcom since Gen 5, now shipping Gen 6, Cisco not sure) already has NVMeoF support.

NetApp also mentioned support for 100GbE (A800 & A700S only) and 32Gbs FC hardware (all AFF systems but A220). So, presumably they offer NVMeoF for both 32Gbps and 16Gbps FC.

No word on when this will be available for Ethernet FCoE or iSCSI (iNVMe?) but with all the major storage vendors bar one, moving to NVMe SSDs it’s only a matter of time before they also support Ethernet NVMeoF.

As for AFF NVMeoF performance, the answer wasn’t entirely satisfactory. The indication was that the interface reduced response time by 10 usecs or so for NVMe SSDs over SAS SSDs. But I didn’t see any other performance information to substantiate that.

We did see on their AFF datasheet that with NVMe SSDs and NVMeoF FC, the AFF A800 response time was sub 200usec with throughput of 300GB/s (in a 24 node cluster, 12 HA pairs). This means they add only about 100usec for ONTAP data services, a decent trade off from our perspective. Later in their datasheet they say the A800 is capable of 1.3M IOPS and sub-500usec latencies. Unsure why they quoted both numbers.

Cloud Data Volumes

NetApp is taking storage to the cloud. They just announced that NetApp Cloud Data Volumes will be available as a native service under Google Cloud Platform (GCP). NetApp Cloud Data Volume is a storage-as-a-service offering that provides on demand ONTAP file services in the cloud.

For GCP,  both Google and NetApp will be offering the service. Dianne Green, GCP VP said Cloud Data Volumes are a bit like Kubernetes, disruption without disrupting. Customers can easily migrate their onprem file based applications to the cloud without having to worry about the performance of their data or data protection for that matter.

Getting the data there is another matter, but NetApp has other services like CloudSync and someday (maybe for Cloud Data Volumes), SnapMirror, which can help customers move data to and from the cloud.

Currently Cloud Data Volumes are in public preview as an Microsoft Azure Enterprise NFS (and SMB) service. It’s also in beta (I think) in AWS marketplace. And availability on GCP is still restricted. There’s a lot of emphasis at NetApp events on Cloud Data Volumes given its current status on public cloud providers but we think they are trying to gain some experience before they roll it out to the rest of the world.

However,  Jean English, NetApp CMO mentioned that NetApp’s Cloud Data Service business unit has over 1800 customers and currently supports a multi-PB storage footprint in various clouds. Note, this is not just Cloud Data Volumes but comprises all NetApp Cloud Data Services, which includes ONTAP Cloud, NPS, CloudSync, AltaVault, etc. Nonetheless, it’s an impressive indicator of just how far they have come in applying their storage magic to the public cloud in a short time. The hyperscalers (read public cloud providers) say NetApp is 2 or more years ahead of all the other competition and from what we can see, it’s true.

One of the key differentiators between NetApp Cloud Data Volumes and ONTAP Cloud is performance SLAs. Cloud Data Volume customers can select and purchase a specified performance SLA. We believe it comes at three levels and is normally purchased on a pay as you go, consumption based, service offering. However, it’s also available to be billed periodically, other purchase options may be available as well.

When asked what storage was behind the service, the only thing NetApp would confirm was that it was ONTAP storage, present in public cloud data centers in various regions. So Cloud Data Volumes is available in only specific regions but I would expect that to expand over time.

Data Visualization Center

They also christened their new Data Visualization Center (DVC) and we had a multi-course meal at the Bistro at the center. The DVC had a wrap around, 1.5 floor tall screen which showed some of NetApp customer success stories. Inside the screen was a more immersive setting and there was plenty of VR equipment in work spaces alongside customer conference rooms.

Full Disclosure: NetApp paid for all our travel, hotel and food during the analyst event and gave us all Google Home Minis as going away presents and NetApp is a long time customer of my firm.

There’s a new cluster filesystem on the block, Elastifile

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

  • File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each  data object holds the 4KB of compressed file data and journal log entries.
  • Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

  • A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
  • A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
  • A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information


New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

Hardware vs. software innovation – round 4

We, the industry and I, have had a long running debate on whether hardware innovation still makes sense anymore (see my Hardware vs. software innovation – rounds 1, 2, & 3 posts).

The news within the last week or so is that Dell-EMC cancelled their multi-million$, DSSD project, which was a new hardware innovation intensive, Tier 0 flash storage solution, offering 10 million of IO/sec at 100µsec response times to a rack of servers.

DSSD required specialized hardware and software in the client or host server, specialized cabling between the client and the DSSD storage device and specialized hardware and flash storage in the storage device.

What ultimately did DSSD in, was the emergence of NVMe protocols, NVMe SSDs and RoCE (RDMA over Converged Ethernet) NICs.

Last weeks post on Excelero (see my 4.5M IO/sec@227µsec … post) was just one example of what can be done with such “commodity” hardware. We just finished a GreyBeardsOnStorage podcast (GreyBeards podcast with Zivan Ori, CEO & Co-founder, E8 storage) with E8 Storage which is yet another approach to using NVMe-RoCE “commodity” hardware and providing amazing performance.

Both Excelero and E8 Storage offer over 4 million IO/sec with ~120 to ~230µsec response times to multiple racks of servers. All this with off the shelf, commodity hardware and lots of software magic.

Lessons for future hardware innovation

What can be learned from the DSSD to NVMe(SSDs & protocol)-RoCE technological transition for future hardware innovation:

  1. Closely track all commodity hardware innovations, especially ones that offer similar functionality and/or performance to what you are doing with your hardware.
  2. Intensely focus any specialized hardware innovation to a small subset of functionality that gives you the most bang, most benefits at minimum cost and avoid unnecessary changes to other hardware.
  3. Speedup hardware design-validation-prototype-production cycle as much as possible to get your solution to the market faster and try to outrun and get ahead of commodity hardware innovation for as long as possible.
  4. When (and not if) commodity hardware innovation emerges that provides  similar functionality/performance, abandon your hardware approach as quick as possible and adopt commodity hardware.

Of all the above, I believe the main problem is hardware innovation cycle times. Yes, hardware innovation costs too much (not discussed above) but I believe that these costs are a concern only if the product doesn’t succeed in the market.

When a storage (or any systems) company can startup and in 18-24 months produce a competitive product with only software development and aggressive hardware sourcing/validation/testing, having specialized hardware innovation that takes 18 months to start and another 1-2 years to get to GA ready is way too long.

What’s the solution?

I think FPGA’s have to be a part of any solution to making hardware innovation faster. With FPGA’s hardware innovation can occur in days weeks rather than months to years. Yes ASICs cost much less but cycle time is THE problem from my perspective.

I’d like to think that ASIC development cycle times of design, validation, prototype and production could also be reduced. But I don’t see how. Maybe AI can help to reduce time for design-validation. But independent FABs can only speed the prototype and production phases for new ASICs, so much.

ASIC failures also happen on a regular basis. There’s got to be a way to more quickly fix ASIC and other hardware errors. Yes some hardware fixes can be done in software but occasionally the fix requires hardware changes. A quicker hardware fix approach should help.

Finally, there must be an expectation that commodity hardware will catch up eventually, especially if the market is large enough. So an eventual changeover to commodity hardware should be baked in, from the start.


In the end, project failures like this happen. Hardware innovation needs to learn from them and move on. I commend Dell-EMC for making the hard decision to kill the project.

There will be a next time for specialized hardware innovation and it will be better. There are just too many problems that remain in the storage (and systems) industry and a select few of these can only be solved with specialized hardware.


Picture credit(s): Gravestones by Sherry NelsonMotherboard 1 by Gareth Palidwor; Copy of a DSSD slide photo taken from EMC presentation by Author (c) Dell-EMC

Intel’s Optane (3D Xpoint) SSD specs in the wild

Read an article the other day in Ars Technica (Specs for 1st Intel 3DX SSD…) about a preview of the Intel Octane specs for their 375GB 3D Xpoint (3DX) flash card. The device is NVMe compliant, PCIe Gen3 add in card, that’s in a half height, half length, low profile form factor.

Intel’s Optane SSD vs. the competition

A couple of items from the Intel Optane spec sheet of interest to me as a storage guru:

  • 30 Drive writes per day/12.3 PBW (written) – 3DX, at launch, had advertised that it would have 1000 times the endurance of (2D-MLC?) NAND. Current flash cards (see Samsung SSD PRO NVMe 256GB Flash card specs) offer about 200TBW (for 256GB card) or 400TBW (for 512GB card). The Samsung PRO is based on 3D (V-)NAND, so its endurance is much better than  2D-MLC at these densities. That being said, the Octane drive is still ~40X the write endurance of the PRO 950. Not quite 1000 but certainly significantly better.
  • Sequential (bandwidth) performance (R/W) of 2400/2000 MB/sec – 3DX advertised 1000 times the performance of (2D-MLC,  non-NVMe?) NAND. Current 3D (V-)NAND cards (see Samsung SSD PRO above) above offers (R/W) 2200/900 MB/sec for an NVMe device. The Optane’s read bandwidth is a slight improvement but the write bandwidth is a 2.2X improvement over current competitive devices.
  • Random 4KB IOPs performance (R/W) of 550K/500K – Similar to the previous bulleted item, 3DX advertised 1000 times the performance of (2D-MLC,  non-NVMe?) NAND. Current 3D (V-)NAND cards like the Samsung SSD PRO offer Random 4KB IOPs performance  (R/W) of 270K/85K IOPS (@4 threads). Optane’s read random 4KB IOPs performance is 2X the PRO 950 but its write performance is ~5.9X better.
  • IO latency of <10 µsec. – 3DX advertised 10X better latency than the current (2D-MLC, non-NVMe) flash drives. According to storage review (Samsung 950 Pro M.2), the Samsung PRO 950 had a latency of ~22 µsec. Optane has at least 2X better latency than the current competition.
  • Density 375GB/HH-HL-LP – 3DX advertised 1000X the density of (then current DRAM). Today Micron offers a 4GiB DDR4/288 pin DIMM which is probably 1/2 the size of the HH flash drive. So maybe in the same space this could be 8GiB. This says that the Optane is about 100X denser than today’s DRAM.

Please note, when 3DX was launched, ~2 years ago, the then current NAND technology was 2D-MLC and NVMe was just a dream. So comparing launch claims against today’s current 3D-NAND, NVMe drives is not a fair comparison.

Nevertheless, the Optane SSD performs considerably better than current competitive NVMe drives and has significantly better endurance than current 3D (V-)NAND flash drives. All of which is a great step in the right direction.

What about DRAM replacement?

At launch, 3DX was also touted as a higher density, potential replacement for DRAM. But so far we haven’t seen any specs for what 3DX NVM looks like on a memory bus. It has much better density than DRAM, but we would need to see 3DX memory access times under 50ns to have a future as a DRAM replacement. Optane’s NVMe SSD at 10 µsec. is about 200X too slow, but then again it’s not a memory device configuration nor is it attached to a memory bus.


Photo Credit(s):  Intel Optane Spec sheet from Ars Technica Article,  DDR4 DRAM from Wikimedia user:Dsimic