Ethernet – Silverton Consulting

Hammerspace and the Open Flash Platform at #AIIFD3

Posted on September 19, 2025 by Ray in AI storage needs, Ethernet, File Storage, Storage density, Storage performance, Strategic Inflection Points

Was at AI Infrastructure Field Day 3 (AIIFD3) last week in CA and Hammerspace presented. (videos here). Molly and Floyd talked about their solution and some of their recent MLCommon’s performance results but Kurt discussed the Open Flash Platform (OFP) Consortium, announced last July which they and partners have been working on..

OFP currently has 6 partners ranging from Hammerspace (storage software supplier), SK Hynix (NAND and SSDs) and Linux Foundation among others and includes end users (Las Alamos National Labs), computational storage (ScaleFlux) and AI solution providers (Xsight).

As I understand it, the OFP is pushing to become a standard adopted by the Open Compute Project (OCP).

OFP is an attempt to redefine NAS as we know it. Hammerspace has been on this journey for a long time with their software only solution but technology is now at a place where it’s time to tackle hardware changes to NAS that would enable even better performance and throughput for AI and other data intensive workloads.

Some of the technology changes driving the need for a different approach to NAS storage:

NAND capacities are going through the roof, accessing all that capacity in an effective and performant way, requires a re-architecturing of the storage stack
Compute is becoming more widespread and ubiquitous. Every thing seems to have more and more compute capability that it’s causing a rethink as to how to take advantage of all this ubiquitous compute to better address IT (and AI) performance needs
AI bandwidth and performance requirements are extreme and are only becoming more so. .
Power has become a limiting factor in many AI deployments.

Hammerspace has addressed much of this from a software perspective with their Linux standards efforts to implement Parallel File System and Flex Files in the Linux kernel and in NFS standards as NFSv4.2. PFS and FlexFiles allows Hammerspace to offer very high file bandwidth and data mobility that can’t be supplied any other way.

So it’s time to see what can be done in hardware to make this even better. Enter OFP.

OFP, NAS storage reborn

The idea is to come up with a new packaging of an NFS (v3) server that’s all storage with high amounts of networking and enough compute to serve the storage. Effectively they are putting a DPU (computational intensive networking card) with 1-800Gbps Ethernet connection in front of a train (or toboggan) of NVMe SSDs and calling this a sled.

Their first version using U.2 NVMe SSDs, offers 1PB of capacity with 800Gbps of networking in a 3.5″ X 1.75″ form factor. They would load a NFS v3 Linux based storage server in the DPU and have it run that along with the Networking stack (and more) on the DPU and have access to all this storage capacity in what essentially is a NFSv3 (relatively dumb storage) storage sled.

Package 6 of these together with a couple of power supplies and now you have 6PB raw capacity in 1RU, with 4.8Tbps of bandwidth, consuming .6 kW of power (presumably this is power consumption at idle).

You will no doubt note that the sled, as configured above, does not allow for hot (or even cold) drive replacement. So when drives fail, the NFSv3 code would need to recover from them and take them out of service. So that over time the sled could still be used even though some SSDs have failed.

In the future, moving from U.2 SSDs to E2(E) NVMe SSDs in the storage sled quadruples the capacity while staying in the same power envelope and supplying the same bandwidth. Again the SSDs are not intended to be (hot or cold) swappable, so drive failure would need to be handled by software. With E2(E) SSDs in a sled and 6 of these in a 1RU, one would have 24PB of storage capacity.

Presumably, OFP Sleds could be hot swappable when enough SSDs in a sled fails.

And of course QLC capacities are not standing still so another doubling of these capacities could easily be possible within the next couple of years (imagine 48PB in a single RU, boggles the mind).

The NAS software one runs in the OFP SLED could be any NFSv3 server software but Hammerspace has their own, called DSX. And when you combine DSX servers with lots of capacity and lots of networking bandwidth, Hammerspace’s NFSv4.2 PFS and FlexFiles can really fly.

And with the power and space efficiency as well as extreme bandwidth available, it could be a winning formula for the AI environments, in contrast to scale-out NAS which is the current alternative.

~~~~

But it seems to me any organization (hypervisors are you listening) with intense storage capacity and storage bandwidth needs would be very interested in the OFP for their own environment.

Comments?

Enfabrica MegaNIC, a solution to GPU backend networking #AIFD5

Posted on September 17, 2024 by Ray in Cognitive computing, Ethernet, Networking, Software Defined Network, Strategic Inflection Points, System effectiveness, Visionary leadershp

I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.

However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.

In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.

Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.

Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.

Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.

With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.

It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.

On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.

Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.

Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.

At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).

Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.

And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)

The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.

After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.

Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .

MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…

~~~~

Comments?

AI benchmark for Storage, MLpERF Storage

Posted on September 19, 2023 by Ray in Artificial Intelligence, Cognitive computing, Ethernet, File Storage, Infiniband, Storage performance

MLperf released their first round of storage benchmark submissions early this month. There’s plenty of interest how much storage is required to keep GPUs busy for AI work. As a result, MLperf has been busy at work with storage vendors to create a benchmark suitable to compare storage systems under a “simulated” AI workload.

For the v0.5 version ,they have released two simulated DNN training workloads one for image segmentation (3D-Unet [146 MB/sample]) and the other for BERT NLP (2.5 KB/sample).

The GPU being simulated is a NVIDIA V100. What they showing with their benchmark is a compute system (with GPUs) reading data directly from a storage system.

By using simulated (GPU) compute, the benchmark doesn’t need physical GPU hardware to run. However, the veracity of the benchmark is somewhat harder to depend on.

But, if one considers, the reported benchmark metric, # supported V100s, as a relative number across the storage submissions, one is on more solid footing. Using it as a real number of V100s that could be physically supported is perhaps invalid.

The other constraint from the benchmark was keeping the simulated (V100) GPUs at 90% busy. MLperf storage benchmark reports, number of samples/second,MBPS metrics as well as # simulated (V100) GPUs supported (@90% utilization).

In the bar chart we show the top 10 # of simulated V100 GPUs for image segmentation storage submissions, DDN AI400X2 had 5 submissions in this category.

The interesting comparison is probably between DDN’s #1 and #3 submission.

The #1 submission had smaller amount of data (24X3.5TB = 64TB of flash), used 200Gbps InfiniBand, with 16 compute nodes and supported 160 simulated V100s.
The #3 submission had more data (24X13.9TB=259TB of flash),used 400Gbps InfiniBand with 1 compute node and supports only 40 simulated V100s

It’s not clear why the same storage, with less flash storage, and slower interfaces would support 4X the simulated GPUs than the same storage, with more flash storage and faster interfaces.

I can only conclude that the number of compute nodes makes a significant difference in simulated GPUs supported.

One can see a similar example of this phenomenon with Nutanix #2 and #6 submissions above. Here the exact same storage was used for two submissions, one with 5 compute nodes and the other with just 1 but the one with more compute nodes supported 5X the # of simulated V100 GPUs.

Lucky for us, the #3-#10 submissions in the above chart, all used one compute node and as such are more directly comparable.

So, if we take #3-#5 in the chart above, as the top 3 submissions (using 1 compute node), we can see that the #3 DDN AI400X2 could support 40 simulated V100s, the #4 Weka IO storage cluster could support 20 simulated V100s and the #5 Micron NVMe SSD could support 17 simulated V100s.

The Micron SSD used an NVMe (PCIe Gen4) interface while the other two storage systems used 400Gbps InfiniBand and 100Gbps Ethernet, respectively. This tells us that interface speed, while it may matter at some point, doesn’t play a significant role in determining the # simulated V100s.

Both the DDN AI4000X2 and Weka IO storage systems are sophisticated storage systems that support many protocols for file access. Presumably the Micron SSD local storage was directly mapped to a Linux file system.

The only other MLperf storage benchmark that had submissions was for BERT, a natural language model.

In the chart, we show the # of simulated V100 GPUs on the vertical axis. We see the same impact here of having multiple compute nodes in the #1 DDN solution supporting 160 simulated V100s. But in this case, all the remaining systems, used 1 compute node.

Comparing the #2-4 BERT submissions, both the #2 and #4 are DDN AI400X2 storage systems. The #2 system had faster interfaces and more data storage than the #4 system and supported 40 simulated GPUs vs the other only supporting 10 simulated V100s.

Once again, Weka IO storage system came in at #3 (2nd place in the 1 compute node systems) and supported 24 simulated V100s.

A couple of suggestions for MLperf:

There should be different classes of submissions one class for only 1 compute node and the other for any number of compute nodes.
I would up level the simulated GPU configurations to A100 rather than V100s, which would only be one generation behind best in class GPUs.
I would include a standard definition for a compute node. I believe these were all the same, but if the number of compute nodes can have a bearing on the number of V100s supported, the compute node hardware/software should be locked down across submissions.
We assume that the protocol used to access the storage oven InfiniBand or Ethernet was standard NFS protocols and not something like GPUDirect storage or other RDMA variants. As the GPUs were simulated this is probably correct but if not, it should be specfied
I would describe the storage configurations with more detail, especially for software defined storage systems. Storage nodes for these systems can vary significantly in storage as well as compute cores/memory sizes which can have a significant bearing on storage throughput.

To their credit this is MLperfs first report on their new Storage benchmark and I like what I see here. With the information provided, one can at least start to see some true comparisons of storage systems under AI workloads.

In addition to the new MLperf storage benchmark, MLperf released new inferencing benchmarks which included updates to older benchmark NN models as well as a brand new GPT-J inferencing benchmark. I’ll report on these next time.

~~~~

Comments?

Is hardware innovation accelerating – hardware vs. software innovation (round 6)

Posted on December 17, 2020April 8, 2021 by Ray in Artificial Intelligence, Brain emulation, Cognitive computing, Data compression, Data QoS, Data transmission, Deep Learning, Ethernet, Infiniband, Market dynamics, Neuromorphic, NVMe, Processing performance, Strategic Inflection Points

There’s something happening to the IT industry, that maybe has not happened in a couple of decades or so but hardware innovation is back. We’ve been covering bits and pieces of it in our hardware vs software innovation series (see Open source ASiCs – HW vs. SW innovation [round 5] post).

But first please take our new poll:

Hardware innovation never really went away, Intel, AMD, Apple and others had always worked on new compute chips. DRAM and NAND also have taken giant leaps over the last two decades. These were all major hardware suppliers. But special purpose chips, non CPU compute engines, and hardware accelerators had been relegated to the dustbins of history as the CPU giants kept assimilating their functionality into the next round of CPU chips.

And then something happened. It kind of made sense for GPUs to be their own electronics as these were SIMD architectures intrinsically different than SISD, standard von Neumann X86 and ARM CPUs architectures

But for some reason it didn’t stop there. We first started seeing some inklings of new hardware innovation in the AI space with a number of special purpose DL NN accelerators coming online over the last 5 years or so (see Google TPU, SC20-Cerebras, GraphCore GC2 IPU chip, AI at the Edge Mythic and Syntiants IPU chips, and neuromorphic chips from BrainChip, Intel, IBM , others). Again, one could look at these as taking the SIMD model of GPUs into a slightly different direction. It’s probably one reason that GPUs were so useful for AI-ML-DL but further accelerations were now possible.

But it hasn’t stopped there either. In the last year or so we have seen SPUs (Nebulon Storage), DPUs (Fungible, NVIDIA Networking, others), and computational storage (NGD Systems, ScaleFlux Storage, others) all come online and become available to the enterprise. And most of these are for more normal workload environments, i.e., not AI-ML-DL workloads,

I thought at first these were just FPGAs implementing different logic but now I understand that many of these include ASICs as well. Most of these incorporate a standard von Neumann CPU (mostly ARM) along with special purpose hardware to speed up certain types of processing (such as low latency data transfer, encryption, compression, etc.).

What happened?

It’s pretty easy to understand why non-von Neumann computing architectures should come about. Witness all those new AI-ML-DL chips that have become available. And why these would be implemented outside the normal X86-ARM CPU environment.

But SPU, DPUs and computational storage, all have typical von Neumann CPUs (mostly ARM) as well as other special purpose logic on them.

Why?

I believe there are a few reasons, but the main two are that Moore’s law (every 2 years halving the size of transistors, effectively doubling transistor counts in same area) is slowing down and Dennard scaling (as you reduce the size of transistors their power consumption goes down and speed goes up) has stopped almost. Both of these have caused major CPU chip manufacturers to focus on adding cores to boost performance rather than just adding more transistors to the same core to increase functionality.

This hasn’t stopped adding instruction functionality to each CPU, but it has slowed considerably. And single (core) processor speeds (GHz) have reached a plateau.

But what it has stopped is having the real estate available on a CPU chip to absorb lots of additional hardware functionality. Which had been the case since the 1980’s.

I was talking with a friend who used to work on math co-processors, like the 8087, 80287, & 80387 that performed floating point arithmetic. But after the 486, floating point logic was completely integrated into the CPU chip itself, killing off the co-processors business.

Hardware design is getting easier & chip fabrication is becoming a commodity

We wrote a post a couple of weeks back talking about an open foundry (see HW vs. SW innovation round 5 noted above)that would take a hardware design and manufacture the ASICs for you for free (or at little cost). This says that the tool chain to perform chip design is becoming more standardized and much less complex. Does this mean that it takes less than 18 months to create an ASIC. I don’t know but it seems so.

But the real interesting aspect of this is that world class foundries are now available outside the major CPU developers. And these foundries, for a fair but high price, would be glad to fabricate a 1000 or million chips for you.

Yes your basic state of the art fab probably costs $12B plus these days. But all that has meant is that A) they will take any chip design and manufacture it, B) they need to keep the factory volume up by manufacturing chips in order to amortize the FAB’s high price and C) they have to keep their technology competitive or chip manufacturing will go elsewhere.

So chip fabrication is not quite a commodity. But there’s enough state of the art FABs in existence to make it seem so.

But it’s also physics

The extremely low latencies that are available with NVMe storage and, higher speed networking (100GbE & above) are demanding a lot more processing power to keep up with. And just the physics of how long it takes to transfer data across a distance (aka racks) is starting to consume too much overhead and impacting other work that could be done.

When we start measuring IO latencies in under 50 microseconds, there’s just not a lot of CPU instructions and task switching that can go on anymore. Yes, you could devote a whole core or two to this process and keep up with it. But wouldn’t the data center be better served keeping that core busy with normal work and offloading that low-latency, realtime (like) work to a hardware accelerator that could be executing on the network rather than behind a NIC.

So real time processing has become faster, or rather the amount of time to execute CPU instructions to switch tasks and to process data that needs to be done in realtime to keep up with faster line speed is becoming shorter.

So that explains DPUs, smart NICS, DPUs, & SPUs. What about the other hardware accelerator cards.

AI-ML-DL is becoming such an important and data AND compute intensive workload that just like GPUs before them, TPUs & IPUs are becoming a necessary evil if we want to service those workloads effectively and expeditiously.
Computational storage is becoming more wide spread because although data compression can be easily done at the CPU, it can be done faster (less data needs to be transferred back and forth) at the smart Drive.

My guess we haven’t seen the end of this at all. When you open up the possibility of having a long term business model, focused on hardware accelerators there would seem to be a lot of stuff that needs to be done and could be done faster and more effectively outside the core CPU.

There was a point over the last decade where software was destined to “eat the world”. I get a lot of flack for saying that was BS and that hardware innovation is really eating the world. Now that hardtware innovation’s back, it seems to be a little of both.

Comments?

Photo Credits:

Cerebras chip, Cerebras (see SC20 post)
Mythic architecture, Mythic computing (see AI at the edge post)
TPU2-iot, Google (see TPU post)
130nm layouts (see Open source ASICs post)
Moore’s law chart – wikipedia, By Max Roser – https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=79751151

There’s a new cluster filesystem on the block, Elastifile

Posted on April 14, 2017April 14, 2017 by Ray in Cloud storage, Clustered storage, Distributed computing, Ethernet, File Storage, NVMe, NVMe storage, SSD storage

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each data object holds the 4KB of compressed file data and journal log entries.
Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information

~~~~

New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

Andrew Mauro’s (@Andrew_Mauro) post, Elastifile launches cross-cloud data fabric
Adam Bergh’s (@AJBergh) post recap of SFD12 – day 1
Chin-Fat Heoh’s (@StorageGaga) post The engineering of Elastifile

QoM1610: Will NVMe over Fabric GA in enterprise AFA by Oct’2017

Posted on October 16, 2016 by Ray in Block Storage, CIFS/SMB, Ethernet, FC, FCoE, File Storage, Forecasting, Infiniband, iSCSI, Networking, NFS, NVMe, NVMe storage, Omni-Path, QoM 2016, RDMA, RoCE, SSD storage, Storage performance, Strategic Inflection Points

NVMe over fabric (NVMeoF) was a hot topic at Flash Memory Summit last August. Facebook and others were showing off their JBOF (see my Facebook moving to JBOF post) but there were plenty of other NVMeoF offerings at the show.

NVMeoF hardware availability

When Brocade announced their Gen6 Switches they made a point of saying that both their Gen5 and Gen6 switches currently support NVMeoF protocols. In addition to Brocade’s support, in Dec 2015 Qlogic announced support for NVMeoF for select HBAs. Also, as of July 2016, Emulex announced support for NVMeoF in their HBAs.

From an Ethernet perspective, Qlogic has a NVMe Direct NIC which supports NVMe protocol offload for iSCSI. But even without NVMe Direct, Ethernet 40GbE & 100GbE with iWARP, RoCEv1-v2, iSCSI SER, or iSCSI RDMA all could readily support NVMeoF on Ethernet. The nice thing about NVMeoF for Ethernet is not only do you get support for iSCSI & FCoE, but CIFS/SMB and NFS as well.

InfiniBand and Omni-Path Architecture already support native RDMA, so they should already support NVMeoF.

So hardware/firmware is already available for any enterprise AFA customer to want NVMeoF for their data center storage.

NVMeoF Software

Intel claims that ~90% of the software driver functionality of NVMe is the same for NVMeoF. The primary differences between the two seem to be the NVMeoY discovery and queueing mechanisms.

There are two fabric methods that can be used to implement NVMeoF data and command transfers: capsule mode where NVMe commands and data are encapsulated in normal fabric packets or fabric dependent mode where drivers make use of native fabric memory transfer mechanisms (RDMA, …) to transfer commands and data.

A (Linux) host driver for NVMeoF is currently available from Seagate. And as a result, support for NVMeoF for Linux is currently under development, and not far from release in the next Kernel (I think). (Mellanox has a tutorial on how to compile a Linux kernel with NVMeoF driver support).

With Linux coming out, Microsoft Windows and VMware can’t be far behind. However, I could find nothing online, aside from base NVMe support, for either platform.

NVMeoF target support is another matter but with NICs/HBAs & switch hardware/firmware and drivers presently available, proprietary storage system target drivers are just a matter of time.

Boot support is a major concern. I could find no information on BIOS support for booting off of a NVMeoF AFA. Arguably, one may not need boot support for NVMeoF AFAs as they are probably not a viable target for storing App code or OS software.

From what I could tell, normal fabric multi-pathing support should work fine with NVMeoF. This should allow for HA NVMeoF storage, a critical requirement for enterprise AFA storage systems these days.

NVMeoF advantages/disadvantages

Chelsio and others have shown that NVMeoF adds ~8μsec of additional overhead beyond native NVMe SSDs, which if true would warrant implementation on all NVMe AFAs. This may or may not impact max IOPS depending on scale-ability of NVMeoF.

For instance, servers (PCIe bus hardware) typically limit the number of private NVMe SSDs to 255 or less. With an NVMeoF, one could potentially have 1000s of shared NVMe SSDs accessible to a single server. With this scale, one could have a single server attached to a scale-out NVMeoF AFA (cluster) that could supply ~4X the IOPS that a single server could perform using private NVMe storage.

Base level NVMe SSD support and protocol stacks are starting to be available for most flash vendors and operating systems such as, Linux, FreeBSD, VMware, Windows, and Solaris. If Intel’s claim of 90% common software between NVMe and NVMeoF drivers is true, then it should be a relatively easy development project to provide host NVMeoF drivers.

The need for special Ethernet hardware that supports RDMA may delay some storage vendors from implementing NVMeoF AFAs quickly. The lack of BIOS boot support may be a minor irritant in comparison.

NVMeoF forecast

AFA storage systems, as far as I can tell, are all about selling high IOPS and very-low latency IOs. It would seem that NVMeoF would offer early adopter AFA storage vendors a significant performance advantage over slower paced competition.

In previous QoM/QoW posts we have established that there are about 13 new enterprise storage systems that come out each year. Probably 80% of these will be AFA, given the current market environment.

Of the 10.4 AFA systems coming out over the next year, ~20% of these systems pride themselves on being the lowest latency solutions in the market, and thus command high margins. One would think these systems would be the first to adopt NVMeoF. But, most of these systems have their own, proprietary flash modules and do not use standard (NVMe) SSDs and can use their own proprietary interface to their proprietary flash storage. This will delay any implementation for them until they can convert their flash storage to NVMe which may take some time.

On the other hand, most (70%) of the other AFA systems, that currently use SAS/SATA SSDs, could boost their IOP counts and drastically reduce their IO response times, by implementing NVMe SSDs and NVMeoF. But converting SAS/SATA backends to NVMe will take time and effort.

But, there are a select few (~10%) of AFA systems, that already use NVMe SSDs in their AFAs, and for these few, they would seem to have a fast track towards implementing NVMeoF. The fact that NVMeoF is supported over all fabrics and all storage interface protocols make it even easier.

Moreover, NVMeoF has been under discussion since the summer of 2015, which tells me that astute AFA vendors have already had 18+ months to develop it. With NVMeoF host drivers & hardware available since Dec. 2015, means hardware and software exist to test and validate against.

I believe that NVMeoF will be GA’d within the next 12 months by at least one enterprise AFA system. So my QoM1610 forecast for NVMeoF is YES, with a 0.83 probability.

Comments?

Exablox, bring your own disk storage

Posted on July 8, 2016July 8, 2016 by Ray in CIFS/SMB, Clustered storage, Data reduction, Ethernet, File Storage, NFS, Object storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading “Exablox, bring your own disk storage” →

Pure Storage FlashBlade well positioned for next generation storage

Posted on June 16, 2016June 17, 2016 by Ray in Distributed computing, Ethernet, File Storage, NFS throughput, R&D measures, Software Defined Network, SSD storage, Strategic Inflection Points, System effectiveness

Sometimes, long after I listen to a vendor’s discussion, I come away wondering why they do what they do. Oftentimes, it passes but after a recent session with Pure Storage at SFD10, it lingered.

Why engineer storage hardware?

In the last week or so, executives at Hitachi mentioned that they plan to reduce hardware R&D activities for their high end storage. There was much confusion what it all meant but from what I hear, they are ahead now, and maybe it makes more sense to do less hardware and more software for their next generation high end storage. We have talked about hardware vs. software innovation a lot (see recent post: TPU and hardware vs. software innovation [round 3]).
Continue reading “Pure Storage FlashBlade well positioned for next generation storage” →