4.5M IO/sec@227µsec 4KB Read on 100GBE with 24 NVMe cards #SFD12

At Storage Field Day 12 (SFD12) this week we talked with Excelero, which is a startup out of Israel. They support a software defined block storage for Linux.

Excelero depends on NVMe SSDs in servers (hyper converged or as a storage system), 100GBE and RDMA NICs. (At the time I wrote this post, videos from the presentation were not available, but the TFD team assures me they will be up on their website soon).

I know, yet another software defined storage startup.

Well yesterday they demoed a single storage system that generated 2.5 M IO/sec random 4KB random writes or 4.5 M IO/Sec random 4KB reads. I didn’t record the random write average response time but it was less than 350µsec and the random read average response time was 227µsec. They only did these 30 second test runs a couple of times, but the IO performance was staggering.

But they used lots of hardware, right?

No. The target storage system used during their demo consisted of:

  • 1-Supermicro 2028U-TN24RT+, a 2U dual socket server with up to 24 NVMe 2.5″ drive slots;
  • 2-2 x 100Gbs Mellanox ConnectX-5 100Gbs Ethernet (R[DMA]-NICs); and
  • 24-Intel 2.5″ 400GB NVMe SSDs.

They also had a Dell Z9100-ON Switch  supporting 32 X 100Gbs QSFP28 ports and I think they were using 4 hosts but all this was not part of the storage target system.

I don’t recall the CPU processor used on the target but it was a relatively lowend, cheap ($300 or so) dual core, Intel standard CPU. I think they said the total target hardware cost $13K or so.

I priced out an equivalent system. 24 400GB 2.5″ NVMe Intel 750 SSDs would cost around $7.8K (Newegg); the 2 Mellanox ConnectX-5 cards $4K (Neutron USA); and the SuperMicro plus an Intel Cpu around $1.5K. So the total system is close to the ~$13K.

But it burned out the target CPU, didn’t it?

During the 4.5M IO/sec random read benchmark, the storage target CPU was at 0.3% busy and the highest consuming process on the target CPU was the Linux “Top” command used to display the PS status.

Excelero claims that the storage target system consumes absolutely no CPU processing to service an 4K read or write IO request. All of IO processing is done by hardware (the R(DMA)-NICs, the NVMe drives and PCIe bus) which bypasses the storage target CPU altogether.

We didn’t look at the host cpu utilization but driving 4.5M IO/sec would take a high level of CPU power even if their client software did most of this via RDMA messaging magic.

How is this possible?

Their client software running in the Linux host is roughly equivalent to an iSCSI initiator but talks a special RDMA protocol (patent pending by Excelero, RDDA protocol) that adds an IO request to the NVMe device submission queue and then rings the doorbell on the target system device and the SSD then takes it off the queue and executes it. In addition to the submission queue IO request they preprogram the PCIe MSI interrupt request message to somehow program (?) the target system R-NIC to send the read data/write status data back to the client host.

So there’s really no target CPU processing for any NVMe message handling or interrupt processing, it’s all done by the client SW and is handled between the NVMe drive and the target and client R-NICs.

The result is that the data is sent back to the requesting host automatically from the drive to the target R-NIC over the target’s PCIe bus and then from the target system to the client system via RDMA across 100GBE and the R-NICS and then from the client R-NIC to the client IO memory data buffer over the client’s PCIe bus.

Writes are a bit simpler as the 4KB write data can be encapsulated into the submission queue command for the write operation that’s sent to the NVMe device and the write IO status is relatively small amount of data that needs to be sent back to the client.

NVMe optimized for 4KB IO

Of course the NVMe protocol is set up to transfer up to 4KB of data with a (write command) submission queue element. And the PCIe MSI interrupt return message can be programmed to (I think) write a command in the R-NIC to cause the data transfer back for a read command directly into the client’s memory using RDMA with no CPU activity whatsoever in either operation. As long as your IO request is less than 4KB, this all works fine.

There is some minor CPU processing on the target to configure a LUN and set up the client to target connection. They essentially only support replicated RAID 10 protection across the NVMe SSDs.

They also showed another demo which used the same drive both across the 100Gbs Ethernet network and in local mode or direct as a local NVMe storage. The response times shown for both local and remote were within  5µsec of each other. This means that the overhead for going over the Ethernet link rather than going local cost you an additional 5µsec of response time.

Disaggregated vs. aggregated configuration

In addition to their standalone (disaggregated) storage target solution they also showed an (aggregated) Linux based, hyper converged client-target configuration with a smaller number of NVMe drives in them. This could be used in configurations where VMs operated and both client and target Excelero software was running on the same hardware.

Simply amazing

The product has no advanced data services. no high availability, snapshots, erasure coding, dedupe, compression replication, thin provisioning, etc. advanced data services are all lacking. But if I can clone a LUN at lets say 2.5M IO/sec I can get by with no snapshotting. And with hardware that’s this cheap I’m not sure I care about thin provisioning, dedupe and compression.  Remote site replication is never going to happen at these speeds. Ok HA is an important consideration but I think they can make that happen and they do support RAID 10 (data mirroring) so data mirroring is there for an NVMe device failure.

But if you want 4.5M 4K random reads or 2.5M 4K random writes on <$15K of hardware and happen to be running Linux, I think they have a solution for you. They showed some volume provisioning software but I was too overwhelmed trying to make sense of their performance to notice.

Yes it really screams for 4KB IO. But that covers a lot of IO activity these days. And if you can do Millions of them a second splitting up bigger IOs into 4K should not be a problem.

As far as I could tell they are selling Excelero software as a standalone product and offering it to OEMs. They already have a few customers using Excelero’s standalone software and will be announcing  OEMs soon.

I really want one for my Mac office environment, although what I’d do with a millions of IO/sec is another question.


6 Replies to “4.5M IO/sec@227µsec 4KB Read on 100GBE with 24 NVMe cards #SFD12”

  1. And so the storage business reaches the level of maturity when the data plane can take care of an I/O without spending processor cycles (more importantly, waiting to schedule a CPU to spend those cycles), which the network business did roughly two decades ago.

    A modern network switch ASIC, last year, could check and forward roughly 3 billion packets per second. Relative to that scale, forwarding a few million I/Os between a network interface and a collection of SSDs is child’s play, particularly if they can be forwarded blindly rather than following fine-granularity access control rules.

    To simplify a very complex topic, the network industry learned to think in terms of data plane (moving that I/O through the hardware without intervention) and control plane (all of the network stack and switch/router software which coordinates knowing who is where and allowed to do what to whom, as well as programming all the tables which allow hardware to (following access control and other rules) forward packets without processor intervention. The control plane software is the hard part, with all due respect to the folks who design the ASICs.

    This storage control plane and data plane will get a lot more interesting when we have shared storage class memory, and the collective processors on a memory fabric can issue billions of storage reads and writes per second, all validated against fine grain (this application or container is allowed to access that address range, across the network) permissions.

    The 2020s are going to be fun!


    1. @FStevenChalmers,

      Thanks for your comment. I couldn’t agree more although in all fairness the previous main storage technology (disk drives) didn’t need the sort of data plane level automatic packet switching that’s present in the Excelero solution with NVMe SSDs. And we are talking about 4KB IO operations to a persistent store not small, transient packets in a network. And we are dealing with a network with very few paths between sure and target vs. many paths between point A and point B.

      All that being said, shared storage class memory (SCM) represents a state change in storage as we know it. When access times drop from 200Usec (for NVMe SSDs) to 20Usec (for SCM NVMe SSDs/Memory) the storage world changes as we know it today. Even 5Usec of overhead for going remote (or non-local) represents a significant amount of overhead.

      Luckily SCM, is not that far away and I know of at least a couple of vendors that have their hands on the technology today.

    2. This is good stuff but looks like a “trick”. I mean if you had a larger ESX cluster (or pick a big machine), it would surely over-run that supermicro controller when it actually had inbound/outbound IO and the buffer copy overhead associated with all that. VMAX for example has over 80 Intel CPUs handling IO, and that is from memory a few years ago, they may be larger now.

      “And so the storage business reaches the level of maturity when the data plane can take care of an I/O without spending processor cycles”

      There is nothing new under the sun. Back in the day (nearly 20 years ago), we were performing VMS shadow copies to HSJ50s and VMS had the hooks to let the copies occur at the HSJ50, like we see above – there was zero CPU overhead. Of course you then switch out for “Enterprise Storage” to a Symmetrix and suddenly you are consuming CPU like crazy as Enginuity of course doesn’t have those same hooks and so the IO has to come back out to OS and back to the storage system. to copy to the “other” disk. There is much to be said about owning the entire stack like DEC (and others) did at one time. We should see some excitement in the HCI space with these HCI players that own the stack. What comes around goes around, eh?

      1. Rob,

        Thanks for your comment. It’s unclear what the CPU utilization would be for a mixed workload but they did 2.5M Write 4KB IOs by themselves with nil CPU time and they did 4.5M Read 4kB IOs with nil CPU utilization. The buffer copy overhead was all in the R-nics, PCIe bus and NVMe SSDs. And again they had almost Nil data management services.

        A better question would be what happens when you start doing 16KB IOs instead of 4KB. NVMe is optimized for 4KB. So unless the client sw splits these up into 4KB IOPS the nil CPU overhead may be at risk.

        What happens when you mix IO types is unclear but with Nil CPU for 4KB Writes and Nil CPU for 4KB reads (meaning it’s all done in hardware and the client software setups) I don’t think it would change the picture. But you might be limited to under 2M 4KB writes and a little over 2M 4KB reads (seems to be an SSD limitation) more than a storage server CPU limit.

        In the days of the mainframe, we had specialized hardware to perform DMAs from the channel to the cache but we still had to consume controller CPU resources to set it up, finish it off and check for errors.

        Their secret is the RDDA protocol that they use ontop of RDMA/RoCE which provides all the setup and termination logic in the command issued from the client to the storage server. Another thing we didn’t look at was what was happening on the client, but any CPU overhead there could be amortized over multiple client systems.

  2. The ‘top’ command doesn’t show you everything the CPU is doing. There are plenty of syscalls and kernel activities that don’t show up in these utilities. You can absolutely be hammering your CPU on kernel functions/interrupts/memory related functions and top will be relatively quiet.

    This is why the perf utility exists

    1. Adam,

      Thanks for the input. They didn’t do the Perf utility as far as I can tell. But from what they are telling me there’s still much less CPU cycles being consumed for doing an IO for Excelero. But next time I’ll ask for the Perf utility to be run…

Comments are closed.