4.5M IO/sec@227µsec 4KB Read on 100GBE with 24 NVMe cards #SFD12

At Storage Field Day 12 (SFD12) this week we talked with Excelero, which is a startup out of Israel. They support a software defined block storage for Linux.

Excelero depends on NVMe SSDs in servers (hyper converged or as a storage system), 100GBE and RDMA NICs. (At the time I wrote this post, videos from the presentation were not available, but the TFD team assures me they will be up on their website soon).

I know, yet another software defined storage startup.

Well yesterday they demoed a single storage system that generated 2.5 M IO/sec random 4KB random writes or 4.5 M IO/Sec random 4KB reads. I didn’t record the random write average response time but it was less than 350µsec and the random read average response time was 227µsec. They only did these 30 second test runs a couple of times, but the IO performance was staggering.

But they used lots of hardware, right?

No. The target storage system used during their demo consisted of:

  • 1-Supermicro 2028U-TN24RT+, a 2U dual socket server with up to 24 NVMe 2.5″ drive slots;
  • 2-2 x 100Gbs Mellanox ConnectX-5 100Gbs Ethernet (R[DMA]-NICs); and
  • 24-Intel 2.5″ 400GB NVMe SSDs.

They also had a Dell Z9100-ON Switch  supporting 32 X 100Gbs QSFP28 ports and I think they were using 4 hosts but all this was not part of the storage target system.

I don’t recall the CPU processor used on the target but it was a relatively lowend, cheap ($300 or so) dual core, Intel standard CPU. I think they said the total target hardware cost $13K or so.

I priced out an equivalent system. 24 400GB 2.5″ NVMe Intel 750 SSDs would cost around $7.8K (Newegg); the 2 Mellanox ConnectX-5 cards $4K (Neutron USA); and the SuperMicro plus an Intel Cpu around $1.5K. So the total system is close to the ~$13K.

But it burned out the target CPU, didn’t it?

During the 4.5M IO/sec random read benchmark, the storage target CPU was at 0.3% busy and the highest consuming process on the target CPU was the Linux “Top” command used to display the PS status.

Excelero claims that the storage target system consumes absolutely no CPU processing to service an 4K read or write IO request. All of IO processing is done by hardware (the R(DMA)-NICs, the NVMe drives and PCIe bus) which bypasses the storage target CPU altogether.

We didn’t look at the host cpu utilization but driving 4.5M IO/sec would take a high level of CPU power even if their client software did most of this via RDMA messaging magic.

How is this possible?

Their client software running in the Linux host is roughly equivalent to an iSCSI initiator but talks a special RDMA protocol (patent pending by Excelero, RDDA protocol) that adds an IO request to the NVMe device submission queue and then rings the doorbell on the target system device and the SSD then takes it off the queue and executes it. In addition to the submission queue IO request they preprogram the PCIe MSI interrupt request message to somehow program (?) the target system R-NIC to send the read data/write status data back to the client host.

So there’s really no target CPU processing for any NVMe message handling or interrupt processing, it’s all done by the client SW and is handled between the NVMe drive and the target and client R-NICs.

The result is that the data is sent back to the requesting host automatically from the drive to the target R-NIC over the target’s PCIe bus and then from the target system to the client system via RDMA across 100GBE and the R-NICS and then from the client R-NIC to the client IO memory data buffer over the client’s PCIe bus.

Writes are a bit simpler as the 4KB write data can be encapsulated into the submission queue command for the write operation that’s sent to the NVMe device and the write IO status is relatively small amount of data that needs to be sent back to the client.

NVMe optimized for 4KB IO

Of course the NVMe protocol is set up to transfer up to 4KB of data with a (write command) submission queue element. And the PCIe MSI interrupt return message can be programmed to (I think) write a command in the R-NIC to cause the data transfer back for a read command directly into the client’s memory using RDMA with no CPU activity whatsoever in either operation. As long as your IO request is less than 4KB, this all works fine.

There is some minor CPU processing on the target to configure a LUN and set up the client to target connection. They essentially only support replicated RAID 10 protection across the NVMe SSDs.

They also showed another demo which used the same drive both across the 100Gbs Ethernet network and in local mode or direct as a local NVMe storage. The response times shown for both local and remote were within  5µsec of each other. This means that the overhead for going over the Ethernet link rather than going local cost you an additional 5µsec of response time.

Disaggregated vs. aggregated configuration

In addition to their standalone (disaggregated) storage target solution they also showed an (aggregated) Linux based, hyper converged client-target configuration with a smaller number of NVMe drives in them. This could be used in configurations where VMs operated and both client and target Excelero software was running on the same hardware.

Simply amazing

The product has no advanced data services. no high availability, snapshots, erasure coding, dedupe, compression replication, thin provisioning, etc. advanced data services are all lacking. But if I can clone a LUN at lets say 2.5M IO/sec I can get by with no snapshotting. And with hardware that’s this cheap I’m not sure I care about thin provisioning, dedupe and compression.  Remote site replication is never going to happen at these speeds. Ok HA is an important consideration but I think they can make that happen and they do support RAID 10 (data mirroring) so data mirroring is there for an NVMe device failure.

But if you want 4.5M 4K random reads or 2.5M 4K random writes on <$15K of hardware and happen to be running Linux, I think they have a solution for you. They showed some volume provisioning software but I was too overwhelmed trying to make sense of their performance to notice.

Yes it really screams for 4KB IO. But that covers a lot of IO activity these days. And if you can do Millions of them a second splitting up bigger IOs into 4K should not be a problem.

As far as I could tell they are selling Excelero software as a standalone product and offering it to OEMs. They already have a few customers using Excelero’s standalone software and will be announcing  OEMs soon.

I really want one for my Mac office environment, although what I’d do with a millions of IO/sec is another question.