The rise of MinIO object storage

MinIO presented at SFD21 a couple of weeks back (see videos here). They had a great session, as always with Jonathan and AB leading the charge. We’ve had a couple of GreyBeardsOnStorage podcasts with AB as well (listen and see GreyBeards talk open source S3… and GreyBeards talk Data Persistence …). We first talked with MinIO last year at SFD 19 where AB made a great impression on the bloggers (see videos here)

Their customers run the gamut from startups to F500. AB said that ~58% of the F500 have MinIO installed and over 8% of the F500 have added capacity over the last year. AB said they have a big presence in Finance, e.g., the 10 largest banks run MinIO, also the auto and Space/Defense sectors have adopted their product.

One reason for the later two sectors (auto & space/defense) is the size of MinIO’s binary, 50MB. And my guess for why the rest of those customers have adopted MinIO is because it’s S3 API compatible, it’s open source, and it’s relatively inexpensive.

Object storage trends

Customers running in the cloud have a love-hate relationship with object storage, they love that it scales but hate what it costs. There are numerous on prem object storage alternatives from traditional and non-traditional storage vendors, but most are deployed on appliances.

With appliances, customers have to order, wait for delivery, rack-configure-set up and after maybe weeks to months finally they have object storage on prem. But with MinIO a purely software, open source solution, it can be tried by merely downloading a couple of (Docker) containers and deployed/activated in under an hour..

As mentioned above, MinIO is API compatible with AWS S3 which helps with adoption. Moreover, now that it’s an integral part of VMware (see their new Data Persistence Platform), it can be enabled in seconds on your standard enterprise VMware cluster with Tanzu.

The other trend is that the edge needs storage, and lots of it. The main drivers of massive edge storage requirement are TelCos deploying 5G and auto industry’s self-driving cars. But this is just a start, industrial IoT will be generating reams of sensor log data at the edge, it will need to be stored somewhere. And what better place to store all this data, but on object storage. Furthermore, all this is driving more adoption of object storage, with MinIO picking up the lion’s share of deployments.

In addition, MinIO recently ported their software to run on ARM. AB said this was to support the expanding hobbyist and developers community driving edge innovation.

And then there was Kubernetes. Everyone in the industry (with the possible exception of Google) is surprised by the adoption of K8S. Google essentially gifted ~$1Bs in R&D on how to scale apps to the world of IT, and now any startup, anywhere, can scale with as well as Google can. And scaling is the “killer app” for the SW industry.

But performance isn’t bad either

Jonathan made mention of MinIO performance (see MinIO 24 node disk and MinIo 32 node NVMe SSD reports) benchmarks. Their disk data shows avg read and write performance of 16.3GB/s and 9.4GB/s, respectively and their NVMe SSD average read and write performance of 183.2GB/s and 171.3GB/s, respectively. The disk numbers are very good for object storage, but the SSD numbers are spectacular.

It turns out that modern, cloud native apps don’t need quick access to data as much as high data throughput. Modern apps have moved to a processing data in memory rather than off of storage, which means they move (large) chunks of data to memory and crunch on it there, and then spit it back out to storage This type of operating mode seems to scale better (in the cloud at least) than having a high priced storage system servicing a blizzard of IO requests from everywhere.

Other vendors had offered SSD object storage before but it never took off. But nowadays, with NVMe SSDs, MinIO is seeing starting to see healthcare, finance, and any AI/ML workloads all deploying NVMe SSD object storage. Yes for large storage repositories, (object storage’s traditional strongpoint), ie, 5PB to 100PB, disk can’t be beat but where blistering high throughput, is needed, NVMe SSD object storage is the way to go.

Open source vs. open core

AB mentioned that MinIO business model is 100% open source vs. many other vendors that use open source but whose business model is open core. The distinction is that open core vendors use open source as base functionality and then build proprietary, charged for, software features/functions on top of this.

But open source vendors, like MinIO offer all their functionality under an open source license (Apache SM License V2.0, GNU AGPL v3 Open Source license and other FOSS licenses), but if you want to use it commercially, build products with it embedded inside, or have enterprise class support, one purchases a commercial license.

As presented at SFD21, but their website home page has updated numbers reflected below

The pure open source model has some natural advantages:

  1. It’s a great lead gen solution because anyone, worldwide, 7X24X365 can download the software and start using it, (see Docker Hub or MinIO’s download page
  2. It’s a great hiring pool. Anyone, who has contributed to the MinIO open source is potentially a great technical hire. MinIO stats says they have 685 contributors, 19 in just the last month for MinIO base code (see MinIO’s GitHub repo).
  3. It’s a great development organization. With ~20 commits a weekover the last year, there’s a lot going on to add functionality/fix bugs. But that’s the new world of software development. Given all this activity, release frequencies increase, ~4 releases a month ((see GitHub repo insights above).
  4. It’s a great testing pool with, ~480M Docker Pulls (using a Docker container to run a standard, already configured MinIO server, mc, console, etc.) and ~18K enterprises running their solution, that’s an awful lot of users. With open source a lot of eye’s or contributors make all problems visible, but what’s more typical, from my perspective, is the more users that deploy your product, the more bugs they find.

Indeed, with the VMware’s Data Persistence Platform, Tanzu customers can use MinIO’s object storage at the click of a button (or three).

Of course, open source has downsides too. Anyone can access packages directly (from GitHub repo and elsewhere) and use your software. And of course, they can clone, fork and modify your source code, to add any functionality they want to it. Historically, open source subscription licensing models don’t generate as much revenues as appliance purchases do. And finally, open source, because it’s created by geeks, is typically difficult to deploy, configure, and use.

But can they meet the requirements of an Enterprise world

Because most open source is difficult to use, the enterprise has generally shied away from it. But that’s where there’s been a lot of changes to MinIO.

MinIO always had a “mc” (minio [admin] client) that offered a number of administrative services via an API, programmatically controlled interface. but they have recently come out with a GUI offering, the minIO console, which has a similarly functionality to their mc APU. They demoed the console on their SFD21 sessions (see videos above).

Supporting 18K enterprise users, even if only 8% are using it a lot, can be a challenge, but supporting almost a half a billion docker pulls (even if only 1/4th of these is a complete minIO deployment) can be hell on earth. The surprising thing is that MinIO’s commercial license promises customers direct-to-engineer support.

At their SFD21 sessions, AB stated they were getting ~2.7 new (tickets) problems a day. I assume these are what’s just coming in from commercial licensed users and not the general public (using their open source licensed offerings). AB said their average resolution time for these tickets was under 15 minutes.

Enter SubNet, the MinIO Subscription Network and their secret (not open source?) weapon to scale enterprise class support. Their direct-to-engineer support model involves a much, more collaborative approach to solving customer problems then you typical enterprise support with level 1, 2 & 3 support engineers. They demoed SubNet briefly at SFD21, but it could deserve a much longer discussion/demostration.

What little we saw (at SFD21) was that it looked almost like slack-PM dialog between customer and engineer but with unlimited downloads and realtime interaction.

MinIO also supports a very active Slack discussion group with ~11K users. Here anyone can ask a question and it will get answered by anyone. MinIO’s Slack has 2 channels: (Ggeneral and GitHub for notifications). It seems like MinIO is using Slack as a crowdsourced level 1 support.

But in the long run, to continue to offer “direct-to-engineer” levels of support, may require adding a whole lot more engineers. But AB seems prepared to do just that.

~~~~

MinIO is an interesting open source S3 API compatible, object storage solution that seems to run just about anywhere, is freely deployable with enterprise class support available (at a price) and has high throughput performance. What’s not to like.

Screaming IOP performance with StarWind’s new NVMeoF software & Optane SSDs

Was at SFD17 last week in San Jose and we heard from StarWind SAN (@starwindsan) and their latest NVMeoF storage system that they have been working on. Videos of their presentation are available here. Starwind is this amazing company from the Ukraine that have been developing software defined storage.

They have developed their own NVMe SPDK for Windows Server. Intel doesn’t currently offer SPDK for Windows today, so they developed their own. They also developed their own initiator (CentOS Linux) for NVMeoF. The target system was a multicore server running Windows Server with a single Optane SSD that they used to test their software.

Extreme IOP performance consumes cores

During their development activity they tested various configurations. At the start of their development they used a Windows Server with their NVMeoF target device driver. With this configuration and on a bare metal server, they found that they could max out the Optane SSD at 550K 4K random write IOPs at 0.6msec to a single Optane drive.

When they moved this code directly to run under a Hyper-V environment, they were able to come close to this performance at 518K 4K write IOPS at 0.6msec. However, this level of IO activity pegged 100% of 8 cores on their 40 core server.

More IOPs/core performance in user mode

Next they decided to optimize their driver code and move as much as possible into user space and out of kernel space, They continued to use Hyper-V. With this level off code, they were able to achieve the same performance as bare metal or ~551K 4K random write IOP performance at the 0.6msec RT and 2.26 GB/sec level. However, they were now able to perform only pegging 2 cores. They expect to release this initiator and target software in mid October 2018!

They converted this functionality to run under ESX/VMware and were able to see much the same 2 cores pegged, ~551K 4K random write IOPS at 0.6msec RT and 2.26 GB/sec. They will have the ESXi version of their target driver code available sometime later this year.

Their initiator was running CentOS on another server. When they decided to test how far they could push their initiator, they were able to drive 4 Optane SSDs at up to ~1.9M 4K random write IOP performance.

At SFD17, I asked what they could have done at 100 usec RT and Max said about 450K IOPs. This is still surprisingly good performance. With 4 Optane SSDs and consuming ~8 cores, you could achieve 1.8M IOPS and ~7.4GB/sec. Doubling the Optane SSDs one could achieve ~3.6M IOPS, with sufficient initiators and target cores with ~14.8GB/sec.

Optane based super computer?

ORNL Summit super computer, the current number one supercomputer in the world, has a sustained throughput of 2.5 TB/sec over 18.7K server nodes. You could do much the same with 337 CentOS initiator nodes, 337 Windows server nodes and ~1350 Optane SSDs.

This would assumes that Starwind’s initiator and target NVMeoF systems can scale but they’ve already shown they can do 1.8M IOPS across 4 Optane SSDs on a single initiator server. Aand I assume a single target server with 4 Optane SSDs and at least 8 cores to service the IO. Multiplying this by 4 or 400 shouldn’t be much of a concern except for the increasing networking bandwidth.

Of course, with Starwind’s Virtual SAN, there’s no data management, no data protection and probably very little in the way of logical volume management. And the ORNL Summit supercomputer is accessing data as files in a massive file system. The StarWind Virtual SAN is a block device.

But if I wanted to rule the supercomputing world, in a somewhat smallish data center, I might be tempted to put together 400 of StarWind NVMeoF target storage nodes with 4 Optane SSDs each. And convert their initiator code to work on IBM Spectrum Scale nodes and let her rip.

Comments?

Atomristors, a new single (atomic) layer memristor

Read an article the other day about the “Atomristor: non-volatile resistance  switching in atomic sheets of transition metal dichalcogenides” (TMDs), an ACS publication. The article shows research that discovered an atomic sheet level version of a memristor. The device is an atomic sheet of TMD that is sandwiched between two (gold, silver or graphene) electrodes.

They refer to the device switching non-volatile resistance (NVR) from low to high or vice versa but from our perspective it could just as easily be considered a non-volatile device usable for memory, storage, or electronic circuitry.

Prior to this research, it was believed that such resistance switching could not be accomplished with single atomic, sub-nanometre (0.7nm) sized, sheet of material.

NVR atomristor technological properties

The researchers discovered that NVR switching can occur at different device temperatures, sheet areas, compliance current, voltage sweep rate, and layer thickness. In all five degrees of freedom were tested to show that  TMD atomristors had wide applicability and allowed for significant environmental and electronic variability.

Not only was the effect extremely versatile, the researchers identified multiple materials which could be used for the atomic sheet. In fact, TMD are a class of materials and they showed 4 different TMD materials that had the NVR effect.

Surprisingly, some TMD materials exhibited the NVR effect using unipolar voltages and some using bipolar voltages, and some could use both.

The researchers went a long way to showing that the NVR was due to the atomic sheet. In one instance they specifically used non-lithographic measures to fabricate the devices. This process utilized graphene manufacturing like methods to produce an atomic sheet ontop of gold foil and depositing another gold layer ontop of that.

But they also used standard fabrication techniques to build the atomristor devices as well. Using these different fabrication methods, they were able to show the NVR effect using different electrodes types, testing gold, silver, and graphene, all of which worked similarly.

The paper talked of using atomristors in a software defined radio, as a electronic circuit/cross bar switch, or as a memory/storage device. But they also indicated that it could easily be used in a neuromorphic computer as well, effectively simulating neuron like computations.

There’s much more information in the ACS article.

How does it compare to flash?

As compared to flash, atomristors NVR devices should be able to provide higher levels (bits per mm) of density. And due to the lower current (~1v) required for (bipolar) NVR setting, reading and resetting, there’s a lower probability of leakage of stored charges as they’re scaled down to nm sizes.

And of course it comes in 2d sheets, so it’s just as amenable to 3D arrays as NAND and 3DX is today. That means that as fabs start scaling 3D NAND up in layers, atomristor NVR devices should be able to follow their technology roadmap to be scaled up just as high.

Atomristor computers, storage or switch devices

Going from the “lab” to an IT shop is a multifaceted endeavour that takes a lot of time. There are many steps to needed to get to commercialization and many lab breakthroughs never make it that far because of complexity, economics, and other factors.

For instance, memristors were first proposed in 1971 and HP(E) researchers first discovered material that could produce the memristor effect in 2008. In March 2012, HRL fabricated the first memristor chip on CMOS. In Dec. 2017, >9 years later, at their Discover Conference, HPE showed off “The Machine”, a prototype of a memristor based computer to the public. But we are still waiting to see one on the market for sale.

That being said, memristor technologies didn’t exist before 2008, so the use of these devices in a computer took sometime to be understood. The fact that atomristors are “just” an extremely, thinner version of memristors should help it be get to market faster that original memristor technologies. But how much faster than 9-12 years is anyone’s guess.

~~~~

Comments?

Picture Credit(s): All graphics and pictures are from the article in ACS

Axellio, next gen, IO intensive server for RT analytics by X-IO Technologies

We were at X-IO Technologies last week for SFD13 in Colorado Springs talking with the team and they showed us their new IO and storage intensive server, the Axellio. They want to sell Axellio to customers that need extreme IOPS, very high bandwidth, and large storage requirements. Videos of X-IO’s sessions at SFD13 are available here.

The hardware

Axellio comes in 2U appliance with two server nodes. Each server supports  2 sockets of Intel E5-26xx v4 CPUs (4 sockets total) supporting from 16 to 88 cores. Each server node can be configured with up to 1TB of DRAM or it also supports NVDIMMs.

There are two key differentiators to Axellio:

  1. The FabricExpress™, a PCIe based interconnect which allows both server nodes to access dual-ported,  2.5″ NVMe SSDs; and
  2. Dense drive trays, the Axellio supports up to 72 (6 trays with 12 drives each) 2.5″ NVMe SSDs offering up to 460TB of raw NVMe flash using 6.4TB NVMe SSDs. Higher capacity NVMe SSDS available soon will increase Axellio capacity to 1PB of raw NVMe flash.

They also probably spent a lot of time on packaging, cooling and power in order to make Axellio a reliable solution for edge computing. We asked if it was NEBs compliant and they told us not yet but they are working on it.

Axellio can also be configured to replace 2 drive trays with 2 processor offload modules such as 2x Intel Phi CPU extensions for parallel compute, 2X Nvidia K2 GPU modules for high end video or VDI processing or 2X Nvidia P100 Tesla modules for machine learning processing. Probably anything that fits into Axellio’s power, cooling and PCIe bus lane limitations would also probably work here.

At the frontend of the appliance there are 1x16PCIe lanes of server retained for networking that can support off the shelf NICs/HCAs/HBAs with HHHL or FHHL cards for Ethernet, Infiniband or FC access to the Axellio. This provides up to 2x100GbE per server node of network access.

Performance of Axellio

With Axellio using all NVMe SSDs, we expect high IO performance. Further, they are measuring IO performance from internal to the CPUs on the Axellio server nodes. X-IO says the Axellio can hit >12Million IO/sec with at 35µsec latencies with 72 NVMe SSDs.

Lab testing detailed in the chart above shows IO rates for an Axellio appliance with 48 NVMe SSDs. With that configuration the Axellio can do 7.8M 4KB random write IOPS at 90µsec average response times and 8.6M 4KB random read IOPS at 164µsec latencies. Don’t know why reads would take longer than writes in Axellio, but they are doing 10% more of them.

Furthermore, the difference between read and write IOP rates aren’t close to what we have seen with other AFAs. Typically, maximum write IOPs are much less than read IOPs. Why Axellio’s read and write IOP rates are so close to one another (~10%) is a significant mystery.

As for IO bandwitdh, Axellio it supports up to 60GB/sec sustained and in the 48 drive lax testing it generated 30.5GB/sec for random 4KB writes and 33.7GB/sec for random 4KB reads. Again much closer together than what we have seen for other AFAs.

Also noteworthy, given PCIe’s bi-directional capabilities, X-IO said that there’s no reason that the system couldn’t be doing a mixed IO workload of both random reads and writes at similar rates. Although, they didn’t present any test data to substantiate that claim.

Markets for Axellio

They really didn’t talk about the software for Axellio. We would guess this is up to the customer/vertical that uses it.

Aside from the obvious use case as a X-IO’s next generation ISE storage appliance, Axellio could easily be used as an edge processor for a massive fabric of IoT devices, analytics processor for large RT streaming data, and deep packet capture and analysis processing for cyber security/intelligence gathering, etc. X-IO seems to be focusing their current efforts on attacking these verticals and others with similar processing requirements.

X-IO Technologies’ sessions at SFD13

Other sessions at X-IO include: Richard Lary, CTO X-IO Technologies gave a very interesting presentation on an mathematically optimized way to do data dedupe (caution some math involved); Bill Miller, CEO X-IO Technologies presented on edge computing’s new requirements and Gavin McLaughlin, Strategy & Communications talked about X-IO’s history and new approach to take the company into more profitable business.

Again all the videos are available online (see link above). We were very impressed with Richard’s dedupe session and haven’t heard as much about bloom filters, since Andy Warfield, CTO and Co-founder Coho Data, talked at SFD8.

For more information, other SFD13 blogger posts on X-IO’s sessions:

Full Disclosure

X-IO paid for our presence at their sessions and they provided each blogger a shirt, lunch and a USB stick with their presentations on it.

 

There’s a new cluster filesystem on the block, Elastifile

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

  • File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each  data object holds the 4KB of compressed file data and journal log entries.
  • Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

  • A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
  • A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
  • A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information

~~~~

New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

Hardware vs. software innovation – round 4

We, the industry and I, have had a long running debate on whether hardware innovation still makes sense anymore (see my Hardware vs. software innovation – rounds 1, 2, & 3 posts).

The news within the last week or so is that Dell-EMC cancelled their multi-million$, DSSD project, which was a new hardware innovation intensive, Tier 0 flash storage solution, offering 10 million of IO/sec at 100µsec response times to a rack of servers.

DSSD required specialized hardware and software in the client or host server, specialized cabling between the client and the DSSD storage device and specialized hardware and flash storage in the storage device.

What ultimately did DSSD in, was the emergence of NVMe protocols, NVMe SSDs and RoCE (RDMA over Converged Ethernet) NICs.

Last weeks post on Excelero (see my 4.5M IO/sec@227µsec … post) was just one example of what can be done with such “commodity” hardware. We just finished a GreyBeardsOnStorage podcast (GreyBeards podcast with Zivan Ori, CEO & Co-founder, E8 storage) with E8 Storage which is yet another approach to using NVMe-RoCE “commodity” hardware and providing amazing performance.

Both Excelero and E8 Storage offer over 4 million IO/sec with ~120 to ~230µsec response times to multiple racks of servers. All this with off the shelf, commodity hardware and lots of software magic.

Lessons for future hardware innovation

What can be learned from the DSSD to NVMe(SSDs & protocol)-RoCE technological transition for future hardware innovation:

  1. Closely track all commodity hardware innovations, especially ones that offer similar functionality and/or performance to what you are doing with your hardware.
  2. Intensely focus any specialized hardware innovation to a small subset of functionality that gives you the most bang, most benefits at minimum cost and avoid unnecessary changes to other hardware.
  3. Speedup hardware design-validation-prototype-production cycle as much as possible to get your solution to the market faster and try to outrun and get ahead of commodity hardware innovation for as long as possible.
  4. When (and not if) commodity hardware innovation emerges that provides  similar functionality/performance, abandon your hardware approach as quick as possible and adopt commodity hardware.

Of all the above, I believe the main problem is hardware innovation cycle times. Yes, hardware innovation costs too much (not discussed above) but I believe that these costs are a concern only if the product doesn’t succeed in the market.

When a storage (or any systems) company can startup and in 18-24 months produce a competitive product with only software development and aggressive hardware sourcing/validation/testing, having specialized hardware innovation that takes 18 months to start and another 1-2 years to get to GA ready is way too long.

What’s the solution?

I think FPGA’s have to be a part of any solution to making hardware innovation faster. With FPGA’s hardware innovation can occur in days weeks rather than months to years. Yes ASICs cost much less but cycle time is THE problem from my perspective.

I’d like to think that ASIC development cycle times of design, validation, prototype and production could also be reduced. But I don’t see how. Maybe AI can help to reduce time for design-validation. But independent FABs can only speed the prototype and production phases for new ASICs, so much.

ASIC failures also happen on a regular basis. There’s got to be a way to more quickly fix ASIC and other hardware errors. Yes some hardware fixes can be done in software but occasionally the fix requires hardware changes. A quicker hardware fix approach should help.

Finally, there must be an expectation that commodity hardware will catch up eventually, especially if the market is large enough. So an eventual changeover to commodity hardware should be baked in, from the start.

~~~~

In the end, project failures like this happen. Hardware innovation needs to learn from them and move on. I commend Dell-EMC for making the hard decision to kill the project.

There will be a next time for specialized hardware innovation and it will be better. There are just too many problems that remain in the storage (and systems) industry and a select few of these can only be solved with specialized hardware.

Comments?

Picture credit(s): Gravestones by Sherry NelsonMotherboard 1 by Gareth Palidwor; Copy of a DSSD slide photo taken from EMC presentation by Author (c) Dell-EMC

4.5M IO/sec@227µsec 4KB Read on 100GBE with 24 NVMe cards #SFD12

At Storage Field Day 12 (SFD12) this week we talked with Excelero, which is a startup out of Israel. They support a software defined block storage for Linux.

Excelero depends on NVMe SSDs in servers (hyper converged or as a storage system), 100GBE and RDMA NICs. (At the time I wrote this post, videos from the presentation were not available, but the TFD team assures me they will be up on their website soon).

I know, yet another software defined storage startup.

Well yesterday they demoed a single storage system that generated 2.5 M IO/sec random 4KB random writes or 4.5 M IO/Sec random 4KB reads. I didn’t record the random write average response time but it was less than 350µsec and the random read average response time was 227µsec. They only did these 30 second test runs a couple of times, but the IO performance was staggering.

But they used lots of hardware, right?

No. The target storage system used during their demo consisted of:

  • 1-Supermicro 2028U-TN24RT+, a 2U dual socket server with up to 24 NVMe 2.5″ drive slots;
  • 2-2 x 100Gbs Mellanox ConnectX-5 100Gbs Ethernet (R[DMA]-NICs); and
  • 24-Intel 2.5″ 400GB NVMe SSDs.

They also had a Dell Z9100-ON Switch  supporting 32 X 100Gbs QSFP28 ports and I think they were using 4 hosts but all this was not part of the storage target system.

I don’t recall the CPU processor used on the target but it was a relatively lowend, cheap ($300 or so) dual core, Intel standard CPU. I think they said the total target hardware cost $13K or so.

I priced out an equivalent system. 24 400GB 2.5″ NVMe Intel 750 SSDs would cost around $7.8K (Newegg); the 2 Mellanox ConnectX-5 cards $4K (Neutron USA); and the SuperMicro plus an Intel Cpu around $1.5K. So the total system is close to the ~$13K.

But it burned out the target CPU, didn’t it?

During the 4.5M IO/sec random read benchmark, the storage target CPU was at 0.3% busy and the highest consuming process on the target CPU was the Linux “Top” command used to display the PS status.

Excelero claims that the storage target system consumes absolutely no CPU processing to service an 4K read or write IO request. All of IO processing is done by hardware (the R(DMA)-NICs, the NVMe drives and PCIe bus) which bypasses the storage target CPU altogether.

We didn’t look at the host cpu utilization but driving 4.5M IO/sec would take a high level of CPU power even if their client software did most of this via RDMA messaging magic.

How is this possible?

Their client software running in the Linux host is roughly equivalent to an iSCSI initiator but talks a special RDMA protocol (patent pending by Excelero, RDDA protocol) that adds an IO request to the NVMe device submission queue and then rings the doorbell on the target system device and the SSD then takes it off the queue and executes it. In addition to the submission queue IO request they preprogram the PCIe MSI interrupt request message to somehow program (?) the target system R-NIC to send the read data/write status data back to the client host.

So there’s really no target CPU processing for any NVMe message handling or interrupt processing, it’s all done by the client SW and is handled between the NVMe drive and the target and client R-NICs.

The result is that the data is sent back to the requesting host automatically from the drive to the target R-NIC over the target’s PCIe bus and then from the target system to the client system via RDMA across 100GBE and the R-NICS and then from the client R-NIC to the client IO memory data buffer over the client’s PCIe bus.

Writes are a bit simpler as the 4KB write data can be encapsulated into the submission queue command for the write operation that’s sent to the NVMe device and the write IO status is relatively small amount of data that needs to be sent back to the client.

NVMe optimized for 4KB IO

Of course the NVMe protocol is set up to transfer up to 4KB of data with a (write command) submission queue element. And the PCIe MSI interrupt return message can be programmed to (I think) write a command in the R-NIC to cause the data transfer back for a read command directly into the client’s memory using RDMA with no CPU activity whatsoever in either operation. As long as your IO request is less than 4KB, this all works fine.

There is some minor CPU processing on the target to configure a LUN and set up the client to target connection. They essentially only support replicated RAID 10 protection across the NVMe SSDs.

They also showed another demo which used the same drive both across the 100Gbs Ethernet network and in local mode or direct as a local NVMe storage. The response times shown for both local and remote were within  5µsec of each other. This means that the overhead for going over the Ethernet link rather than going local cost you an additional 5µsec of response time.

Disaggregated vs. aggregated configuration

In addition to their standalone (disaggregated) storage target solution they also showed an (aggregated) Linux based, hyper converged client-target configuration with a smaller number of NVMe drives in them. This could be used in configurations where VMs operated and both client and target Excelero software was running on the same hardware.

Simply amazing

The product has no advanced data services. no high availability, snapshots, erasure coding, dedupe, compression replication, thin provisioning, etc. advanced data services are all lacking. But if I can clone a LUN at lets say 2.5M IO/sec I can get by with no snapshotting. And with hardware that’s this cheap I’m not sure I care about thin provisioning, dedupe and compression.  Remote site replication is never going to happen at these speeds. Ok HA is an important consideration but I think they can make that happen and they do support RAID 10 (data mirroring) so data mirroring is there for an NVMe device failure.

But if you want 4.5M 4K random reads or 2.5M 4K random writes on <$15K of hardware and happen to be running Linux, I think they have a solution for you. They showed some volume provisioning software but I was too overwhelmed trying to make sense of their performance to notice.

Yes it really screams for 4KB IO. But that covers a lot of IO activity these days. And if you can do Millions of them a second splitting up bigger IOs into 4K should not be a problem.

As far as I could tell they are selling Excelero software as a standalone product and offering it to OEMs. They already have a few customers using Excelero’s standalone software and will be announcing  OEMs soon.

I really want one for my Mac office environment, although what I’d do with a millions of IO/sec is another question.

Comments?