85: GreyBeards talk NVMe NAS with Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data Inc.

As most of you know, Howard Marks was a founding co-Host of the GreyBeards-On- Storage podcast and has since joined with VAST Data, an NVMe file and object storage vendor headquartered in NY with R&D out of Israel. We first met with VAST at StorageFieldDay18 (SFD18, video presentation). Howard announced his employment at that event. VAST was a bit circumspect at their SFD18 session but Howard seems to be more talkative, so on the podcast we learn a lot more about their solution.

VAST Data is essentially an NFS-S3 object store, scale out solution with both stateless, VAST Data storage servers and JBoF drive enclosures with Optane and NVMe QLC SSDs. Storage servers or JBoFs can be scaled independently. They don’t support tiering or DRAM caching of data but instead seem to use the Optane SSDs as a write buffer for the QLC SSDs.

At the SFD18 event their spokesperson said that they were going to kill off disk storage media. (Ed’s note: Disk shipments fell 18% y/y in 1Q 2019, with enterprise disk shipments at 11.5M units, desktop at 24.5M units and laptops at 37M units).

The hardware

The VAST Data storage servers are in a 2U/4 server configuration, that runs interface protocols (NFS & S3), data reduction (see below), data reformating/buffering etc. They are stateless servers with all the metadata and other control state maintained on JBoF Optane drives.

Each drive enclosure JBoF has 12 Optane SSDs and 44 U.2 QLC (no DRAM/no super cap) SSDs. This means there are no write buffers on the QLC SSDs that can lose data when power failures occur. The interface to the JBoF is NVMeoF, either RDMA-RoCE Ethernet or InfiniBand (customer selected). Their JBoFs have high availability, with dual fabric modules that support 2-100Gbps Ethernet/InfiniBand ports per module, 4 per JBoF.

Minimum starting capacity is 500TB and they claim support up to Exabytes. Although how much has actually been tested is an open question. They also support billions of objects/files.

Guaranteed better data reduction

They have a rather unique, multi-level, data reduction scheme. At the start, data is chunked in variable length chunks. They use heuristics to determine the chunk size that fits best. (Ed note, unclear which is first in this sequence below so presented in (our view of) logical order)

  • 1st level computes a similarity hash (56 bit not SHA1), which is used to determine a similarity level with any other currently stored data chunk in the system.
  • 2nd level uses a ZSTD compression algorithm. If a similarity is found, the new data chunk is compressed with the ZSTD compression algorithm and a reference dictionary used by the earlier, similar data chunk. If no existing chunk is similar to this one, the algorithm identifies a semi-unique reference dictionary that optimizes the compression of this data chunk. This semi-unique dictionary is stored as metadata.
  • 3rd level, If it turns out to be a complete duplicate data chunk, then the dedupe count for the original data chunk is incremented, a pointer is saved to the original unique data and the data discarded. If not a complete duplicate of other data, the system computes a delta from the closest “similar’ block and stores just the delta bytes, includes a pointer to the original similar block and increments a delta block counter.

So data is chunked, compressed with a optimized dictionary, be delta-diffed or deduped. All data reduction is done post data write (after the client is ACKed), and presumably, re-hydrated after being read from SSD media. VAST Data guarantees better data reduction for your stored data than any other storage solution.

New data protection

They also supply a unique Locally Decodable Erasure Coding with 4 parity (-like) blocks and anywhere from 36 (single enclosure leaving 4 spare u.2 SSDs) to 150 data blocks per stripe all of which support up to 4 device failures per stripe. 

The locally decodable erasure coding scheme allows for rebuilds without having to read all remaining data blocks in a stripe. In this scheme, once you read the 4 parity (-like) blocks, one has all the information calculated from up to ¾ of the remaining drives in the stripe, so the system only has to read the remaining ¼ drives in the stripe to reconstruct one, two, three, or four failing drives.  Given their data stripe width, this cuts down on the amount of data needing to be read considerably. Still with 150 data drives in a stripe, the system still has to read 38 drives worth of QLC SSD data to rebuild a data drive.

In addition to all the above, VAST Data also reblocks the data into much larger segments, (it writes 1MB segments to the QLC drives) and uses a heat map along with other heuristics to separate actively written data from less actively written data, thus reducing garbage collection, write amplification.

The podcast is a long and runs over ~43 minutes. Howard has always been great to talk with and if anything, now being a vendor, has intensified this tendency. Listen to the podcast to learn more.

Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data, Inc.

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.


83: GreyBeards talk NVMeoF/TCP with Muli Ben-Yehuda, Co-founder & CTO and Kam Eshghi, VP Strategy & Bus. Dev., Lightbits Labs

This is the first time we’ve talked with Muli Ben-Yehuda (@Muliby), Co-founder & CTO and Kam Eshghi (@KamEshghi), VP of Strategy & Business Development, Lightbits Labs. Keith and I first saw them at Dell Tech World 2019, in Vegas as they are a Dell Ventures funded organization. The company has 70 (mostly engineering) employees and is based in Israel, with offices in NY and the Valley as well as elsewhere around the world. Kam was previously with (Dell) EMC DSSD and Muli’s spent years as a Master Inventor with IBM Research.

[This was Keith Townsend’s (@CTOAdvisor & The CTO Advisor), first time as a GreyBeard co-host and we had a great time with him on the show.]

I would have to say it was a far ranging discussion but focused on their software defined, NVMeoF/TCP storage. As you may recall we talked with Solarflare Communications last year who were also working on a NVMeoF/TCP, only in their case it was an accelerator board. After the recording, Muli said the hardware accelerator they have is their own design.

Why NVMeoF/TCP?

Most NVMeoF today, that uses Ethernet, requires RoCE or iWARP compatible NICs and switches. Lightbits Labs has long been active in the NVMeoF/RoCE-iWARP market place. Early on they noticed that enterprise and cloud service providers were reluctant to adopt NVMeoF technology because of the need to change out all their networking equipment to use it. This is what brought about their focus on NVMeoF/TCP.

The advantage of NVMeoF/TCP is that it can be run on any Ethernet NIC and switch available today. From Muli’s perspective, NVMeoF/TCP is going to become the next SAN of choice for the data center. They were active, early on, in the standards committee to push for NVMeoF/TCP adoption.

How does it work?

Their software defined solution runs LightOS® storage software, a Linux based package, and uses off the shelf, server hardware with persistent storage (Optane DC PM/SSDs, NV DIMMs, V-NAND, etc.). They use persistent memory for a FAST write buffer and a place where they can “mold” the written data into something that can be better written to backend NVMe SSDs.

One surprise about Lightbits solution is that it offers a decent set of data services. These include erasure coding, thin provisioning, wire-speed inline compression, QoS and wide striping. It seems like any of these can be disabled by a customers want. But they only add very little overhead. I think Muli mentioned one Lightbits customer with encrypted data that disabled compression.

Lightbits also offers a global FTL (flash translation layer), which means they control SSD addressing which maps data to physical/raw NAND locations at the storage system level. If done well, a global FTL can help improve flash endurance and may offer better write performance (through increased parallelism).

Lightbits claim to inline, wire speed data compression is premised on the use of more current CPUs with high (>=28) core counts in a storage server. If the storage server has older CPUs (<28 cores), they suggest you install their LightField™ hardware accelerator add in card. LightField offers a number of hardware based, performance accelerations in addition to compression speedups.

LightOS requires no host (client) software. Muli’s a long time Linux kernel contributor and indicated that the only thing LightOS needs is a current Linux Kernel (5.0 or later) which has the NVMeoF/TCP driver software (and persistent memory). Lightbits believes that it’s only a matter of time until other OSs also implement NVMeoF/TCP drivers.

Lightbits business considerations

Long term, Lightbits sees a need for compute-storage disaggregation in hyper scalar and enterprise cloud environments. Early on it was relatively easy to replicate servers with DAS storage but as NVMe SSDs came out the expense to do this throughout their >>1000 server environment starts to become exorbitant. If they only had an easy way to disaggregate their storage from compute and still enjoy all the performance advantages of DAS NVMe SSDS. With LightOS they can do that.

Lightbits can be sold today through Dell, as a partner solution, which means that Dell can integrate, test and validate their servers with LightField accelerator card and deliver that package to your data center. I believe you still need to purchase and install their LightOS software yourself.

Lightbits charges for LightOS software on a per storage node basis, but they have different charges based on the maximum number of NVMe SSD slots available is in a server. There is no capacity charge. They also offer worldwide service and support for LightOS software and LightField hardware.

It’s all about performance

From a performance perspective, one Fortune 500 hyper-scalar benchmarked their storage solution against a DAS NVMe server and found it added about 30 µsec to the IO latency as compare to DAS NVMe SSDs. From their perspective, the added data services, better endurance, and disaggregated compute-storage environment provided by LightOS more than made up for the additional overhead.

Finally, I asked about whether multiple LightOS storage servers could be clustered together. Muli intervened, after stating some legal stuff, said they were working on the next generation LightOS and it will support clustered storage servers, local data replication as well as distributed (across storage servers) erasure coding.

The podcast is a long one and runs over ~47 minutes. There was a lot to talk about and Kam and Muli seem to know it all. It was interesting to hear the history of their pivot to TCP. They seem to have the right technology to address the market. Listen to the podcast to learn more.

Muli Ben-Yehuda, Co-founder and CTO, Lightbits Labs

Muli Ben-Yehuda is the CTO and Co-Founder of Lightbits Labs, where he leads technological developments.

Prior to founding Lightbits, he was chief scientist at Stratoscale and a researcher and Master Inventor at IBM Research.

He holds an M.Sc. in Computer Science (summa cum laude) from the Technion — Israel Institute of Technology and a B.A. (cum laude) from the Open University of Israel.

He is a long time Linux kernel contributor and his code and ideas are most likely included in an operating system or hypervisor running near you. He is also one of the authors of the NVMe/TCP standard and technology. 

Kam Eshghi, VP Strategy & Business Development, Lightbits Labs

Kam joined Lightbits Labs from Dell EMC and has over 20yrs of experience in strategic marketing and business development with startups and public companies.

Most recently as VP of strategic alliances at startup DSSD, Kam led business development with technology partners and developed DSSD’s partnership with EMC, leading to EMC’s acquisition of DSSD.

Previously as Sr. Director of Marketing & Business Development at IDT, Kam built their NVMe Controller business from scratch. Previous to that, Kam worked in data center storage, compute and networking markets at HP, Intel, and Crosslayer Networks. 

Kam is a U.C. Berkeley and MIT graduate with a BS and MS in Electrical Engineering and Computer Science and an MBA.

69: GreyBeards talk HCI with Lee Caswell, VP Products, Storage & Availability, VMware

Sponsored by:

For this episode we preview VMworld by talking with Lee Caswell (@LeeCaswell), Vice President of Product, Storage and Availability, VMware.

This is the third time Lee’s been on our show, the previous one was back in August of last year. Lee’s been at VMware for a couple of years now and, among other things, is leading the HCI journey at VMware.

The first topic we discussed was VMware’s expanded HCI software defined data center (SDDC) solution, which now includes compute, storage, networking and enhanced operations with alerts/monitoring/automation that ties it all together.

We asked Lee to explain VMware’s SDDC:

  • HCI operates at the edge – with ROBO-2-server environments, VMware’s HCI can be deployed in a closet and remotely operated by a VI from the central site.
  • HCI operates in the data center – with vSphere-vSAN-NSX-vRealize and other software, VMware modernizes data centers for the  pace of digital business..
  • HCI operates in the public Cloud –with VMware Cloud (VMC)  on AWS, IBM Cloud and over 400 service providers, VMware HCI also operates in the public cloud.
  • HCI operates for containers and cloud native apps – with support for containers under vSphere, vSAN and NSX, developers are finding VMware HCI an easy option to run container apps in the data center, at the edge, and in the public cloud.

The importance of the edge will become inescapable, as 50B edge connected devices power IoT by 2020. Lee heard Pat saying compute processing is moving to the edge because of 3 laws:

  1. the law of physics, light/information only travels so fast;
  2. the law of economics, doing all processing at central sites would take too much bandwidth and cost; and
  3. the law(s) of the land, data sovereignty and control is ever more critical in today’s world.

VMware SDDC is a full stack option, that executes just about anywhere the data center wants to go. Howard mentioned one customer he talked with at FMS18, just wanted to take their 16 node VMware HCI rack and clone it forever, to supply infinite infrastructure.

Next, we turned our discussion to Virtual Volumes (VVols). Recently VMware added replication support for VVols. Lee said VMware has an intent to provide a SRM SRA for VVols. But the real question is why hasn’t there been higher field VVol adoption. We concluded it takes time.

VVols wasn’t available in vSphere 5.5 and nowadays, three or more years have to go by before a significant amount of the field moves to a new release. Howard also said early storage systems didn’t implement VVols right. Moreover, VMware vSphere 5.5 is just now (9/16/18) going EoGS.

Lee said 70% of all current vSAN deployments are AFA. With AFA, hand tuning storage performance is no longer something admins need to worry about. It used to be we all spent time defragging/compressing data to squeeze more effective capacity out of storage, but hand capacity optimization like this has become a lost art. Just like capacity, hand tuning AFA performance doesn’t make sense anymore.

We then talked about the coming flash SSD supply glut. Howard sees flash pricing ($/GB) dropping by 40-50%, regardless of interface. This should drive AFA shipments above 70%, as long as the glut continues.

The podcast runs ~21 minutes. Lee’s always great to talk with and is very knowledgeable about the IT industry, HCI in general, and of course, VMware HCI in particular.  Listen to the podcast to learn more.

Lee Caswell, V.P. of Product, Storage & Availability, VMware

Lee Caswell leads the VMware storage marketing team driving vSAN products, partnerships, and integrations. Lee joined VMware in 2016 and has extensive experience in executive leadership within the storage, flash and virtualization markets.

Prior to VMware, Lee was vice president of Marketing at NetApp and vice president of Solution Marketing at Fusion-IO. Lee was a founding member of Pivot3, a company widely considered to be the founder of hyper-converged systems, where he served as the CEO and CMO. Earlier in his career, Lee held marketing leadership positions at Adaptec, and SEEQ Technology, a pioneer in non-volatile memory. He started his career at General Electric in Corporate Consulting.

Lee holds a bachelor of arts degree in economics from Carleton College and a master of business administration degree from Dartmouth College. Lee is a New York native and has lived in northern California for many years. He and his wife live in Palo Alto and have two children. In his spare time Lee enjoys cycling, playing guitar, and hiking the local hills.

68: GreyBeards talk NVMeoF/TCP with Ahmet Houssein, VP of Marketing & Strategy @ Solarflare Communications

In this episode we talk with Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications, (@solarflare_comm). Ahmet’s been in the industry forever and has a unique view on where NVMeoF needs to go. Howard had talked with Ahmet at last years FMS. Ahmet will also be speaking at this years FMS (this week in Santa Clara, CA)..

Solarflare Communications sells Ethernet communication gear, mostly to the financial services market and has developed a software plugin for the standard TCP/IP stack on Linux that supports both target and client mode NVMeoF/TCP. That is, their software plugin provides a complete implementation of NVMeoF across TCP Ethernet that extends the TCP protocol but doesn’t require RDMA (RoCE or iWARP) or data center bridging.

Implementing NVMeoF/TCP

Solarflare’s NVMeoF/TCP is a free plugin that once approved by the NVMe(oF) standard’s committees anyone can use to create a NVMeoF storage system and consume that storage from almost anywhere. The standards committee is expected to approve the protocol extension soon and sometime after that the plugin will be added to the Linux Kernel. After standards approval, maybe VMware and Microsoft will adopt it as well, but may take more work.

Over the last year plus most NVMeoF/Ethernet we encounter requires sophisticated RDMA hardware. When we talked with Pavilion Data Systems, a month or so ago, they had designed a more networking like approach to NVMeoF using RoCE and TCP a special purpose FPGA that’s used in their RDMA NICs and Mellanox switches to support client-target mode NVMeoF/UDP [updated 8/8/18 after VR’s comment, the ed.]. When we talked with Attala Systems, they had special purpose FPGA that’s used in RDMA NICs and Mellanox switches to support target & client mode NVMeoF/UDP were using standard RDMA NICs and Mellanox switches to support their NVMeoF/Ethernet storage [updated 8/8/18 after VR’s comment, the ed.].

Solarflare is taking a different tack.

One problem with the NVMeoF/Ethernet RDMA is compatibility. You can use either RoCE or iWARP RDMA NICs but at the moment you can’t use both. With TCP/IP plugins there’s no hardware compatibility issue. (Yes, there’s software compatibility at both ends of the pipe).

SolarFlare recently measured latencies for their NVMeoF/TCP (Iometer/FIO) which shows that the with the protocol running it adds about a 5-10% increase in latency versus running RDMA NVMeoF/UDP-RoCE-iWARP.

Performance measurements were taken using a server, running Red Hat Linux + their TCP plugin with NVMe SSDs on the storage side and a similar configuration on the client side without the SSDs.

If they add 10% latency to 10 microsec. IO (for Optane), latency becomes 11 microsec. Similarly for flash NVMe SSDs it moves from 100 microsec to 110 microsec.

Ahmet did mention that their NICs have some hardware optimizations which brings down this added latency into something approaching closer to 5%. And later we discuss the immense parallelism opportunities using the TCP stack in user space. Their hardware also better supports more threads doing IO in parallel.

Why TCP

Ahmets on a mission. He says there’s this misbelief that Ethernet RDMA hardware is required to achieve lightening fast response times using NVMeoF, but it’s not true. Standard TCP with proper protocol enhancements is more than capable of performing at very close to the same latencies as RDMA, without special NICs and DCB switch configurations.

Furthermore, TCP/IP already has multipathing support. So current high availability characteristics of TCP are readily applicable to NVMeoF/TCP

Parallelism through user space

NVMeoF/TCP was the subject of 1st half of our discussion but we spent the 2nd half talking about scaling or parallelism. Even if you can do 11 or 110 microsecond latency at some point, if you do enough of these IOs, the kernel overhead in processing blocks and transferring control from kernel space to user space will become a bottleneck.

However, there’s nothing stopping IT from running the TCP/IP stack in user space and eliminating any kernel control transfer whatsoever. By doing so, data centers could parallelize all this IO using as many cores as available.

Running the plugin in a TCP/IP stack in user space allows you to scale NVMeoF lightening fast IO to as many users as you have user spaces or cores, and the kernel doesn’t even break into a sweat

Anyone could simply download Solarflare’s plugin, configure a white box server with Linux and 24 NVMe SSDs and support ~8.4M IOPS (350Kx24) at ~110 microsec latency And with user space scaling, one could easily have 1000s of user spaces connected to it.

They’re going to need need faster pipes!

The podcast runs ~39 minutes. Ahmet was very knowledgeable about NVMe, NVMeoF and TCP.  He was articulate and easy to talk with.  Listen to the podcast to learn more.

Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications 

Ahmet Houssein is responsible for establishing marketing strategies and implementing programs to drive revenue growth, enter new markets and expand brand awareness to support Solarflare’s continuous development and global expansion.

He has over twenty-five years of experience in the server, storage, data center and networking industry, and held senior level executive positions in product development, marketing and business development at Intel and Honeywell. Most recently Houssein was SVP/GM at QLogic where he successfully delivered first to market with 25Gb Ethernet products securing design wins at HP and Dell.

One of the key leaders in the creation of the INFINIBAND and PCI-Express industry standard, Houssein is a recipient of the Intel Achievement Award and was a founding board member of the Storage Network Industry Association (SNIA), a global organization of 400 companies in the storage market. He was educated in London, UK and holds an Electrical Engineering Degree equivalent.

66: GreyBeards talk Midrange storage part 2, with Sean Kinney, Sr. Dir. Midrange Storage Mkt, Dell EMC

Sponsored by:

Dell EMC Midrange Storage

In this episode we talk with Sean Kinney (@SeanRKinney14), senior director, midrange storage marketing at Dell EMC.  Howard and I have both known Sean for a number of years now. Sean has had multiple roles in the IT industry, doing various marketing and management duties at multiple vendors. He’s back at Dell EMC now and wanted to take on opportunity to discuss Dell EMC midrange storage with us.

As you probably already know, Dell EMC midrange storage dominates their market and has done so for a number of years now. Currently, Dell EMC midrange storage has 2X the revenue of any other competitor.

This is the third time (Dell) EMC has been on our show (see our EMCWorld2015 summary podcast with Chad Sakac, and  Talk with Pierluca Chiodelli sponsored podcast).  Since our last podcast, there’s been plenty of happenings at Dell EMC midrange storage.

Dell EMC Unity and SC storage news

Dell EMC Unity storage has recently added new file data reduction and file sync replication functionality. And a short time ago, Dell EMC came out with an AFA version of their SC Series storage.

With the two midrange product lines there’s been some cross fertilization. That is Dell EMC is starting to take some of the best features from one solution and applying it to the other.

For example,

  • SC series has had its Health Check offering since Compellent days. This is a PS, offered by Dell EMC, that reviews the health of your data center’s SC storage, DR plans, backup activity, IO performance, etc. and provides recommendations as to how to improve the overall storage environment. The Health Check PS is now also available for Unity storage.
  • Unity storage has had its CloudIQ management/monitoring solution since December of 2016. CloudIQ is a big data analytics-remote management, software-as-a-service offering, running in the cloud that allows customers to manage/monitor Unity storage from anywhere. With SC Series’ latest, 7.3 code update, SC storage is also supported under CloudIQ.

We also discussed some of the inherent advantages to SC Series storage, such as their forever software licensing, storage federation/scale out clusters and economical $/GB pricing.

Sean mentioned some of the Future Proof guarantees that Dell EMC offers on both Unity and SC series storage. These include hardware investment protection, data-in-place upgrades, data reduction guarantees, etc.

The podcast runs ~20 minutes. Sean has been around the storage for a long time now and is very knowledgeable about Dell EMC Midrange storage as well as competitive solutions. Howard and I have talked with Sean at a number of industry events in the past and it was fun to talk with him again.  Listen to the podcast to learn more.

Sean Kinney, Senior Director,  Dell EMC Midrange Storage Marketing

Sean Kinney is an industry leader in the storage and data protection market, with over 20 years of experience in the IT industry.

Currently, he is the Senior Director for midrange storage marketing at Dell EMC.  He spent the first 10 years of his career at EMC, and then held positions including VP and General Manager of online backup at Acronis and Senior Director, Storage Marketing at Hewlett-Packard Enterprise.

Sean has a B.A. from Dartmouth College and a M.B.A from the University of Michigan.