85: GreyBeards talk NVMe NAS with Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data Inc.

As most of you know, Howard Marks was a founding co-Host of the GreyBeards-On- Storage podcast and has since joined with VAST Data, an NVMe file and object storage vendor headquartered in NY with R&D out of Israel. We first met with VAST at StorageFieldDay18 (SFD18, video presentation). Howard announced his employment at that event. VAST was a bit circumspect at their SFD18 session but Howard seems to be more talkative, so on the podcast we learn a lot more about their solution.

VAST Data is essentially an NFS-S3 object store, scale out solution with both stateless, VAST Data storage servers and JBoF drive enclosures with Optane and NVMe QLC SSDs. Storage servers or JBoFs can be scaled independently. They don’t support tiering or DRAM caching of data but instead seem to use the Optane SSDs as a write buffer for the QLC SSDs.

At the SFD18 event their spokesperson said that they were going to kill off disk storage media. (Ed’s note: Disk shipments fell 18% y/y in 1Q 2019, with enterprise disk shipments at 11.5M units, desktop at 24.5M units and laptops at 37M units).

The hardware

The VAST Data storage servers are in a 2U/4 server configuration, that runs interface protocols (NFS & S3), data reduction (see below), data reformating/buffering etc. They are stateless servers with all the metadata and other control state maintained on JBoF Optane drives.

Each drive enclosure JBoF has 12 Optane SSDs and 44 U.2 QLC (no DRAM/no super cap) SSDs. This means there are no write buffers on the QLC SSDs that can lose data when power failures occur. The interface to the JBoF is NVMeoF, either RDMA-RoCE Ethernet or InfiniBand (customer selected). Their JBoFs have high availability, with dual fabric modules that support 2-100Gbps Ethernet/InfiniBand ports per module, 4 per JBoF.

Minimum starting capacity is 500TB and they claim support up to Exabytes. Although how much has actually been tested is an open question. They also support billions of objects/files.

Guaranteed better data reduction

They have a rather unique, multi-level, data reduction scheme. At the start, data is chunked in variable length chunks. They use heuristics to determine the chunk size that fits best. (Ed note, unclear which is first in this sequence below so presented in (our view of) logical order)

  • 1st level computes a similarity hash (56 bit not SHA1), which is used to determine a similarity level with any other currently stored data chunk in the system.
  • 2nd level uses a ZSTD compression algorithm. If a similarity is found, the new data chunk is compressed with the ZSTD compression algorithm and a reference dictionary used by the earlier, similar data chunk. If no existing chunk is similar to this one, the algorithm identifies a semi-unique reference dictionary that optimizes the compression of this data chunk. This semi-unique dictionary is stored as metadata.
  • 3rd level, If it turns out to be a complete duplicate data chunk, then the dedupe count for the original data chunk is incremented, a pointer is saved to the original unique data and the data discarded. If not a complete duplicate of other data, the system computes a delta from the closest “similar’ block and stores just the delta bytes, includes a pointer to the original similar block and increments a delta block counter.

So data is chunked, compressed with a optimized dictionary, be delta-diffed or deduped. All data reduction is done post data write (after the client is ACKed), and presumably, re-hydrated after being read from SSD media. VAST Data guarantees better data reduction for your stored data than any other storage solution.

New data protection

They also supply a unique Locally Decodable Erasure Coding with 4 parity (-like) blocks and anywhere from 36 (single enclosure leaving 4 spare u.2 SSDs) to 150 data blocks per stripe all of which support up to 4 device failures per stripe. 

The locally decodable erasure coding scheme allows for rebuilds without having to read all remaining data blocks in a stripe. In this scheme, once you read the 4 parity (-like) blocks, one has all the information calculated from up to ¾ of the remaining drives in the stripe, so the system only has to read the remaining ¼ drives in the stripe to reconstruct one, two, three, or four failing drives.  Given their data stripe width, this cuts down on the amount of data needing to be read considerably. Still with 150 data drives in a stripe, the system still has to read 38 drives worth of QLC SSD data to rebuild a data drive.

In addition to all the above, VAST Data also reblocks the data into much larger segments, (it writes 1MB segments to the QLC drives) and uses a heat map along with other heuristics to separate actively written data from less actively written data, thus reducing garbage collection, write amplification.

The podcast is a long and runs over ~43 minutes. Howard has always been great to talk with and if anything, now being a vendor, has intensified this tendency. Listen to the podcast to learn more.

Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data, Inc.

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.

Leave a Reply