127: Annual year end wrap up podcast with Keith, Matt & Ray

[Ray’s sorry about his audio, it will be better next time he promises, The Eds] This was supposed to be the year where we killed off COVID for good. Alas, it was not to be and it’s going to be with us for some time to come. However, this didn’t stop that technical juggernaut we call the GreyBeards on Storage podcast.

Once again we got Keith, Matt and Ray together to discuss the past year’s top 3 technology trends that would most likely impact the year(s) ahead. Given our recent podcasts, Kubernetes (K8s) storage was top of the list. To this we add AI-MLops in the enterprise and continued our discussion from last year on how Covid & WFH are remaking the world, including offices, data centers and downtowns around the world. Listen to the podcast to learn more.

K8s rulz

For some reason, we spent many of this year’s podcasts discussing K8s storage. TK8s was never meant to provide (storage) state AND as a result, any K8s data storage has had to be shoe horned in.

Moreover, why would any IT group even consider containerizing enterprise applications let alone deploy these onto K8s. The most common answers seem to be automatic scalability, cloud like automation and run-anywhere portability.

Keith chimed in with enterprise applications aren’t going anywhere and we were off. Just like the mainframe, client-server and OpenStack applications before them, enterprise apps will likely outlive most developers, continuing to run on their current platforms forever.

But any new apps will likely be born, live a long life and eventually fade away on the latest runtime environment. which is K8s.

Matt mentioned hybrid and multi-cloud as becoming the reason-d’etre for enterprise apps to migrate to containers and K8s. Further, enterprises have pressing need to move their apps to the hybrid- & multi-cloud model. AWS’s recent hiccups, notwithstanding, multi-cloud’s time has come.

Ray and Keith then discussed which is bigger, K8s container apps or enterprise “normal” (meaning virtualized/bare metal) apps. But it all comes down to how you define bigger that matters, Sheer numbers of unique applications – enterprise wins, Compute power devoted to running those apps – it’s a much more difficult race to cal/l. But even Keith had to agree that based on compute power containerized apps are inching ahead.

AI-MLops coming on strong

AI /MLops in the enterprise was up next. For me the most significant indicator for heightened interest in AI-ML was VMware announced native support for NVIDIA management and orchestration AI-MLops technologies.

Just like K8s before it and VMware’s move to Tanzu and it’s predecessors, their move to natively support NVIDIA AI tools signals that the enterprise is starting to seriously consider adding AI to their apps.

We think VMware’s crystal ball is based on

  • Cloud rolling out more and more AI and MLops technologies for enterprises to use. on their infrastructure
  • GPUs are becoming more and more pervasive in enterprise AND in cloud infrastructure
  • Data to drive training and inferencing is coming out of the woodwork like never before.

We had some discussion as to where AMD and Intel will end up in this AI trend.. Consensus is that there’s still space for CPU inferencing and “some” specialized training which is unlikely to go away. And of course AMD has their own GPUs and Intel is coming out with their own shortly.

COVID & WFH impacts the world (again)

And then there was COVID and WFH. COVID will be here for some time to come. As a result, WFH is not going away, at least not totally any time soon. And is just becoming another way to do business.

WFH works well for some things (like IT office work) and not so well for others (K-12 education). If the GreyBeards were into (non-crypto) investing, we’d be shorting office real estate. What could move into those millions of square feet (meters) of downtime office space is anyones guess. But just like the factories of old, cities and downtowns in particular can take anything and make it useable for other purposes.

That’s about it, 2021 was another “interesteing” year for infrastructure technology. It just goes to show you, “May you live in interesting times” is actually an old (Chinese) curse.

Keith Townsend, (@TheCTOadvisor)

Keith is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations. Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

Matt Leib, (@MBLeib)

Matt Leib has been blogging in the storage space for over 10 years, with work experience both on the engineering and presales/product marketing. His blog is at Virtually Tied to My Desktop and he’s on LinkedIN.

Ray Lucchesi, (@RayLucchesi)

Ray is the host and co-founder of GreyBeardsOnStorage and is President/Founder of Silverton Consulting, and a prominent (AI/storage/systems technology) blogger at RayOnStorage.com. Signup for SCI’s free, monthly industry e-newsletter here, published continuously since 2007. Ray can also be found on LinkedIn

113: GreyBeards talk storage for next gen. workloads with Liran Zvibel, Co-Founder & CEO WekaIO

Sponsored By:

I’ve known Liran Zvibel, Co-founder and CEO of Weka IO for many years now and it’s the second time he’s been on our show, (see: Episode 56: GreyBeards talk high performance file storage...). In those days, WekaIO was just coming out and hitting the world with this extremely high-performing, scale out unstructured data solution. Well since then, they’ve just gotten better.

Keith and I had a great time talking with Liran again. Liran has deep knowledge about unstructured data and how enterprises use it these days. WekaIO’s story, over the last two years has gone beyond great performance to real world, hybrid cloud offerings e as well as going after the cloud native app’s (read Kubernetes [K8S]) persistent storage. Listen to the podcast to learn more.

We started with a history lesson on WekaIO. Back in those days (which persists today, I might add) there were many IO workloads that required companies to purchase different solutions for different work. For example, they needed DAS or SAN for performance, NAS for ease of access and object for scale. WekaIO came out with an answer to all these problems in a single, scaleable storage system. That is, they performed IO as fast as DAS or SAN block, had all the ease of access of NAS, and could scale as much as object.

However, the real culprit holding the world back was “NFS”. At the outset NFS was designed (back in the 1990s) with the then current networking speeds available (10-100Mbps), which performed just fine at those speeds. But when 10-100GbE came out in the 2000’s, NFS’s metadata overhead was too chatty to support wire speeds. Thus, any storage that depended on NFS protocols couldn’t supply (small) files fast enough for modern applications.

This is why WekaIO has moved to not only support NFS and SMB but also POSIX and NVIDIA® GPUDirect® Storage interfaces. By offering POSIX, WekaIO is able to plug into standard Linux and Windows server systems and provide excellent small file performance. Of course applications that demand small file performance today are mostly data analytics and AI/ML/DL workloads.

Consequently., NVIDIA came out with their GPUDirect Storage protocol to address getting small file (data) into GPUs faster. With GPUDirect, storage systems can RDMA data directly from storage to GPU memory and vice versa, with no OS intervention (other than to set up the transfer). If you happen to have a small file, high performing storage system attached to your fabric that supports GPUDirect , like WekaIO, you can significantly speed up your AI/ML/DL workloads.

Next we started talking K8S storage. WekaIO usestheir POSIX interface in their CSI plugin to support K8S container persistent storage. Again, supplying high performance for small files seems to be tailor made for K8S container applications that exist today and will for the foreseeable future.

Enter the cloud. Almong other things, WekaIO is a AWS primary storage vendor. It also offers snap to cloud. And with both of these in tandem, it’s just become a lot easier to move and access your unstructured data in the cloud. Liran mentioned that WekaIO primary storage in AWS operates across AZ’s. This means it can be configured to support better availability than EBS.

Large BioPharma companies are using WekaIO in AWS to store and process field data and research data, so that this work can be done around the world. Some companies have run out of compute in a single AZ (unbelievable I know but it’s COVID). By offering multi-AZ support unstructured data access with WekaIO, these companies can spread their compute across AZ’s and region and still access their data. And when their products are ready for gov’t certification, having all this data in the cloud, can make provide an easy way to have gov’t access this same data.

Liran Zvibel, Co-founder and CEO WekaIO

As Co-Founder and CEO, Mr. Liran Zvibel guides long term vision and strategy at WekaIO. Prior to creating the opportunity at WekaIO, he ran engineering at social startup and Fortune 100 organizations including Fusic, where he managed product definition, design, and development for a portfolio of rich social media applications.

Liran also held principal architectural responsibilities for the hardware platform, clustering infrastructure and overall systems integration for XIV Storage System, acquired by IBM in 2007.

Mr. Zvibel holds a BSc.in Mathematics and Computer Science from Tel Aviv University.

109: GreyBeards talk SmartNICs & DPUs with Kevin Deierling, Head of Marketing at NVIDIA Networking

We decided to take a short break (of sorts) from storage to talk about something equally important to the enterprise, networking. At (virtual) VMworld a month or so ago, Pat made mention of developing support for SmartNIC-DPUs and even porting vSphere to run on top of a DPU. So we thought it best to go to the source of this technology and talk with Kevin Deierling (TechSeerKD), Head of Marketing at NVIDIA Networking who are the ones supplying these SmartNICs to VMware and others in the industry.

Kevin is always a pleasure to talk with and comes with a wealth of expertise and understanding of the technology underlying data centers today. The GreyBeards found our discussion to be very educational on what a SmartNIC or DPU can do and why VMware and others would be driving to rapidly adopt the technology. Listen to the podcast to learn more.

NVIDIA’s recent acquisition of Mellanox brought them Mellanox’s NIC, switch and router technology. And while Mellanox, and now NVIDIA have some pretty impressive switches and routers, what interested the GreyBeards was their SmartNIC technology.

Essentially, SmartNICS provide acceleration and offload of data handling needs required to move data around an enterprise network. These offload services include at a minimum, encryption/decryption, packet pacing (delivering gadzillion video streams at the right speed to insure proper playback by all), compression, firewalls, NVMeoF/RoCE, TCP/IP, GPU direct storage (GDS) transfers, VLAN micro-segmentation, scaling, and anything else that requires real time processing to perform at line speeds.

For those who haven’t heard of it, GDS transfers data from storage directly into GPU memory and from GPU memory directly to storage without any CPU cycles or server memory involvement, other than to set up the transfer. This extends NVMeoF RDMA tech to/from storage and server memory, to GPUs. That is, GDS offers a RDMA like path between storage and GPU memory. GPU to/from server memory direct interface already exists over the PCIe bus.

But even with all the offloads and accelerators above, they can also offer an additional a secure enclave outside the TPM in the CPU, to better isolate security sensitive functionality for a data center. (See DPU below).

Kevin mentioned multiple times that the new unit of computation is no longer a server but rather is now a data center. When you have public cloud, private cloud and other systems that all serve up virtual CPUs, NICs, GPUs and storage, what’s really being supplied to a user is a virtual data center. Cloud providers can carve up their hardware and serve it to you any way you want or need it. Virtual data centers can provide a multitude of VMs and any infrastructure that customers need to use to run their workloads.

Kevin mentioned by using SmartNics, IT or cloud providers can return 30% of the processor cycles (that were being spent doing networking work on CPUs) back to workloads that run on CPUs. Any data center can effectively obtain 30% more CPU cycles and increased networking speed and performance just by deploying SmartNICs throughout all the servers in their environment.

SmartNICs are an outgrowth of Mellanox technology embedded in their HPC InfiniBAND and high end Ethernet switches/routers. Mellanox had been well known for their support of NVMeoF/RoCE to supply high IOPs/low-latency IO activity for NVMe storage over Ethernet and before that their InfiniBAND RDMA technologies.

As Mellanox came out with their 2nd Gen SmartNIC they began to call their solution a “DPU” (data processing unit), which they see forming part of a “holy trinity” underpinning the new data center which has CPUs, GPUs and now DPUs. But a DPU is more than just a SmartNIC.

All NVIDIA SmartNICs and DPUs are based on Mellanox’s BlueField cards and chip technology. Their DPU uses BlueField2 (gen 2 technology) chips, which has a multi-core ARM engine inside of it and memory which can be used to perform computational processing in addition to the onboard offload/acceleration capabilities.

Besides adding VMware support for SmartNICs, PatG also mentioned that they were porting vSphere (ESX) to run on top of NVIDIA Networking DPUs. This would move the core VMware’s hypervisor functionality from running on CPUs, to running on DPUs. This of course would free up most if not all VMware Hypervisor CPU cycles for use by customer workloads.

During our discussion with Kevin, we talked a lot about the coming of AI-ML-DL workloads, which will require ever more bandwidth, ever lower latencies and ever more compute power. NVIDIA was a significant early enabler of the AI-ML-DL with their CUDA API that allowed a GPU to be used to perform DL network training and inferencing. As such, CUDA became an industry wide phenomenon allowing industry wide GPUs to be used as DL compute engines.

NVIDIA plans to do the same with their SmartNICs and DPUs. NVIDIA Networking is releasing the DOCA (Data center On a Chip Architecture) SDK and API. DOCA provides the API to use the BlueField2 chips and cards which are the central techonology behind their DPU. They have also announced a roadmap to continue enhancing DOCA, as they have done with CUDA, over the foreseeable future, to add more bandwidth, speed and functionality to DPUs.

It turns out the real problem which forced Mellanox and now NVIDIA to create SmartNics was the need to support the extremely low latencies required for NVMeoF and GDS IO.

It wasn’t clear that the public cloud providers were using SmartNICS but Kevin said it’s been sort of a widely known secret that they have been using the tech. The public clouds (AWS, Azure, Alibaba) have been deploying SmartNICS in their environments for some time now. Always on the lookout for any technology that frees up compute resources to be deployed for cloud users, it appears that public cloud providers were early adopters of SmartNICS.

Kevin Deierling, Head of Marketing NVIDIA Networking

Kevin is an entrepreneur, innovator, and technology executive with a proven track record of creating profitable businesses in highly competitive markets.

Kevin has been a founder or senior executive at five startups that have achieved positive outcomes (3 IPOs, 2 acquisitions). Combining both technical and business expertise, he has variously served as the chief officer of technology, architecture, and marketing of these companies where he led the development of strategy and products across a broad range of disciplines including: networking, security, cloud, Big Data, machine learning, virtualization, storage, smart energy, bio-sensors, and DNA sequencing.


Kevin has over 25 patents in the fields of networking, wireless, security, error correction, video compression, smart energy, bio-electronics, and DNA sequencing technologies.

When not driving new technology, he finds time for fly-fishing, cycling, bee keeping, & organic farming.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png


106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel

We had talked with Intel at Storage Field Day 20 (SFD20), about a month ago. At the virtual event, Intel’s focus was on their Optane PMEM (persistent memory) technology. Kelsey Prantis (@kelseyprantis), Senior Software Engineering Manager, Intel was on the show and gave an introduction into Intel’s DAOS (Distributed Architecture Object Storage, DAOS.io) a new HPC (high performance computing, super computers) file system they developed from scratch to use leading edge, Intel technologies, and Optane PMEM was one of them.

Kelsey has worked on LUSTRE and other HPC file systems for a long time now and came into the company from the acquisition of Whamcloud. Currently, she manages the development team working on DAOS. DAOS is a new HPC object storage file system which is completely open source (available on GitHub).

DAOS was designed from the start to take advantage of NVMe SSDs and Optane PMEM. With PMEM, current servers can support up to 20TB of memory. Besides the large memory sizes, Optane PMEM also offers non-volatile memory and byte addressability (just like DRAM). These two characteristics opens up new functionality that allows DAOS to move beyond legacy, block oriented, storage architectures that have been the only storage solution for HPC (and the enterprise) for decades now.

What’s different about DAOS

DAOS uses PMEM for all metadata and for storing small files. HPC IO has always focused on heavy bandwidth (IO using large blocks) oriented but lately newer applications have emerged, such as AI/ML/DL, data analytics and others, that use smaller files/blocks. Indeed, most new HPC clusters and supercomputers are deploying almost as many GPUs as CPUs in their configurations to support AI activities.

The problem is that these newer applications typically consume much smaller files. Matt mentioned one HPC client he worked with was processing small batches of seismic data, to predict, in real time, earthquakes that were happening around the world.

By using PMEM for metadata and small files, DAOS can be much more responsive to file requests (open, close, delete, status) as well as provide higher performing IO for small files. All this leads to a much better performing system for the new HPC workloads as well as great sustainable performance for the more traditional large file workloads.

DAOS storage

DAOS provides a cluster storage system that can be configured with from 1 (no data protection), but more normally 3 nodes (with data protection) at a minimum to 512 nodes (lab tested). Data protection in DAOS is currently based on mirroring data and can use from 0 to the number of nodes in a cluster as data mirrors.

DAOS system nodes are homogeneous. That is they all come with the same amount of PMEM and NVMe SSDs. Note, DAOS doesn’t support disk drives. Kelsey mentioned DAOS node hardware can be tailored to suit any particular application environment. But they typically require an average of 6% of overall DAOS system capacity in PMEM for metadata and small file activity.

DAOS current supports their own API, POSIX, HDFS5, MPIIO and Apache Spark storage protocols. Kelsey mentioned that standard POSIX uses a pessimistic conflict resolution mode which leads to performance bottlenecks during parallel access. In contrast, DAOS’s versos of POSIX uses optimistic conflict resolution, which means DAOS starts writes assuming there’s no conflict, but if one occurs it handles the conflict in real time. Of course with all the metadata byte addressable and in PMEM this doesn’t take up a lot of (IO) time.

As mentioned earlier, DAOS data protection uses mirror-replicas. However, unlike most other major file systems, DAOS mirroring can be done at the object level. DAOS internally is an object store. Data organization on DAOS starts at the pool level, underneath that is data containers, and then under that are objects. Any object in DAOS can have its own mirroring configuration. DAOS is working towards supporting Erasure Coding as another form of data protection for a future release.

DAOS performance

There’s a new storage benchmark that was developed specifically for HPC, called the IO500. The IO500 benchmark simulates a number of different HPC workloads, measures performance for each of them, and computes an (aggregate) performance score to rank HPC storage systems.

IO500 ranks system performance using two lists: one is for any sized configuration that typically range from 50 to 1000s of nodes and their other list limits the configuration to 10 nodes. The first performance ranking can sometimes be gamed by throwing more hardware into a cluster. The 10 node rankings are much harder to game this way and from our perspective, show a fairer comparison of system performance.

As presented (virtually) at ISC 2020, DAOS took the top spot on the IO500 any size configuration list and performed better than 2X the next best solution. And on the IO500 10 node list, Intel’s DAOS configuration, Texas Advanced Computing (TAC) DAOS configuration, and Argonne Nat Labs DAOS configuration took the top 3 spots and had 3X better performance than the next best, non-DAOS storage system.

The Argonne National Labs has already stated that they will be using DAOS in their new HPC system to be deployed in the near future. Early specifications for storage at the new Argonne Lab required support for 230PB of data and 25TB/sec of bandwidth.

The podcast ran ~43 minutes. Kelsey was great to talk with and very knowledgeable about HPC systems and HPC IO in particular. Matt has worked at Argonne in the past so understood these systems better than I. Sadly, we lost Matt’s end of the conversation about 1/2 way into the recording. Both Matt and I thought that DAOS represents the birth of a new generation of HPC storage. Listen to the podcast to learn more.


This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png

This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png

Kelsey Prantis, Senior Software Engineering Manager, Intel

 Kelsey Prantis heads the Extreme Storage Architecture and Development division at Intel Corporation. She leads the development of Distributed Asynchronous Object Storage (DAOS), an open-source, low-latency and high IOPS object store designed from the ground up for massively distributed Non-Volatile Memory (NVM).

She joined Intel in 2012 with the acquisition of Whamcloud, where she led the development of the Intel Manager for Lustre* product.

Prior to Whamcloud, she was a software developer at personal genomics and biotechnology company 23andMe.

Prantis holds a Bachelor’s degree in Computer Science from Rochester Institute of Technology

103: Greybeards talk scale-out file and cloud data with Molly Presley & Ben Gitenstein, Qumulo

Sponsored by:

Ray has known Molly Presley (@Molly_J_Presley), Head of Global Product Marketing for just about a decade now and we both just met Ben Gitenstein (@Qumulo_Product), VP of Products & Solutions, Qumulo on this podcast. Both Molly and Ben were very knowledgeable about the problems customers have with massive data troves. 

Molly has been on our podcast before (with another company, (see: GreyBeards talk HPC storage with Molly Rector, CMO & EVP, DDN ). And we have talked with Qumulo before as well (see: GreyBeards talk data-aware, scale-out file systems with Peter Godman, Co-founder & CEO, Qumulo)

Qumulo has a long history of dealing with customer issues with data center application access to data, usually large data repositories, with billions of small or large files, they have accumulated over time.  But recently Qumulo has taken on similar problems in the cloud as well.

Qumulo’s secret has always been to allow researchers to run their applications wherever their data  resides. This has led Qumulo’s software defined storage to offer multiple protocol access as well as a completely native, AWS and GCP cloud version of their solution. 

That way customers can run Qumulo in their data center or in the cloud and have the same great access to data. Molly mentioned one customer that creates and gathers data using SMB protocol on prem and then, after replication, processes it in the cloud. 

Qumulo Shift

Ben mentioned that many competitive storage systems are business model focused. That is they are all about keeping customer data within their solutions so they can charge for capacity. Although Qumulo also charges for capacity, with the new <strong>Qumulo Shift</strong> service, customer can easily move data off Qumulo and into native cloud storage. Using Shift, customers can free up Qumulo storage space (and cost) for any data that only needs to be accessed as objects.

With Shift, customers can replicate or move on prem or in the cloud Qumulo file data to AWS S3 objects. Once in S3, customers can access it with AWS native applications, other applications that make use of AWS S3 data, or can have that data be accessible around the world.

Qumulo customers can select directories to Shift to an AWS S3 bucket. The Qumulo directory name will be mapped to a S3 bucket name and each file in that directory will be copied to an S3 object in that bucket with the same file name.

At the moment, Qumulo Shift only supports AWS S3. Over time, Qumulo plans to offer support for other public cloud storage targets for Shift.

Shift is based on Qumulo replication services. Qumulo has a number of patents on replication technology that provides for sophisticated monitoring, control and high performance for moving vast amounts of data.

How customers use Shift

One large customer uses Qumulo cloud file services to process seismic data but then makes the results of that analysis available to other clients as S3 objects. 

Customers can also take advantage of AWS and other applications that support objects only. For example, AWS SageMaker Machine Learning (ML) processes S3 object data. Qumulo customers could gather training data as files and Shift it to S3 objects for ML training.

Moreover, customers can use Shift to create AWS S3 object  backups, archives and DR repositories of Qumulo file data. Ben mentioned DevOps could also use Qumulo Shift via APIs to move file data to S3 objects as part of new application deployment.

Finally, using Shift to copy or move file data to AWS S3, makes it ideal for collaboration by researchers, analysts and just about other entity that needs access to data. 

The podcast ran ~26 minutes. Molly has always been easy to talk with and Ben turned out also to be easy to talk with and knew an awful lot  about the product and how customers can use it. Keith and I enjoyed our time with Molly and Ben discussing Qumulo and their new Shift service. Listen to the podcast to learn more.

Ben Gitenstein, VP of Products and Solutions, Qumulo

Ben Gitenstein runs Product at Qumulo. He and his team of product managers and data scientists have conducted nearly 1,000 interviews with storage users and analyzed millions of data points to understand customer needs and the direction of the storage market.

Prior to working at Qumulo, Ben spent five years at Microsoft, where he split his time between Corporate Strategy and Product Planning.

Molly Presley, Head of Global Product Marketing, Qumulo

Molly Presley joined Qumulo in 2018 and leads worldwide product marketing. Molly brings over 15 years of file system and archive technology leadership experience to the role. 

Prior to Qumulo, Molly held executive product and marketing leadership roles at Quantum, DataDirect Networks (DDN) and Spectra Logic.

Presley also created the term “Active Archive”, founded the Active Archive Alliance and has served on the Board of the Storage Networking Industry Association (SNIA).

(Updated due to formatting problem, The Eds.)