83: GreyBeards talk NVMeoF/TCP with Muli Ben-Yehuda, Co-founder & CTO and Kam Eshghi, VP Strategy & Bus. Dev., Lightbits Labs

This is the first time we’ve talked with Muli Ben-Yehuda (@Muliby), Co-founder & CTO and Kam Eshghi (@KamEshghi), VP of Strategy & Business Development, Lightbits Labs. Keith and I first saw them at Dell Tech World 2019, in Vegas as they are a Dell Ventures funded organization. The company has 70 (mostly engineering) employees and is based in Israel, with offices in NY and the Valley as well as elsewhere around the world. Kam was previously with (Dell) EMC DSSD and Muli’s spent years as a Master Inventor with IBM Research.

[This was Keith Townsend’s (@CTOAdvisor & The CTO Advisor), first time as a GreyBeard co-host and we had a great time with him on the show.]

I would have to say it was a far ranging discussion but focused on their software defined, NVMeoF/TCP storage. As you may recall we talked with Solarflare Communications last year who were also working on a NVMeoF/TCP, only in their case it was an accelerator board. After the recording, Muli said the hardware accelerator they have is their own design.

Why NVMeoF/TCP?

Most NVMeoF today, that uses Ethernet, requires RoCE or iWARP compatible NICs and switches. Lightbits Labs has long been active in the NVMeoF/RoCE-iWARP market place. Early on they noticed that enterprise and cloud service providers were reluctant to adopt NVMeoF technology because of the need to change out all their networking equipment to use it. This is what brought about their focus on NVMeoF/TCP.

The advantage of NVMeoF/TCP is that it can be run on any Ethernet NIC and switch available today. From Muli’s perspective, NVMeoF/TCP is going to become the next SAN of choice for the data center. They were active, early on, in the standards committee to push for NVMeoF/TCP adoption.

How does it work?

Their software defined solution runs LightOS® storage software, a Linux based package, and uses off the shelf, server hardware with persistent storage (Optane DC PM/SSDs, NV DIMMs, V-NAND, etc.). They use persistent memory for a FAST write buffer and a place where they can “mold” the written data into something that can be better written to backend NVMe SSDs.

One surprise about Lightbits solution is that it offers a decent set of data services. These include erasure coding, thin provisioning, wire-speed inline compression, QoS and wide striping. It seems like any of these can be disabled by a customers want. But they only add very little overhead. I think Muli mentioned one Lightbits customer with encrypted data that disabled compression.

Lightbits also offers a global FTL (flash translation layer), which means they control SSD addressing which maps data to physical/raw NAND locations at the storage system level. If done well, a global FTL can help improve flash endurance and may offer better write performance (through increased parallelism).

Lightbits claim to inline, wire speed data compression is premised on the use of more current CPUs with high (>=28) core counts in a storage server. If the storage server has older CPUs (<28 cores), they suggest you install their LightField™ hardware accelerator add in card. LightField offers a number of hardware based, performance accelerations in addition to compression speedups.

LightOS requires no host (client) software. Muli’s a long time Linux kernel contributor and indicated that the only thing LightOS needs is a current Linux Kernel (5.0 or later) which has the NVMeoF/TCP driver software (and persistent memory). Lightbits believes that it’s only a matter of time until other OSs also implement NVMeoF/TCP drivers.

Lightbits business considerations

Long term, Lightbits sees a need for compute-storage disaggregation in hyper scalar and enterprise cloud environments. Early on it was relatively easy to replicate servers with DAS storage but as NVMe SSDs came out the expense to do this throughout their >>1000 server environment starts to become exorbitant. If they only had an easy way to disaggregate their storage from compute and still enjoy all the performance advantages of DAS NVMe SSDS. With LightOS they can do that.

Lightbits can be sold today through Dell, as a partner solution, which means that Dell can integrate, test and validate their servers with LightField accelerator card and deliver that package to your data center. I believe you still need to purchase and install their LightOS software yourself.

Lightbits charges for LightOS software on a per storage node basis, but they have different charges based on the maximum number of NVMe SSD slots available is in a server. There is no capacity charge. They also offer worldwide service and support for LightOS software and LightField hardware.

It’s all about performance

From a performance perspective, one Fortune 500 hyper-scalar benchmarked their storage solution against a DAS NVMe server and found it added about 30 µsec to the IO latency as compare to DAS NVMe SSDs. From their perspective, the added data services, better endurance, and disaggregated compute-storage environment provided by LightOS more than made up for the additional overhead.

Finally, I asked about whether multiple LightOS storage servers could be clustered together. Muli intervened, after stating some legal stuff, said they were working on the next generation LightOS and it will support clustered storage servers, local data replication as well as distributed (across storage servers) erasure coding.

The podcast is a long one and runs over ~47 minutes. There was a lot to talk about and Kam and Muli seem to know it all. It was interesting to hear the history of their pivot to TCP. They seem to have the right technology to address the market. Listen to the podcast to learn more.

Muli Ben-Yehuda, Co-founder and CTO, Lightbits Labs

Muli Ben-Yehuda is the CTO and Co-Founder of Lightbits Labs, where he leads technological developments.

Prior to founding Lightbits, he was chief scientist at Stratoscale and a researcher and Master Inventor at IBM Research.

He holds an M.Sc. in Computer Science (summa cum laude) from the Technion — Israel Institute of Technology and a B.A. (cum laude) from the Open University of Israel.

He is a long time Linux kernel contributor and his code and ideas are most likely included in an operating system or hypervisor running near you. He is also one of the authors of the NVMe/TCP standard and technology. 

Kam Eshghi, VP Strategy & Business Development, Lightbits Labs

Kam joined Lightbits Labs from Dell EMC and has over 20yrs of experience in strategic marketing and business development with startups and public companies.

Most recently as VP of strategic alliances at startup DSSD, Kam led business development with technology partners and developed DSSD’s partnership with EMC, leading to EMC’s acquisition of DSSD.

Previously as Sr. Director of Marketing & Business Development at IDT, Kam built their NVMe Controller business from scratch. Previous to that, Kam worked in data center storage, compute and networking markets at HP, Intel, and Crosslayer Networks. 

Kam is a U.C. Berkeley and MIT graduate with a BS and MS in Electrical Engineering and Computer Science and an MBA.

79: GreyBeards talk AI deep learning infrastructure with Frederic Van Haren, CTO & Founder, HighFens, Inc.

We’ve talked with Frederic before (see: Episode #33 on HPC storage) but since then, he has worked for an analyst firm and now he’s back on his own again, at HighFens. Given all the interest of late in AI, machine learning and deep learning, we thought it would be a great time to catch up and have him shed some light on deep learning and what it needs for IT infrastructure.

Frederic has worked for HPC / Big Data / AI / IoT solutions in the speech recognition industry, providing speech recognition services for some of the largest organizations in the world. As I understand it, the last speech recognition AI application he worked on implemented deep learning.

A brief history of AI

Frederic walked the Greybeards through the history of AI from the dawn of computing (1950s) until the recent emergence of deep learning (2010).

He explained that, early on one could implement a chess playing program, using hand coded rules based on a chess expert’s playing technique. Later when machine learning came out, one could use statistical analysis on multiple games and limited rule creation to teach a AI machine learning system how to play chess. With deep learning (DL), all you have to do now is to feed a DL model all the games you have and it learns how to play chess well all by itself. No rule making needed.

AI DL training and deployment infrastructure

Frederic described some of the infrastructure and data needs for various phases of an industrial scale, AI DL workflow.

Training deep learning models takes data and the more, the better. Gathering/saving large amounts of data used for DL training is a massive write workload and at the end of that process, hopefully you have PB of data to work with.

Selecting DL training data from all those PBs, involves a lot of mixed read and write IO. In the end, one has selected and extracted the data to use to train your DL models.

During DL training, IO needs are all about heavy data read throughput. But there’s more, in the later half of the talk, Frederic talked about the need to keep expensive GPU cores busy and that requires sophisticated caching or Tier 0 storage supporting low latency IO.

Ray’s been doing a lot of blogging and other work on AI machine and deep learning (e.g., see Learning machine learning – parts 1, 2, & 3) so it was great to hear from Frederic, a real practitioner of the art. Frederic (with some of Ray’s help) explained the deep learning training process. But it wasn’t detailed enough for Howard, so per Howard’s request, we went deeper into how it really works.

Once you have a DL model trained and working within specifications (e.g., prediction accuracy), Frederic said deploying DL models into production involves creating two separate clusters. One devoted to deep learning model inferencing, which takes in data from the world and performs inferencing (prediction, classification, interpretations, etc.) and the other uses that information for model adaption to fine tune DL models for specific instances.

Adaption and inferencing were both read and write IO workloads and the performance of this IO was dependent on a specific model’s use

Model adaption would personalize model predictions for each and every person, car, genotype, etc. This would be done periodically (based on SLAs, e.g. every 4 hrs). After that, a new, adapted model could be introduced into production, adapted for that specific person/car/genotype.

If the adaption applied more generally, that data and its human-machine validated/vetted prediction, classification, interpretation, etc. would be added back into the DL model training set to be used the next time a full model training pass was to be done. Frederic said AI DL model training is never done.

Sometime later, all this DL training, production and adaption data needs to be archived for long term access.

We then discussed the recent offerings from NVIDIA and major storage vendors that package up a solution for AI deep learning. It seems we are seeing another iteration of Converged Infrastructure, only this time for AI DL.

Finally, over the course of Ray’s AI DL education, he had come to the belief that AI deep learning could be applied by anyone. Frederic corrected Ray stating that AI deep learning should be applied by anyone.

The podcast runs ~44 minutes. Frederic’s been an old friend of Howard’s and Ray’s, since before the last podcast. He’s one of the few persons in the world that the GreyBeards know that has real world experience in deploying AI DL, at industrial scale. Frederic’s easy to talk with and very knowledgeable about the intersection of Ai DL and IT infrastructure. Howard and I had fun talking with him again on this episode. Listen to the podcast to learn more. .

Frederic Van Haren

Frederic Van Haren is the Chief Technology Officer @ HighFens. He has over 20 years of experience in high tech and is known for his insights in HPC, Big Data and AI from his hands-on experience leading research and development teams. He has provided technical leadership and strategic direction in the Telecom and Speech markets.

He spent more than a decade at Nuance Communications building large HPC and AI environments from the ground up and is frequently invited to speak at events to provide his vision on the HPC, AI, and storage markets. Frederic has also served as the president of a variety of technology user groups promoting the use of innovative technology.

As an engineer, he enjoys working directly with engineering teams from technology vendors and on challenging customer projects.

Frederic lives in Massachusetts,  USA but grew up in the northern part of Belgium where he received his Masters in Electrical Engineering, Electronics and Automation.

78: GreyBeards YE2018 IT industry wrap-up podcast

In this, our yearend industry wrap up episode, we discuss trends and technology impacting the IT industry in 2018 and what we can see ahead for 2019 and first up is NVMeoF

NVMeoF has matured

In the prior years, NVMeoF was coming from startups, but last year it’s major vendors like IBM FlashSystem, Dell EMC PowerMAX and NetApp AFF releasing new NVMeoF storage systems. Pure Storage was arguably earliest with their NVMeoF JBOF.

Dell EMC, IBM and NetApp were not far behind this curve and no doubt see it as an easy way to reduce response time without having to rip and replace enterprise fabric infrastructure.

In addition, NVMeoFstandards have finally started to stabilize. With the gang of startups, standards weren’t as much of an issue as they were more than willing to lead, ahead of standards. But major storage vendors prefer to follow behind standards committees.

As another example, VMware showed off an NVMeoF JBOF for vSAN. A JBoF like this improves vSAN storage efficiency for small clusters. Howard described how this works but with vSAN having direct access to shared storage, it can reduce data and server protection requirements for storage. Especially, when dealing with small clusters of servers becoming more popular these days to host application clusters.

The other thing about NVMeoF storage is that NVMe SSDs have also become very popular. We are seeing them come out in everyone’s servers and storage systems. Servers (and storage systems) hosting 24 NVMe SSDs is just not that unusual anymore. For the price of a PCIe switch, one can have blazingly fast, direct access to a TBs of NVMe SSD storage.

HCI reaches critical mass

HCI has also moved out of the shadows. We recently heard news thet HCI is outselling CI. Howard and I attribute this to the advances made in VMware’s vSAN 6.2 and the appliance-ification of HCI. That and we suppose NVMe SSDs (see above).

HCI makes an awful lot of sense for application clusters that VMware is touting these days. CI was easy but an HCI appliance cluster is much, simpler to deploy and manage

For VMware HCI, vSAN Ready Nodes are available from just about any server vendor in existence. With ready nodes, VARs and distributors can offer an HCI appliance in the channel, just like the majors. Yes, it’s not the same as a vendor supplied appliance, doesn’t have the same level of software or service integration, but it’s enough.

[If you want to learn more, Howard’s is doing a series of deep dive webinars/classes on HCI as part of his friend’s Ivan’s ipSpace.net. The 1st 2hr session was recorded 11 December, part 2 goes live 22 January, and the final installment on 5 February. The 1st session is available on demand to subscribers. Sign up here]

Computional storage finally makes sense

Howard and I 1st saw computational storage at FMS18 and we did a podcast with Scott Shadley of NGD systems. Computational storage is an SSD with spare ARM cores and DRAM that can be used to run any storage intensive, Linux application or Docker container.

Because it’s running in the SSD, it has (even faster than NVMe) lightening fast access to all the data on the SSD. Indeed, And the with 10s to 1000s of computational storage SSDs in a rack, each with multiple ARM cores, means you can have many 1000s of cores available to perform your data intensive processing. Almost like GPUs only for IO access to storage (SPUs?).

We tried this at one vendor in the 90s, executing some database and backup services outboard but it never took off. Then in the last couple of years (Dell) EMC had some VM services that you could run on their midrange systems. But that didn’t seem to take off either.

The computational storage we’ve seen all run Linux. And with todays data intensive applications coming from everywhere these days, and all the spare processing power in SSDs, it might finally make sense.

Futures

Finally, we turned to what we see coming in 2019. Howard was at an Intel Analyst event where they discussed Optane DIMMs. Our last podcast of 2018 was with Brian Bulkowski of Aerospike who discussed what Optane DIMMs will mean for high performance database systems and just about any memory intensive server application. For example, affordable, 6TB memory servers will be coming out shortly. What you can do with 6TB of memory is another question….

Howard Marks, Founder and Chief Scientist, DeepStorage

Howard Marks is the Founder and Chief Scientist of DeepStorage, a prominent blogger at Deep Storage Blog and can be found on twitter @DeepStorageNet.

Raymond Lucchesi, Founder and President, Silverton Consulting

Ray Lucchesi is the President and Founder of Silverton Consulting, a prominent blogger at RayOnStorage.com, and can be found on twitter @RayLucchesi. Signup for SCI’s free, monthly e-newsletter here.

73: GreyBeards talk HCI with Gabriel Chapman, Sr. Mgr. Cloud Infrastructure NetApp

Sponsored by: NetApp

In this episode we talk HCI  with Gabriel Chapman (@Bacon_Is_King), Senior Manager, Cloud Infrastructure, NetApp. Gabriel presented at the NetApp Insight 2018 TechFieldDay Extra (TFDx) event (video available here). Gabriel also presented last year at the VMworld 2017 TFDx event (video available here). If you get a chance we encourage you to watch the videos as Gabriel, did a great job providing some design intent and descriptions of NetApp HCI capabilities. Our podcast was recorded after the TFDx event.

NetApp HCI consists of NetApp Solidfire storage re-configured, as a small enterprise class AFA storage node occupying one blade of a four blade system, where the other three blades are dedicated compute servers. NetApp HCI runs VMware vSphere but uses enterprise class iSCSI storage supplied by the NetApp SolidFire AFA.

On our podcast, we talked a bit about SolidFire storage. It’s not well known but the 1st few releases of SolidFire (before NetApp acquisition) didn’t have a GUI and was entirely dependent on its API/CLI for operations. That heritage continues today as NetApp HCI management console is basically a front end GUI for NetApp HCI API calls.

Another advantage of SolidFire storage was it’s extensive QoS support which included state of the art service credits as well as service limits.  All that QoS sophistication is also available in NetApp HCI, so that customers can more effectively limit noisy neighbor interference on HCI storage.

Although NetApp HCI runs VMware vSphere as its preferred hypervisor, it’s also possible to run other hypervisors in bare metal clusters with NetApp HCI storage and compute servers. In contrast to other HCI solutions, with NetApp HCI, customers can run different hypervisors, all at the same time, sharing access to NetApp HCI storage.

On our podcast and the Insight TFDx talk, Gabriel mentioned some future deliveries and roadmap items such as:

  • Extending NetApp HCI hardware with a new low-end, 2U configuration designed specifically for RoBo and SMB customers;.
  • Adding NetApp Cloud Volume support so that customers can extend their data fabric out to NetApp HCI; and
  • Adding (NFS) file services support so that customers using NFS data stores /VVols could take advantage of NetApp HCI storage.

Another thing we discussed was the new development HCI cadence. In the past they typically delivered new functionality about 1/year. But with the new development cycle,  they’re able to deliver functionality much faster but have settled onto a 2 releases/year cycle, which seems about as quickly as their customer base can adopt new functionality.

The podcast runs ~22 minutes. We apologize for any quality issues with the audio. It was recorded at the show and we were novices with the onsite recording technology. We promise to do better in the future. Gabriel has almost become a TFDx regular these days and provides a lot of insight on both NetApp HCI and SolidFire storage.  Listen to our podcast to learn more.

Gabriel Chapman, Senior Manager, Cloud Infrastructure, NetApp

Gabriel is the Senior Manager for NetApp HCI Go to Market. Today he is mainly engaged with NetApp’s top tier customers and partners with a primary focus on Hyper Converged Infrastructure for the Next Generation Data Center.

As a 7 time vExpert that transitioned into the vendor side after spending 15 years working in the end user Information Technology arena, Gabriel specializes in storage and virtualization technologies. Today his primary area of expertise revolves around storage, data center virtualization, hyper-converged infrastructure, rack scale/hyper scale computing, cloud, DevOps, and enterprise infrastructure design.

Gabriel is a Prime Mover, Technologist, Unapologetic Randian, Social Media Junky, Writer, Bacon Lover, and Deep Thinker, whose goal is to speak truth on technology and make complex ideas sound simple. In his free time, Gabriel is the host of the In Tech We Trust podcast and enjoys blogging as well as public speaking.

Prior to joining SolidFire, Gabriel was a storage technologies specialist covering the United States with Cisco, focused on the Global Service Provider customer base. Before Cisco, he was part of the go-to-market team at SimpliVity, where he concentrated on crafting the customer facing messaging, pre-sales engagement, and evangelism efforts for the early adopters of Hyper Converged Infrastructure.

69: GreyBeards talk HCI with Lee Caswell, VP Products, Storage & Availability, VMware

Sponsored by:

For this episode we preview VMworld by talking with Lee Caswell (@LeeCaswell), Vice President of Product, Storage and Availability, VMware.

This is the third time Lee’s been on our show, the previous one was back in August of last year. Lee’s been at VMware for a couple of years now and, among other things, is leading the HCI journey at VMware.

The first topic we discussed was VMware’s expanded HCI software defined data center (SDDC) solution, which now includes compute, storage, networking and enhanced operations with alerts/monitoring/automation that ties it all together.

We asked Lee to explain VMware’s SDDC:

  • HCI operates at the edge – with ROBO-2-server environments, VMware’s HCI can be deployed in a closet and remotely operated by a VI from the central site.
  • HCI operates in the data center – with vSphere-vSAN-NSX-vRealize and other software, VMware modernizes data centers for the  pace of digital business..
  • HCI operates in the public Cloud –with VMware Cloud (VMC)  on AWS, IBM Cloud and over 400 service providers, VMware HCI also operates in the public cloud.
  • HCI operates for containers and cloud native apps – with support for containers under vSphere, vSAN and NSX, developers are finding VMware HCI an easy option to run container apps in the data center, at the edge, and in the public cloud.

The importance of the edge will become inescapable, as 50B edge connected devices power IoT by 2020. Lee heard Pat saying compute processing is moving to the edge because of 3 laws:

  1. the law of physics, light/information only travels so fast;
  2. the law of economics, doing all processing at central sites would take too much bandwidth and cost; and
  3. the law(s) of the land, data sovereignty and control is ever more critical in today’s world.

VMware SDDC is a full stack option, that executes just about anywhere the data center wants to go. Howard mentioned one customer he talked with at FMS18, just wanted to take their 16 node VMware HCI rack and clone it forever, to supply infinite infrastructure.

Next, we turned our discussion to Virtual Volumes (VVols). Recently VMware added replication support for VVols. Lee said VMware has an intent to provide a SRM SRA for VVols. But the real question is why hasn’t there been higher field VVol adoption. We concluded it takes time.

VVols wasn’t available in vSphere 5.5 and nowadays, three or more years have to go by before a significant amount of the field moves to a new release. Howard also said early storage systems didn’t implement VVols right. Moreover, VMware vSphere 5.5 is just now (9/16/18) going EoGS.

Lee said 70% of all current vSAN deployments are AFA. With AFA, hand tuning storage performance is no longer something admins need to worry about. It used to be we all spent time defragging/compressing data to squeeze more effective capacity out of storage, but hand capacity optimization like this has become a lost art. Just like capacity, hand tuning AFA performance doesn’t make sense anymore.

We then talked about the coming flash SSD supply glut. Howard sees flash pricing ($/GB) dropping by 40-50%, regardless of interface. This should drive AFA shipments above 70%, as long as the glut continues.

The podcast runs ~21 minutes. Lee’s always great to talk with and is very knowledgeable about the IT industry, HCI in general, and of course, VMware HCI in particular.  Listen to the podcast to learn more.

Lee Caswell, V.P. of Product, Storage & Availability, VMware

Lee Caswell leads the VMware storage marketing team driving vSAN products, partnerships, and integrations. Lee joined VMware in 2016 and has extensive experience in executive leadership within the storage, flash and virtualization markets.

Prior to VMware, Lee was vice president of Marketing at NetApp and vice president of Solution Marketing at Fusion-IO. Lee was a founding member of Pivot3, a company widely considered to be the founder of hyper-converged systems, where he served as the CEO and CMO. Earlier in his career, Lee held marketing leadership positions at Adaptec, and SEEQ Technology, a pioneer in non-volatile memory. He started his career at General Electric in Corporate Consulting.

Lee holds a bachelor of arts degree in economics from Carleton College and a master of business administration degree from Dartmouth College. Lee is a New York native and has lived in northern California for many years. He and his wife live in Palo Alto and have two children. In his spare time Lee enjoys cycling, playing guitar, and hiking the local hills.