72: GreyBeards talk Computational Storage with Scott Shadley, VP Marketing NGD Systems

For this episode the GreyBeards talked with another old friend, Scott Shadley, VP Marketing, NGD Systems. As we discussed on our FMS18 wrap up show with Jim Handy, computational storage had sort of a coming out party at the show.

NGD systems started in 2013 and have  been working towards a solution that goes general availability at the end of this year. Their computational storage SSD supplies general purpose processing power sitting inside an SSD. NGD shipped their first prototypes in 2016, shipped FPGA version of their smart SSD in 2017 and already have their field upgradable, ASIC prototypes in customer hands.

NGD’s smart SSDs have a 4-core ARM processor and  run an Ubuntu Distro on 3 of them.  Essentially, anything that could be run on Ubuntu Linux, including Docker containers and Kubernetes could be run on their smart SSDs.

NGD sells standard (storage only) SSDs as well as their smart SSDs. The smart hardware is shipped with all of their SSDs, but is only enabled after customer’s purchase a software license key. They currently offer their smart SSD solutions in  America and Europe, with APAC coming later.

They offer smart SSDs in both a 2.5” and M.2 form factor. NGD Systemss are following the flash technology road map and currently offer a 16TB SSD in 2.5” FF.

How applications work on smart SSDs

They offer an open-source, SDK which creates a TCP/IP tunnel across the  NVMe bus that attaches their smart SSD. This allows the host and the SSD server to communicate and send (RPC) work back and forth between them.

A normal smart SSD work flow could be

  1. Host server writes data onto the smart SSD;
  2. Host signals the smart SSD to perform work on the data on the smartSSD;
  3. Smart SSD processes the data that has been sent to the SSD; and
  4. When smart SSD work is done, it sends a response back to the host.

I assume somewhere before #2 above, you load application software onto the device.

All the work to be done on smart SSDs could be the same for the attached SSD and the work could easily be distributed across all attached smart SSDs attached and the host processor. For example, for image processing, a host processor would write images to be processed across all the SSDs and have each perform image recognition and append tags (or other results info) metadata onto the image and then respond back to the host. Or for media transcoding, video streams could be written to a smart SSD and have it perform transcoding completely outboard.

The smart SSD processors access the data just like the host processor or could use services available in their SDK which would access the data much faster. Just about any data processing you could do on the host processor could be done outboard, on smart SSD processor elements. Scott mentioned that memory intensive applications are probably not a good fit for computational storage.

He also said that their processing (ARM) elements were specifically designed for low power operations. So although AI training and inference processing might be much faster on GPUs, their power consumption was much higher. As a result, AI training and inference processing power-performance would be better on smart SSDs.

Markets for smart SSDs?

One target market for NGD’s computational storage SSDs is hyper scalars. At FMS18, Microsoft Research published a report on running FAISS software on NGD Smart SSDs that led to a significant speedup. Scott also brought up one company they’re working with that was testing  to find out just how many 4K video  streams can be processed on a gaggle of smart SSDs. There was also talk of three letter (gov’t) organizations interested in smart SSDs to encrypt data and perform other outboard processing of (intelligence) data.

Highly distributed applications and data reminds me of a lot of HPC customers I  know. But bandwidth is also a major concern for HPC.  NVMe is fast, but there’s a limit to how many SSDs can be attached to a server.

However, with NVMeoF, NGD Systems could support a lot more “attached”  smart SSDs. Imagine a scoop of smart SSDs, all attached to a slurp of servers,  performing data intensive applications on their processing elements in a widely distributed fashion. Sounds like HPC to me.

The podcast runs ~39 minutes. Scott’s great to talk with and is very knowledgeable about the Flash/SSD industry and NGD Systems. His talk on their computational storage was mind expanding. Listen to the podcast to learn more.

Scott Shadley, VP Marketing, NGD Systems

Scott Shadley, Storage Technologist and VP of Marketing at NGD Systems, has more than 20 years of experience with Storage and Semiconductor technology. Working at STEC he was part of the team that enabled and created the world’s first Enterprise SSDs.

He spent 17 years at Micron, most recently leading the SATA SSD product line with record-breaking revenue and growth for the company. He is active on social media, a lover of all things High Tech, enjoys educating and sharing and a self-proclaimed geek around mobile technologies.

56: GreyBeards talk high performance file storage with Liran Zvibel, CEO & Co-Founder, WekaIO

This month we talk high performance, cluster file systems with Liran Zvibel (@liranzvibel), CEO and Co-Founder of WekaIO, a new software defined, scale-out file system. I first heard of WekaIO when it showed up on SPEC sfs2014 with a new SWBUILD benchmark submission. They had a 60 node EC2-AWS cluster running the benchmark and achieved, at the time, the highest SWBUILD number (500) of any solution.

At the moment, WekaIO are targeting HPC and Media&Entertainment verticals for their solution and it is sold on an annual capacity subscription basis.

By the way, a Wekabyte is 2**100 bytes of storage or ~ 1 trillion exabytes (2**60).

High performance file storage

The challenges with HPC file systems is that they need to handle a large number of files, large amounts of storage with high throughput access to all this data. Where WekaIO comes into the picture is that they do all that plus can support high file IOPS. That is, they can open, read or write a high number of relatively small files at an impressive speed, with low latency. These are becoming more popular with AI-machine learning and life sciences/genomic microscopy image processing.

Most file system developers will tell you that, they can supply high throughput  OR high file IOPS but doing both is a real challenge. WekaIO’s is able to do both while at the same time supporting billions of files per directory and trillions of files in a file system.

WekaIO has support for up to 64K cluster nodes and have tested up to 4000 cluster nodes. WekaIO announced last year an OEM agreement with HPE and are starting to build out bigger clusters.

Media & Entertainment file storage requirements are mostly just high throughput with large (media) file sizes. Here WekaIO has a more competition from other cluster file systems but their ability to support extra-large data repositories with great throughput is another advantage here.

WekaIO cluster file system

WekaIO is a software defined  storage solution. And whereas many HPC cluster file systems have metadata and storage nodes. WekaIO’s cluster nodes are combined meta-data and storage nodes. So as one scale’s capacity (by adding nodes), one not only scales large file throughput (via more IO parallelism) but also scales small file IOPS (via more metadata processing capabilities). There’s also some secret sauce to their metadata sharding (if that’s the right word) that allows WekaIO to support more metadata activity as the cluster grows.

One secret to WekaIO’s ability to support both high throughput and high file IOPS lies in  their performance load balancing across the cluster. Apparently, WekaIO can be configured to constantly monitoring all cluster nodes for performance and can balance all file IO activity (data transfers and metadata services) across the cluster, to insure that no one  node is over burdened with IO.

Liran says that performance load balancing was one reason they were so successful with their EC2 AWS SPEC sfs2014 SWBUILD benchmark. One problem with AWS EC2 nodes is a lot of unpredictability in node performance. When running EC2 instances, “noisy neighbors” impact node performance.  With WekaIO’s performance load balancing running on AWS EC2 node instances, they can  just redirect IO activity around slower nodes to faster nodes that can handle the work, in real time.

WekaIO performance load balancing is a configurable option. The other alternative is for WekaIO to “cryptographically” spread the workload across all the nodes in a cluster.

WekaIO uses a host driver for Posix access to the cluster. WekaIO’s frontend also natively supports (without host driver) NFSv3, SMB3.1, HDFS and AWS S3  protocols.

WekaIO also offers configurable file system data protection that can span 100s of failure domains (racks) supporting from 4 to 16 data stripes with 2 to 4 parity stripes. Liran said this was erasure code like but wouldn’t specifically state what they are doing differently.

They also support high performance storage and inactive storage with automated tiering of inactive data to object storage through policy management.

WekaIO creates a global name space across the cluster, which can be sub-divided into one to thousands  of file systems.

Snapshoting, cloning & moving work

WekaIO also has file system snapshots (readonly) and clones (read-write) using re-direct on write methodology. After the first snapshot/clone, subsequent snapshots/clones are only differential copies.

Another feature Howard and I thought was interesting was their DR as a Service like capability. This is, using an onprem WekaIO cluster to clone a file system/directory, tiering that to an S3 storage object. Then using that S3 storage object with an AWS EC2 WekaIO cluster to import the object(s) and re-constituting that file system/directory in the cloud. Once on AWS, work can occur in the cloud and the process can be reversed to move any updates back to the onprem cluster.

This way if you had work needing more compute than available onprem, you could move the data and workload to AWS, do the work there and then move the data back down to onprem again.

WekaIO’s RtOS, network stack, & NVMeoF

WekaIO runs under Linux as a user space application. WekaIO has implemented their own  Realtime O/S (RtOS) and high performance network stack that runs in user space.

With their own network stack they have also implemented NVMeoF support for (non-RDMA) Ethernet as well as InfiniBand networks. This is probably another reason they can have such low latency file IO operations.

The podcast runs ~42 minutes. Linar has been around  data storage systems for 20 years and as a result was very knowledgeable and interesting to talk with. Liran almost qualifies as a Greybeard, if not for the fact that he was clean shaven ;/. Listen to the podcast to learn more.

Linar Zvibel, CEO and Co-Founder, WekaIO

As Co-Founder and CEO, Mr. Liran Zvibel guides long term vision and strategy at WekaIO. Prior to creating the opportunity at WekaIO, he ran engineering at social startup and Fortune 100 organizations including Fusic, where he managed product definition, design and development for a portfolio of rich social media applications.

 

Liran also held principal architectural responsibilities for the hardware platform, clustering infrastructure and overall systems integration for XIV Storage System, acquired by IBM in 2007.

Mr. Zvibel holds a BSc.in Mathematics and Computer Science from Tel Aviv University.

53: GreyBeards talk MAMR and future disk with Lenny Sharp, Sr. Dir. Product Management, WDC

This month we talk new disk technology with Lenny Sharp, Senior Director of Product Management, responsible for enterprise disk with Western Digital Corp. (WDC). WDC recently announced their future disk offerings will be based on a new disk recording technology, called MAMR or microwave assisted magnetic recording.

Over the last decade or so the disk industry has been investing in HAMR or heat assisted magnetic recording as the next recording innovation. So, MAMR is a significant departure but appears well worth it.

WDC is arguably the leading supplier of HDD and one of the leading SSD suppliers to the industry today. Any departure from industry technology roadmaps for WDC is big news.

WDC is banking on MAMR technology to continue to offer capacity disk (for big data) at prices that are 10X below the price of flash storage for the foreseeable future. If they and the rest of the disk industry can deliver on that promise then there should be a substantial market for capacity disk for the next decade or so.

What’s  MAMR?

HAMR uses lasers to heat up a media spot being recorded. This boost in energy helps reduce the magnetic threshold of the grains inside the media and allowed them to be written or change state. Once that energy was removed, the data state on media would persist and could be read multiple times without error.

MAMR uses microwaves to add similar energy to the spot being written on disk media. MAMR doesn’t actually heat up the spot with microwaves, but it does add elector-magnetic energy to the spot being written, which has the same affect of reducing the threshold for writing the media.  I wrote a recent blog post about MAMR technology describing the technology in more detail

HAMR heated the media spot from 400C to 700C, which was potentially reduces disk reliability. MAMR, because it doesn’t heat the disk anymore than normal operations, should not impact disk reliability.

Also MAMR can use pretty much the same disk substrate used in enterprise disks today and be fabricated using much the same manufacturing lines used for PMR (perpendicular magnetic recording) heads, today.

Disk densities

MAMR should allow the industry to get to ~4.5Tb/sqin. Current PMR technology will probably max out at 1.0 to 1.3Tb/sqin.  PMR density growth has flatlined (6-7% per year) recently, but MAMR should put the disk industry back on a 15% density growth/year. The new MAMR disks will be sampling for enterprise customer in 2018 and in production by 2019.

As for how far MAMR will take disk, WDC said we can expect a 40TB disk device (using multiple platters) by 2025 and Lenny said perhaps double that eventually.

We ended our discussion with Lenny on WDC and other disk vendor moves outside of the device level. Over time, IT use of disks have changed and the disk vendor’s seem to believe the best way to address this transition is to look beyond disk/SSD devices and towards manufacturing storage shelves and potentially even systems!? We’ll need to wait and see the dust settle on these moves.

The podcast runs ~45 minutes. Lenny was very knowledgeable about current and future disk technology and seems to have been around the disk industry forever.  He’s got an insider’s view of disk technology, IT’s use of disk and storage market dynamics. Both  Howard and I enjoyed our time with him.   Listen to the podcast to learn more.

Lenny Sharp, Sr. Dir. Product Management, WDC

Lenny Sharp serves as Western Digital’s Sr. Director of Enterprise HDD product line management and planning. He has over 30 years of experience in high technology and storage. Sharp joined HGST in 2009, iniIally responsible for enterprise SSD.
He has also managed client HDD and spent four years in Japan, working closely with the development team and APAC customers.
Previously, he was responsible for managing systems, software, storage and semiconductors for companies including Dell, Philips, Western Digital and Maxtor (since acquired by Seagate).

33: GreyBeards talk HPC storage with Frederic Van Haren, founder HighFens & former Sr. Director of HPC at Nuance

IMG_6319In episode 33 we talk with Frederic Van Haren (@fvha), founder of HighFens, Inc. (@HighFens), a new HPC consultancy and former Senior Director of HPC at Nuance Communications. Howard and I got a chance to talk with Frederic at a recent HPE storage deep dive event, I met up with him again during SFD10, where he was talking on behalf of Kaminario, and he was also at HPE Discover conference last week.

Nuance is the backend speech recognition engine for a number of popular service offerings. Nuance looks very similar to a lot of other hyper-scale customers and ultimately, we feel may be the way of the future for all IT over the coming decades.  Nuance’s data storage journey since Frederic’s tenure with the company holds many lessons for all of us in the storage industry

Nuance currently has ~6PB usable (~16PB raw) of speech wave files as well as uncountable text and other files, all inside IBM SpectrumScale (GPFS).  They have both lots of big files and lots of small files. These days, Spectrum Scale is processing 2-3M files/second. They have doubled capacity for each of the last 9 years, and today handle a billion new files a month. GPFS stripes data across storage, provides data protection, migration, snapshotting and storage tiering across a diverse mix of storage. At the end of the podcast we discussed some open source alternatives to Spectrum Scale but at the time Nuance started down this path,  GPFS was found to be the only thing that could do the job. This proved to be a great solution as they have completely swapped out the underlying storage at least 3 times and all their users were none the wiser.

The first storage that Frederic talked about was Coraid (no longer in business) and their ATA over Ethernet storage solution. This used a SuperMicro with 24 SATA drives/shelf and they bought 40 shelves. Over time this grew to 1000s of SATA drives and was easily scaleable but hard to manage, as it was pretty dumb storage. In fact, they had to deploy video cameras, focused on drive shelves, to detect when drives failed!

Overtime, Nuance came to the realization that they had to do something more manageable and brought in HPE MSA storage to replace their Coraid storage. The MSA was a great solution for them which had 96 SAS drives, were able to support both faster “SCRATCH” storage using fast SAS 300GB/15KRPM drives and slower “STATIC” storage with slower SATA 760GB/7.2KRPM drives and was much more manageable than the Coraid solution.

Although MSA storage worked great, after a while, Nuance’s sprawling FC environment which was doubling yearly, caused them to rethink their storage once again. This led them to swap out all their HPE MSA storage, for HPE 3PAR to consolidate their FC network and storage footprint.

For metadata, Nuance uses a 76 node, Hadoop cluster for sophisticated search queries as doing an LS on the GPFS file system would take days. Their file meta-data is essentially a textual, row by row database and they use queries over the Hadoop cluster to determine things like which files have american english, spoken by females, with 8Khz recording.  Not sure when, but eventually Nuance deployed HPE Vertica SQL over Hadoop for their metadata engine and dropped average query from 12 minutes to 73 sec.(!!)

Nuance, because of their extreme growth and more open environment to storage innovation, had become a favorite for storage startups and major vendors to do Proofs of Concepts (PoC) on new storage offerings. One PoC, Nuance did was for Kamanario storage. There is a standard metric that says a CPU core requires so many IOPS, so that when CPU cores  increase,  you need to supply more IOPS. They went with Kaminario for their test-dev environment and more performance intensive storage. Nuance appreciates Kamanario’s reliability, high availability and highly predictable performance. (See the SFD10 video feed for Frederic’s session)

We talked a bit about how speech recognition’s Hidden Markov Chain statistical model was heavily dependent on CPU cores. Today, if you want to do a recognition task, you assigned it to one core and waited until it was done, a serial process dependent on the # of CPU cores you had available. This turned out to be quite a problem as you had to scale CPU cores if you wanted to do more concurrent speech recognition activities. Then came GPUs and you could do speech recognition work on a GPU core. With the new GPU cards,   instead of a server having ~16 CPU cores,  you could have a server with multiple Graphic cards having 3000-GPU cores. This scaled a lot easier. Machine learning and deep neural nets have the potential to parallelize this, so that it will scale even better

In the end, HPC trials, tribulations and ways of doing business are starting to become  mainstream. I was recently talking to one vendor that said, most HPC groups start out in isolation to support one application but over time they either subsume corporate IT or get absorbed into corp. IT or continue to be a standalone group (while waiting until one of the other two happen).

The podcast runs ~41 minutes and  covers a lot of ground about one HPC organization’s evolution of their storage environment over time, what was driving some of that evolution and the tools they chose to master it.  Listen to the podcast to learn more.

0F2A7849 - Copyv2-resizedFrederic Van Haren, founder HighFens, Inc.

Frederic Van Haren is the Chief Technology Officer @Highfens and known for his insights in the HPC and storage industry. He has over 20 years of experience in High Tech providing technical leadership and strategic direction in Telecom and Speech markets. Frederic spent the last decade at  Nuance Communications building large HPC environments from the ground up. He is frequently invited to speak at events to provide his insights on the HPC and storage markets. He has played leading roles as President of a variety of technology user groups promoting the use of innovative technology. As an Engineer he enjoys working with the engineering teams from technology vendors providing feedback on new and upcoming products.

Frederic lives in Massachusetts,  USA but grew up in the northern part of Belgium where he received his Masters in Electrical Engineering, Electronics and Automation.

GreyBeards talk HPC storage with Molly Rector, CMO & EVP, DDN

oIn our 27th episode we talk with Molly Rector (@MollyRector), CMO & EVP of Product Management/Worldwide Marketing for DDN.  Howard and I have known Molly since her days at Spectra Logic. Molly is also on the BoD of SNIA and Active Archive Alliance (AAA), so she’s very active in the storage industry, on multiple dimensions and a very busy lady.

We (or maybe just I) didn’t know that DDN has a 20 year history in storage and in servicing high performance computing (HPC) customers. It turns out that more enterprise IT organizations are starting to take on workloads that look like HPC activity.

In HPC there are 1000s of compute cores that are crunching on PB of data. For Oil&Gas companies, it’s seismic and wellhead analysis; with bio-informatics it’s genomic/proteomic analysis; and with financial services, it’s economic modeling/backtesting trading strategies. For today’s enterprises such as retailers, it’s customer activity analytics; for manufacturers, it’s machine sensor/log analysis;  and for banks/financial institutions, it’s credit/financial viability assessments. Enterprise IT might not have 1000s of cores at their disposal just yet, but it’s not far off. Molly thinks one way to help enterprise IT is to provide a SuperComputer as a service (ScaaS?) offering, where top 10 supercomputers can be rented out by the hour, sort of like a supercomputing compute/data cloud.

We start early talking about DDN WOS: object store, which can handle archive to cloud or backend tape libraries. Later we discuss DDN ExaScaler and GridScaler, which are NAS appliances for Lustre and massively scale out, parallel file system storage, respectively.

Another key supercomputing storage requirement is  predictable performance. Aside from sophisticated QoS offerings across their products, DDN also offers the IME solution, a bump in the cable, caching system, that can optimize large and small file IO activity for backend DDN NAS scalers. DDN IME is stateless and can be removed from the data path while still allowing IT access  to all their data.

While we were discussing DDN storage interfaces, Molly mentioned they were working on an Omni Path Fabric.  Intel’s new Omni Path Fabric is intended to replace rack scale PCIe networks for HPC.

This months edition is not too technical and runs just over 45 minutes. We only got to SNIA and AAA at the tail end and just for a minute or two. Molly’s always fun to talk to, with enough technical smarts to keep Howard and I at bay, at least for awhile :). Listen to the podcast to learn more.

HeadshotMolly Rector, CMO and EVP Product Management & Worldwide Marketing,  DDN

With 15 years of experience working in the HPC, Media and Entertainment, and Enterprise IT industries running global marketing programs, Molly Rector serves as DDN’s Chief Marketing Officer (CMO) responsible for product management and worldwide marketing. Rector’s role includes providing customer and market input into the company’s product roadmap, raising the Corporate brand visibility outside traditional markets, expanding the partner ecosystem and driving the end-to-end customer experience from definition to delivery.

Rector is a founding member and currently serves as Chairman of the Board for the Active Archive Alliance. She is also the Storage Networking Industry Association’s (SNIA) Vice Chairman of the Board and the Analytics and Big Data committee Vice Chairman. Prior to joining DDN, Rector was responsible for product management and worldwide marketing as CMO at Spectra Logic. During her tenure at Spectra Logic, the company grew revenues consistently by double digits year-over-year, while also maintaining profitability. Rector holds certifications as CommVault Certified System Administrator; Veritas Certified Data Protection Administrator; and Oracle Certified Enterprise DBA: Backup and Recovery. She earned a Bachelor’s of Science degree in biology and chemistry.