103: GreyBeards talk scale-out file and cloud data with Molly Presley & Ben Gitenstein, Qumulo

Sponsored by:

Ray has known Molly Presley (@Molly_J_Presley), Head of Global Product Marketing for just about a decade now and we both just met Ben Gitenstein (@Qumulo_Product), VP of Products & Solutions, Qumulo on this podcast. Both Molly and Ben were very knowledgeable about the problems customers have with massive data troves.

Molly has been on our podcast before (with another company, see: GreyBeards talk HPC storage with Molly Rector, CMO & EVP, DDN ). And we have talked with Qumulo before as well (see: GreyBeards talk data-aware, scale-out file systems with Peter Godman, Co-founder & CEO, Qumulo ).

Qumulo has a long history of dealing with customer issues with data center application access to data, usually large data repositories, with billions of small or large files, they have accumulated over time. But recently Qumulo has taken on similar problems in the cloud as well.

Qumulo’s secret has always been to allow researchers to run their applications wherever their data resides. This has led Qumulo’s software defined storage to offer multiple protocol access as well as a completely native, AWS and GCP cloud version of their solution.

That way customers can run Qumulo in their data center or in the cloud and have the same great access to data. Molly mentioned one customer that creates and gathers data using SMB protocol on prem and then, after replication, processes it in the cloud.

Qumulo Shift

Ben mentioned that many competitive storage systems are business model focused. That is they are all about keeping customer data within their solutions so they can charge for capacity. Although Qumulo also charges for capacity, with the new Qumulo Shift service, customer can easily move data off Qumulo and into native cloud storage. Using Shift, customers can free up Qumulo storage space (and cost) for any data that only needs to be accessed as objects.

With Shift, customers can replicate or move on prem or in the cloud Qumulo file data to AWS S3 objects. Once in S3, customers can access it with AWS native applications, other applications that make use of AWS S3 data, or can have that data be accessible around the world.

Qumulo customers can select directories to Shift to an AWS S3 bucket. The Qumulo directory name will be mapped to a S3 bucket name and each file in that directory will be copied to an S3 object in that bucket with the same file name.

At the moment, Qumulo Shift only supports AWS S3. Over time, Qumulo plans to offer support for other public cloud storage targets for Shift.

Shift is based on Qumulo replication services. Qumulo has a number of patents on replication technology that provides for sophisticated monitoring, control and high performance for moving vast amounts of data.

How customers use Shift

One large customer uses Qumulo cloud file services to process seismic data but then makes the results of that analysis available to other clients as S3 objects.

Customers can also take advantage of AWS and other applications that support objects only. For example, AWS SageMaker Machine Learning (ML) processes S3 object data. Qumulo customers could gather training data as files and Shift it to S3 objects for ML training.

Moreover, customers can use Shift to create AWS S3 object backups, archives and DR repositories of Qumulo file data. Ben mentioned DevOps could also use Qumulo Shift via APIs to move file data to S3 objects as part of new application deployment.

Finally, using Shift to copy or move file data to AWS S3, makes it ideal for collaboration by researchers, analysts and just about other entity that needs access to data.

The podcast ran ~26 minutes. Molly has always been easy to talk with and Ben turned out also to be easy to talk with and knew an awful lot about the product and how customers can use it. Keith and I enjoyed our time with Molly and Ben discussing Qumulo and their new Shift service. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png
This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Ben Gitenstein, VP of Products and Solutions, Qumulo

Ben Gitenstein runs Product at Qumulo. He and his team of product managers and data scientists have conducted nearly 1,000 interviews with storage users and analyzed millions of data points to understand customer needs and the direction of the storage market.

Prior to working at Qumulo, Ben spent five years at Microsoft, where he split his time between Corporate Strategy and Product Planning.

Molly Presley, Head of Global Product Marketing, Qumulo

Molly Presley joined Qumulo in 2018 and leads worldwide product marketing. Molly brings over 15 years of file system and archive technology leadership experience to the role.

Prior to Qumulo, Molly held executive product and marketing leadership roles at Quantum, DataDirect Networks (DDN) and Spectra Logic.

Presley also created the term “Active Archive”, founded the Active Archive Alliance and has served on the Board of the Storage Networking Industry Association (SNIA).

0101: Greybeards talk with Howard Marks, Technologist Extraordinary & Plenipotentiary at VAST

As most of you know, Howard Marks (@deepstoragenet), Technologist Extraordinary & Plenipotentiary at VAST Data used to be a Greybeards co-host and is still on our roster as a co-host emeritus. When I started to schedule this podcast, it was going to be our 100th podcast and we wanted to invite Howard and the rest of the co-hosts to be on the call to discuss our podcast. But alas, the 100th Greybeards podcast came and went, before we could get it done. So we decided to refocus this podcast back on VAST Data.

We talked with Howard last year about VAST and some of this podcast covers the same ground (see last year’s podcast with Howard on VAST Data) but I highlighted below different aspects of their product that we also discussed.

For starters, VAST just finalized a recent round of funding, which if I recall, valued them at over $1B USD, or yet another data storage unicorn.

VAST is a scale out, disaggregated, unstructured data platform that takes advantage of the economics of QLC SSD (from Intel) combined with the speed of 3D XPoint storage class memory (Optane SSD, also from Intel) to support customer data. Intel is an investor in VAST.

VAST uses mutliple front end (controller) servers, with one or more HA NVMe drive module(s) connected via a dual infiniband or 100Gbps Ethernet RDMA cluster interconnect. The HA NVMe drive module has two (IO modules) adapter cards, one for each connection that takes IO and data requests and transfers them across a PCIe bus which connects to QLC and Optane SSDs. They also have a Mellanox (another investor) switch on their backend with a (round robin) DNS router to connect hosts to their storage (front-end) servers.

Each backend HA NVMe drive module has 12 1.5TB Optane U.2 SSDs and 44 15.4TB QLC SSDs, for a total of 56 drives. Customer data is first written to Optane and then destaged to QLC SSD.

QLC has the advantage of being 4 bits per cell (for a lower $/GB stored) but it’s endurance or drive writes/day (dw/d)) is significantly worse than TLC. So VAST has had to work to increase QLC endurance in their system.

Natively, QLC offers ~0.2 dw/d when doing random 4K writes. However, if your system does 128KB sequential writes, it offers 4.0 dw/d. VAST destages data from Optane SSDs to QLC in 1MB chunks which both optimizes endurance and reduces garbage collection write amplification within the drive.

Howard mentioned their frontend servers are stateless, i.e., maintain no state information about any IO activity going on. Any IO state information is maintained by their system in Optane SSDs. Each server maintains a work log (like) structure on Optane that describes what they are doing in support of host IO and other activities. That way, if one front end server goes down, another one can access its log and take over its activity.

Metadata is also maintained only on Optane SSDs. Howard called their metadata structure a V-tree (B-tree). VAST mirrors all meta-data and customer data to two Optane SSDs. So if one Optane SSD goes down, its pair can be used to continue operations.

In last years podcast we talked at length about VAST data protection and data reduction capabilities so we won’t discuss these any further here.

However, one thing worth noting is that VAST has a very large RAID (erasure code protection) stripe. Data is written to the QLC SSDs in a VAST designed, locally decodable erasure coding format.

One problem with large stripes is rebuild time. VAST’s locally decodable parity codes help with this but the other thing that helps is distributing rebuild IO activity to all front end servers in the system.

The other problem with large stripe sizes is garbage collection. VAST segregates customer data by “temporariness” based on their best guess. In this way all data in one stripe should have similar lifetimes. When it’s time for stripe garbage collection, having all temporary data allows VAST to jettison the whole stripe (or most of it) rather than having to collect and re-write old stripe data to another new stripe.

VAST came out supporting NFSv3 and S3 object storage protocols, Their next release adds support for SMB 2.2, data-at-rest encryption and snapshotting to an external S3 store. As you may recall SMB is a stateful protocol. In VAST’s home grown, SMB implementation, front end servers can take over SMB transactions from other failed servers, without having to fail the whole transaction and start over again.

VAST uses a fail in place, maintenance policy. That is failed SSDs are not normally replaced in customer deployments, rather blocks, pages, or SSDs are marked as failed and the spare capacity available in the drive enclosure is used to provide space for any needed rebuilt data.

VAST offers a 10 year maintenance option where the customer keeps the same storage for 10 full years. That way customers don’t have to migrate data from one system to another until their 10 years are up.

The podcast runs a little under 44 minutes. Howard and I can talk forever. He is always a pleasure to talk with as well as extremely knowledgeable about (VAST) storage and other industry solutions.  The co-hosts and I had a great time talking with him again. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data, Inc.

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.

098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous

Sponsored By:

Even before COVID-19 there was a lot of file data being created and mined, but with the advent of the pandemic, this has accelerated considerably. As such, it seemed an appropriate time to talk with Christian Smith, VP of Product at Igneous, (@IgneousIO) a company that targets the protection and visibility of massive quantities of unstructured data, on premise, in the cloud, or just about anywhere else it may live.

Let me state at the outset, that my belief had always been, that you don’t backup 10PB of data, rather you bite the (big expense) bullet to replicate it and hope for the best. After talking with Christian and Igneous I am going to have to modify that belief by a couple of more orders of magnitude.

All this data is coming from: LIDAR, RADAR, audio, video, pictures, medical film, MRI/CAT Scans, etc., and as noted above, it’s exploding. Christian talked about one customer of theirs that supplies aerial photography/LIDAR/RADAR scans of areas on request. This can used to better understand crop, forest, wildlife, land health and use. One surprise Igneous found with this customer is that the data is typically archived after first use, but within a month or so it’s moved back online for some other purpose.

Igneous heritage

Many of the people who started up and currently work at Igneous have been around file storage for some time having, primarily coming from (Dell EMC) Isilon, NetApp, Qumulo and other industry heavyweights. When they started Igneous, they realized the world didn’t need another NAS box or file system. Rather, with the advent of 10-100PB unstructured data farms, what was needed was an effective way to protect and understand that data.

When they considered how to protect and visualize 100PB of unstructured data, the only they found to do this was to build a scale-out solution that used on premise and cloud infrastructure and was offered as a service.

Igneous DataProtect solution

With 10PB or 100PB of files, located across a gaggle of heterogeneous file servers, with billions of files across ~100s of servers, each of with has ~1K or more file shares, just scanning all the file servers would take weeks, if not longer and then you need to move the data someplace to protect it. Seems like an impossible task.

Igneous immediately figured out the first thing they needed was a radically new, scale out architecture to rapidly scan of the file servers. Thus was born ActiveScan. Christian said it was designed to scan a trillion files and they have customers with a billion files using their service today. ActiveScan doesn’t use NFS/SMB/Object (S3) access protocols to talk with file servers rather it uses internal APIs to access file metadata. DataProtect currently supports APIs for NetApp, Dell EMC Isilon, Pure FlashBlade, Qumulo, Gluster, Lustre, & GPFS (IBM Spectrum Scale) file systems. They use ActiveScan to build a file index database.

Their other major concern was hot to move PBs of data rapidly across to the cloud and other locations. Again they created a scale out, multi-threaded service to do this and also made use of internal APIs rather than standard file or object protocols. This became IntelliMove. That same customer above with billions of files, has 6PB of file data to protect.

Normal data movement is fine for largish, files but bogs down with lots of small files or extremely large files to back up. DataProtect gathers together small files into a large chunks and splits up extremely large files into smaller chunks and moves these chunks to secondary storage.

Data expiration is another problem, especially when you chunk files together. Here they came up with an intelligent garbage collection algorithm which only collects free space when it makes the most sense but deletes data access at the time of expiration.

DataProtect uses a cloud based, SaaS control plane that manages and coordinates its activities across data centers, sites and cloud instances. It also has a client VM (OVA, with 8 core CPU, 32GB DRAM, ~100MB) that runs in the customers infrastructure, on site, in CoLo’s or in the cloud that is used to scan-move-protect customer unstructured data. If more scan and data movement performance is needed, the VM can spawn additional threads automatically and more VMs can be added to provide even more throughput.

DataDiscover solution

The other service that Igneous offers is DataDiscover a data visualization tool. DataDiscover uses ActiveScan and its database to provide customers a way to understand the file data that resides in their massive unstructured data farms across the data center, cloud or wherever else it resides.

We didn’t discuss this solution as much but having a way to better understand the files in a 10-100PB unstructured data farm could be very useful and a great way to keep that 100PB from growing to 1EB faster than it has too.

As part of their outreach to the world, Igneous is giving away free DataProtect services to organizations that are focused on COVID-19 research. Check out their offer here

The podcast ran ~24 minutes. Christian was extremely knowledgeable about the problems that happen with very large unstructured data farms and how Igneous solutions can provide a better way to protect and visualize that data. Matt and I had a fun time discussing Igneous’s approach with Christian. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Christian Smith, VP Product at Igneous

Christian is VP of Product, responsible for product management, solutions, and customer success. Prior to Igneous, Christian spent 15 years running field engineering organizations at EMC, Isilon Systems, NetApp and Silicon Graphics.

Christian has been working with organizations that work with file data since working at Silicon Graphics. Before that Christian was co-founder of a small management consulting company associated with Y2K and deregulation.

Christian received dual bachelor’s degrees in Chemistry and Computer Science from the University of Missouri-Columbia. Christian is an avid camper, skier and traveler and has long since traveled through all of the continental 48 states.

095: GreyBeards talk file sync&share with S. Azam Ali, VP Customer Success at CentreStack

We haven’t talked with a file synch and share vendor in a while now and Matt was interested in the technology. He had been talking with CentreStack, and found that they had been making some inroads in the enterprise. So we contacted S. Azam Ali, VP of Customer Success at CentreStack and asked if he wanted to talk about their product on our podcast.

File synch and share, is part collaboration tool, part productivity tool. With file synch & share many users share the same files, across many different environments and end point devices. It’s especially popular with road warriors that need access to the same files on the road that reside in corporate data centers. With this technology, files updated anywhere would be available to all.

Most file synch&share systems require you to use their storage. But CentreStack just provides synch and share access to NFS and SMB storage that’s already in the data center.

CentreStack doesn’t use VPNs to access data, many other vendor do. But with CentreStack, one just log’s into a website (with AD credentials) and they have immediate browser access to files.

CentreStack uses a gateway VM, that runs in the corporate data center, configured to share files/file directories/shares. We asked whether they were in the data path and Azam said no. However, the gateway does register for file system notifications (e.g. when files are updated, outside CentreStack, they get notified).

CentreStack does maintain meta-data on the files, directories, shares that are under it’s control. Presumably, once an admin sets it up, it goes out and access the file systems that have shared files and populates their meta-data for those files.

CentreStack works with any NFS and SMB file system as well as NAS servers that support these two. It’s unclear whether customers can have more than one gateway server in their data center supporting synch and share but Azam did say that it wasn’t unusual for customers with multi-data centers to have a gateway in each, to support synch&share requirements for each data center.

They use client software on end point devices, which presents the shared files as an external drive (to Mac), presumably a cloud drive for Windows PCs and similar services (in an App) for other systems (IOS, Android phones, iPad, etc.). We believe Azam said Linux was coming soon.

The client software can be configured in cache mode or offline mode:

  • Cache mode – the admin can configure how much space to use on the endpoint device and the software will cache the most recently used files in that space for faster access
  • Offline mode – the software moves all files that the endpoint login can access, to the device.

In cache mode, when users open a file (not in the most recently used cache), there will be some delay as the system retrieves data from the internet and copies it to the endpoint device. Unclear what the delay might be but it’s probably a function of internet speed and load on the gateway, with possibly some overhead for the NFS/SMB/NAS system to supply the data. If there’s not enough space to hold the file, the oldest non-open file is erased from the cache.

In both modes, Centrestack supports cross domain locking. That is, if one client has a file open (for update), all other systems/endpoints may only access the file in read-only mode. After the file is closed. the file can then be opened for update by other users.

When CentreStack clients are used to update files, the data is stored back in the original file systems with versioning. This way if the data is corrupted, admins can easily return back to a known good copy version.

CentreStack also offers a cloud backup and DR service. Gateway admins can request that synch&share files be backed up to cloud storage (AWS S3, Azure Blob and Wasabi). When CentreStack backups file data to the cloud, it also includes metadata information about the files so they can be re-constituted anywhere.

A CentreStack cloud gateway VM can be activated in the cloud to supply access to backed up files. Unclear whether the CentreStack cloud backup has to be restored to block or file storage first or if it just accesses the data on cloud storage directly. But one customers using CentreStack cloud DR would need to run client software in their applications accessing these files.

Wasabi seemed an odd solution to have on their list of supported cloud storage providers, but Azam said for their market, the economics of Wasabi storage were hard to ignore. See our previous podcast with David Friend, Co-Founder& CEO, Wasabi, to learn more about Wasabi.

CentreStack is licensed on a per user basis, not storage capacity bucking industry trends. But they don’t actually own the storage so it makes sense. For CentreStack cloud backup, customers also have to supply the cloud storage.

They also offer a 30 day free trial on their website with unlimited users. We assume this uses CentreStacks cloud gateway and customers bring their own cloud storage to support it.

The podcast runs about 35 minutes. Azam was a bit more marketing than we are used to, but he warmed up once we started asking questions. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

S. Azam Ali, VP of Customer Success, CentreStack

S. Azam Ali, is VP of Customer Success at CentreStack and is an executive with extensive experience in managing global teams including sales, support and consulting services.

Azam’s channel experience includes on-boarding new partners including creation of marketing and training collateral for the partners. Azam is an executive with a passion for customer success and establishing long term relationships and partnerships.

Azam is also an advisor to startups as well as established technology companies.

094: GreyBeards talk shedding light on data with Scott Baker, Dir. Content & Data Intelligence at Hitachi Vantara

Sponsored By:

At Hitachi NEXT 2019 Conference, last month, there was a lot of talk about new data services from Hitachi. Keith and I thought it would be a good time to sit down and talk with Scott Baker (@Kraken-Scuba), Director of Content and Data Intelligence, at Hitachi Vantara about what’s going on with data operations these days and how customers are shedding more light on their data.

Information supply chain

Something Scott said in his opening remarks caught my attention when he mentioned customer information supply chains. The information supply chain is similar to manufacturing supply chains, but it’s all about data. Just like manufacturing supply chains where parts and services come from anywhere and are used to create products/services for customers,

information supply chains are about the data used in their organization operations. Information supply chain data is A) being sourced from many places (or applications); B) being added to by supply chain processing (or other applications); and C) ultimately used by the organization to supply a product/service to customers.

But after the product/service is supplied the similarity between manufacturing and information supply chains breaks down. With the information supply chain, data is effectively indestructible, is infinitely re-useable and can live forever. Who throws data away anymore?

The problem most organizations have with information supply chains is once the product/service is supplied, data is often put away never to be seen again or as Scott puts it, goes dark.

This is where Hitachi Content intelligence (HCI) comes in. HCI is designed to take (unstructured or structured) data and analyze it (using natural language and other processing tools) to surround it with information and other metadata, so that it can become more visible and useful to the organization for the life of its existence.

Customers can also use HCI to extract and blend data streams together, automating the creation of an information rich, data repository. The data repository can readily be searched to re-discover or uncover attributes about the data not visible before.

Scott also mentioned the Hitachi Pentaho Platform which can be used to make real time decision from structured data. Pentaho information can also be fed into HCI to provide more intelligence for your structured data.

But HCI can also be used to analyze other database data as well. For instance, database blob and text elements can be fed to and analyzed by HCI. HCI analysis can include natural language processing and other functionality to tag the data by adding key:value information, all of which can be supplied back to the database or Pentaho to add further value to structured data.

Customers can also use HCI to read and transform database tables into XML files. XML files can be stored in object stores as objects or in file systems. XML data could easily be textually indexed and be searched by various tools to better understand the structured data information

We also talked about Hadoop data that can be offloaded to Hitachi Content Platform (HCP) object storage with a stub left behind. Once data is in HCP, HCI can be triggered to index and add more metadata, which can then later be used to decide when to move data back to Hadoop for further analysis.

Finally, Keith mentioned that he just got back from KubeCon and there was an increasing cry for data being used with containerized applications. Scott mentioned HCP for Cloud Scale, the newest member of the HCP object store family, focused on scale out capabilities to provide highly consistent, object storage performance for customers that need it. Customers running containerized workloads use scale-out capabilities to respond to user demand and now they have on premises object storage that can scale with them, as needs change.

The podcast ran ~24 minutes. Scott was very knowledgeable about data workflows, pipelines and the need for better discovery tools. We had a great time discussing information supply chains and how Hitachi can help customers optimize their data pipelines. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Scott Baker, Director of Content and Data Intelligence at Hitachi Vantara

Scott Baker is, and has been, an active member of the information technology, data analytics, data management, and data protection disciplines for longer than he is willing to admit.

In his present role at Hitachi, Scott is the Senior Director of the Content and Data Intelligence organization focused on Hitachi’s Digital Transformation, Data Management, Data Governance, Data Mobility, Data Protection and Data Analytics solutions which includes Hitachi Content Platform (HCP), HCP Anywhere, HCP Gateway, Hitachi Content Intelligence, and Hitachi Data Protection Solutions.

Scott is a VMware Certified Professional, recognized as a subject matter expert, industry speaker, and author. Scott has been a panelist on topics related to storage, cloud, information governance, data security, infrastructure standardization, and social media topics. His educational background includes an MBA, Master’s & Bachelor’s in Computer Science.

When he’s not working, Scott is an avid scuba diver, underwater photographer, and PADI Scuba Instructor. He has a passion for public speaking, whiteboarding, teaching, and traveling the world.