098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous

Sponsored By:

Even before COVID-19 there was a lot of file data being created and mined, but with the advent of the pandemic, this has accelerated considerably. As such, it seemed an appropriate time to talk with Christian Smith, VP of Product at Igneous, (@IgneousIO) a company that targets the protection and visibility of massive quantities of unstructured data, on premise, in the cloud, or just about anywhere else it may live.

Let me state at the outset, that my belief had always been, that you don’t backup 10PB of data, rather you bite the (big expense) bullet to replicate it and hope for the best. After talking with Christian and Igneous I am going to have to modify that belief by a couple of more orders of magnitude.

All this data is coming from: LIDAR, RADAR, audio, video, pictures, medical film, MRI/CAT Scans, etc., and as noted above, it’s exploding. Christian talked about one customer of theirs that supplies aerial photography/LIDAR/RADAR scans of areas on request. This can used to better understand crop, forest, wildlife, land health and use. One surprise Igneous found with this customer is that the data is typically archived after first use, but within a month or so it’s moved back online for some other purpose.

Igneous heritage

Many of the people who started up and currently work at Igneous have been around file storage for some time having, primarily coming from (Dell EMC) Isilon, NetApp, Qumulo and other industry heavyweights. When they started Igneous, they realized the world didn’t need another NAS box or file system. Rather, with the advent of 10-100PB unstructured data farms, what was needed was an effective way to protect and understand that data.

When they considered how to protect and visualize 100PB of unstructured data, the only they found to do this was to build a scale-out solution that used on premise and cloud infrastructure and was offered as a service.

Igneous DataProtect solution

With 10PB or 100PB of files, located across a gaggle of heterogeneous file servers, with billions of files across ~100s of servers, each of with has ~1K or more file shares, just scanning all the file servers would take weeks, if not longer and then you need to move the data someplace to protect it. Seems like an impossible task.

Igneous immediately figured out the first thing they needed was a radically new, scale out architecture to rapidly scan of the file servers. Thus was born ActiveScan. Christian said it was designed to scan a trillion files and they have customers with a billion files using their service today. ActiveScan doesn’t use NFS/SMB/Object (S3) access protocols to talk with file servers rather it uses internal APIs to access file metadata. DataProtect currently supports APIs for NetApp, Dell EMC Isilon, Pure FlashBlade, Qumulo, Gluster, Lustre, & GPFS (IBM Spectrum Scale) file systems. They use ActiveScan to build a file index database.

Their other major concern was hot to move PBs of data rapidly across to the cloud and other locations. Again they created a scale out, multi-threaded service to do this and also made use of internal APIs rather than standard file or object protocols. This became IntelliMove. That same customer above with billions of files, has 6PB of file data to protect.

Normal data movement is fine for largish, files but bogs down with lots of small files or extremely large files to back up. DataProtect gathers together small files into a large chunks and splits up extremely large files into smaller chunks and moves these chunks to secondary storage.

Data expiration is another problem, especially when you chunk files together. Here they came up with an intelligent garbage collection algorithm which only collects free space when it makes the most sense but deletes data access at the time of expiration.

DataProtect uses a cloud based, SaaS control plane that manages and coordinates its activities across data centers, sites and cloud instances. It also has a client VM (OVA, with 8 core CPU, 32GB DRAM, ~100MB) that runs in the customers infrastructure, on site, in CoLo’s or in the cloud that is used to scan-move-protect customer unstructured data. If more scan and data movement performance is needed, the VM can spawn additional threads automatically and more VMs can be added to provide even more throughput.

DataDiscover solution

The other service that Igneous offers is DataDiscover a data visualization tool. DataDiscover uses ActiveScan and its database to provide customers a way to understand the file data that resides in their massive unstructured data farms across the data center, cloud or wherever else it resides.

We didn’t discuss this solution as much but having a way to better understand the files in a 10-100PB unstructured data farm could be very useful and a great way to keep that 100PB from growing to 1EB faster than it has too.

As part of their outreach to the world, Igneous is giving away free DataProtect services to organizations that are focused on COVID-19 research. Check out their offer here

The podcast ran ~24 minutes. Christian was extremely knowledgeable about the problems that happen with very large unstructured data farms and how Igneous solutions can provide a better way to protect and visualize that data. Matt and I had a fun time discussing Igneous’s approach with Christian. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Christian Smith, VP Product at Igneous

Christian is VP of Product, responsible for product management, solutions, and customer success. Prior to Igneous, Christian spent 15 years running field engineering organizations at EMC, Isilon Systems, NetApp and Silicon Graphics.

Christian has been working with organizations that work with file data since working at Silicon Graphics. Before that Christian was co-founder of a small management consulting company associated with Y2K and deregulation.

Christian received dual bachelor’s degrees in Chemistry and Computer Science from the University of Missouri-Columbia. Christian is an avid camper, skier and traveler and has long since traveled through all of the continental 48 states.

91: Keith and Ray show at CommvaultGO 2019

There was a lot of news at CommvaultGO this year and it was our first chance to talk with their new CEO, Sanjay Mirchandani. Just prior to the show Commvault introduced new SaaS backup offering for the mid market, Metallic™ and about a month or so prior to the show Commvault had acquired Hedvig, a software defined storage solution. Keith and I also participated in a TechFieldDay Exclusive (TFDx) for Commvault, the day before the show began.

First up is Metallic, a Commvault Venture. When Sanjay arrived he took a worldwide tour of Commvault offices and customers and came back saying they needed a Software-as-a-Service backup offering to go after the mid market. That was about 6 months ago and since then, they have spun up a development and marketing team and today delivered their first product.

Metallic has three offerings all based on Commvault technology but re-implemented to be simpler to use and operate in the cloud.

  1. Metallic Core Backup & Recovery which is targeted at virtualized server environments whether on premises or in the cloud. It covers backup and recovery for VMware vSphere, Microsoft Hyper-V & KVM VMs, SQL server and file servers running on Windows or Linux.
  2. Metallic Office 365 Backup & Recovery, which is targeted at Office 365 solutions and provides backup and recovery solutions for these customer environments.
  3. Metallic Endpoint Backup & Recovery, which is focused on desktop and laptop users and provides backup and recovery for those end-user environments.

Metallic operates in it’s own cloud environment (believed to be Microsoft Azure) and it’s a bring your own cloud secondary storage solution with an option to use Metallic cloud storage as secondary storage.

At the moment, Metallic is only offered to US based organizations and purchased through Commvault channel partners. However, the free (believe 45 day) trial can be downloaded and purchased without the channel.

Pricing for the Core Backup & Recovery is based on TB/month and pricing for the other two Metallic offerings is based on user seats/month. There doesn’t seem to be any retention limit for the Office365 and Endpoint products. The Core Backup product data retention is only limited by the TBs that are licensed.

Next up is Commvault Activate™. This product was announced at last years GO conference but neither Keith or I took note. Activate is data management solution using Commvault backup storage and provides three capabilities, File storage Optimization, which identifies files that are suitable for archive; Sensitive Data Governance, which profiles and id’s sensitive data in files and provides governance; and Compliance Search & eDiscovery, which can be used to put legal holds and create review sets for legal and other compliance activities.

And then there’s Hedvig, a Commvault Venture. At the show there was much talk about the Data Brain as having two sides, one was for the management of data protection and the other was for the management of storage. What Commvault plans to do over the next few years is to deliver on a unified storage and protection Data Brain that supports both of these sides. During the TFD sessions there was quite a lot of chatter, twitter and otherwise about whether customers would ever be willing to have both primary and secondary storage on the same system, or be have both be controlled by the same data plane. Commvault isn’t the only vendor to have gone down this path. We will need to wait and see how customers react.

The podcast is ~23 minutes. As mentioned previously, Keith is a long time friend and co-host of our GreyBeards On Storage podcast. He always has an interesting perspective on how new technology can benefit the data center today. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Keith Townsend, The CTO Advisor

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations.

Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

90: GreyBeards talk K8s containers storage with Michael Ferranti, VP Product Marketing, Portworx

At VMworld2019 USA there was a lot of talk about integrating Kubernetes (K8s) into vSphere’s execution stack and operational model. We had heard that Portworx was a leader in K8s storage services or persistent volume support and thought it might be instructive to hear from Michael Ferranti (@ferrantiM), VP of Product Marketing at Portworx about just what they do for K8s container apps and their need for state information.

Early on Michael worked for RackSpace in their SaaS team and over time saw how developers and system engineers just loved container apps. But they had great difficulty using them for mission critical applications and containers of the time had a complete lack of support for storage. Michael joined Portworx to help address these and other limitations in using containers for mission critical workloads.

Portworx is essentially a SAN, specifically designed for containers. It’s a software defined storage system that creates a cluster of storage nodes across K8s clusters and provides standard storage services on a container level granularity.

As a software defined storage system, Portworx is right in the middle of the data path, storage they must provide high availability, RAID protection and other standard storage system capabilities. But we talked only a little about basic storage functionality on the podcast.

Portworx was designed from the start to work for containers, so it can easily handle provisioning and de-provisioning, 100s to 1000s of volumes without breaking a sweat. Not many storage systems, software defined or not, can handle this level of operations and not impact storage services.

Portworx supports both synchronous and asynchronous (snapshot based) replication solutions. As all synchronous replication, system write performance is dependent on how far apart the storage nodes are, but it can provide RPO=0 (recovery point objective) for mission critical container applications.

Portworx takes this another step beyond just data replication. They also replicate container configuration (YAML) files. We’re no experts but YAML files contain an encapsulation of everything needed to understand how to run containers and container apps in a K8s cluster. When one combines replicated container YAML files, replicated persistent volume data AND an appropriate external registry, one can start running your mission critical container apps at a disaster site in minutes.

Their asynchronous replication for container data and configuration files, uses Portworx snapshots , which are sent to an alternate site. But they also support asynch replication to any S3 compatible storage via CloudSnap.

Portworx also supports KubeMotion, which replicates/copies name spaces, container app volume data and container configuration YAML files from one K8s cluster to another. This way customers can move their K8s namespaces and container apps to any other Portworx K8s cluster site. This works across on prem K8s clusters, cloud K8s clusters, between public cloud provider K8s clusters s or between on prem and cloud K8s clusters.

Michael also mentioned that data at rest encryption, for Portworx, is merely a tick box on a storage class specification in the container’s YAML file. They make use use of KMIP services to provide customer generated keys for encryption.

This is all offered as part of their Data Security/Disaster Recovery (DSDR) service. that supports any K8s cluster service whether they be AWS, Azure, GCP, OpenShift, bare metal, or VMware vSphere running K8s VMs.

Like any software defined storage system, customers needing more performance can add nodes to the Portworx (and K8s) cluster or more/faster storage to speed up IO

It appears they have most if not all the standard storage system capabilities covered but their main differentiator, besides container app DR, is that they support volumes on a container by container basis. Unlike other storage systems that tend to use a VM or higher level of granularity to contain container state information, with Portworx, each persistent volume in use by a container is mapped to a provisioned volume.

Michael said their focus from the start was to provide high performing, resilient and secure storage for container apps. They ended up with a K8s native storage and backup/DR solution to support mission critical container apps running at scale. Licensing for Portworx is on a per host (K8s node basis).

The podcast ran long, ~48 minutes. Michael was easy to talk with, knew K8s and their technology/market very well. Matt and I had a good time discussing K8s and Portworx’s unique features made for K8s container apps. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Michael Ferranti, VP of Product Marketing, Portworx

Michael (@ferrantiM) is VP of Product Marketing at Portworx, where he is responsible for communicating the value of containerization and digital transformation to global architects and CIOs.

Prior to joining Portworx, Michael was VP of Marketing at ClusterHQ, an early leader in the container storage market and spent five years at Rackspace in a variety of product and marketing roles

88: A GreyBeard talks DataPlatform with Jon Hildebrand, Principal Technologist, Cohesity at VMworld 2019

Sponsored by:

This is another sponsored GreyBeards on Storage podcast and it was recorded at Vmworld 2019. I talked with Jon Hildebrand (@snoopJ123), Principal Technologist at Cohesity. Jon’s been a long time friend from TechFieldDay days and has been working with Cohesity for ~14 months now. For such a short time, Jon’s seen a lot of changes in Cohesity functionality

Indeed, they just announced general availability of Cohesity 6.4 which he called a “major release”. One of the first things we talked about in the 6.4 release, was CyberScan, Powered by Tenable, which is a new capability that uses backup data and scans it for vulnerabilities and risk postures. This way customers can assess their data to see if it’s been infected, potentially long before ransomware or other cyber threats can cripple your systems.

One of the other features in 6.4 was a new run book automation, called the Cohesity Runbook application, that can be used for instance to standup a physical clone of customer data and applications in the cloud or elsewhere. This way customers can have a fully operational copy of their applications running in the cloud, automatically supplied by Cohesity Runbook. Besides the great use of this facility for DR, and DR testing, such capabilities could be used to fire up a Test/Dev environment of your production applications on public cloud infrastructure.

The last feature of 6.4 that Jon and I discussed, supports archiving data from a primary NAS/filer storage systems and move that data out to Cohesity NAS. A stub or SymLink to the data is retained on the primary NAS system. By doing that, customers still have access to all the metadata and can access the data anytime they want, but frees up primary storage capacity and most of the IO processing to access the data.

Cohesity NAS provides the capacity and the processing power to support the IO and data that has been archived. With the new feature, Cohesity DataPlatform acts as an archive or tier of storage behind the primary NAS server. By doing so, customers should be able to delay tech refresh cycles, which should save them time and money. 

When I asked Jon if there were any last items he wanted to discuss he mentioned the Cohesity Truck. Apparently John, Chris and others at Cohesity have stood up a complete data center inside a semi-trailer. Jon said if we can’t bring customers to the Executive Briefing Center (EBC), then we can bring the EBC to the customers. Jon said the truck is touring the USA and you can arrange a visit by going to Cohesity.com/tour.

The podcast is a little under ~20 minutes. Jon is an old friend from TechFieldDays and seems to be taking to Cohesity very well. I’ve always respected Jon’s knowledge of the customer environment and his technical acumen. Listen to the podcast to learn more.

Jon Hildebrand, Principal Technologist, Cohesity. 

Principal Technologist @ Cohesity | Public Speaker | Blogger | Purveyor of PowerShell | VMware vExpert | Cisco Champion

76: GreyBeards talk backup content, GDPR and cyber security with Jim McGann, VP Mkt & Bus. Dev., Index Engines

In this episode we talkindexing old backups, GDPR and CyberSense, a new approach to cyber security, with Jim McGann, VP Marketing and Business Development, Index Engines.

Jim’s an old industry hand that’s been around backups, e-discovery and security almost since the beginning. Index Engines solution to cyber security, CyberSense, is also offered by Dell EMC and Jim presented at a TFDx event this past October hosted by Dell EMC (See Dell EMC-Index Engines TFDx session on CyberSense).

It seems Howard’s been using Index Engines for a long time but keeping them a trade secret. In one of his prior consulting engagements he used Index Engines technology to locate a a multi-million dollar email for one customer.

Universal backup data scan and indexing tool

Index Engines has long history as a tool to index and understand old backup tapes and files. Index Engines did all the work to understand the format and content of NetBackup, Dell EMC Networker, IBM TSM (now Spectrum Protect), Microsoft Exchange backups, database vendor backups and other backup files. Using this knowledge they are able to read just about anyone’s backup tapes or files and tell customers what’s on them.

But it’s not just a backup catalog tool, Index Engines can also crack open backup files and index the content of the data. In this way customers can search backup data, with Google like search terms. This is used day in and day out, for E-discovery and the occasional consulting engagement.

Index Engines technology is also useful for companies complying with GDPR and similar legislation. When any user can request information about them be purged from corporate data, being able to scan, index and search backups is great feature.

In addition to backup file scanning, Index Engines has a multi-PB, indexing solution which can be used to perform the same, Google-like searching on a data center’s file storage. Once again, Index Engines has done the development work to implement their own, highly parallelized metadata and content search engine, demonstratively falter than any open source (Lucene) search solution available today.

CyberSense

All that’s old news, what Jim presented at a TFDx event was their new CyberSense solution. CyberSense was designed to help organizations detect and head off ransomware, cyber assaults and other data corruption attacks.

CyberSense computes a data entropy (randomness) score as well as ~39 other characteristics for every file in backups or online in a custmer’s data center. It then uses that information to detect when a cyber attack is taking place and determine the extent of the corruption. With current and previous entropy and other characteristics on every data file, CyberSense can flag files that look like they have been corrupted and warn customers that a cyber attack is in process before it corrupts all of customers data files.

One typical corruption is to change file extensions. CyberSense cracks open file contents and can determine if it’s an office or other standard document type and then check to see if its extension matches its content. Another common corruption is to encrypt files. Such files necessarily have an increased entropy and can be automatically detected by CyberSense

When CyberSense has detected some anomaly, it can determine who last accessed the file and what executable was used to modify it. In this way CyberSecurity can be used to provide forensics on who, what, when and where about a corrupted file, so that IT can shut the corruption activity down before it’s gone to far.

CyberSense can be configured to periodically scan files online as well as just examine backup data (offline) during or after it’s backed up. Their partnership with Dell EMC is to do just that with Data Domain and Dell EMC backup software.

Index Engines proprietary indexing functionality has been optimized for parallel execution and for reduced index size. Jim mentioned that their content indexes average about 5% of the full storage capacity and that they can index content at a TB/hour.

Index Engines is a software only offering but they also offer services for customers that want a turn key solution. They also are available through a number of partners, Dell EMC being one.

The podcast runs ~44 minutes. Jim’s been around backups, storage and indexing forever. And seems to have good knowledge on data compliance regimes and current security threats impacting customers, across the world today . Listen to our podcast to learn more.

Jim McGann, VP Marketing and Business Development, Index Engines

Jim has extensive experience with the eDiscovery and Information Management in the Fortune 2000 sector. Before joining Index Engines in 2004, he worked for leading software firms, including Information Builders and the French based engineering software provider Dassault Systemes.

In recent years he has worked for technology based start-ups that provided financial services and information management solutions. Prior to Index Engines, Jim was responsible for the business development of Scopeware at Mirror Worlds Technologies, the knowledge management software firm founded by Dr. David Gelernter of Yale University. Jim graduated from Villanova University with a degree in Mechanical Engineering.

Jim is a frequent writer and speaker on the topics of big data, backup tape remediation, electronic discovery and records management.