Even before COVID-19 there was a lot of file data being created and mined, but with the advent of the pandemic, this has accelerated considerably. As such, it seemed an appropriate time to talk with Christian Smith, VP of Product at Igneous, (@IgneousIO) a company that targets the protection and visibility of massive quantities of unstructured data, on premise, in the cloud, or just about anywhere else it may live.
Let me state at the outset, that my belief had always been, that you don’t backup 10PB of data, rather you bite the (big expense) bullet to replicate it and hope for the best. After talking with Christian and Igneous I am going to have to modify that belief by a couple of more orders of magnitude.
All this data is coming from: LIDAR, RADAR, audio, video, pictures, medical film, MRI/CAT Scans, etc., and as noted above, it’s exploding. Christian talked about one customer of theirs that supplies aerial photography/LIDAR/RADAR scans of areas on request. This can used to better understand crop, forest, wildlife, land health and use. One surprise Igneous found with this customer is that the data is typically archived after first use, but within a month or so it’s moved back online for some other purpose.
Many of the people who started up and currently work at Igneous have been around file storage for some time having, primarily coming from (Dell EMC) Isilon, NetApp, Qumulo and other industry heavyweights. When they started Igneous, they realized the world didn’t need another NAS box or file system. Rather, with the advent of 10-100PB unstructured data farms, what was needed was an effective way to protect and understand that data.
When they considered how to protect and visualize 100PB of unstructured data, the only they found to do this was to build a scale-out solution that used on premise and cloud infrastructure and was offered as a service.
Igneous DataProtect solution
With 10PB or 100PB of files, located across a gaggle of heterogeneous file servers, with billions of files across ~100s of servers, each of with has ~1K or more file shares, just scanning all the file servers would take weeks, if not longer and then you need to move the data someplace to protect it. Seems like an impossible task.
Igneous immediately figured out the first thing they needed was a radically new, scale out architecture to rapidly scan of the file servers. Thus was born ActiveScan. Christian said it was designed to scan a trillion files and they have customers with a billion files using their service today. ActiveScan doesn’t use NFS/SMB/Object (S3) access protocols to talk with file servers rather it uses internal APIs to access file metadata. DataProtect currently supports APIs for NetApp, Dell EMC Isilon, Pure FlashBlade, Qumulo, Gluster, Lustre, & GPFS (IBM Spectrum Scale) file systems. They use ActiveScan to build a file index database.
Their other major concern was hot to move PBs of data rapidly across to the cloud and other locations. Again they created a scale out, multi-threaded service to do this and also made use of internal APIs rather than standard file or object protocols. This became IntelliMove. That same customer above with billions of files, has 6PB of file data to protect.
Normal data movement is fine for largish, files but bogs down with lots of small files or extremely large files to back up. DataProtect gathers together small files into a large chunks and splits up extremely large files into smaller chunks and moves these chunks to secondary storage.
Data expiration is another problem, especially when you chunk files together. Here they came up with an intelligent garbage collection algorithm which only collects free space when it makes the most sense but deletes data access at the time of expiration.
DataProtect uses a cloud based, SaaS control plane that manages and coordinates its activities across data centers, sites and cloud instances. It also has a client VM (OVA, with 8 core CPU, 32GB DRAM, ~100MB) that runs in the customers infrastructure, on site, in CoLo’s or in the cloud that is used to scan-move-protect customer unstructured data. If more scan and data movement performance is needed, the VM can spawn additional threads automatically and more VMs can be added to provide even more throughput.
The other service that Igneous offers is DataDiscover a data visualization tool. DataDiscover uses ActiveScan and its database to provide customers a way to understand the file data that resides in their massive unstructured data farms across the data center, cloud or wherever else it resides.
We didn’t discuss this solution as much but having a way to better understand the files in a 10-100PB unstructured data farm could be very useful and a great way to keep that 100PB from growing to 1EB faster than it has too.
As part of their outreach to the world, Igneous is giving away free DataProtect services to organizations that are focused on COVID-19 research. Check out their offer here
The podcast ran ~24 minutes. Christian was extremely knowledgeable about the problems that happen with very large unstructured data farms and how Igneous solutions can provide a better way to protect and visualize that data. Matt and I had a fun time discussing Igneous’s approach with Christian. Listen to the podcast to learn more.
Christian Smith, VP Product at Igneous
Christian is VP of Product, responsible for product management, solutions, and customer success. Prior to Igneous, Christian spent 15 years running field engineering organizations at EMC, Isilon Systems, NetApp and Silicon Graphics.
Christian has been working with organizations that work with file data since working at Silicon Graphics. Before that Christian was co-founder of a small management consulting company associated with Y2K and deregulation.
Christian received dual bachelor’s degrees in Chemistry and Computer Science from the University of Missouri-Columbia. Christian is an avid camper, skier and traveler and has long since traveled through all of the continental 48 states.