Tradition says no way. IT backup history says not on your life. Common sense would say never in a million years.
Most organizations with PB of data or more, depend on remote replication to protect against data center outage or massive loss of data. This of course costs ~2X your original data center. And for some organizations one copy is not enough, so ~3X .
I don’t know what a PB scale data storage costs these days but I can’t believe it’s under a couple Million $ USD in hw and sw costs and probably at least another Million or so in OpEx/year. Multiply that by 2 or 3X and you’re now talking real money.
How could backup help?
Well for one you wouldn’t need replicas, so that would cut your hw & sw acquisition costs by a factor of 2 or 3. But backup storage is not free either. So you’d probably need to add back 30-50% of the original data center in hw & sw costs for backups.
You certainly wouldn’t need as many admins. And power for backup storage should also be substantially less. So maybe your OpEx would only be 1.5X in total for the original PB and its backups.
But what could possibly back up a PB of data?
We were talking with Igneous at Cloud Field Day 8 (CFD8, see their video here) a couple of weeks back and they said they could and do backup PBs of data for customers today. A while back, e also talked with them on a GreyBeards on Storage podcast.
The problems with backing up a PB seem insurmountable. First you have to be able to scan a PB of data. This means looking into multiple file systems on many different hardware platforms, across potentially multiple data centers, and that’s just to get a baseline of what all needs to be backed up.
Then at some point you actually have to store all that data on backup storage. So, to gain some cost advantage, you’d want to compress and deduplicate a PB of data, so that the first full backup wouldn’t take a full PB of backup storage.
Then of course you have to transfer a PB of data to your backup storage, in something that wouldn’t take months to perform. And that just gets you the first full backup.
Next, comes the daily scan of what’s changed. This has to re-scan your PB of data to find that 100TB or so, that’s changed over the last 24 hrs. Sometime after that scan completes, then all that 100TB or so of changed data needs to be compressed, deduped and transferred again to backup storage
And if that’s not enough, you have to do it all over again, every day, from now on, almost forever. And data continues to grow. So 1PB today is likely to be 2PB of more in 12 months (it’s great to be in the storage business).
So those are the challenges. How can it be done, effectively, day in and day out, enough so that IT can depend on their data being backed up.
Igneous to the rescue…
First, Igneous came out of stealth a while back (listen to our podcast) with a couple of unique capabilities needed for massive data repository discovery and analysis. That is they built a unique engine to scan and index PB scale data repositories. This was so they couldd provide administrators better visibility into their PB scale data repositories. But this isn’t about that product, it’s about backup.
But some of the capabilities they needed to support that product helped them perform backups as well. For instance, their scan needed to handle PBs of data. They came up with AdaptiveSCAN, which didn’t use standard NFS and SMB data transfer protocols to gain access to file metadata. To open a file on NFS or SMB takes quite a lot of NFS or SMB transactions. But to access metadata only, one doesn’t have to use all those NFS and SMB capabilities, it can be done with much less overhead even when using NFS or SMB.
Of course having a way to scan Billions of files was a major accomplishment, but then where do you put all that metadata. And how can you access it effectively to support backup up a PB data repository. So they needed some serious data indexing capabilities and so came up with InfiniteINDEX
Now a trillion item index, seems a bit much, even for PB scale repositories. But my guess is they have eyes on taking their PB scale backups and going after even bigger fish,. That is offering backups for EB scale data repository. And that might just take a trillion item index
Next, there’s moving PB or even TB of data quickly is no small trick. As the development team at Igneous mostly came from unstructured data providers, they also understood and have access to APIs for most storage vendors (NetApp, Dell-EMC Isilon, Pure FlashBlade, Qumulo, etc.). As such, where available, they utilized those native vendor storage API calls to help them move data rather than having to Open an NFS or SMB file and Read it.
Of course, even doing all that, moving 100TBs of data around or scanning PB sized data repositories is going to take a lot of processing and IO bandwidth to do in a reasonable period of time.
So another capability they developed is massive parallelism. That is being able to distribute scan, indexing or data movement work, out to multiple systems. In that fashion it can be accomplished in significantly less wall clock time.
Well with all that, they pretty much had the guts of a backup application system for PB data repositories but they still didn’t have the glue to put it all together. But recently they announced just that a Igneous’s DataProtect, a full scale backup application for PB of data.
I suppose I haven’t done justice to all of what they have developed or talked about at their session, so I would suggest viewing their talk at CFD8 and listening to our GBoS podcast to learn more. They did demo their product at CFD8 but I believe it was a canned demo.
I didn’t think I’d see the day when some vendor would offer backup services for PBs of data let alone be shooting for more, but there you have it. Igneous means to take your PB scale data repositories and make them as easy to operate as TB scale data repositories. They call that democratizing data.
See these other CFD8 bloggers write ups on Igneous.
Picture credit(s): All from screen saves during Igneous’s session at CFD8