Cloud storage, CDP & deduplication

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Somebody needs to create a system that encompasses continuous data protection, deduplication and cloud storage.  Many vendors have various parts of such a solution but none to my knowledge has put it all together.

Why CDP, deduplication and cloud storage?

We have written about cloud problems in the past (eventual data consistency and what’s holding back the cloud) despite all that, backup is a killer app for cloud storage.  Many of us would like to keep backup data around for a very long time. But storage costs govern how long data can be retained.  Cloud storage with its low cost/GB/month can help minimize such concerns.

We have also blogged about dedupe in the past (describing dedupe) and have written in industry press and our own StorInt dispatches on dedupe product introductions/enhancements.  Deduplication can reduce storage footprint and works especially well for backup which often saves the same data over and over again.  By combining deduplication with cloud storage we can reduce the data transfers and data stored on the cloud, minimizing costs even more.

CDP is more troublesome and yets still worthy of discussion.  Continuous data protection has always been sort of a step child in the backup business.  As a technologist, I understand it’s limitations (application consistency) and understand why it has been unable to take off effectively (false starts).   But, in theory at some point CDP will work, at some point CDP will use the cloud, at some point CDP will embrace deduplication and when that happens it could be the start of an ideal backup environment.

Deduplicating CDP using cloud storage

Let me describe the CDP-Cloud-Deduplication appliance that I envision.  Whether through O/S, Hypervisor or storage (sub-)system agents, the system traps all writes (forks the write) and sends the data and meta-data in real time to another appliance.  Once in the CDP appliance, the data can be deduplicated and any unique data plus meta data can be packaged up, buffered, and deposited in the cloud.  All this happens in an ongoing fashion throughout the day.

Sometime later, a restore is requested. The appliance looks up the appropriate mapping for the data being restored, issues requests to read the data from the cloud and reconstitutes (un-deduplicates) the data before copying it to the restoration location.


The problems with this solution include:

  • Application consistency
  • Data backup timeframes
  • Appliance throughput
  • Cloud storage throughput

By tieing the appliance to a storage (sub-)system one may be able to get around some of these problems.

One could configure the appliance throughput to match the typical write workload of the storage.  This could provide an upper limit as to when the data is at least duplicated in the appliance but not necessarily backed up (pseudo backup timeframe).

As for throughput, if we could somehow understand the average write and deduplication rates we could configure the appliance and cloud storage pipes accordingly.  In this fashion, we could match appliance throughput to the deduplicated write workload (appliance and cloud storage throughput)

Application consistency is more substantial concern.  For example, copying every write to a file doesn’t mean one can recover the file.  The problem is at some point the file is actually closed and that’s the only time it is in an application consistent state.  Recovering to a point before or after this, leaves a partially updated, potentially corrupted file, of little use to anyone without major effort to transform it into a valid and consistent file image.

To provide application consistency, one needs to somehow understand when files are closed or applications quiesced.  Application consistency needs would argue for some sort of O/S or hypervisor agent rather than storage (sub-)system interface.  Such an approach could be more cognizant of file closure or application quiesce, allowing a synch point could be inserted in the meta-data stream for the captured data.

Most backup software has long mastered application consistency through the use of application and/or O/S APIs/other facilities to synchronize backups to when the application or user community is quiesced.  CDP must take advantage of the same facilities.

Seems simple enough, tie cloud storage behind a CDP appliance that supports deduplication.  Something like this could be packaged up in a cloud storage gateway or similar appliance.  Such a system could be an ideal application for cloud storage and would make backups transparent and very efficient.

What do you think?

16 thoughts on “Cloud storage, CDP & deduplication

  1. Hi

    Great idea.

    What you're describing sounds pretty close to what EMC's Recoverpoint appliance does today. Writes get forked, sequenced, compressed and sent over the wire — fairly agnostic to what kind of server, application and/or storage device. The product has been in the market a number of years and has a considerable installed base — as well as some ardent fans as well.

    You list a number of challenges, but I don't think you got the right ones. Application consistency (both within an application, and across applications) is reasonably well provided for. "Backup timeframes" really should be rephrased as having enough resources to get the writes off-site for a decent RPO. The appliances have decent throughput, and many customers use multiples. Back-end service provider bandwidth is just a matter of having enough horsepower. Network bandwidth is considerably reduced by collapsing and compressing writes, and — of course — it can run async further reducing bandwidth requirements.

    The implementation and management is relatively straightforward as well. I'm not aware of anyone who describes the environment as "troublesome" to the best of my knowledge 🙂

    So what's the downside — other than the cost of the solution and associated bandwidth, capacity, etc.?

    Having enough network bandwidth around in the event you need to do a full recovery. Partial roll-backs aren't that much of a problem though. As a result, most implementations have some server capacity at the remote end to recover the app environment while the primary is being reconstructed. Very popular in VMware environments, BTW.

    We have a number of service providers (cloud?) that offer it as a remote service today, with more coming on-stream as time progresses. If you're interested in more information, please let me know …

    1. Chuck,Thanks for the comment. I was aware of EMC's Recoverpoint but thought it only did CDP and replication and didn't think it worked with cloud storage. Although as you say it is fairly agnostic as to what kind of target it uses . I think you are right with regards to RPO as a better term for “backup timeframes” and I think we agree that with proper sizing this can be adequately addressed.As for the full recovery I hadn't considered that. It would require some alternative solution and having a server set aside in a remote location to recover the app environment would make do nicely. Although in the case of the appliance I envisioned it would need to be at the cloud storage provider.I keep thinking the one thing holding back CDP is application consistency. But the backup and RecoverPoint guys have figured this out if they could just help the rest of the CDP industry to understand this. And your right again,”troublesome” was too strong a word to use.In my mind, the other key to network bandwidth and storage costs is deduplication. There are plenty of players that understand this (and EMC probably more than most what with Avamar and Data Domain) but I don't see any product that puts CDP and deduplication together using cloud storage as it's backend, just yet. If you know of any products or service offerings that do all this, please let me know.Ray

  2. Hi Ray, our cloud storage gateway product, the Nasuni Filer, does synchronous point-in-time snapshots of the file system every hour and deduplicates between versions and stores all the data in the cloud of your choosing. 90% of our customers are using the Nasuni Filer for primary storage for continuous offsite protection.


    1. Rob,Thanks for your comment. I knew that Nasuni snapshot the file systems but wasn't aware of the deduplication. Nonetheless, to my knowledge it does not provide a CDP solution for onsite storage. Regards,Ray

  3. Hi Ray, thats correct, the Nasuni Filer is primary storage backed by the cloud. Content is locally versioned and then moved back and forth to the cloud based on our caching algorithms. Since we cache both data and metadata, both past (snapshots) and present, you essentially have a local file system with point in time consistent snapshots of your active working data set with the rest of the data scaling out forever to the cloud.

    Thanks again for your thoughts in this area,

  4. Pingback: Parker Pens Ad |
  5. Hi Ray,

    Great article. You're right huge challenge getting a true 'cloud' CDP offering to market along the lines you are suggesting.

    My company offers an appliance based CDP solution that backs up application consistent images locally then replicates to our cloud, however without de-dupe it fails to tick that key box you're looking for.

    It's all about evolution though we have to grow through solutions like ours, Nasuni and others to get to the utopia you seek and that no doubt will become a reality in the next few years!


Comments are closed.