Cloud storage, CDP & deduplication

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Somebody needs to create a system that encompasses continuous data protection, deduplication and cloud storage.  Many vendors have various parts of such a solution but none to my knowledge has put it all together.

Why CDP, deduplication and cloud storage?

We have written about cloud problems in the past (eventual data consistency and what’s holding back the cloud) despite all that, backup is a killer app for cloud storage.  Many of us would like to keep backup data around for a very long time. But storage costs govern how long data can be retained.  Cloud storage with its low cost/GB/month can help minimize such concerns.

We have also blogged about dedupe in the past (describing dedupe) and have written in industry press and our own StorInt dispatches on dedupe product introductions/enhancements.  Deduplication can reduce storage footprint and works especially well for backup which often saves the same data over and over again.  By combining deduplication with cloud storage we can reduce the data transfers and data stored on the cloud, minimizing costs even more.

CDP is more troublesome and yets still worthy of discussion.  Continuous data protection has always been sort of a step child in the backup business.  As a technologist, I understand it’s limitations (application consistency) and understand why it has been unable to take off effectively (false starts).   But, in theory at some point CDP will work, at some point CDP will use the cloud, at some point CDP will embrace deduplication and when that happens it could be the start of an ideal backup environment.

Deduplicating CDP using cloud storage

Let me describe the CDP-Cloud-Deduplication appliance that I envision.  Whether through O/S, Hypervisor or storage (sub-)system agents, the system traps all writes (forks the write) and sends the data and meta-data in real time to another appliance.  Once in the CDP appliance, the data can be deduplicated and any unique data plus meta data can be packaged up, buffered, and deposited in the cloud.  All this happens in an ongoing fashion throughout the day.

Sometime later, a restore is requested. The appliance looks up the appropriate mapping for the data being restored, issues requests to read the data from the cloud and reconstitutes (un-deduplicates) the data before copying it to the restoration location.


The problems with this solution include:

  • Application consistency
  • Data backup timeframes
  • Appliance throughput
  • Cloud storage throughput

By tieing the appliance to a storage (sub-)system one may be able to get around some of these problems.

One could configure the appliance throughput to match the typical write workload of the storage.  This could provide an upper limit as to when the data is at least duplicated in the appliance but not necessarily backed up (pseudo backup timeframe).

As for throughput, if we could somehow understand the average write and deduplication rates we could configure the appliance and cloud storage pipes accordingly.  In this fashion, we could match appliance throughput to the deduplicated write workload (appliance and cloud storage throughput)

Application consistency is more substantial concern.  For example, copying every write to a file doesn’t mean one can recover the file.  The problem is at some point the file is actually closed and that’s the only time it is in an application consistent state.  Recovering to a point before or after this, leaves a partially updated, potentially corrupted file, of little use to anyone without major effort to transform it into a valid and consistent file image.

To provide application consistency, one needs to somehow understand when files are closed or applications quiesced.  Application consistency needs would argue for some sort of O/S or hypervisor agent rather than storage (sub-)system interface.  Such an approach could be more cognizant of file closure or application quiesce, allowing a synch point could be inserted in the meta-data stream for the captured data.

Most backup software has long mastered application consistency through the use of application and/or O/S APIs/other facilities to synchronize backups to when the application or user community is quiesced.  CDP must take advantage of the same facilities.

Seems simple enough, tie cloud storage behind a CDP appliance that supports deduplication.  Something like this could be packaged up in a cloud storage gateway or similar appliance.  Such a system could be an ideal application for cloud storage and would make backups transparent and very efficient.

What do you think?