Deduplication – Silverton Consulting

Pure Storage surfaces

Posted on August 23, 2011April 10, 2012 by Ray in Block Storage, Clustered storage, Data reduction, FC, SSD storage, Storage, Storage performance, Storage reliability, Strategic Inflection Points, Strategy, Systems

We were talking with Pure Storage last week, another SSD startup which just emerged out of stealth mode today. Somewhat like SolidFire which we discussed a month or so ago, Pure Storage uses only SSDs to provide primary storage. In this case, they are supporting a FC front end, with an all SSDs backend, and implementing internal data deduplication and compression, to try to address the needs of enterprise tier 1 storage.

Pure Storage is in final beta testing with their product and plan to GA sometime around the end of the year.

Pure Storage hardware

Their system is built around MLC SSDs which are available from many vendors but with a strategic investment from Samsung, currently use that vendor’s storage. As we know, MLC has write endurance limitations but Pure Storage was built from the ground up knowing they were going to use this technology and have built their IP to counteract these issues.

The system is available in one or two controller configurations, with an Infiniband interconnect between the controllers, 6Gbps SAS backend, 48GB of DRAM per controller for caching purposes, and NV-RAM for power outages. Each controller has 12-cores supplied by 2-Intel Xeon processor chips.

With the first release they are limiting the controllers to one or two (HA option) but their storage system is capable of clustering together many more, maybe even up to 8-controllers using the Infiniband back end.

Each storage shelf provides 5.5TB of raw storage using 2.5″ 256GB MLC SSDs. It looks like each controller can handle up to 2-storage shelfs with the HA (dual controller option) supporting 4 drive shelfs for up to 22TB of raw storage.

Pure Storage Performance

Although these numbers are not independently verified, the company says a single controller (with 1-storage shelf) they can do 200K sustained 4K random read IOPS, 2GB/sec bandwidth, 140K sustained write IOPS, or 500MB/s of write bandwidth. A dual controller system (with 2-storage shelfs) can achieve 300K random read IOPS, 3GB/sec bandwidth, 180K write IOPS or 1GB/sec of write bandwidth. They also claim that they can do all this IO with an under 1 msec. latency.

One of the things they pride themselves on is consistent performance. They have built their storage such that they can deliver this consistent performance even under load conditions.

Given the amount of SSDs in their system this isn’t screaming performance but is certainly up there with many enterprise class systems sporting over 1000 disks. The random write performance is not bad considering this is MLC. On the other hand the sequential write bandwidth is probably their weakest spec and reflects their use of MLC flash.

Purity software

One key to Pure Storage (and SolidFire for that matter) is their use of inline data compression and deduplication. By using these techniques and basing their system storage on MLC, Pure Storage believes they can close the price gap between disk and SSD storage systems.

The problems with data reduction technologies is that not all environments can benefit from them and they both require lots of CPU power to perform well. Pure Storage believes they have the horsepower (with 12 cores per controller) to support these services and are focusing their sales activities on those (VMware, Oracle, and SQL server) environments which have historically proven to be good candidates for data reduction.

In addition, they perform a lot of optimizations in their backend data layout to prolong the life of MLC storage. Specifically, they use a write chunk size that matches the underlying MLC SSDs page width so as not to waste endurance with partial data writes. Also they migrate old data to new locations occasionally to maintain “data freshness” which can be a problem with MLC storage if the data is not touched often enough. Probably other stuff as well, but essentially they are tuning their backend use to optimize endurance and performance of their SSD storage.

Furthermore, they have created a new RAID 3D scheme which provides an adaptive parity scheme based on the number of available drives that protects against any dual SSD failure. They provide triple parity, dual parity for drive failures and another parity for unrecoverable bit errors within a data payload. In most cases, a failed drive will not induce an immediate rebuild but rather a reconfiguration of data and parity to accommodate the failing drive and rebuild it onto new drives over time.

At the moment, they don’t have snapshots or data replication but they said these capabilities are on their roadmap for future delivery.

—-

In the mean time, all SSD storage systems seem to be coming out of the wood work. We mentioned SolidFire, but WhipTail is another one and I am sure there are plenty more in stealth waiting for the right moment to emerge.

I was at a conference about two months ago where I predicted that all SSD systems would be coming out with little of the engineering development of storage systems of yore. Based on the performance available from a single SSD, one wouldn’t need 100s of SSDs to generate 100K IOPS or more. Pure Storage is doing this level of IO with only 22 MLC SSDs and a high-end, but essentially off-the-shelf controller.

Just imagine what one could do if you threw some custom hardware at it…

Comments?

Tape vs. Disk, the saga continues

Posted on February 25, 2011February 25, 2011 by Ray in Block Storage, Data, Data compression, data protection, Data reduction, Disk storage, Market dynamics, Storage, Strategic Inflection Points, Strategy, System effectiveness, Systems, Tape storage

Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)

Was on a call late last month where Oracle introduced their latest generation T1000C tape system (media and drive) holding 5TB native (uncompressed) capacity. In the last 6 months I have been hearing about the coming of a 3TB SATA disk drive from Hitachi GST and others. And last month, EMC announced a new Data Domain Archiver, a disk only archive appliance (see my post on EMC Data Domain products enter the archive market).

Oracle assures me that tape density is keeping up if not gaining on disk density trends and capacity. But density or capacity are not the only issues causing data to move off of tape in today’s enterprise data centers.

“Dedupe Rulz”

A problem with the data density trends discussion is that it’s one dimensional (well literally it’s 2 dimensional). With data compression, disk or tape systems can easily double the density on a piece of media. But with data deduplication, the multiples start becoming more like 5X to 30X depending on frequency of full backups or duplicated data. And number’s like those dwarf any discussion of density ratios and as such, get’s everyone’s attention.

I can remember talking to an avowed tape enginerr, years ago and he was describing deduplication technology at the VTL level as being architecturally inpure and inefficient. From his perspective it needed to be done much earlier in the data flow. But what they failed to see was the ability of VTL deduplication to be plug-compatible with the tape systems of that time. Such ease of adoption allowed deduplication systems to build a beach-head and economies of scale. From there such systems have no been able to move up stream, into earlier stages of the backup data flow.

Nowadays, what with Avamar, Symantec Pure Disk and others, source level deduplication, or close by source level deduplication is a reality. But all this came about because they were able to offer 30X the density on a piece of backup storage.

Tape’s next step

Tape could easily fight back. All that would be needed is some system in front of a tape library that provided deduplication capabilities not just to the disk media but the tape media as well. This way the 30X density over non-deduplicated storage could follow through all the way to the tape media.

In the past, this made little sense because a deduplicated tape would require potentially multiple volumes in order to restore a particular set of data. However, with today’s 5TB of data on a tape, maybe this doesn’t have to be the case anymore. In addition, by having a deduplication system in front of the tape library, it could support most of the immediate data restore activity while data restored from tape was sort of like pulling something out of an archive and as such, might take longer to perform. In any event, with LTO’s multi-partitioning and the other enterprise class tapes having multiple domains, creating a structure with meta-data partition and a data partition is easier than ever.

“Got Dedupe”

There are plenty of places, that today’s tape vendors can obtain deduplication capabilities. Permabit offers Dedupe code for OEM applications for those that have no dedupe systems today. FalconStor, Sepaton and others offer deduplication systems that can be OEMed. IBM, HP, and Quantum already have tape libraries and their own dedupe systems available today all of which can readily support a deduplicating front-end to their tape libraries, if they don’t already.

Where “Tape Rulz”

There are places where data deduplication doesn’t work very well today, mainly rich media, physics, biopharm and other non-compressible big-data applications. For these situations, tape still has a home but for the rest of the data center world today, deduplication is taking over, if it hasn’t already. The sooner tape get’s on the deduplication bandwagon the better for the IT industry.

—-

Of course there are other problems hurting tape today. I know of at least one large conglomerate that has moved all backup off tape altogether, even data which doesn’t deduplicate well (see my previous Oracle RMAN posts). And at least another rich media conglomerate that is considering the very same move. For now, tape has a safe harbor in big science, but it won’t last long.

Comments?

Oracle RMAN and data deduplication – part 2

Posted on February 11, 2011February 11, 2011 by Ray in Data, Data compression, Data efficiency, data protection, System effectiveness

Insight01C 0011 by watz (cc) (from Flickr)

I have blogged about the poor deduplication ratios seen when using Oracle 10G RMAN compression before (see my prior post) but not everyone uses compressed backupsets. As such, the question naturally arises as how well RMAN non-compressed backupsets deduplicate.

RMAN backup types

Oracle 10G RMAN supports both full and incremental backups. The main potential for deduplication would come when using full backups. However, 10G also supports something called RMAN cumulative incremental backups in addition to the more normal differential backups. Cumulative incrementals backs up all changes since the last full and as such, could readily duplicate many changes which occur between full backups also leading to higher deduplication rates.

RMAN multi-threading

In any event, the other issue with RMAN backups is Oracle’s ability to multi-thread or multiplex backup data. This capability was originally designed to keep tape drives busy and streaming when backing up data. But the problem with file multiplexing is that file data is intermixed with blocks from other files within a single data backup stream, thus losing all context and potentially reducing deduplication ability. Luckily, 10G RMAN file multiplexing can be disabled by setting FILESPERSET=1, telling Oracle to provide only a single file per data stream.

Oracle’s use of meta-data in RMAN backups also makes them more difficult to deduplicate but some vendors provide workarounds to increase RMAN deduplication (see Quantum DXI, EMC Data Domain and others).

—-

So deduplication of RMAN backups will vary depending on vendor capabilities as well as admin RMAN backup specifications. As such, to obtain the best data deduplication of RMAN backups follow deduplication vendor best practices, use periodic full and/or cumulative incremental backups, don’t use compressed backupsets, and set FILESPERSET=1.

Comments?

Top 10 storage technologies over the last decade

Posted on December 30, 2010April 10, 2012 by Ray in Block Storage, Cloud services, Data, Data compression, Disk storage, Ethernet, FC, File Storage, Networking, Storage, Storage Backup, Storage density, Storage Features, Storage performance, Strategic Inflection Points, System effectiveness, Systems, Tape storage

Aurora's Perception or I Schrive When I See Technology by Wonderlane (cc) (from Flickr)

Some of these technologies were in development prior to 2000, some were available in other domains but not in storage, and some were in a few subsystems but had yet to become popular as they are today. In no particular order here are my top 10 storage technologies for the decade:

NAND based SSDs – DRAM and other technology solid state drives (SSDs) were available last century but over the last decade NAND Flash based devices have dominated SSD technology and have altered the storage industry forever more. Today, it’s nigh impossible to find enterprise class storage that doesn’t support NAND SSDs.
GMR head– Giant Magneto Resistance disk heads have become common place over the last decade and have allowed disk drive manufacturers to double data density every 18-24 months. Now GMR heads are starting to transition over to tape storage and will enable that technology to increase data density dramatically
Data Deduplication – Deduplication technologies emerged over the last decade as a complement to higher density disk drives as a means to more efficiently backup data. Deduplication technology can be found in many different forms today, ranging from file and block storage systems, backup storage systems, to backup software only solutions.
Thin provisioning – No one would argue that thin provisioning emerged last century but it took the last decade to really find its place in the storage pantheon. One almost cannot find a data center class storage device that does not support thin provisioning today.
Scale-out storage – Last century if you wanted to get higher IOPS from a storage subsystem you could add cache or disk drives but at some point you hit a subsystem performance wall. With scale-out storage, one can now add more processing elements to a storage system cluster without having to replace the controller to obtain more IO processing power. The link reference talks about the use of commodity hardware to provide added performance but scale-out storage can also be done with non-commodity hardware (see Hitachi’s VSP vs. VMAX).
Storage virtualization – server virtualization has taken off as the dominant data center paradigm over the last decade but a counterpart to this in storage has also become more viable as well. Storage virtualization was originally used to migrate data from old subsystems to new storage but today can be used to manage and migrate data over PBs of physical storage dynamically optimizing data placement for cost and/or performance.
LTO tape – When IBM dominated IT in the mid to late last century, the tape format dejour always matched IBM’s tape technology. As the decade dawned, IBM was no longer the dominant player and tape technology was starting to diverge into a babble of differing formats. As a result, IBM, Quantum, and HP put their technology together and created a standard tape format, called LTO, which has become the new dominant tape format for the data center.
Cloud storage – Unclear just when over the last decade cloud storage emerged but it seemed to be a supplement to cloud computing that also appeared this past decade. Storage service providers had existed earlier but due to bandwidth limitations and storage costs didn’t survive the dotcom bubble. But over this past decade both bandwidth and storage costs have come down considerably and cloud storage has now become a viable technological solution to many data center issues.
iSCSI – SCSI has taken on many forms over the last couple of decades but iSCSI has the altered the dominant block storage paradigm from a single, pure FC based SAN to a plurality of technologies. Nowadays, SMB shops can have block storage without the cost and complexity of FC SANs over the LAN networking technology they already use.
FCoE – One could argue that this technology is still maturing today but once again SCSI has taken opened up another way to access storage. FCoE has the potential to offer all the robustness and performance of FC SANs over data center Ethernet hardware simplifying and unifying data center networking onto one technology.

No doubt others would differ on their top 10 storage technologies over the last decade but I strived to find technologies that significantly changed data storage that existed in 2000 vs. today. These 10 seemed to me to fit the bill better than most.

Comments?

Why Bus-Tech, why now – Mainframe/System z data growth

Posted on November 11, 2010 by Ray in Data, Data efficiency, Data growth, data protection, Storage, Tape storage

Z10 by Roberto Berlim (cc) (from Flickr)

Yesterday, EMC announced the purchase of Bus-Tech, their partner in mainframe or System z attachment for the Disk Library Mainframe (DLm) product line.

The success of open systems mainframe attach products based on Bus-Tech or competitive technology is subject to some debate but it’s the only inexpensive way to bring such functionality into mainframes. The other, more expensive approach is to build in System z attach directly into the hardware/software for the storage system.

Most mainframer’s know that FC and FICON (System z storage interface) utilize the same underlying transport technology. However, FICON has a few crucial differences when it comes to data integrity, device commands and other nuances which make easy interoperability more of a challenge.

But all that just talks about the underlying hardware when you factor in disk layout (CKD), tape format, disk and tape commands (CCWs), System z interoperability can become quite an undertaking.

Bus-Tech’s virtual tape library maps mainframe tape/tape library commands and FICON protocols into standard FC and tape SCSI command sets. This way one could theoretically attach anybody’s open system tape or virtual tape system onto System z. Looking at Bus-Tech’s partner list, there were quite a few organizations including Hitachi, NetApp, HP and others aside from EMC using them to do so.

Surprise – Mainframe data growth

Why is there such high interest in mainframes? Mainframe data is big and growing, in some markets almost at open systems/distributed systems growth rates. I always thought mainframes made better use of data storage, had better utilization, and controlled data growth better. However, this can only delay growth, it can’t stop it.

Although I have no hard numbers to back up my mainframe data market or growth rates, I do have anecdotal evidence. I was talking with an admin at one big financial firm a while back and he casually mentioned they had 1.5PB of mainframe data storage under management! I didn’t think this was possible – he replied not only was this possible, he was certain they weren’t the largest in their vertical/East coast area by any means .

Ok so mainframe data is big and needs lot’s of storage but this also means that mainframe backup needs storage as well.

Surprise 2 – dedupe works great on mainframes

Which brings us back to EMC DLm and their deduplication option. Recently, EMC announced a deduplication storage target for disk library data used as an alternative to their previous CLARiion target. This just happens to be a Data Domain 880 appliance behind a DLm engine.

Another surprise, data deduplication works great for mainframe backup data. It turns out that z/OS users have been doing incremental and full backups for decades. Obviously, anytime some system uses full backups, dedupe technology can reduce storage requirements substantially.

I talked recently with Tom Meehan at Innovation Data Processing, creators of FDR, one of only two remaining mainframe backup packages (the other being IBM DFSMShsm). He re-iterated that deduplication works just fine on mainframes assuming you can separate the meta-data from actual backup data.

System z and distributed systems

In the mean time, this past July, IBM recently announced the zBX, System z Blade eXtension hardware system which incorporates Power7 blade servers running AIX into and under System z management and control. As such, the zBX brings some of the reliability and availability of System z to the AIX open systems environment.

IBM had already supported Linux on System z but that was just a software port. With zBX, System z could now support open systems hardware as well. Where this goes from here is anybody’s guess but it’s not a far stretch to talk about running x86 servers under System z’s umbrella.

—-

So there you have it, Bus-Tech is the front-end of EMC DLm system. As such, it made logical sense if EMC was going to focus more resources in the mainframe dedupe market space to lock up Bus-Tech, a critical technology partner. Also, given market valuations these days, perhaps the opportunity was too good to pass up.

However, this now leaves Luminex as the last standing independent vendor to provide mainframe attach for open systems. Luminex and EMC Data Domain already have a “meet-in-the-channel” model to sell low-end deduplication appliances to the mainframe market. But with the Bus-Tech acquisition we see this slowly moving away and current non-EMC Bus-Tech partners migrating to Luminex or abandoning the mainframe attach market altogether.

[I almost spun up a whole section on CCWs, CKD and other mainframe I/O oddities but it would have detracted from this post’s main topic. Perhaps, another post will cover mainframe IO oddities, stay tuned.]

Poor deduplication with Oracle RMAN compressed backups

Posted on October 20, 2010 by Ray in Storage Backup, Storage Features, System effectiveness

Oracle offices by Steve Parker (cc) (from Flickr)

I was talking with one large enterprise customer today and he was lamenting how poorly Oracle RMAN compressed backupsets dedupe. Apparently, non-compressed RMAN backup sets generate anywhere from 20 to 40:1 deduplication ratios but when they use RMAN backupset compression, their deduplication ratios drop down to 2:1. Given that RMAN compression probably only adds another 2:1 compression ratio then the overall data reduction becomes something ~4:1.

RMAN compression

It turns out Oracle RMAN supports two different compression algorithms that can be used zlib (or gzip) and bzip2. I assume the default is zlib and if you want to one can specify bzip2 for even higher compression rates with the commensurate slower or more processor intensive compression activity.

Zlib is pretty standard repeating strings elimination followed by Huffman coding which uses shorter bit strings to represent more frequent characters and longer bit strings to represent less frequent characters.
Bzip2 also uses Huffman coding but only after a number of other transforms such as run length encoding (changing duplicated characters to a count:character sequence), Burrows–Wheeler transform (changes data stream so that repeating characters come together), move-to-front transform (changes data stream so that all repeating character strings are moved to the front), another run length encoding step, huffman encoding, followed by another couple of steps to decrease the data length even more…

The net of all this is that a block of data that is bzip2 encoded may look significantly different if even one character is changed. Similarly, even zlib compressed data will look different with a single character insertion, but perhaps not as much. This will depend on the character and where it’s inserted but even if the new character doesn’t change the huffman encoding tree, adding a few bits to a data stream will necessarily alter its byte groupings significantly downstream from that insertion. (See huffman coding to learn more).

Deduplicating RMAN compressed backupsets

Sub-block level deduplication often depends on seeing the same sequence of data that may be skewed or shifted by one to N bytes between two data blocks. But as discussed above, with bzip2 or zlib (or any huffman encoded) compression algorithm the sequence of bytes looks distinctly different downstream from any character insertion.

One way to obtain decent deduplication rates from RMAN compressed backupsets would be to decompress the data at the dedupe appliance and then run the deduplication algorithm on it – dedupe appliance ingestion rates would suffer accordingly. Another approach is to not use RMAN compressed backupsets but the advantages of compression are very appealing such as less network bandwidth, faster backups (because they are not transferring as much data), and quicker restores.

Oracle RMAN OST

On the other hand, what might work is some form of Data Domain OST/Boost like support from Oracle RMAN which would partially deduplicate the data at the RMAN server and then send the deduplicated stream to the dedupe appliance. This would provide less network bandwidth and faster backups but may not do anything for restores. Perhaps a tradeoff worth investigating.

As for the likelihood that Oracle would make such services available to deduplicatione vendors, I would have said this was unlikely but ultimately the customers have a say here. It’s unclear why Symantec created OST but it turned out to be a money maker for them and something similar could be supported by Oracle. Once an Oracle RMAN OST-like capability was in place, it shouldn’t take much to provide Boost functionality on top of it. (Although EMC Data Domain is the only dedupe vendor that has Boost yet for OST or their own Networker Boost version.)

—-

When I first started this post I thought that if the dedupe vendors just understood the format of the RMAN compressed backupsets they would be able to have the same dedupe ratios as seen for normal RMAN backupsets. As I investigated the compression algorithms being used I became convinced that it’s a computationally “hard” problem to extract duplicate data from RMAN compressed backupsets and ultimately would probably not be worth it.

So, if you use RMAN backupset compression, probably ought to avoid deduplicating this data for now.

Anything I missed here?

Data storage features for virtual desktop infrastructure (VDI) deployments

Posted on September 10, 2010 by Ray in desktop virtualization, File Storage, Storage Features, Storage performance, Storage reliability

The Planet Data Center by The Planet (cc) (from Flickr)

Was talking with someone yesterday about one of my favorite topics, data storage for virtual desktop infrastructure (VDI) deployments. In my mind there are a few advanced storage features that help considerably with VDI implemetations:

Deduplication – almost every one of your virtual desktops will share 75-90% of their O/S disk data with every other virtual desktop. Having sub-file/sub-block deduplication can be a godsend for all this replicated data and reduce O/S storage requirements considerably.
0 storage snapshots/clones – another solution to the duplication of O/S data is to use some sort of space conserving snapshots. For example, one creates a master (gold) disk image and makes 100s if not 1000s of snapshots of it, taking almost no additional space.
Highly available/highly reliable storage – when you have a lone desktop dependent on DAS for it’s O/S, it doesn’t impact a lot of users if that device fails. However, when you have 100s to 1000s of users dependent on DAS device(s) for their O/S software, any DAS failure could impact all of them at the same time. As such, one needs to move off DAS and invest in highly reliable and available external storage of some kind to sustain reasonable uptime for your user community.

Those seem to me to be the most important attributes for VDI storage but there are a couple more features/facilities which can also:

NAS systems with NFS – VDI deployments will generate lots of VMDKs for all the user desktop C: drives. Although this can be managed with block level storage as separate LUNs or multi-VMDK LUNs, who want’s to configure a 100 to 1000 LUNs. NFS files can perform just as well and are much easier to create on the fly and thus, for VDI it’s hard to beat NFS storage.
Boot storm enhancements – Another problem with VDI is that everyone gets to work 8am Monday and proceeds to boot up their (virtual) machines, which drives an awful lot of IO to their virtual C: drives. Deduplication and 0 storage snapshots can help manage the boot storm as long as these characteristics are retained throughout system cache, i.e, deduplication exists in cache as well as on backend disk. But there are other approaches to the problem as well, available from various vendors to better manage boot storms.
Anti-Virus scan enhancements – Similar to boot storms, A-V scans also typically happen around the same time for many desktop users and can be just as bad for virtual C: drive performance. Again, deduplication or 0 storage snapshots can help (with above caveats) but some vendor storage can offload these activities from the desktop alltogether. Also last weeks VMworld release of VMware’s vShield Edge (see VMworld 2010 review) also supports some A-V scan enhancements. Any of these approaches should be able to help.

Regular “dumb” block storage will always work but it will require a lot more raw storage, performance will suffer just when everybody gets back to work, and the administrative burden will be much higher.

I may seem biased but enterprise class reliability&availability with some of the advanced storage features described above can help make your deployment of VDI that much better for you and all your knowledge workers.

Anything I missed?

Cloud storage, CDP & deduplication

Posted on August 20, 2010 by Ray in Block Storage, Cloud services, Data, Data efficiency, data protection, File Storage, Storage architecture, Storage Backup

Strange Clouds by michaelroper (cc) (from Flickr)

Somebody needs to create a system that encompasses continuous data protection, deduplication and cloud storage. Many vendors have various parts of such a solution but none to my knowledge has put it all together.

Why CDP, deduplication and cloud storage?

We have written about cloud problems in the past (eventual data consistency and what’s holding back the cloud) despite all that, backup is a killer app for cloud storage. Many of us would like to keep backup data around for a very long time. But storage costs govern how long data can be retained. Cloud storage with its low cost/GB/month can help minimize such concerns.

We have also blogged about dedupe in the past (describing dedupe) and have written in industry press and our own StorInt dispatches on dedupe product introductions/enhancements. Deduplication can reduce storage footprint and works especially well for backup which often saves the same data over and over again. By combining deduplication with cloud storage we can reduce the data transfers and data stored on the cloud, minimizing costs even more.

CDP is more troublesome and yets still worthy of discussion. Continuous data protection has always been sort of a step child in the backup business. As a technologist, I understand it’s limitations (application consistency) and understand why it has been unable to take off effectively (false starts). But, in theory at some point CDP will work, at some point CDP will use the cloud, at some point CDP will embrace deduplication and when that happens it could be the start of an ideal backup environment.

Deduplicating CDP using cloud storage

Let me describe the CDP-Cloud-Deduplication appliance that I envision. Whether through O/S, Hypervisor or storage (sub-)system agents, the system traps all writes (forks the write) and sends the data and meta-data in real time to another appliance. Once in the CDP appliance, the data can be deduplicated and any unique data plus meta data can be packaged up, buffered, and deposited in the cloud. All this happens in an ongoing fashion throughout the day.

Sometime later, a restore is requested. The appliance looks up the appropriate mapping for the data being restored, issues requests to read the data from the cloud and reconstitutes (un-deduplicates) the data before copying it to the restoration location.

Problems?

The problems with this solution include:

Application consistency
Data backup timeframes
Appliance throughput
Cloud storage throughput

By tieing the appliance to a storage (sub-)system one may be able to get around some of these problems.

One could configure the appliance throughput to match the typical write workload of the storage. This could provide an upper limit as to when the data is at least duplicated in the appliance but not necessarily backed up (pseudo backup timeframe).

As for throughput, if we could somehow understand the average write and deduplication rates we could configure the appliance and cloud storage pipes accordingly. In this fashion, we could match appliance throughput to the deduplicated write workload (appliance and cloud storage throughput)

Application consistency is more substantial concern. For example, copying every write to a file doesn’t mean one can recover the file. The problem is at some point the file is actually closed and that’s the only time it is in an application consistent state. Recovering to a point before or after this, leaves a partially updated, potentially corrupted file, of little use to anyone without major effort to transform it into a valid and consistent file image.

To provide application consistency, one needs to somehow understand when files are closed or applications quiesced. Application consistency needs would argue for some sort of O/S or hypervisor agent rather than storage (sub-)system interface. Such an approach could be more cognizant of file closure or application quiesce, allowing a synch point could be inserted in the meta-data stream for the captured data.

Most backup software has long mastered application consistency through the use of application and/or O/S APIs/other facilities to synchronize backups to when the application or user community is quiesced. CDP must take advantage of the same facilities.

Seems simple enough, tie cloud storage behind a CDP appliance that supports deduplication. Something like this could be packaged up in a cloud storage gateway or similar appliance. Such a system could be an ideal application for cloud storage and would make backups transparent and very efficient.

What do you think?