I have blogged about the poor deduplication ratios seen when using Oracle 10G RMAN compression before (see my prior post) but not everyone uses compressed backupsets. As such, the question naturally arises as how well RMAN non-compressed backupsets deduplicate.
RMAN backup types
Oracle 10G RMAN supports both full and incremental backups. The main potential for deduplication would come when using full backups. However, 10G also supports something called RMAN cumulative incremental backups in addition to the more normal differential backups. Cumulative incrementals backs up all changes since the last full and as such, could readily duplicate many changes which occur between full backups also leading to higher deduplication rates.
RMAN multi-threading
In any event, the other issue with RMAN backups is Oracle’s ability to multi-thread or multiplex backup data. This capability was originally designed to keep tape drives busy and streaming when backing up data. But the problem with file multiplexing is that file data is intermixed with blocks from other files within a single data backup stream, thus losing all context and potentially reducing deduplication ability. Luckily, 10G RMAN file multiplexing can be disabled by setting FILESPERSET=1, telling Oracle to provide only a single file per data stream.
Oracle’s use of meta-data in RMAN backups also makes them more difficult to deduplicate but some vendors provide workarounds to increase RMAN deduplication (see Quantum DXI, EMC Data Domain and others).
—-
So deduplication of RMAN backups will vary depending on vendor capabilities as well as admin RMAN backup specifications. As such, to obtain the best data deduplication of RMAN backups follow deduplication vendor best practices, use periodic full and/or cumulative incremental backups, don’t use compressed backupsets, and set FILESPERSET=1.
I was talking with one large enterprise customer today and he was lamenting how poorly Oracle RMAN compressed backupsets dedupe. Apparently, non-compressed RMAN backup sets generate anywhere from 20 to 40:1 deduplication ratios but when they use RMAN backupset compression, their deduplication ratios drop down to 2:1. Given that RMAN compression probably only adds another 2:1 compression ratio then the overall data reduction becomes something ~4:1.
RMAN compression
It turns out Oracle RMAN supports two different compression algorithms that can be used zlib (or gzip) and bzip2. I assume the default is zlib and if you want to one can specify bzip2 for even higher compression rates with the commensurate slower or more processor intensive compression activity.
Zlib is pretty standard repeating strings elimination followed by Huffman coding which uses shorter bit strings to represent more frequent characters and longer bit strings to represent less frequent characters.
Bzip2 also uses Huffman coding but only after a number of other transforms such as run length encoding (changing duplicated characters to a count:character sequence), Burrows–Wheeler transform (changes data stream so that repeating characters come together), move-to-front transform (changes data stream so that all repeating character strings are moved to the front), another run length encoding step, huffman encoding, followed by another couple of steps to decrease the data length even more…
The net of all this is that a block of data that is bzip2 encoded may look significantly different if even one character is changed. Similarly, even zlib compressed data will look different with a single character insertion, but perhaps not as much. This will depend on the character and where it’s inserted but even if the new character doesn’t change the huffman encoding tree, adding a few bits to a data stream will necessarily alter its byte groupings significantly downstream from that insertion. (See huffman coding to learn more).
Deduplicating RMAN compressed backupsets
Sub-block level deduplication often depends on seeing the same sequence of data that may be skewed or shifted by one to N bytes between two data blocks. But as discussed above, with bzip2 or zlib (or any huffman encoded) compression algorithm the sequence of bytes looks distinctly different downstream from any character insertion.
One way to obtain decent deduplication rates from RMAN compressed backupsets would be to decompress the data at the dedupe appliance and then run the deduplication algorithm on it – dedupe appliance ingestion rates would suffer accordingly. Another approach is to not use RMAN compressed backupsets but the advantages of compression are very appealing such as less network bandwidth, faster backups (because they are not transferring as much data), and quicker restores.
Oracle RMAN OST
On the other hand, what might work is some form of Data Domain OST/Boost like support from Oracle RMAN which would partially deduplicate the data at the RMAN server and then send the deduplicated stream to the dedupe appliance. This would provide less network bandwidth and faster backups but may not do anything for restores. Perhaps a tradeoff worth investigating.
As for the likelihood that Oracle would make such services available to deduplicatione vendors, I would have said this was unlikely but ultimately the customers have a say here. It’s unclear why Symantec created OST but it turned out to be a money maker for them and something similar could be supported by Oracle. Once an Oracle RMAN OST-like capability was in place, it shouldn’t take much to provide Boost functionality on top of it. (Although EMC Data Domain is the only dedupe vendor that has Boost yet for OST or their own Networker Boost version.)
—-
When I first started this post I thought that if the dedupe vendors just understood the format of the RMAN compressed backupsets they would be able to have the same dedupe ratios as seen for normal RMAN backupsets. As I investigated the compression algorithms being used I became convinced that it’s a computationally “hard” problem to extract duplicate data from RMAN compressed backupsets and ultimately would probably not be worth it.
So, if you use RMAN backupset compression, probably ought to avoid deduplicating this data for now.