huffman coding – Silverton Consulting

Macroblocking: demolish the eerie ▼oid by Rosa Menkman (cc) (from Flickr)

Last week NetApp announced the availability of data compression on many of their unified storage platforms, which includes block and file storage. Earlier this year EMC announced data compression for LUNs on CLARiion and Celerra. I must commend both of them for re-integrating data compression back into primary storage systems, missing since IBM and Sun stopped marketing RVA and SVA.

Data compression algorithms

Essentially data compression is an algorithm that eliminates redundancy in data streams. Data compression can be “lossy” or “loss-less”. Data compression in storage subsystems is typically loss-less which means that the original data can be reconstructed without any loss of information. One sees lossy algorithms in video/audio data compression which doesn’t impact video/audio fidelity unless it results in significant loss of data.

One simple example of loss-less data compression is Run-Length Encoding which substitutes a trigger, count, and character string for any character repeated more than 4 times in a block of data. This compresses well any text strings with lots of blanks, numerical data with lot’s of 0’s and initial format data written with a repeating character.

There are other, more sophisticated compression algorithms like Huffman Coding, which identify the most frequent bytes in a block of data and replace these bytes with shorter bit patterns. For example if ~50% of the characters in a text file are the letters “a”, “e”, “i”, “o”, “t”, and “n” (see Wikipedia, Frequency Analysis) then these characters can take up much less space if we encode them 4 or less bits rather than the 8-bits in a byte.

I am certain that both EMC and NetApp are using much more sophisticated algorithms than either of these and it wouldn’t surprise me to know they are using something like the open source algorithm like Zlib (gzip) or Bzip2 (see my Poor deduplication with Oracle RMAN compressed backups post for an explanation) which uses Huffman Coding and adds even more sophistication. Data compression algorithms like these could offer something like 50% compression, i.e., your data could be stored in 50% less space.

Data compression is often confused with Data Deduplication but it’s not the same. Deduplication looks for duplicate data across different data blocks and files while data compression is strictly only examining the data stream within a block or file and doesn’t depend on any other data.

Storage system data compression

In the past, data compression was relegated to a separate appliance, tape storage systems, and/or host software. By integrating these algorithms into their main storage engines, both NetApp and EMC are taking advantage of the recent processor speed increases being embedding into their systems to offer offline functionality for online data.

Historically, compression algorithms such as these were implemented in hardware but nowadays they can easily be done in software by being relegated to operate during off-peak IO time or execute as the lowest priority task in the storage system. As such, there can be no guarantee when your data will finally be compressed but it will be compressed eventually.

Data compression like this is great for data that isn’t modified frequently. It takes some processing time to compress data and as such, would need to be repeated after every modification of a compressed block or file. So if the data isn’t modified that much, compression’s processing cost could be amortized over longer data lifetimes.

Further, data compression must be undone at read time, i.e., the data needs to be de-compressed and handed off to the IO requesting it. De-compression is a much faster algorithm than compression because in the case of something like Huffman Coding the character dictionary is already known and as such, it’s just a matter of table lookup and bit field isolation. It would be convenient if this data were sitting in the system DRAM someplace but lacking that, moving it from cache to DRAM could be done quickly enough, processed there, and then moved back before final transfer to the requesting IO.

As such, data compression may impact response time for compressed data reads. How much of an impact is yet TBD.

Data writes will not be impacted at all because the compression activity is done much later. Whether the data stays in cache until compressed or is brought back in at some later time is another algorithm question which may impact cache hit rates/compression performance but this doesn’t have to be a serious impediment.

NetApp is able to offer this capability for both block and file storage because of it’s WAFL backend data structure which essentially allows it to create variable length blocks for file and block data. EMC only offers this for LUN data (block storage) as of yet but it’s probably just a matter of time before it’s available for other data as well.

Any questions?

Oracle offices by Steve Parker (cc) (from Flickr)

I was talking with one large enterprise customer today and he was lamenting how poorly Oracle RMAN compressed backupsets dedupe. Apparently, non-compressed RMAN backup sets generate anywhere from 20 to 40:1 deduplication ratios but when they use RMAN backupset compression, their deduplication ratios drop down to 2:1. Given that RMAN compression probably only adds another 2:1 compression ratio then the overall data reduction becomes something ~4:1.

RMAN compression

It turns out Oracle RMAN supports two different compression algorithms that can be used zlib (or gzip) and bzip2. I assume the default is zlib and if you want to one can specify bzip2 for even higher compression rates with the commensurate slower or more processor intensive compression activity.

Zlib is pretty standard repeating strings elimination followed by Huffman coding which uses shorter bit strings to represent more frequent characters and longer bit strings to represent less frequent characters.
Bzip2 also uses Huffman coding but only after a number of other transforms such as run length encoding (changing duplicated characters to a count:character sequence), Burrows–Wheeler transform (changes data stream so that repeating characters come together), move-to-front transform (changes data stream so that all repeating character strings are moved to the front), another run length encoding step, huffman encoding, followed by another couple of steps to decrease the data length even more…

The net of all this is that a block of data that is bzip2 encoded may look significantly different if even one character is changed. Similarly, even zlib compressed data will look different with a single character insertion, but perhaps not as much. This will depend on the character and where it’s inserted but even if the new character doesn’t change the huffman encoding tree, adding a few bits to a data stream will necessarily alter its byte groupings significantly downstream from that insertion. (See huffman coding to learn more).

Deduplicating RMAN compressed backupsets

Sub-block level deduplication often depends on seeing the same sequence of data that may be skewed or shifted by one to N bytes between two data blocks. But as discussed above, with bzip2 or zlib (or any huffman encoded) compression algorithm the sequence of bytes looks distinctly different downstream from any character insertion.

One way to obtain decent deduplication rates from RMAN compressed backupsets would be to decompress the data at the dedupe appliance and then run the deduplication algorithm on it – dedupe appliance ingestion rates would suffer accordingly. Another approach is to not use RMAN compressed backupsets but the advantages of compression are very appealing such as less network bandwidth, faster backups (because they are not transferring as much data), and quicker restores.

Oracle RMAN OST

On the other hand, what might work is some form of Data Domain OST/Boost like support from Oracle RMAN which would partially deduplicate the data at the RMAN server and then send the deduplicated stream to the dedupe appliance. This would provide less network bandwidth and faster backups but may not do anything for restores. Perhaps a tradeoff worth investigating.

As for the likelihood that Oracle would make such services available to deduplicatione vendors, I would have said this was unlikely but ultimately the customers have a say here. It’s unclear why Symantec created OST but it turned out to be a money maker for them and something similar could be supported by Oracle. Once an Oracle RMAN OST-like capability was in place, it shouldn’t take much to provide Boost functionality on top of it. (Although EMC Data Domain is the only dedupe vendor that has Boost yet for OST or their own Networker Boost version.)

—-

When I first started this post I thought that if the dedupe vendors just understood the format of the RMAN compressed backupsets they would be able to have the same dedupe ratios as seen for normal RMAN backupsets. As I investigated the compression algorithms being used I became convinced that it’s a computationally “hard” problem to extract duplicate data from RMAN compressed backupsets and ultimately would probably not be worth it.

So, if you use RMAN backupset compression, probably ought to avoid deduplicating this data for now.

Anything I missed here?

Tag: huffman coding

Data compression lives on

Data compression algorithms

Storage system data compression

Poor deduplication with Oracle RMAN compressed backups

RMAN compression

Deduplicating RMAN compressed backupsets

Oracle RMAN OST