Data compression lives on

Macroblocking: demolish the eerie ▼oid by Rosa Menkman (cc) (from Flickr)
Macroblocking: demolish the eerie ▼oid by Rosa Menkman (cc) (from Flickr)

Last week NetApp announced the availability of data compression on many of their unified storage platforms, which includes block and file storage.  Earlier this year EMC announced data compression for LUNs on CLARiion and Celerra.  I must commend both of them for re-integrating data compression back into primary storage systems, missing since IBM and Sun stopped marketing RVA and SVA.

Data compression algorithms

Essentially data compression is an algorithm that eliminates redundancy in data streams.  Data compression can be “lossy” or “loss-less”.  Data compression in storage subsystems is typically loss-less which means that the original data can be reconstructed without any loss of information.  One sees lossy algorithms in video/audio data compression which doesn’t impact video/audio fidelity unless it results in significant loss of data.

One simple example of loss-less data compression is Run-Length Encoding which substitutes a trigger, count, and character string for any character repeated more than 4 times in a block of data.  This compresses well any text strings with lots of blanks, numerical data with lot’s of 0’s and initial format data written with a repeating character.

There are other, more sophisticated compression algorithms like Huffman Coding, which identify the most frequent bytes in a block of data and replace these bytes with shorter bit patterns. For example if ~50% of the characters in a text file are the letters “a”, “e”, “i”, “o”, “t”, and “n” (see Wikipedia, Frequency Analysis) then these characters can take up much less space if we encode them 4 or less bits rather than the 8-bits in a byte.

I am certain that both EMC and NetApp are using much more sophisticated algorithms than either of these and it wouldn’t surprise me to know they are using something like the open source algorithm like Zlib (gzip) or Bzip2 (see my Poor deduplication with Oracle RMAN compressed backups post for an explanation) which uses Huffman Coding and adds even more sophistication.  Data compression algorithms like these could offer something like 50% compression, i.e., your data could be stored in 50% less space.

Data compression is often confused with Data Deduplication but it’s not the same. Deduplication looks for duplicate data across different data blocks and files while data compression is strictly only examining the data stream within a block or file and doesn’t depend on any other data.

Storage system data compression

In the past, data compression was relegated to a separate appliance, tape storage systems, and/or host software.  By integrating these algorithms into their main storage engines, both NetApp and EMC are taking advantage of the recent processor speed increases being embedding into their systems to offer offline functionality for online data.

Historically, compression algorithms such as these were implemented in hardware but nowadays they can easily be done in software by being relegated to operate during off-peak IO time or execute as the lowest priority task in the storage system. As such, there can be no guarantee when your data will finally be compressed but it will be compressed eventually.

Data compression like this is great for data that isn’t modified frequently.  It takes some processing time to compress data and as such, would need to be repeated after every modification of a compressed block or file.  So if the data isn’t modified that much, compression’s processing cost could be amortized over longer data lifetimes.

Further, data compression must be undone at read time, i.e., the data needs to be de-compressed and handed off to the IO requesting it.  De-compression is a much faster algorithm than compression because in the case of something like Huffman Coding the character dictionary is already known and as such, it’s just a matter of table lookup and bit field isolation. It would be convenient if this data were sitting in the system DRAM someplace but lacking that, moving it from cache to DRAM could be done quickly enough, processed there, and then moved back before final transfer to the requesting IO.

As such, data compression may impact response time for compressed data reads.  How much of an impact is yet TBD.

Data writes will not be impacted at all because the compression activity is done much later. Whether the data stays in cache until compressed or is brought back in at some later time is another algorithm question which may impact cache hit rates/compression performance but this doesn’t have to be a serious impediment.

NetApp is able to offer this capability for both block and file storage because of it’s WAFL backend data structure which essentially allows it to create variable length blocks for file and block data.  EMC only offers this for LUN data (block storage) as of yet but it’s probably just a matter of time before it’s available for other data as well.

Any questions?

6 thoughts on “Data compression lives on

  1. Just as a quick comment following your sentence :

    "De-compression is a much faster algorithm than compression"

    This is not always true. Sure, this is a property of LZ77 compressors, such as gzip or 7zip, and therefore it might seem generic, but is not.

    For example, regarding Huffman, decompression is typically slightly slower than compression, due to table look-up. For range coder, decompression is typically twice slower than compression due to divisions. For many context-oriented algorithms with complex probabilities calculations, compression and decompression are mostly identical because their code are nearly identical.

    Obviously, it doesn't change the whole picture. Storage systems would probably do better to keep CPU usage low, and therefore avoid slow & complex algorithm. And algorithm with very fast decoding speed to indeed exist.

    1. Yann,Thanks for the info, when it was all being done with hardware/state-machines none of this mattered that much. However, It's hard for me to understand how software Huffman encoding would be longer (while creating the dictionary) than software Huffman un-coding (decompression) when the dictionary is supplied for you, but I'll take your word for it.Regards,Ray

  2. Hi Ray

    I agree, this is not "common sense". I also had to experience it to properly understand it, then it was confirmed many times by expert compression programmers.

    You're right that encoding has some additional job to do : the huffman table must be generated.
    However, encoding does not suffer from table look-up access.
    All effects combined, table look-up costs more than table generation.
    Getting into detail, it's a matter of update trade-off and microprocessor architecture.

    I you want some concrete examples, you can try the following old website, where you'll find some downloadable huffman encoder/decoder (Huff0).

    If you can measure performance (-b option), you'll see that decoding is slightly slower than encoding.

    I paid a lot of attention on table lookup, to ensure most hits remain in L1 cache, so difference is not too large; but my initial implementation was much less favorable for decoding.

Comments are closed.