
Last week NetApp announced the availability of data compression on many of their unified storage platforms, which includes block and file storage. Earlier this year EMC announced data compression for LUNs on CLARiion and Celerra. I must commend both of them for re-integrating data compression back into primary storage systems, missing since IBM and Sun stopped marketing RVA and SVA.
Data compression algorithms
Essentially data compression is an algorithm that eliminates redundancy in data streams. Data compression can be “lossy” or “loss-less”. Data compression in storage subsystems is typically loss-less which means that the original data can be reconstructed without any loss of information. One sees lossy algorithms in video/audio data compression which doesn’t impact video/audio fidelity unless it results in significant loss of data.
One simple example of loss-less data compression is Run-Length Encoding which substitutes a trigger, count, and character string for any character repeated more than 4 times in a block of data. This compresses well any text strings with lots of blanks, numerical data with lot’s of 0’s and initial format data written with a repeating character.
There are other, more sophisticated compression algorithms like Huffman Coding, which identify the most frequent bytes in a block of data and replace these bytes with shorter bit patterns. For example if ~50% of the characters in a text file are the letters “a”, “e”, “i”, “o”, “t”, and “n” (see Wikipedia, Frequency Analysis) then these characters can take up much less space if we encode them 4 or less bits rather than the 8-bits in a byte.
I am certain that both EMC and NetApp are using much more sophisticated algorithms than either of these and it wouldn’t surprise me to know they are using something like the open source algorithm like Zlib (gzip) or Bzip2 (see my Poor deduplication with Oracle RMAN compressed backups post for an explanation) which uses Huffman Coding and adds even more sophistication. Data compression algorithms like these could offer something like 50% compression, i.e., your data could be stored in 50% less space.
Data compression is often confused with Data Deduplication but it’s not the same. Deduplication looks for duplicate data across different data blocks and files while data compression is strictly only examining the data stream within a block or file and doesn’t depend on any other data.
Storage system data compression
In the past, data compression was relegated to a separate appliance, tape storage systems, and/or host software. By integrating these algorithms into their main storage engines, both NetApp and EMC are taking advantage of the recent processor speed increases being embedding into their systems to offer offline functionality for online data.
Historically, compression algorithms such as these were implemented in hardware but nowadays they can easily be done in software by being relegated to operate during off-peak IO time or execute as the lowest priority task in the storage system. As such, there can be no guarantee when your data will finally be compressed but it will be compressed eventually.
Data compression like this is great for data that isn’t modified frequently. It takes some processing time to compress data and as such, would need to be repeated after every modification of a compressed block or file. So if the data isn’t modified that much, compression’s processing cost could be amortized over longer data lifetimes.
Further, data compression must be undone at read time, i.e., the data needs to be de-compressed and handed off to the IO requesting it. De-compression is a much faster algorithm than compression because in the case of something like Huffman Coding the character dictionary is already known and as such, it’s just a matter of table lookup and bit field isolation. It would be convenient if this data were sitting in the system DRAM someplace but lacking that, moving it from cache to DRAM could be done quickly enough, processed there, and then moved back before final transfer to the requesting IO.
As such, data compression may impact response time for compressed data reads. How much of an impact is yet TBD.
Data writes will not be impacted at all because the compression activity is done much later. Whether the data stays in cache until compressed or is brought back in at some later time is another algorithm question which may impact cache hit rates/compression performance but this doesn’t have to be a serious impediment.
NetApp is able to offer this capability for both block and file storage because of it’s WAFL backend data structure which essentially allows it to create variable length blocks for file and block data. EMC only offers this for LUN data (block storage) as of yet but it’s probably just a matter of time before it’s available for other data as well.
Any questions?