Since IBM’s announced their intent to purchase StorWize there has been much discussion on whether primary storage data compression can be made to work. As far as I know StorWize only offered primary storage compression for file data but there is nothing that prohibits doing something similar for block storage as long as you have some control over how blocks are laid down on disk.
Although secondary block data compression has been around for years in enterprise tape and more recently with some deduplication appliances, primary storage compression pre-dates secondary storage compression. STK delivered primary storage data compression with Iceberg in the early 90’s but it wasn’t until a couple of years later that they introduced compression on tape.
In both primary and secondary storage, data compression works to reduce the space needed to store data. Of course, not all data compresses well, most notably image data (as it’s already compressed) but compression ratios of 2:1 were common for primary storage of that time and are normal for today’s secondary storage. I see no reason why such ratios couldn’t be achieved for current primary storage block data.
Implementing primary block storage data compression
There is significant interest in implementing deduplication for primary storage as NetApp has done but supporting data compression is not much harder. I believe much of the effort to deduplicate primary storage lies in creating a method to address partial blocks out of order, which I would call data block virtual addressing which requires some sort of storage pool. The remaining effort to deduplicate data involves implementing the chosen (dedupe) algorithm, indexing/hashing, and other administrative activities. These later activities aren’t readily transferable to data compression but the virtual addressing and space pooling should be usable by data compression.
Furthermore, block storage thin provisioning requires some sort of virtual addressing as does automated storage tiering. So in my view, once you have implemented some of these advanced capabilities, implementing data compression is not that big a deal.
The one question that remains is does one implement compression with hardware or software (see Better storage through hardware for more). Considering that most deduplication is done via software today it seems that data compression in software should be doable. The compression phase could run in the background sometime after the data has been stored. Real time decompression using software might take some work, but would cost considerably less than any hardware solution. Although the intensive bit fiddling required to perform data compression/decompression may argue for some sort of hardware assist.
Data compression complements deduplication
The problem with deduplication is that it needs duplicate data. This is why it works so well for secondary storage (backing up the same data over and over) and for VDI/VMware primary storage (with duplicated O/S data).
But data compression is an orthogonal or complementary technique which uses the inherent redundancy in information to reduce storage requirements. For instance, something like LZ compression takes advantage of the fact that in text some letters occur more often than others (see letter frequency). For instance, in English, ‘e’, ‘t’, ‘a’, ‘o’, ‘i’, and ‘n, represent over 50% of the characters in most text documents. By using shorter bit combinations to encode these letters one can reduce the bit-length of any (English) text string substantially. Another example is run length encoding which takes any repeated character and substitutes a trigger character, the character itself, and a count of the number of repetitions for the repeated string.
Moreover, the nice thing about data compression is that all these techniques can be readily combined to generate even better compression rates. And of course compression could be applied after deduplication to reduce storage footprint even more.
Why would any vendor compress data?
For a couple of reasons:
- Compression not only reduces storage footprint but with hardware assist it can also increase storage throughput. For example, if 10GB of data compresses down to 5GB, it should take ~1/2 the time to read.
- Compression reduces the time it would take time to clone, mirror or replicate.
- Compression increases the amount of data that could be stored which should incentivise them to pay more for your storage.
In contrast, with data compression vendors might may sell less storage. But the advantages of enterprise storage is in the advanced functionality/features and higher reliability/availability/performance that are available. I see data compression as just another advantages to enterprise class storage and as a feature, the user could enable or disable it and see how well it works for there data.
What do you think?