More on data growth from NetApp analyst days customers

Installing a power line at the Tram. Pat, Allan and Chris by bossco (cc) (from Flickr)
Installing a power line at the Tram. Pat, Allan and Chris by bossco (cc) (from Flickr)

Some customers at NetApp’s Analyst Days were discussing deployments of NetApp storage with Dave Hitz the new storage efficiency czar and others but I was more interested in their comments on storage growth issues. Jonathan Bartes of Virginia Farm Bureau mentioned the “natural growth rate of unstructured data” seemed to be about 20% per year, but some of the other customers had even higher growth rates.

Tucson Electric Power

Christopher Jeffry Rima from Tucson Electric Power is dealing with 70% CAGR in data growth per year. What’s driving this is primarily regulations (Power companies are heavily regulated utilities in USA), high resolution imagery/GIS data and power management/smart metering. It turns out imagery has increased resolution by about 10X in a matter of years and they use such images as work plan overlays for field work to fix, upgrade or retire equipment. It seems they have hi-res images of all the power equipment and lines in their jurisdiction which are updated periodically via fly overs.

The other thing that’s driving their data growth is smart metering and demand power management. I have talked about smart metering data appetite before. But demand management was new to me.

Rima said that demand management is similar to smart metering but adds a real time modeling of  demand and capacity and bi-directional transmissions to request consumers to shed demand when required. Smart meters and real time generation data feeds the load management model used to predict peak demand over the next time period which is then used to determine whether to shed demand or not.   It turns out that at ~60% utilization the power grid is much more cost effective than at 80% due the need to turn on gas generators which cost more than coal. In any case, when their prediction model shows utilization will top ~60-70% they start shunting load.


Another customer, Neil Clover from Arup (a construction/engineering firm) started talking about 3D building/site modeling and fire simulation flow dynamics modeling. Clover lamented that it’s not unusual to have a TB of data show up out of nowhere for a project they just took on.

incendio en el edificio 04 by donrenexito (cc) (from Flickr)
incendio en el edificio 04 by donrenexito (cc) (from Flickr)

Clover said the fire flow modeling’s increasing resolution and multiple iterations under varying conditions were generating lots of data. The 3D models are also causing serious data growth and need to be maintained across the design, build, operate cycle of buildings.  TB of data showing up on your data center storage with no advance notice – incredible.  All this and more is causing Clover’s data growth to average around 70% per year.

University Hospitals Leuven, Belgium

The day before at the analyst meeting Reinoud Reynders from the University Hospital Leuven, Belgium mentioned some key drivers of data growth at their hospital as digital pathology studies that generate about 100GB each but which they do about 100 times a day and DNA studies that generate about 1TB of data each and take about a week to create.  This seems higher than I predicted, almost 16X higher.  However, Reynders said the DNA studies are still pretty expensive at $15K USD each but he forecasts costs decreasing drasmatically over the coming years and a commensurate volume increase.

But the more critical current issue might be the digital pathology exams at ~10TB per day.  The saving grace for pathology exams is that such studies can be archived when completed rather than kept online. Reynders also mentioned that digital radiology and imaging studies are also creating massive amounts of data but unfortunately this data must be kept online because they are re-referenced often and has no predictability about it.

While data growth was an understated concern during much of the conference sessions, how customers dealt with such (ab?)normal growth by using NetApp storage and Ontap functionality was the main topic of their presentations.  Explanation on this NetApp functionality and how effective they were at managing data growth will need to await another day.

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.