
Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage. For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.
Essentially, deduplication systems identify duplicate data and only store one copy of such data. It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.
The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data). The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter. A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.
There’s more to this of course. For example, chapters or duplicate data segments must be tagged with how often they are duplicated so that such data is not lost when modified. Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.
Types of deduplication
- Source deduplication involves a repository, a client application, and an operation which copies client data to the repository. Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository. On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data. The repository stores the unique data chunks and the data stream’s table of contents.
- Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data. Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk. In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received. Post-processing refers to data that is completely staged to disk before being deduplicated later.
- Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication. For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written. Whether either system detects duplicate data below these levels is implementation dependent.
Deduplication overhead
Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly. There may be some minor performance loss because of lack of sequentiality but that only impacts data throughput and not that much.
Where dedupe makes sense
Deduplication was first implemented for backup data streams. Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data. For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc. As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage. After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.
Deduplication can also work for secondary or even primary storage. Most IT shops with 1000’s of users, duplicate lot’s of data. For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc. Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.
Full disclosure, I have worked for many deduplication vendors in the past.