Dits, codons & chromosomes – the storage of life

All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)
All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)

I was thinking the other day that DNA could easily be construed as information storage for life.  For example, DNA uses 4-distinct nucleic acids (A, C, G, & U) as its basic information unit.  I would call these units of DNA information as Dits (for DNA digITs) and as such, DNA uses a base-4 number system.

Next in data storage parlance comes the analogue for the binary byte that holds 8-bits.  In the case of DNA the term to use is the codon, a three nucleic-acid (or 3-Dit) unit which codes for one of the 20 amino acids used in life, not unlike how a byte of data defines an ASCII character.  With 64 possibilities in a codon, there is some room for amino acid encoding overlap and to encode for other mechanisms beyond just amino acids (see chart above for amino-acid codon encoding).  I envision something akin to ASCII non-character codes such as STX (DNA-AUG), ETX (DNA-UAA, -UAG & -UGA), etc. which for DNA would define non-amino acid encoding DNA codons.

DNA is stored in two strips, each one a complementary image of the other strand.  In data storage terminology we would consider this a form of data protection somewhat similar to RAID1. Perhaps we should call this -RAID1 as it’s complementary storage.

DNA chromosomes seem to exist primarily as a means to read-out codons.  It seems the chromosomes are split, read sequentially, duplicated into intermediate mRNA and then these intermediate mRNA forms, with the help of enzymes are converted into the proteins of life.  Chromosomes would correspond to data blocks in standard IT terminology as they are read as a single unit and read sequentially.  However, they are variable in length and seem to carry with them some historical locality of reference information but this is only my perception.  mRNA might be considered as a storage cache for DNA data, although it’s unclear whether mRNA is read multiple times or used just once.

The cell or rather the cell nucleus could be construed as an information (data) storage device where DNA blocks or chromosomes are held.  However when it comes to Dits as in bits there are multiple forms of storage devices.  For example, it turns out that DNA can exist outside of the cell nucleus in the form of mitochondrial DNA.  I like to think of mitochondral DNA as similar to storage device firmware as they encode for the proteins needed to supply energy to the cell.

The similarity to data storage starts to breakdown at this point.  DNA is mostly WORM (Write-Once-Read-Many times) tape-like media and is not readily changed except through mutation/evolution (although recent experiments to construct artificial DNA belie this fact).  As such, DNA is mostly exact copies of other DNA within an organism or across organisms within the same species (except for minor individualization changes).  Across species, DNA is readily copied and we find that human DNA has a high (94%) proportion of similarity to chimp DNA and less percentage to other mammalian DNA.

For DNA, I see nothing like storage subsystems that hold multiple storage devices with different (data) information on them.  Perhaps seed banks might qualify for plant DNA but these seem a somewhat artificial construct for life storage subsystems.  However, as I watch the dandelion puffs pass by my back porch there seems to be some rough semblance of cloud storage going on as they look omnipresent, ephemeral, but with active propagation (or replication), not unlike the cloud storage that exists today.  Perhaps my environmentalist friends would call the ecosystem a life storage subsystem as it retains multiple DNA instances or species.

Science tell us that human DNA has ~3B (3×10**9) base pairs or ~1B codons.  To put this into data storage perspective, human DNA holds ~64GB of data.  Density wise, human DNA aligned end to end stands about ~8.5cm long and at that length it’s about 620 million bits per mm or over 45,000 times the density of an LTO-4 tape and roughly half that for LTO-5 tape.

It’s fairly amazing to me that something as marvelous as a human being can be constructed using only 64GB of data.  I now have an unrestrained urge to want to copy my DNA so I can back it up offline, to some other non-life media.  But it’s not clear what I could do with it other than that and restore seems somewhat problematic at best…

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.