Genome informatics takes off at 100GB/human

All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)
All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)

Read a recent article (actually a series of charts and text) on MIT Technical Review called Bases to Bytes which discusses how the costs of having your DNA sequenced is dropping faster than Moore’s law and how storing a person’s DNA data now takes ~100GB.

Apparently Nature magazine says ~30,000 genomes have been sequenced (not counting biotech sequenced genomes), representing ~3PB of data.

Why it takes 100GB

At the moment DNA sequencing is not doing any compression, no deduplication nor any other storage efficiency tools to reduce this capacity footprint.  The 3.2Billion DNA base pairs each would take a minimum of 2 bits to store which should be ~800MB but for some reason more information about each base is saved (for future needs?) and they often re-sequence the DNA multiple times just to be sure (replica’s?).  All this seems to add up  to needing 100GB of data for a typical DNA sequencing output.

How they go from 0.8GB to 100GB with more info on each base pair and multiple copies or 125X the original data requirement is beyond me.

However, we have written about DNA informatics before (see our Dits, codons & chromozones – the storage of life post).  In that post I estimated that human DNA would need ~64GB of storage, almost right on.  (Although there was a math error somewhere in that analysis. Let’s see, 1B codons each with 64 possibilities [needing 6 bits] should require 6Bbits or ~750MB of storage, close enough).

Dedupe to the rescue

But in my view some deduplication should help.  Not clear if it’s at the Codon level or at some higher organizational level (chromosome, protein, ?)  but a “codon-differential” deduplication algorithm might just do the trick and take DNA capacity requirements down to size.  In fact with all the replication in junk DNA, it starts to looks more and more like backup sets already.

I am sure any of my Deduplication friends in the industry such as EMC Data Domain, HP StoreOnce, NetApp, SEPATON, and others would be happy to give it some thought if adequate funding were to follow.  But with this much storage at stake, some of them may take it on just to go after the storage requirements.

Gosh with a 50:1 deduplication ratio, maybe we could get a human DNA sequence down to 2GB.  Then it would only take 14EB to sequence the worlds 7B population today.

Now if we could just sequence the human microbiome with metagenomic analysis of the microbiological communities of organisms that live upon, within and around all of us.  Then we might have the answer to everything biologically we wanted to know about some person.

What we could do with all this information is another matter.