Big science/big data ENCODE project decodes “Junk DNA”

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression.  Or junk DNA is really on-off switches for protein encoding DNA.  ENCODE project results were published by Nature,  Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to  the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2).  But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

  • Junk DNA is actually programming for protein production in a cell.  Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins.  Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
  • Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species.  One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
  • Project ENCODE has further narrowed the “Known Unknowns” of human DNA.  For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
  • There are cell specific regulation DNA.  That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc.  Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions.  I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA.  Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

  • Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project.   All told almost 12,000 files were analyzed from these experiments.
  • 15TB of data were used in the project
  • ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration.   It took about half way into the project before they figured out  how to define and assess data quality from experiments.   What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project.  In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team.  The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another.  This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”.  This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction.  This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools.  I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools.  Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables.  But even a search of the project didn’t reveal any hadoop references.


The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M.  Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line.  I would love to hear more about it or you can always comment here.


Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

DNA as storage, the end of evolution – part 2

I had talked about DNA programming/computing previously (see my DNA computing and the end of natural evolution post) and today we have an example of  another step along this journey.  A new story in today’s Science News titled DNA used as rewriteable data storage in cells discusses another capability needed for computation, namely information storage.

The new synthetic biology “logic” is able to record, erase and overwrite (DNA) data in an E. coli cell.  DNA information storage like this brings us one step closer to a universal biologic Turing machine or computational engine.

Apparently the new process uses enzymes to “flip” a small segment of DNA to read backwards and then with another set of enzymes, flip it back again.  With another application of synthetic biology, they were able to have the cell fluoresce in different colors depending on whether the DNA segment was reversed or in its normal orientation.

To top it all off, the DNA data storage device was inheritable.   Scientists showed that the data device was still present in the 100th generation of the cell they originally modified.  How’s that for persistent storage.

The universal biological Turing machine

Let’s see, my universal Turing machine parts list includes:

  • Tape or infinite memory device = DNA memory device – Check (todays post, well maybe not infinite, but certainly single bits today, bytes next year, so it’s only a matter of time before it’s KB)
  • Read head or ability to read out memory information = biological read head – Check (todays post, it can fluoresce, therefore it can be read)
  • State register = biologic counter  – Check (seems to have been discovered in 2009, see Science News article Engineered DNA counts it out, don’t know how I missed that)
  • State transition table or program = biological programming – Check (previous post plus today’s post, able to compute a new state from a given previous state and current data and write or rewrite data).

As far as I can tell this means we could construct an equivalent to a universal turing machine with today’s synthetic biology. Which of course means we could perform  just about any computation ever conceived within a single cell AND all generations of the cell would inherit this ability.

End of natural evolution, …

Gosh the possibilities of this new synthetic biological turing machine are both frightening and astonishing.  My original post talked about how adding ECC like functionality plus a ECC codeword to human DNA strand would spell the end of natural evolution for our species.

I suppose the one comforting thought is that flipping DNA segments takes hours rather than nano-seconds which means biological computation will never displace electronic/optronic computation.  But biological computation really doesn’t have to.  All it has to do is repair DNA mutations over the course of days, weeks and/or years, before it has a chance to propagate in order to end natural evolution.

…,  the dawn of un-natural evolution

Of course with such capabilities, “un-natural” or programmed evolution is quite possible but is it entirely desireable.  With such capabilities we could readily change a cell’s DNA to whatever we desire it to be.

My real problem is its inheritability.  It’s one thing to muck with a persons genome, it’s another thing to muck with their children’s, children’s, children’s, … DNA.

Let’s say you were able to change someone’s DNA to become a super-athelete, super-brain or super-beautiful/handsome person.  (Moving from a single cell’s DNA to a whole person’s is a leap, but not outside the realm of possibility).   Over time, any such changes would accumulate and could confer an seemingly un-assailable advantage to an individual’s gene line.

There’s probably some time to think these things through and set up some sort of policies, guidelines, and/or regulations environment around the use of the technology before capabilities get out of hand.

In my mind this goes well beyond genetically modified organisms (GMO) organisms that are just static changes to a gene line.  Programming gene lines to repair DNA, alter DNA, or even to make better copies, seems to me to be an order of magnitude increase in new capabilities taking us to genetically programmed organisms that has the potential to end evolution itself.

We need to have some serious discussions before it goes that far.


Image: E. coli GFP by KitKor

DNA computing and the end of natural evolution

DNA Molecule Arrangement in the Chip (from
DNA Molecule Arrangement in the Chip (from

Read an article the other day in the Economist on how researchers are now performing computation using DNA.  The intent is to someday come up with small biologic computers that can be inserted into cells/organisms which can cure or kill cells that are in trouble and leave the rest alone.

Computing soup?!

Research in the area of molecular computing has been going on since 1994, when a scientist created a DNA based solution to compute an answer to a specified traveling salesman problem.

In those days the answer was derived from running a centrifuge on the end-product soup of DNA strings and extracting the answer from the resultant gel matrix.

Molecular computing redefined

Since then, there has been significant improvements in DNA computing.  Currently, most are based on DNA strand displacement.  Today’s molecular computers consists of free floating DNA or RNA snippets.  A logic gate is made up of two strands, one of which is the “computational logic” and the other an “output signal”.  In addition to the logic gate there is another DNA/RNA strand which is an “input signal” or almost like input data.  Input signals are matched up to a specific logic gate and cause the output signal snippet to be detached creating yet another input signal for other computations cascading down the pipeline.

DNA-RNA based digital logic

2-bit_ALU (from
2-bit_ALU (from

By doing all this, researchers have been able to create DNA snippets that perform various logical computing operations such as AND, OR and NOT logic gates and producing the signal pathways to connect them in a computational sequence or “program”.

The molecular automata all looks like elementary electronic circuits made up of base level logic gates logic to me but just as in electronic digital logic it seems to gets the job done.  One gets a computation done by adding 1000’s of copies of the logic gates and input sequences together and some how assaying the end result many hours later.

Using these capabilities, they have created DNA programs made up of 74 different DNA strands that could calculate the square roots of 4 digit numbers.

Next, they tied an artificial neuron to fire when input signals hit a certain level together with a soup of 114 different DNA strands to do rudimentary pattern recognition.  They used then “programed” their DNA neural net to recognize Yes/No answers provided by different  scientists.  The report said that the neural net, was able to get the correct answer every time but took 8 hours to perform the calculations.

There are a couple of groups working on a programming language and a simulator tool for DNA or molecular computing called the DNA Strand Displacement (DSD) tool.

The report went on to say how another set of researchers were fabricating synthetic genes which when introduced into cell could be used to trick the cell into producing the cellular computer itself.

The end of natural evolution?

The end game for all this is to create a computational device that can somehow be injected into tissue cells which would identify “sick” cells then cure or destroy them.

A couple of years ago, I was waiting in a doctor’s office for something or another and penned a poem on the end of human evolution involving ECC combined with DNA.  (No, you can’t see the poem.)

You see in computers today there is a computational device called an ECC or error correcting code which is a circuit and a special code word that can be appended to a sequence of data that together can then be used to correct for errors in transmission or storage of that data.

Once someone can build digital logic out of DNA-RNA, it’s not a big leap to have build an ECC circuit.  Once the circuit is ready, anyone could potentially have their DNA modified to have an appropriate ECC codeword appended to it.  With DNA + ECC code word and an active ECC circuit in the cell, it’s quite possible than any single, double, or triple mutation could be detected and fixed inside a cell.  Of course ECC can go beyond triple error detection if needed.  Also, Reed-Solomon and other erasure codes can even go much beyond that.

After such a device was incorporated into the human genome, it would seem to signal the end to natural evolution, at least for humans.



Dits, codons & chromosomes – the storage of life

All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)
All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)

I was thinking the other day that DNA could easily be construed as information storage for life.  For example, DNA uses 4-distinct nucleic acids (A, C, G, & U) as its basic information unit.  I would call these units of DNA information as Dits (for DNA digITs) and as such, DNA uses a base-4 number system.

Next in data storage parlance comes the analogue for the binary byte that holds 8-bits.  In the case of DNA the term to use is the codon, a three nucleic-acid (or 3-Dit) unit which codes for one of the 20 amino acids used in life, not unlike how a byte of data defines an ASCII character.  With 64 possibilities in a codon, there is some room for amino acid encoding overlap and to encode for other mechanisms beyond just amino acids (see chart above for amino-acid codon encoding).  I envision something akin to ASCII non-character codes such as STX (DNA-AUG), ETX (DNA-UAA, -UAG & -UGA), etc. which for DNA would define non-amino acid encoding DNA codons.

DNA is stored in two strips, each one a complementary image of the other strand.  In data storage terminology we would consider this a form of data protection somewhat similar to RAID1. Perhaps we should call this -RAID1 as it’s complementary storage.

DNA chromosomes seem to exist primarily as a means to read-out codons.  It seems the chromosomes are split, read sequentially, duplicated into intermediate mRNA and then these intermediate mRNA forms, with the help of enzymes are converted into the proteins of life.  Chromosomes would correspond to data blocks in standard IT terminology as they are read as a single unit and read sequentially.  However, they are variable in length and seem to carry with them some historical locality of reference information but this is only my perception.  mRNA might be considered as a storage cache for DNA data, although it’s unclear whether mRNA is read multiple times or used just once.

The cell or rather the cell nucleus could be construed as an information (data) storage device where DNA blocks or chromosomes are held.  However when it comes to Dits as in bits there are multiple forms of storage devices.  For example, it turns out that DNA can exist outside of the cell nucleus in the form of mitochondrial DNA.  I like to think of mitochondral DNA as similar to storage device firmware as they encode for the proteins needed to supply energy to the cell.

The similarity to data storage starts to breakdown at this point.  DNA is mostly WORM (Write-Once-Read-Many times) tape-like media and is not readily changed except through mutation/evolution (although recent experiments to construct artificial DNA belie this fact).  As such, DNA is mostly exact copies of other DNA within an organism or across organisms within the same species (except for minor individualization changes).  Across species, DNA is readily copied and we find that human DNA has a high (94%) proportion of similarity to chimp DNA and less percentage to other mammalian DNA.

For DNA, I see nothing like storage subsystems that hold multiple storage devices with different (data) information on them.  Perhaps seed banks might qualify for plant DNA but these seem a somewhat artificial construct for life storage subsystems.  However, as I watch the dandelion puffs pass by my back porch there seems to be some rough semblance of cloud storage going on as they look omnipresent, ephemeral, but with active propagation (or replication), not unlike the cloud storage that exists today.  Perhaps my environmentalist friends would call the ecosystem a life storage subsystem as it retains multiple DNA instances or species.

Science tell us that human DNA has ~3B (3×10**9) base pairs or ~1B codons.  To put this into data storage perspective, human DNA holds ~64GB of data.  Density wise, human DNA aligned end to end stands about ~8.5cm long and at that length it’s about 620 million bits per mm or over 45,000 times the density of an LTO-4 tape and roughly half that for LTO-5 tape.

It’s fairly amazing to me that something as marvelous as a human being can be constructed using only 64GB of data.  I now have an unrestrained urge to want to copy my DNA so I can back it up offline, to some other non-life media.  But it’s not clear what I could do with it other than that and restore seems somewhat problematic at best…