Big science/big data ENCODE project decodes “Junk DNA”

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression.  Or junk DNA is really on-off switches for protein encoding DNA.  ENCODE project results were published by Nature,  Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to  the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2).  But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

  • Junk DNA is actually programming for protein production in a cell.  Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins.  Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
  • Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species.  One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
  • Project ENCODE has further narrowed the “Known Unknowns” of human DNA.  For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
  • There are cell specific regulation DNA.  That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc.  Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions.  I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA.  Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

  • Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project.   All told almost 12,000 files were analyzed from these experiments.
  • 15TB of data were used in the project
  • ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration.   It took about half way into the project before they figured out  how to define and assess data quality from experiments.   What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project.  In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team.  The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another.  This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”.  This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction.  This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools.  I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools.  Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables.  But even a search of the project didn’t reveal any hadoop references.


The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M.  Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line.  I would love to hear more about it or you can always comment here.


Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

New cloud storage and Hadoop managed service offering from Spring SNW

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Last week I posted my thoughts on Spring SNW in Dallas, but there were two more items that keep coming back to me (aside from the tornados).  The first was a new startup called Symform in cloud storage and the other was an announcement from SunGard about their new Hadoop managed services offering.


Symform offers an interesting alternative on cloud storage that avoids the build up of large multi-site data centers and uses your desktop storage as a sort of crowd-sourced storage cloud, sort of bit-torrent cloud storage.

You may recall I discussed such a Peer-to-Peer cloud storage and computing services in a posting a couple of years ago.  It seems Symform has taken this task on, at least for storage.

A customer downloads (Windows or Mac) software which is installed and executes on your desktop.  The first thing you have to do after providing security credentials is to identify which directories will be moved to the cloud and the second is to tell whether you wish to contribute to Symform’s cloud storage and where this storage is located.  Symform maintains a cloud management data center which records all the metadata about your cloud resident data and everyone’s contributed storage space.

Symform cloud data is split up into 64MB blocks and encrypted (AES-256) using a randomly generated key (known only to Symform). Then this block is broken up into 64 fragments with 32 parity fragments (using erasure coding) added to the stream which is then written to 96 different locations.  With this arrangement, the system could potentially lose 31 fragments out of the 96 and still reconstitute your 64MB of data.  The metadata supporting all this activity sits in Symform’s data center.

Unclear to me what you have to provide as far as ongoing access to your contributed storage.  I would guess you would need to provide 7X24 access to this storage but the 32 parity fragments are there for possible network/power failures outside your control.

Cloud storage performance is an outcome of the many fragments that are disbursed throughout their storage cloud world. It’s similar to a bit torrent stream with all 96 locations participating in reconstituting your 64MB of data.  Of course, not all 96 locations have to be active just some > 64 fragment subset but it’s still cloud storage so data access latency is on the order of internet time (many seconds).  Nonetheless, once data transfer begins, throughput performance can be pretty high, which means your data should arrive shortly thereafter.

Pricing seemed comparable to other cloud storage services with a monthly base access fee and a storage amount fee over that.  But, you can receive significant discounts if you contribute storage and your first 200GB is free as long as you contribute 200GB of storage space to the Symform cloud.

Sungard’s new Apache Hadoop managed service

Hadoop Logo (from website)
Hadoop Logo (from website)

We are well aware of Sungard’s business continuity/disaster recovery (BC/DR) services, an IT mainstay for decades now. But sometime within the last decade or so Sungard has been expanding outside this space by moving into managed availability services.

Apparently this began when Sungard noticed the number of new web apps being deployed each year exceeded the number of client server apps. Then along came virtualization, which reduced the need for lots of server and storage hardware for BC/DR.

As evident of this trend, last year Sungard announced a new enterprise class computing cloud service.  But in last week’s announcement, Sungard has teamed up with EMC Greenplum to supply an enterprise ready Apache Hadoop managed service offering.

Recall, that EMC Greenplum is offering their own Apache Hadoop supported distribution, Greenplum HD.  Sungard is basing there service on this distribution. But there’s more.

In conjunction with Hadoop, Sungard adds Greenplum appliances.  With this configuration Sungard can load Hadoop processed and structured data into a Greenplum relational database for high performance data analytics.  Once there, any standard SQL analytics and queries can be used against to analyze the data.

With these services Sungard is attempting to provide a unified analytics service that spans all structured, semi-structured and unstructured data.


Probably more to Spring SNW but given my limited time on the exhibition floor and time in vendor discussions these and my previously published post are what I seem of most interest to me.

Archeology meets Big Data

Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)
Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)

Read an article yesterday about the use of LIDAR (light detection and ranging, Wikipedia) to map the residues of an pre-columbian civilization in Central America, the little know Purepecha empire, peers of the Aztecs.

The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.


LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.

The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.

So how much data?

I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.

Another paper I found (see Evaluation of MapReduce for Gridding LIDAR Data) said that a LIDAR “grid point” (containing X, Y & Z coordinates) takes 52 bytes of data.

Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.

My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.

With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.

Indiana Jones meets Hadoop

That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.

The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.

The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.


Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.


Hadoop – part 2

Hadoop Graphic (c) 2011 Silverton Consulting
Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data.  Namely,

  • Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently.  Data access is via a Thrift based API supporting many languages.  Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp.  One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween.  Also Casandra is optimized for writes.  Cassandra can be used as the Map portion of a MapReduce run.
  • Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB.  Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS).  In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations.  Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications.  Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
  • Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop.  Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
  • Hypertable – is a Google open source project which is a  c++ implementation of BigTable only using HDFS rather than GFS .  Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families.   Hypertable supports both a client (c++) and Thrift API.  Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
  • Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS  in combination with an interpretive analysis.  Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS.  Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

  • Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction.  However, Mahout works on non-Hadoop clusters as well.  Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions.  While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
  • Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data.  The focus here is on scientific computation.  Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

  • Chukwa – which is used for monitoring large distributed clusters of servers.
  • ZooKeeper – which is a cluster configuration tool  and distributed serialization manager useful to build large clusters of Hadoop nodes.
  • MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
  • Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well.  In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.

Hadoop – part 1

Hadoop Logo (from website)
Hadoop Logo (from website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.


I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.