Data consistency – Silverton Consulting

MinIO Strong Consistency and object gateways #CFD23

Posted on June 18, 2025June 18, 2025 by Ray in Data consistency, Object storage

I attended Cloud Field Day 23 (#CFD23) a couple of weeks back and MinIO presented on their AIStor for one of the CFD sessions (see videos here). During the session the speaker mentioned that object storage gateways were not strongly consistent.

Object storage has been around a long time but AWS S3 took it from a niche solution to a major storage type. Originally S3 was “eventually consistent” that is if you wrote an object and the system returned to you, that object wouldn’t necessarily be all there (that is stored in the object storage system) until sometime “later” .

It took AWS from Mar’2006 until Dec’2020 to offer “strong consistency” for objects that is when the system returned to a user after having created a new object, that object was guaranteed to have been written to the object storage before the return happened.

What the MinIO presenter was saying is that object storage gateways didn’t offer strong consistency for objects being written to them. Most object storage gateways are built ontop of enterprise class file systems. For these systems to be not strongly consistent seemed pretty unusual to me so I asked them to clarify why they said that.

The speaker turned to Anand Babu Periasamy (AB), CEO & Co-Founder, MinIO to answer my question. Essentially AB answered me that it was all about the size of the volumes (or LUNs) that underpin the objects and buckets in the file object storage gateway system.

For our post we will set aside the few object storage systems based solely ontop of block storage (are there any of these) and will just consider the object storage gateways built ontop of file systems.

In S3, there are Single (object) PUT requests and multi-part (object) uploads to create objects. According to MinIO as objects and their buckets get bigger, object gateways don’t maintain strong consistency. Bucket sizes for S3 Objects storage can be on the order of many PBs.

As for the Objects themselves, AWS S3 supports a max object size of 5TiB, using multipart uploads and Single PUT maximum of 5GiB. According to AWS, during S3 multi-part uploads, the object is not readable until after the last multi-part upload is replied to, only then is the full (5TiB) object available and thereby offers strong consistency over multi-part uploads. For Single PUT object creations, the object is strongly consistent after the PUT is returned to.

Here’s single PUT and mult-part upload object size constraints for select object storage gateways:

NetApp ONTAP offers object storage protocols for their Data ONTAP storage systems. The maximum NetApp S3 put size is 5TiB but anything over 5GiB will be flagged to encourage multi-part uploads
Hitachi VSP One Object storage, Object maximum S3 PUT size is 5GiB and an object is limited to a maximum of 5TiB
Dell PowerScale ObjectScale maximum S3 PUT size is 5GiB, and the maximum S3 object size is 4.4TiB.
VAST Data supports S3 objects and it’s maximum S3 PUT size is 5GiB, and the maximum object size is 5TiB.
WekaFS maximum S3 PUT size is 5GiB and maximum S3 object size is 5TiB.
Scality (also presented at CFD23, videos here) RING and Artesca maximum S3 PUT size is 5GiB, and implies that they support 10K parts in a multi-part upload so a maximum object size of ~50TiB
Qumulo (also presented at CFD23, videos here) maximum S3 PUT size is 5GiB, and for S3 compatibility their maximum object size is 5TiB. But they also state that they can actually support an object of ~50TiB (10K multipart uploads * 5GiB)
MinIO S3 maximum object size is 50TiB and their maximum S3 PUT size is 5TiB.

I’m probably missing a couple of other storage systems that support S3 objects but you get the drift. It’s important to note that just about every object system gateway offers ful S3 compatible single PUT size of 5GiB (NetApp & MinIO offer 5TiB though!). And they all seem to offer 5TiB object sizes (with few exceptions Dell PowerScale ObjectScale’s 4.4TiB, Scality, Qumulo & MinIO 50TiB).

The question needs to be asked is if a customer uses multi-part uploads of 5GiB for a 5TiB object are they guaranteed strong consistency in an object storage gateway

According to everything I see the answer seem to be yes. Although outside of AWS S3 and MinIO I don’t see anything that specifically states they are strongly consistent.

However, most of object storage systems map objects to files and single file creates are strongly consistent for these systems. So for a single (5GiB) PUT the answer would be obviously YES. And most of these systems also support much larger files than 5GiB, so I would suspect that multi-part uploads would also support strong consistency as they would be mapped to a multi-write/single file in these systems. So my answer to the question of whether multi-part uploads are strongly consistent for object storage gateways is a guarded YES.

Perhaps what MinIO was meaning to say was that if you do a 5TiB part for a 50TiB multipart upload will it be strongly consistent. I would have to answer NO for anyone of these systems other than Qumulo, Scality & MinIO because they don’t support anything that large. But this is way (10X for object sizes & 1000X for cumulative multi-part upload part sizes) beyond AWS S3’s current capabilities, so I don’t see how this is relevant for those customers wishing to retain AWS S3 compatibility

And I guess I don’t see the relevance of bucket size to strong consistency for an object. I suppose there may be problems as you populate a bucket with big objects and run out of room during a (multi-part) upload/PUT. But it’s clear to me that when that occurred, the object storage gateway would indicate that the object store failed and then consistency doesn’t matter.

So I have to disagree with my friends at MinIO, regarding the lack of strong consistency with the proviso that all these systems support strong consistency for AWS S3 object sizes.

Comments?

Living forever – the end of evolution part-3

Posted on April 15, 2022April 15, 2022 by Ray in Data consistency, Data integrity, Data precision, data protection, Data retention, Storage longevity, Storage reliability, Strategic Inflection Points

Read an article yesterday on researchers who had been studying various mammals and trying to determine the number of DNA mutations they accumulate at about the time they die. The researchers found that after about 800 mutations for mole rats, they die, see Nature article Somatic mutation rates scale with lifespan across mammals and Telegraph article reporting on the research, Mystery of why humans die around 80 may finally be solved.

Similarly, at around 3500 mutations humans die, at around 3000 mutations dogs die and at around 1500 mutations mice die. But the real interesting thing is that the DNA mutation rates and mammal lifespan are highly (negatively) correlated. That is higher mutation rates lead to mammals with shorter life spans.

C. Linear regression of somatic substitution burden (corrected for analysable genome size) on individual age for dog, human, mouse and naked mole-rat samples. Samples from the same individual are shown in the same colour. Regression was performed using mean mutation burdens per individual. Shaded areas indicate 95% confidence intervals of the regression line. A shows microscopic images of sample mammalian cels and the DNA strands examined and B shows the distribution of different types of DNA mutations (substitutions or indels [insertion/deletions of DNA]).

The Telegraph article seems to imply that at 800 mutations all mammals die. But the Nature Article clearly indicates that death is at different mutation counts for each different type of mammal.

Such research show one way on how to live forever. We have talked about similar topics in the distant past see …-the end of evolution part 1 & part 2

But in any case it turns out that one of the leading factors that explains the average age of a mammal at death is its DNA mutation rate. Again, mammals with lower DNA mutation rates live longer on average and mammals with higher DNA mutation rates live shorter lives on average.

Moral of the story

if you want to live longer reduce your DNA mutation rates.

c, Zero-intercept LME regression of somatic mutation rate on inverse lifespan (1/lifespan), presented on the scale of untransformed lifespan (x axis). For simplicity, the y axis shows mean mutation rates per species, although rates per crypt were used in the regression. The darker shaded area indicates 95% CI of the regression line, and the lighter shaded area marks a twofold deviation from the line. Point estimate and 95% CI of the regression slope (k), FVE and range of end-of-lifespan burden are indicated.

All astronauts are subject to significant forms of cosmic radiation which can’t help but accelerate DNA mutations. So one would have to say that the risk of being an astronaut is that you will die younger.

Moon and Martian colonists will also have the same problem. People traveling, living and working there will have an increased risk of dying young. And of course anyone that works around radiation has the same risk.

Note, the mutation counts/mutation rates, that seem to govern life span are averages. Some individuals have lower mutation rates than their species and some (no doubt) have higher rates. These should have shorter and longer lives on average, respectively.

Given this variability in DNA mutation rates, I would propose that space agencies use as one selection criteria, the astronauts/colonists DNA mutation rate. So that humans which have lower than average DNA mutation rates have a higher priority of being selected to become astronauts/extra-earth colonists. One could using this research and assaying astronauts as they come back to earth for their DNA mutation counts, could theoretically determine the impact to their average life span.

In addition, most life extension research is focused on rejuvenating cellular or organism functionality, mainly through the use of young blood, other select nutrients, stem cells that target specific organs, etc. For example, see MIT Scientists Say They’ve Invented a Treatment That Reverses Hearing Loss which involves taking human cells, transform them into stem cells (at a certain maturity) and injecting them into the ear drum.

Living forever

In prior posts on this topic (see parts 1 &2 linked above) we suggested that with DNA computation and DNA storage (see or listen rather, to our GBoS podcast with CTO of Catalog) now becoming viable, one could potentially come up with a DNA program that could

Store an individuals DNA using some very reliable and long lived coding fashion (inside a cell or external to the cell) and
Craft a DNA program that could periodically be activated (cellular crontab) to access the stored DNA for the individual(in the cell would be easiest) and use this copy to replace/correct any DNA mutation throughout an individuals cells.

And we would need a very reliable and correct copy of that person’s DNA (using SHA256 hashing, CRCs, ECC, Parity and every other way to insure the DNA as captured is stored correctly forever). And the earlier we obtained the DNA copy for an individual human, the better.

Also, we would need a copy of the program (and probably the DNA) to be present in every cell in a human for this to work effectively. .

However, if we could capture a good copy of a person’s DNA early in their life we could, perhaps, sometime later, incorporate DNA code/program into the individual to use this copy and sweep through a person’s body (at that point in time) and correct any mutations that have accumulated to date. Ultimately, one could schedule this activity to occur like an annual checkup.

So yeah, life extension research can continue along the lines they are going and you can have a bunch of point solutions for cellular/organism malfunction OR it can focus on correctly copying and storing DNA forever and creating a DNA program that can correct DNA defects in every individual cell, using the stored DNA.

End of evolution

Yes mammals and that means any human could live forever this way. But it would signify the start of the end of evolution for the human species. That is whenever we captured their DNA copy, from that point on evolution (by mutating DNA) of that individual and any offspring of that individual could no longer take place. And if enough humans do this, throughout their lifespan, it means the end of evolution for humanity as a species

This assumes that evolution (which is natural variation driven by genetic mutation & survival of the fittest) requires DNA variation (essentially mutation) to drive the species forward.

~~~~

So my guess, is either we can live forever and stagnate as a species OR live normal lifespans and evolve as a species into something better over time. I believe nature has made it’s choice.

The surprising thing is that we are at a point in humanities existence where we can conceive of doing away with this natural process – evolution, forever.

Photo Credit(s):

From Nature Article, Somatic mutation rates scale with lifespan across mammals
From Nature Article, Somatic mutation rates scale with lifespan across mammals

CTERA, Cloud NAS on steroids

Posted on August 13, 2021August 13, 2021 by Ray in Cloud services, Cloud storage, data access, Data consistency, Data security, Distributed computing, File Storage, Mobile computing, Object storage, storage scalability, Strategic Inflection Points

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
Do cache working set data locally to provide faster access
Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
Do provide multiple options to host user data in multiple clouds or on prem
Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
Through a mobile device app for IOS or Android
Through their web portal
Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

Stay within specified geographies,
Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

Ethereum enters the enterprise

Posted on March 2, 2017March 2, 2017 by Ray in Crowdsourcing, Data consistency, Distributed computing, Information economy, Strategic Inflection Points, Visionary leadershp

Read an article the other day on NYT (Business Giants Announce Creation of … Ethereum).

In case you don’t know Ethereum is a open source, block chain solution that’s different than the software behind Bitcoin and IBM’s Hyperledger (for more on Hyperledger see our Blockchains at IBM post or our GreyBeardsOnStorage podcast with Donna Dillinger, IBM Fellow).

Blockchains are a software based, permanent ledger which can be used to record anything. Bitcoin, Ethereum and Hyperledger are all based on blockchains that provide similar digital information services with varying security, programability, consensus characteristics, etc.

Blockchains represent an entirely new way of doing business in the digital world and have the potential to take over many financial services and other contracting activities that are done today between organizations.

Blockchain services provide the decentralized recording of transactions into an immutable ledger. The decentralized nature of blockchains makes it difficult (if not impossible) to game the system to record an invalid transaction.

Miners

Ethereum supports an Ethereum Virtual Machine (EVM) application which offers customers and users a more programmable blockchain. That is rather than just updating accounts with monetary transactions like Bitcoin does, one can implement specialized transaction processing for updating the immutable ledger. It’s this programability that allows for the creation of “smart contracts” which can be programmatically verified and executed.

Ethereum miner nodes are responsible for validating transactions and the state transition(s) that update the ledger. Transactions are grouped in blocks by miners.

Miners are responsible for validating the transaction block and performing a hard mathematical computation or proof of work (PoW) which goes along used to validate the block of transactions. Once the PoW computation is complete, the block is packaged up and the miner node updates its database (ledger) and communicates its result to all the other nodes on the network which updates their transaction ledgers as well. This constitutes one state transition of the Ethereum ledger.

Miners that validate Ethereum transactions get paid in Ethers, which are a form of currency throughout the Ethereum ecosystem.

Blockchain consensus

Ethereum ledger consensus is based on the miner nodes executing the PoW algorithm properly. The current Ethereal PoW algorithm is Ethash, which is an “ASIC resistant” algorithm. What this means is that standard GPUs and (less so) CPUs are already very well optimized to perform this algorithm and any potential ASIC designer, if they could do better, would make more money selling their design to GPU and CPU designers, than trying to game the system.

One problem with Bitcoin is that its PoW is more ASIC friendly, which has led some organizations to developing special purpose ASICs in an attempt to dominate Bitcoin mining. If they can dominate Bitcoin mining, this can be used to game the Bitcoin consensus system and potentially implement invalid transactions.

Ethereum Accounts

Ethereum has two types of accounts:

Contract accounts that are controlled by the EVM application code, or
Externally owned accounts (EOA) that are controlled by a set of private keys and represent external agents (miner nodes, people, transaction generating entities)

Contract accounts really are code and data which constitute the EVM bytecode (application). Contract account bytecode is also stored on the Ethereum ledger (when deployed?) and are associated with an EOA that initiates the Contract account.

Contract functionality is written in Solidity, Serpent, Lisp Like Language (LLL) or other languages that can be compiled into EVM bytecode. Smart contracts use Ethereum Contract accounts to validate and execute contract actions.

Ethereum gas pricing

As EVMs contract accounts can consume arbitrary amounts of computation, bandwidth and storage to process transactions, Ethereum uses a concept called “gas” to pay for their resource consumption.

When a contract account transaction is initiated, it identifies a gas price (in Ethers) and a maximum gas amount that it is willing to consume to process the transaction.

When a contract transaction takes place:

If the maximum gas amount is less than what the transaction consumes, then the transaction is executed and is applied to the ledger. Any left over or remaining gas Ethers is credited back to the EOA.
If the maximum gas amount is not enough to execute the transaction, then the transaction fails and no update occurs.

Enterprise Ethereum Alliance

What’s new to Ethereum is that Accenture, Bank of New York Mellon, BP, CreditSuisse, Intel, Microsoft, JP Morgan, UBS and many others have joined together to form the Enterprise Ethereum Alliance. The alliance intends to work to create a standard version of the Ethereum software that enterprise companies can use to manage smart contracts.

Microsoft has had a Azure Blockchain-as-a-Service online since 2015. This was based on an earlier version of Ethereum called Project Bletchley.

Ethereum seems to be an alternative to IBM Hyperledger, which offers another enterprise class block chain for smart contracts. As enterprise class blockchains look like they will transform the way companies do business in the future, having multiple enterprise class blockchain solutions seems smart to many companies.

Comments?

Photo Credit(s): Miner by Mark Callahan; Gas prices by Corpsman.com; File: Ether pharmecie.jpg by Wikimedia

Blockchains at IBM

Posted on September 24, 2016September 24, 2016 by Ray in Cloud services, Crowdsourcing, Data consistency, data logistics, Data security, Distributed computing, Information economy, Strategic Inflection Points

I attended IBM Edge 2016 (videos available here, login required) this past week and there was a lot of talk about their new blockchain service available on z Systems (LinuxONE).

IBM’s blockchain software/service is based on the open source, Open Ledger, HyperLedger project.

Blockchains explained

We have discussed blockchain before (see my post on BlockStack). Blockchains can be used to implement an immutable ledger useful for smart contracts, electronic asset tracking, secured financial transactions, etc.

BlockStack was being used to implement Private Key Infrastructure and to implement a worldwide, distributed file system.

IBM’s Blockchain-as-a-service offering has a plugin based consensus that can use super majority rules (2/3+1 of members of a blockchain must agree to ledger contents) or can use consensus based on parties to a transaction (e.g. supplier and user of a component).

BitCoin (an early form of blockchain) consensus used data miners (performing hard cryptographic calculations) to determine the shared state of a ledger.

There can be any number of blockchains in existence at any one time. Microsoft Azure also offers Blockchain as a service.

The potential for blockchains are enormous and very disruptive to middlemen everywhere. Anywhere ledgers are used to keep track of assets, information, money, etc, that undergo transformations, transitions or transactions as they are further refined, produced and change hands, can be easily tracked in blockchains. The only question is can these assets, information, currency, etc. be digitally fingerprinted and can that fingerprint be read/verified. If such is the case, then blockchains can be used to track them.

New uses for Blockchain

IBM showed a demo of their new supply chain management service based on z Systems blockchain in action. IBM component suppliers record when they shipped component(s), shippers would record when they received the component(s), port authorities would record when components arrived at port, shippers would record when parts cleared customs and when they arrived at IBM facilities. Not sure if each of these transitions were recorded, but there were a number of records for each component shipment from supplier to IBM warehouse. This service is live and being used by IBM and its component suppliers right now.

Leanne Kemp, CEO Everledger, presented another example at IBM Edge (presumably built on z Systems Hyperledger service) used to track diamonds from mining, to cutter, to polishing, to wholesaler, to retailer, to purchaser, and beyond. Apparently the diamonds have a digital bar code/fingerprint/signature that’s imprinted microscopically on the diamond during processing and can be used to track diamonds throughout processing chain, all the way to end-user. This diamond blockchain is used for fraud detection, verification of ownership and digitally certify that the diamond was produced in accordance of the Kimberley Process.

Everledger can also be used to track any other asset that can be digitally fingerprinted as they flow from creation, to factory, to wholesaler, to retailer, to customer and after purchase.

Why z System blockchains

What makes z Systems a great way to implement blockchains is its securely, isolated partitioning and advanced cryptographic capabilities such as z System functionality accelerated hashing, signing & securing and hardware based encryption to speed up blockchain processing. z Systems also has FIPS-140 level 4 certification which can provide the highest security possible for blockchain and other security based operations.

From IBM’s perspective blockchains speak to the advantages of the mainframe environments. Blockchains are compute intensive, they require sophisticated cryptographic services and represent formal systems of record, all traditional strengths of z Systems.

Aside from the service offering, IBM has made numerous contributions to the Hyperledger project. I assume one could just download the z Systems code and run it on any LinuxONE processing environment you want. Also, since Hyperledger is Linux based, it could just as easily run in any OpenPower server running an appropriate version of Linux.

Blockchains will be used to maintain the system of record of the future just like mainframes maintained the systems of record of today and the past.

Comments?

Big science/big data ENCODE project decodes “Junk DNA”

Posted on September 7, 2012 by Ray in Data analytics, Data consistency, Data science, Information economy, System quality

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression. Or junk DNA is really on-off switches for protein encoding DNA. ENCODE project results were published by Nature, Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2). But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

Junk DNA is actually programming for protein production in a cell. Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins. Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species. One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
Project ENCODE has further narrowed the “Known Unknowns” of human DNA. For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
There are cell specific regulation DNA. That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc. Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions. I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA. Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project. All told almost 12,000 files were analyzed from these experiments.
15TB of data were used in the project
ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration. It took about half way into the project before they figured out how to define and assess data quality from experiments. What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project. In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team. The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another. This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”. This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction. This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools. I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools. Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables. But even a search of the project didn’t reveal any hadoop references.

~~~~

The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M. Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line. I would love to hear more about it or you can always comment here.

Comments?

Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)