Read a couple of articles this week Inching closer to a DNA-based file system in ArsTechnica and DNA storage gets random access in IEEE Spectrum. Both of these seem to be citing an article in Nature, Random access in large-scale DNA storage (paywall).
We’ve known for some time now that we can encode data into DNA strings (see my DNA as storage … and Genomic informatics takes off posts).
However, accessing DNA data has been sequential and reading and writing DNA data has been glacial. Researchers have started to attack the sequentiality of DNA data access. The prize, DNA can store 215PB of data in one gram and DNA data can conceivably last millions of years.
Researchers at Microsoft and the University of Washington have come up with a solution to the sequential access limitation. They have used polymerase chain reaction (PCR) primers as a unique identifier for files. They can construct a complementary PCR primer that can be used to extract just DNA segments that match this primer and amplify (replicate) all DNA sequences matching this primer tag that exist in the cell.
DNA data format
The researchers used a Reed-Solomon (R-S) erasure coding mechanism for data protection and encode the DNA data into many DNA strings, each with multiple (metadata) tags on them. One of tags is the PCR primer tag header, another tag indicates the position of the DNA data segment in the file and an end of data tag that is the same PCR primer tag.
The PCR primer tag was used as sort of a file address. They could configure a complementary PCR tag to match the primer tag of the file they wanted to access and then use the PCR process to replicate (amplify) only those DNA segments that matched the searched for primer tag.
Apparently the researchers chunk file data into a block of 150 base pairs. As there are 2 complementary base pairs, I assume one bit to one base pair mapping. As such, 150 base pairs or bits of data per segment means ~18 bytes of data per segment. Presumably this is to allow for more efficient/effective encoding of data into DNA strings.
DNA strings don’t work well with replicated sequences of base pairs, such as all zeros. So the researchers created a random sequence of 150 base pairs and XOR the file DNA data with this random sequence to determine the actual DNA sequence to use to encode the data. Reading the DNA data back they need to XOR the data segment with the random string again to reconstruct the actual file data segment.
Not clear how PCR replicated DNA segments are isolated and where they are originally decoded (with a read head). But presumably once you have thousands to millions of copies of a DNA segment, it’s pretty straightforward to decode them.
Once decoded and XORed, they use the R-S erasure coding scheme to ensure that the all the DNA data segments represent the actual data that was encoded in them. They can then use the position of the DNA data segment tag to indicate how to put the file data back together again.
What’s missing?
I am assuming the cellular data storage system has multiple distinct cells of data, which are clustered together into some sort of organism.
Each cell in the cellular data storage system would hold unique file data and could be extracted and a file read out individually from the cell and then the cell could be placed back in the organism. Cells of data could be replicated within an organism or to other organisms.
To be a true storage system, I would think we need to add:
- DNA data parity – inside each DNA data segment, every eighth base pair would be a parity for the eight preceding base pairs, used to indicate when a particular base pair in eight has mutated.
- DNA data segment (block) and file checksums – standard data checksums, used to verify and correct for double and triple base pair (bit) corruption in DNA data segments and in the whole file.
- Cell directory – used to indicate the unique Cell ID of the cell, a file [name] to PCR primer tag mapping table, a version of DNA file metadata tags, a version of the DNA file XOR string, a DNA file data R-S version/level, the DNA file length or number of DNA data segments, the DNA data creation data time stamp, the DNA last access date-time stamp,and DNA data modification data-time stamp (these last two could be omited)
- Organism directory – used to indicate unique organism ID, organism metadata version number, organism unique cell count, unique cell ID to file list mapping, cell ID creation data-time stamp and cell ID replication count.
The problem with an organism cell-ID file list is that this could be quite long. It might be better to somehow indicate a range or list of ranges of PCR primer tags that are in the cell-ID. I can see other alternatives using a segmented organism directory or indirect organism cell to file lists b-tree, which could hold file name lists to cell-ID mapping.
It’s unclear whether DNA data storage should support a multi-level hierarchy, like file system directories structures or a flat hierarchy like object storage data, which just has buckets of objects data. Considering the cellular structure of DNA data it appears to me more like buckets and the glacial access seems to be more useful to archive systems. So I would lean to a flat hierarchy and an object storage structure.
Is DNA data is WORM or modifiable? Given the effort required to encode and create DNA data segment storage, it would seem it’s more WORM like than modifiable storage.
How will the DNA data storage system persist or be kept alive, if that’s the right word for it. There must be some standard internal cell mechanisms to maintain its existence. Perhaps, the researchers have just inserted file data DNA into a standard cell as sort of junk DNA.
If this were the case, you’d almost want to create a separate, data nucleus inside a cell, that would just hold file data and wouldn’t interfere with normal cellular operations.
But doesn’t the PCR primer tag approach lend itself better to a key-value store data base?
Photo Credit(s): Cell structure National Cancer Institute
Guide to Open VMS file applications
Unix Inodes CSE410 Washington.edu
Key Value Databases, Wikipedia By Clescop – Own work, CC BY-SA 4.0, Link
I was at Dell EMC World2017 last week and although most of the news was on Dell’s new 14th generation server and Dell-EMC integration progress, Wednesday’s keynote was devoted to storage and non-server infrastructure news.
Dell EMC Data Domain Cloud DR (DDCDR) is a new capability that enables DD to backup to AWS S3 object storage and when needed restart the virtual machines within AWS.
Once the OVA is installed, it will read the changed data and will segment, encrypt, and compress the backup data and then send this and the backup metadata to AWS S3 objects. Avamar/DD policies can be established to control how many daily backup copies are to be saved to S3 object storage. There’s no need for Data Domain or Avamar to run in AWS.
When there’s a problem at the primary data center, an admin can click on a Avamar GUI button and have the Cloud DR server, uncompress, decrypt, rehydrate and restore the backup data into EBS volumes, translate the VMware VM image to an AMI image and then restarts the AMI on an AWS virtual server (EC2) with its data on EBS volume storage. The Cloud DR server will use the backup metadata to select the AWS EC2 instance with the proper CPU and RAM needed to run the application. Once this completes, the VM is running standalone, in an AWS EC2 instance. Presumably, you have to have EC2 and EBS storage volumes resources available under your AWS domain to be able to install the application and restore its data.
For simplicity purposes, the user can control almost all of the required functionality for DDCDR from the Avamar GUI alone. But in case of a site outage, the user can initiate the application DR from a portal supplied by the Cloud DR server utility.
There was much more infrastructure news at Dell EMC World2017. I’ll discuss more details on their new storage offerings in my upcoming Storage Intelligence newsletter, due out the end of this month. If your interested in receiving your own copy of my newsletter, checkout the signup button in the upper right of this page.
Earlier this week I attended Hitachi Summit 2016 along with a number of other analysts and Hitachi executives where Hitachi discussed their current and ongoing focus on the IoT (Internet of Things) business.
Hitachi has been in the OT (Operational [industrial] Technology) business for over a century now. Hitachi has also had a very successful and ongoing IT business (Hitachi Data Systems) for decades now. Their main competitors in this IoT business are GE and Siemans but neither have the extensive history in IT that Hitachi has had. But both are working hard to catchup.
For one example of what Hitachi is doing in IoT, they have recently won a 27.5 year Rail-as-a-Service contract to upgrade, ticket, maintain and manage all new trains for UK Rail. This entails upgrading all train rolling stock, provide upgraded rail signaling, traffic management systems, depot and station equipment and ticketing services for all of UK Rail.
The success and profitability of this Hitachi service offering hinges on their ability to provide more cost efficient rail transport. A key capability they plan to deliver is predictive maintenance.
Hitachi said their new trains capture 48K data items and generate over ~25GB/train/day. All this data, will be fed into their new Hitachi Insight Group Lumada platform which includes Pentaho, HSDP (Hitachi Streaming Data Platform) and their Content Analytics to analyze train data and determine how best to keep the trains running. Behind all this analytical power will no doubt be HDS HCP object store used to keep track of all the train sensor data and other information, Hitachi UCP servers to process it all, and other Hitachi software and hardware to glue it all together.
As you may recall, Scality is an object storage solution that came out of the telecom, consumer networking industry to provide Google/Facebook like storage services to other customers.
Scality open sourced their S3 driver just to make this process easier. Now, one could just download their S3server driver (available from
Firing up the S3server on my Mac
NetApp announced a new version of their object storage solution, the
Somewhere during Bycast’s journey they developed support for tape archives and information lifecycle management (ILM) for objects. The previous generation, StorageGrid 10.2 had a number of features, including:
But customers complained StorageGRID was too complex to install and update which required too much hand holding by NetApp professional services. StorageGRID Webscale 10.3 was targeted to address these deficiencies. Some of the features in StorageGrid 10.3, include:
ILM policy change predictions/modeling, so that admins can now see how changes to ILM policies will impact StorageGRID.
Apparently, service providers are adopting object storage to provide competition to AWS, Azure and Google cloud storage for backup and storage archives as well as for DR as a service. Also, many media and other customers managing massive data repositories are turning to object storage to support their multi-site, very large file libraries. And as more solution vendors support S3 object protocols for data access and archive, something like StorageGRID can become their onsite-offsite storage alternative.
We talked with