Random access, DNA object storage system

Read a couple of articles this week Inching closer to a DNA-based file system in ArsTechnica and DNA storage gets random access in IEEE Spectrum. Both of these seem to be citing an article in Nature, Random access in large-scale DNA storage (paywall).

We’ve known for some time now that we can encode data into DNA strings (see my DNA as storage … and Genomic informatics takes off posts).

However, accessing DNA data has been sequential and reading and writing DNA data has been glacial. Researchers have started to attack the sequentiality of DNA data access. The prize, DNA can store 215PB of data in one gram and DNA data can conceivably last millions of years.

Researchers at Microsoft and the University of Washington have come up with a solution to the sequential access limitation. They have used polymerase chain reaction (PCR) primers as a unique identifier for files. They can construct a complementary PCR primer that can be used to extract just DNA segments that match this primer and amplify (replicate) all DNA sequences matching this primer tag that exist in the cell.

DNA data format

The researchers used a Reed-Solomon (R-S) erasure coding mechanism for data protection and encode the DNA data into many DNA strings, each with multiple (metadata) tags on them. One of tags is the PCR primer tag header, another tag indicates the position of the DNA data segment in the file and an end of data tag that is the same PCR primer tag.

The PCR primer tag was used as sort of a file address. They could configure a complementary PCR tag to match the primer tag of the file they wanted to access and then use the PCR process to replicate (amplify) only those DNA segments that matched the searched for primer tag.

Apparently the researchers chunk file data into a block of 150 base pairs. As there are 2 complementary base pairs, I assume one bit to one base pair mapping. As such, 150 base pairs or bits of data per segment means ~18 bytes of data per segment. Presumably this is to allow for more efficient/effective encoding of data into DNA strings.

DNA strings don’t work well with replicated sequences of base pairs, such as all zeros. So the researchers created a random sequence of 150 base pairs and XOR the file DNA data with this random sequence to determine the actual DNA sequence to use to encode the data. Reading the DNA data back they need to XOR the data segment with the random string again to reconstruct the actual file data segment.

Not clear how PCR replicated DNA segments are isolated and where they are originally decoded (with a read head). But presumably once you have thousands to millions of copies of a DNA segment,  it’s pretty straightforward to decode them.

Once decoded and XORed, they use the R-S erasure coding scheme to ensure that the all the DNA data segments represent the actual data that was encoded in them. They can then use the position of the DNA data segment tag to indicate how to put the file data back together again.

What’s missing?

I am assuming the cellular data storage system has multiple distinct cells of data, which are clustered together into some sort of organism.

Each cell in the cellular data storage system would hold unique file data and could be extracted and a file read out individually from the cell and then the cell could be placed back in the organism. Cells of data could be replicated within an organism or to other organisms.

To be a true storage system, I would think we need to add:

  • DNA data parity – inside each DNA data segment, every eighth base pair would be a parity for the eight preceding base pairs, used to indicate when a particular base pair in eight has mutated.
  • DNA data segment (block) and file checksums –  standard data checksums, used to verify and correct for double and triple base pair (bit) corruption in DNA data segments and in the whole file.
  • Cell directory – used to indicate the unique Cell ID of the cell, a file [name] to PCR primer tag mapping table, a version of DNA file metadata tags, a version of the DNA file XOR string, a DNA file data R-S version/level, the DNA file length or number of DNA data segments, the DNA data creation data time stamp, the DNA last access date-time stamp,and DNA data modification data-time stamp (these last two could be omited)
  • Organism directory – used to indicate unique organism ID, organism metadata version number, organism unique cell count,  unique cell ID to file list mapping, cell ID creation data-time stamp and cell ID replication count.

The problem with an organism cell-ID file list is that this could be quite long. It might be better to somehow indicate a range or list of ranges of PCR primer tags that are in the cell-ID. I can see other alternatives using a segmented organism directory or indirect organism cell to file lists b-tree, which could hold file name lists to cell-ID mapping.

It’s unclear whether DNA data storage should support a multi-level hierarchy, like file system  directories structures or a flat hierarchy like object storage data, which just has buckets of objects data. Considering the cellular structure of DNA data it appears to me more like buckets and the glacial access seems to be more useful to archive systems. So I would lean to a flat hierarchy and an object storage structure.

Is DNA data is WORM or modifiable? Given the effort required to encode and create DNA data segment storage, it would seem it’s more WORM like than modifiable storage.

How will the DNA data storage system persist or be kept alive, if that’s the right word for it. There must be some standard internal cell mechanisms to maintain its existence. Perhaps, the researchers have just inserted file data DNA into a standard cell as sort of junk DNA.

If this were the case, you’d almost want to create a separate, data  nucleus inside a cell, that would just hold file data and wouldn’t interfere with normal cellular operations.

But doesn’t the PCR primer tag approach lend itself better to a  key-value store data base?

Photo Credit(s): Cell structure National Cancer Institute

Prentice Hall textbook

Guide to Open VMS file applications

Unix Inodes CSE410 Washington.edu

Key Value Databases, Wikipedia By ClescopOwn work, CC BY-SA 4.0, Link

Disaster recovery from VMware to AWS using Dell EMC Avamar & Data Domain

avI was at Dell EMC World2017 last week and although most of the news was on Dell’s new 14th generation server and Dell-EMC integration progress, Wednesday’s keynote was devoted to storage and non-server infrastructure news.

There was plenty of non-server news but one item that caught my attention was new functionality from Dell EMC Data Protection Division that used Avamar and Data Domain to provide disaster recovery for VMware VMs directly to AWS.

Data Domain (AWS) Cloud DR

Dell EMC Data Domain Cloud DR (DDCDR) is  a new capability that enables DD to backup to AWS S3 object storage and when needed restart the virtual machines within AWS.

DDCDR requires that a customer with Avamar backup and Data Domain (DD) storage install an OVA which deploys an “add-on” to their on-prem Avamar/DD system and install a lightweight VM (Cloud DR server) utility in their AWS domain.

Once the OVA is installed, it will read the changed data and will segment, encrypt, and compress the backup data and then send this and the backup metadata to AWS S3 objects. Avamar/DD policies can be established to control how many daily backup copies are to be saved to S3 object storage. There’s no need for Data Domain or Avamar to run in AWS.

When there’s a problem at the primary data center, an admin can click on a Avamar GUI button and have the Cloud DR server, uncompress, decrypt, rehydrate and restore the backup data into EBS volumes, translate the VMware VM image to an AMI image and then restarts the AMI on an AWS virtual server (EC2) with its data on EBS volume storage. The Cloud DR server will use the backup metadata to select the AWS EC2 instance with the proper CPU and RAM needed to run the application. Once this completes, the VM is running standalone, in an AWS EC2 instance. Presumably, you have to have EC2 and EBS storage volumes resources available under your AWS domain to be able to install the application and restore its data.

For simplicity purposes, the user can control almost all of the required functionality for DDCDR from the Avamar GUI alone. But in case of a site outage, the user can initiate the application DR from a portal supplied by the Cloud DR server utility.

There you have it, simplified, easy to use (AWS) Cloud DR for your VM applications all through Dell EMC Avamar, Data Domain storage and DDCDR. At the moment, it only works with AWS cloud but it’s likely to be available for other public clouds in the near future.

~~~~

There was much more infrastructure news at Dell EMC World2017. I’ll discuss more details on their new storage offerings in my upcoming Storage Intelligence newsletter, due out the end of this month. If your interested in receiving your own copy of my newsletter, checkout the signup button in the upper right of this page.

Comments?

[Edits were made for readability and technical accuracy after this post was published. Ed]

Hitachi and the coming IoT gold rush

img_7137Earlier this week I attended Hitachi Summit 2016 along with a number of other analysts and Hitachi executives where Hitachi discussed their current and ongoing focus on the IoT (Internet of Things) business.

We have discussed IoT before (see QoM1608: The coming IoT tsunami or not, Extremely low power transistors … new IoT applications). Analysts and companies predict  ~200B IoT devices by 2020 (my QoM prediction is 72.1B 0.7 probability). But in any case there’s a lot of IoT activity going to come online, very shortly. Hitachi is already active in IoT and if anything, wants it to grow, significantly.

Hitachi’s current IoT business

Hitachi is uniquely positioned to take on the IoT business over the coming decades, having a number of current businesses in industrial processes, transportation, energy production, water management, etc. Over time, all these industries and more are becoming much more data driven and smarter as IoT rolls out.

Some metrics indicating the scale of Hitachi’s current IoT business, include:

  • Hitachi is #79 in the Fortune Global 500;
  • Hitachi’s generated $5.4B (FY15) in IoT revenue;
  • Hitachi IoT R&D investment is $2.3B (over 3 years);
  • Hitachi has 15K customers Worldwide and 1400+ partners; and
  • Hitachi spends ~$3B in R&D annually and has 119K patents

img_7142Hitachi has been in the OT (Operational [industrial] Technology) business for over a century now. Hitachi has also had a very successful and ongoing IT business (Hitachi Data Systems) for decades now.  Their main competitors in this IoT business are GE and Siemans but neither have the extensive history in IT that Hitachi has had. But both are working hard to catchup.

Hitachi Rail-as-a-Service

img_7152For one example of what Hitachi is doing in IoT, they have recently won a 27.5 year Rail-as-a-Service contract to upgrade, ticket, maintain and manage all new trains for UK Rail.  This entails upgrading all train rolling stock, provide upgraded rail signaling, traffic management systems, depot and station equipment and ticketing services for all of UK Rail.

img_7153The success and profitability of this Hitachi service offering hinges on their ability to provide more cost efficient rail transport. A key capability they plan to deliver is predictive maintenance.

Today, in UK and most other major rail systems, train high availability is often supplied by using spare rolling stock, that’s pre-positioned and available to call into service, when needed. With Hitachi’s new predictive maintenance capabilities, the plan is to reduce, if not totally eliminate the need for spare rolling stock inventory and keep the new trains running 7X24.

img_7145Hitachi said their new trains capture 48K data items and generate over ~25GB/train/day. All this data, will be fed into their new Hitachi Insight Group Lumada platform which includes Pentaho, HSDP (Hitachi Streaming Data Platform) and their Content Analytics to analyze train data and determine how best to keep the trains running. Behind all this analytical power will no doubt be HDS HCP object store used to keep track of all the train sensor data and other information, Hitachi UCP servers to process it all, and other Hitachi software and hardware to glue it all together.

The new trains and services will be rolled out over time, but there’s a pretty impressive time table. For instance, Hitachi will add 120 new high speed trains to UK Rail by 2018.  About the only thing that Hitachi is not directly responsible for in this Rail-as-a-Service offering, is the communications network for the trains.

Hitachi other IoT offerings

Hitachi is actively seeking other customers for their Rail-as-a-service IoT service offering. But it doesn’t stop there, they would like to offer smart-water-as-a-service, smart-city-as-a-service, digital-energy-as-a-service, etc.

There’s almost nothing that Hitachi currently supplies as industrial products that they wouldn’t consider offering in an X-as-a-service solution. With HDS Lumada Analytics, HCP and HDS storage systems, Hitachi UCP converged infrastructure, Hitachi industrial products, and Hitachi consulting services, together they are primed to take over the IoT-industrial products/services market.

Welcome to the new Hitachi IoT world.

Comments?

Scality’s Open Source S3 Driver

img_6931
The view from Scality’s conference room

We were at Scality last week for Cloud Field Day 1 (CFD1) and one of the items they discussed was their open source S3 driver. (Videos available here).

Scality was on the 25th floor of a downtown San Francisco office tower. And the view outside the conference room was great. Giorgio Regni, CTO, Scality, said on the two days a year it wasn’t foggy out, you could even see Golden Gate Bridge from their conference room.

Scality

img_6912As you may recall, Scality is an object storage solution that came out of the telecom, consumer networking industry to provide Google/Facebook like storage services to other customers.

Scality RING is a software defined object storage that supports a full complement of interface legacy and advanced protocols including, NFS, CIGS/SMB, Linux FUSE, RESTful native, SWIFT, CDMI and Amazon Web Services (AWS) S3. Scality also supports replication and erasure coding based on object size.

RING 6.0 brings AWS IAM style authentication to Scality object storage. Scality pricing is based on usable storage and you bring your own hardware.

Giorgio also gave a session on the RING’s durability (reliability) which showed they support 13-9’s data availability. He flashed up the math on this but it was too fast for me to take down:)

Scality has been on the market since 2010 and has been having a lot of success lately, having grown 150% in revenue this past year. In the media and entertainment space, Scality has won a lot of business with their S3 support. But their other interface protocols are also very popular.

Why S3?

It looks as if AWS S3 is becoming the defacto standard for object storage. AWS S3 is the largest current repository of objects. As such, other vendors and solution providers now offer support for S3 services whenever they need an object/bulk storage tier behind their appliances/applications/solutions.

This has driven every object storage vendor to also offer S3 “compatible” services to entice these users to move to their object storage solution. In essence, the object storage industry, like it or not, is standardizing on S3 because everyone is using it.

But how can you tell if a vendor’s S3 solution is any good. You could always try it out to see if it worked properly with your S3 application, but that involves a lot of heavy lifting.

However, there is another way. Take an S3 Driver and run your application against that. Assuming your vendor supports all the functionality used in the S3 Driver, it should all work with the real object storage solution.

Open source S3 driver

img_6916Scality open sourced their S3 driver just to make this process easier. Now, one could just download their S3server driver (available from Scality’s GitHub) and start it up.

Scality’s S3 driver runs ontop of a Docker Engine so to run it on your desktop you would need to install Docker Toolbox for older Mac or Windows systems or run Docker for Mac or Docker for Windows for newer systems. (We also talked with Docker at CFD1).

img_6933Firing up the S3server on my Mac

I used Docker for Mac but I assume the terminal CLI is the same for both.Downloading and installing Docker for Mac was pretty straightforward.  Starting it up took just a double click on the Docker application, which generates a toolbar Docker icon. You do need to enter your login password to run Docker for Mac but once that was done, you have Docker running on your Mac.

Open up a terminal window and you have the full Docker CLI at your disposal. You can download the latest S3 Server from Scality’s Docker hub by executing  a pull command (docker pull scality/s3server), to fire it up, you need to define a new container (docker run -d –name s3server -p 8000:8000 scality/s3server) and then start it (docker start s3server).

It’s that simple to have a S3server running on your Mac. The toolbox approach for older Mac’s and PC’S is a bit more complicated but seems simple enough.

The data is stored in the container and persists until you stop/delete the container. However, there’s an option to store the data elsewhere as well.

I tried to use CyberDuck to load some objects into my Mac’s S3server but couldn’t get it to connect properly. I wrote up a ticket to the S3server community. It seemed to be talking to the right port, but maybe I needed to do an S3cmd to initialize the bucket first – I think.

[Update 2016Sep19: Turns out the S3 server getting started doc said you should download an S3 profile for Cyberduck. I didn’t do that originally because I had already been using S3 with Cyberduck. But did that just now and it now works just like it’s supposed to. My mistake]

~~~~

Anyways, it all seemed pretty straight forward to run S3server on my Mac. If I was an application developer, it would make a lot of sense to try S3 this way before I did anything on the real AWS S3. And some day, when I grew tired of paying AWS, I could always migrate to Scality RING S3 object storage – or at least that’s the idea.

Comments?

NetApp updates their StorageGRID Webscale solution

grid001NetApp announced a new version of their object storage solution, the StorageGRID WebScale 10.3.

At a former employer, I first talked with StorageGRID (Bycast at the time) a decade or so ago. At that time, they were focused on medical and healthcare verticals and had a RAIN (redundant array of independent nodes) storage solution.  It has come a long way.

StorageGRID Business is booming

On the call, NetApp announced they sold 50PB of StorageGRID in FY’16 with 20PB of that in the last quarter and also reported 270% Y/Y revenue growth, which means they are starting to gain some traction in the marketplace. Are we seeing an acceleration of object storage adoption?

As you may recall, StorageGRID comes in a software only solution that runs on just about any white box server with DAS or as two hardware appliances: the SG5612 (12 drive); and the SG5660 (60 drive) nodes. You can mix and match any appliance with any white box software only solution, they don’t have to have the same capacity or performance. But all nodes need network and controller/admin node(s) access.

StorageGRID past

grid002Somewhere during Bycast’s journey they developed support for tape archives and information lifecycle management (ILM) for objects. The previous generation, StorageGrid 10.2 had a number of features, including:

  • S3 cloud archive support that allowed objects to be migrated to AWS S3 as they were no longer actively accessed
  • NAS bridge support that allowed CIFS/SMB or NFS access to StorageGRID objects, which could also be read as S3 objects for easier migration to/from object storage;
  • Hierarchical erasure coding option that was optimized for efficiently storing large objects;
  • Node level erasure coding support that can be used to rebuild data for node drive failures, without having to go outside the node data retrieval;
  • Object byte-granular range read support that allowed users to read an object at any byte offset without requiring rebuild;
  • Support for OpenStack Swift API that made StorageGRID objects natively available to any OpenStack service; and
  • Software support for running as Docker containers or as a VM under VMware ESX, or OpenStack KVM that allowed StorageGRID software to run just about anywhere.

StorageGRID present and future

grid003But customers complained StorageGRID was too complex to install and update which required too much hand holding by NetApp professional services. StorageGRID Webscale 10.3 was targeted to address these deficiencies. Some of the features in StorageGrid 10.3, include:

  • Radically simplified, more modern UI, new dashboard and policy wizard/editor, so that it’s a lot easier to manage the StorageGRID. All features of the UI are also available via RESTfull API access and the UI is the same for white box, software only implementations as well as appliance configurations.
  • Simplified automated installation scripts, so that installations that used to take multiple steps, separate software installs and required professional services support, now use a full-solution software stack install, take only minutes and can be done by the customers alone;
  • S3 object versioning support, so that objects can have multiple versions, limited via the UI, if needed, but provide a snapshot-like capability for S3 data that protects against object accidental deletion.
  • grid004ILM policy change predictions/modeling, so that admins can now see how changes to ILM policies will impact StorageGRID.
  • Even more flexibility in DAS storage, so that future StorageGRID configurations can support 10TB drives and 6TB FIPS-140 drive encryption support, which adds to the current drive capacity and data security options already available in StorageGRID.

To top it all off, StorageGRID 10.3 improves performance for both small (30KB) and large (300MB) object get/puts.

  • Small S3 Load Data Router (LDR, 1-thread) object performance has improved ~4X for both PUTs and GETs; and
  • Large S3 LDR (1-thread) object performance has improved ~2X for PUTs and ~4X for GETs.

Object storage market heating up

grid005Apparently, service providers are adopting object storage to  provide competition to AWS, Azure and Google cloud storage for backup and storage archives as well as for DR as a service. Also, many media and other customers managing massive data repositories are turning to object storage to support their multi-site, very large file libraries.  And as more solution vendors support S3 object protocols for data access and archive, something like StorageGRID can become their onsite-offsite storage alternative.

And Amazon, Azure and Google are starting to realize that most enterprise customers are not going to leap to the cloud for everything they do. So, some sort of hybrid solution is needed for the long term. Having an on premises and off premises object storage solution that can also archive/migrate data to the cloud is a great hybrid alternative that takes enterprises one step closer to the cloud.

Comments?

Hedvig storage system, Docker support & data protection that spans data centers

Hedvig003We talked with Hedvig (@HedvigInc) at Storage Field Day 10 (SFD10), a month or so ago and had a detailed deep dive into their technology. (Check out the videos of their sessions here.)

Hedvig implements a software defined storage solution that runs on X86 or ARM processors and depends on a storage proxy operating in a hypervisor host (as a VM) and storage service nodes. Their proxy and the storage services can execute as separate VMs on the same host in a hyper-converged fashion or on different nodes as a separate storage cluster with hosts doing IO to the storage cluster.

Hedvig’s management team comes from hyper-scale environments (Amazon Dynamo/Facebook Cassandra) so they have lots of experience implementing distributed software defined storage at (hyper-)scale.
Continue reading “Hedvig storage system, Docker support & data protection that spans data centers”

BlockStack, a Bitcoin secured global name space for distributed storage

At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).

Bitcoin blockchain simplified

Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.

Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.

All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.

The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).

All blockchain transactions are owned by a cryptographic address. Each cryptographic address has a public and private key associated with it.
Continue reading “BlockStack, a Bitcoin secured global name space for distributed storage”

Exablox, bring your own disk storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading “Exablox, bring your own disk storage”