The Greybeards get off the beaten (enterprise) path this month, to see what lies ahead with a discussion on DNA storage. David Turek, CTO, Catalog DNA (@CatalogDNA) is a long time IBMer that had been focused on HPC systems at IBM but left and went to Catalog DNA to pursue the commercialization of DNA storage, an “emerging” technology. CatalogDNA is a company out of Boston that had recently closed a round of funding and are focused on bringing DNA storage out into the world of IT.
David was a pleasure to talk and has lot’s of knowledge on HPC and enterprise data center solutions. He also has a good grasp of what it will take to bring DNA storage to market. Keith has had some prior experience with DNA technologies in BioPharma so could talk in more detail about the technology and its ecosystem. [We’re trying out a new format, let us know what you think; The Eds.]
Ray has written about DNA storage in his RayOnStorage Blog, most recently in April of this year and May of last year. It’s been an ongoing blog topic of his for almost a decade now. When Ray was interviewed about the technology he thought it interesting but had serious obstacles with read and write latencies and throughput as well as the size of the storage device.
Well CatalogDNA seems to have got a good handle on write throughput and are seriously working on the rest.
However, DNA storage’- volumetric density was always of exceptional. Early on in the podcast, David mentioned that DNA storage was 6 orders of magnitude (1 million times) more dense in bytes/mm**3 than magnetic tape today. An LTO8 tape device stores 12TB (uncompressed) in a tape cartridge, 14.2 in**3 (230.3 cm**3) or roughly 845GB/in**3 (52GB/cm**3). One million times this, would be 12EB in the same volume.
The challenge with LTO8, disk or SSD storage today is at some point you have to move the data from one device to a more modern device. This could be every 3-5 years (for disk or SSD) or 25-30 years for tape. In either case, at some point IT would need to incur the cost and time to move the data. Not much of a problem for 100TB or so but when you start talking PB or EB of data, it can be a never ending task.
David mentioned Catalog uses “synthetic DNA” in their storage. This means the DNA it uses is designed to be incompatible with natural DNA such that it wouldn’t work in a cell. It has stops or other biological mechanisms to inhibit it’s use in nature. Yes it uses the same sugars, backbones, and other chemistry of biologically active DNA, but it has been specifically modified to inhibit its use by normal cellular machinery.
DNA storage has a number of unique capabilities :
- It can be made to last forever, by being dried out (dessicated) and encased in a crystal and takes 0 power/energy to be stored for eons.
- It can be cheaply and easily replicated, almost an infinite number of times, for only the cost of chemical feedstock, chemical interactions and energy. Yes, this may take time but the process scales up nicely. One could make 2 copies in first cycle, 4 in the 2nd, 8 in the 3rd, etc and by doing this it would only take 20 cycles to create a million copies. If each cycle takes 10 minutes, in 3:20, you could have a million copies of 1EB of data.
- It can be easily searched for target information. This involves fabricating a DNA search molecule and inserting it into the storage solution. Once there it would match up with the DNA segment that held your key. And of course, the search molecule and the data could be replicated to speed up any search process.
- We already mentioned the extreme density advantage above.
Speed of DNA storage access
David said they can already write Catalog DNA storage in MB/sec.
The process they use to write is like a conveyer belt which starts off with a polyethylene sheet (web actually). Somewhere, the digital data comes in, is chunked and transformed into DNA strand (25-50 base pairs) molecules or dots. The polyethylene sheet rolls into a machine that uses multiple 3D print heads to deposit dots (the DNA strand data chunks) at web points. This machine/process deposits 100K or more of these dots onto the web. The sheet then moves to the next stage where the DNA molecules are scraped off and drained into a solution. Then a wet process occurs which uses chemistry to make the DNA more readable and enables the separate DNA molecules to connect into a data strand. Then this data strand goes into another process where it gets reduced in volume and so that it is more stable.
If needed, one can add another step that dries out or desiccates the data strand into even a smaller volume which can then be embedded into a crystalline structure which could last for centuries.
David compared the DNA molecules (data chunks) to Legos, only they are the same pieces in a million different colors Each piece represents some segment of data bits/bytes. Using chemistry and proprietary IP each separate DNA molecule self organizes (connects) into a data strand, representing the information you want to store.
Reading DNA involves, off the shelf, DNA sequencers. The one Catalog currently uses is the Oxford NanoPore device, but there are others. David didn’t say how fast they could read DNA data. But current DNA reading devices destroy the data. So making replicas of the data would be required to read it.
David said their current write device is L shaped with one leg about 14’ (4.3m) long and the other about 12’ (3.7m) long with each leg being about 3’ (0.9m) wide.
Searching EB of data in minutes?!
DNA strands can be searched (matched) using a search molecule and inserting this into the storage solution (that holds the data strands). Such a molecule will find a place in the data that has a matching (DNA) data element and I believe attach itself to the data strand.
For example, lets say you had recorded all of a country’s emails for a month or so and you wanted to search them for the words, “bomb”, “terrorist”, “kill”, etc. One could create a set of search molecules, replicate them any number of times (depending on how quickly you wanted to search the data and how many matches you expected), and insert them into a data pool with multiple data strands that stored the email traffic.
After some time, you’d come back and your search would be done. You’d need to then extract the search hits, and read out the portion of the data strands (emails) that matched. I’m guessing extraction would involve some sort of (wet) chemical process or filtration.
State of Catalog DNA storage
David mentioned that as a publicity stunt they wrote the whole Wikipedia onto Catalog DNA storage. The whole Wikipedia fit into a cylinder about the height of a big knuckle on your hand and in a width smaller than a finger. The size of the whole Wikipedia, with complete edit history is 10TB uncompressed and if they stored all the edit versions plus its media such as images, videos, audio and other graphics, that would add another 23TB (as of end of 2014), so ~33TB uncompressed.
David believes in 18 months they could have a WORM (write once, read many times) data storage solution that could be deployed in customer data centers which would supply immense data repositories in relatively small solution containers.
CatalogDNA is currently in a number of PoCs with major corporations (not labs or universities) to show how DNA storage technology can be used to solve problems.
David believes that at some point they will be able to make compute engines entirely of DNA. At that point, one could have a combined compute and storage (HCI-like) DNA server using the same technology in a solution. And as mentioned previously, one could replicate from one DNA server & storage to a million DNA servers & storage in just 20 cycles. How’s that for scale out.
David Turek, CTO Catalog DNA
Dave Turek is Catalog’s Chief Technology Officer. He comes to Catalog from IBM where he held numerous executive positions in High Performance Computing and emerging technologies.
He was the development executive for the IBM SP program which produced the first commercially successful massively parallel system; he started IBM’s Linux Cluster business; launched an early offering in Cloud computing called Deep Computing Capacity on Demand; produced the Roadrunner system, the world’s first petascale computer; and was responsible for IBM’s exascale strategy which led to the deployment of the Summit and Sierra systems at Oak Ridge and Lawrence Livermore National Laboratories respectively.
David has been invited to testify to Congress on numerous occasions regarding the future of computing in the US and has helped establish technical collaborations with universities, businesses, and government agencies around the world.