Data index – Silverton Consulting

Random access, DNA object storage system

Posted on February 24, 2018 by Ray in Clustered storage, data access, Data density, Data index, Data longevity, data protection, File Storage, Object storage, Strategic Inflection Points, Visionary leadershp

Read a couple of articles this week Inching closer to a DNA-based file system in ArsTechnica and DNA storage gets random access in IEEE Spectrum. Both of these seem to be citing an article in Nature, Random access in large-scale DNA storage (paywall).

We’ve known for some time now that we can encode data into DNA strings (see my DNA as storage … and Genomic informatics takes off posts).

However, accessing DNA data has been sequential and reading and writing DNA data has been glacial. Researchers have started to attack the sequentiality of DNA data access. The prize, DNA can store 215PB of data in one gram and DNA data can conceivably last millions of years.

Researchers at Microsoft and the University of Washington have come up with a solution to the sequential access limitation. They have used polymerase chain reaction (PCR) primers as a unique identifier for files. They can construct a complementary PCR primer that can be used to extract just DNA segments that match this primer and amplify (replicate) all DNA sequences matching this primer tag that exist in the cell.

DNA data format

The researchers used a Reed-Solomon (R-S) erasure coding mechanism for data protection and encode the DNA data into many DNA strings, each with multiple (metadata) tags on them. One of tags is the PCR primer tag header, another tag indicates the position of the DNA data segment in the file and an end of data tag that is the same PCR primer tag.

The PCR primer tag was used as sort of a file address. They could configure a complementary PCR tag to match the primer tag of the file they wanted to access and then use the PCR process to replicate (amplify) only those DNA segments that matched the searched for primer tag.

Apparently the researchers chunk file data into a block of 150 base pairs. As there are 2 complementary base pairs, I assume one bit to one base pair mapping. As such, 150 base pairs or bits of data per segment means ~18 bytes of data per segment. Presumably this is to allow for more efficient/effective encoding of data into DNA strings.

DNA strings don’t work well with replicated sequences of base pairs, such as all zeros. So the researchers created a random sequence of 150 base pairs and XOR the file DNA data with this random sequence to determine the actual DNA sequence to use to encode the data. Reading the DNA data back they need to XOR the data segment with the random string again to reconstruct the actual file data segment.

Not clear how PCR replicated DNA segments are isolated and where they are originally decoded (with a read head). But presumably once you have thousands to millions of copies of a DNA segment, it’s pretty straightforward to decode them.

Once decoded and XORed, they use the R-S erasure coding scheme to ensure that the all the DNA data segments represent the actual data that was encoded in them. They can then use the position of the DNA data segment tag to indicate how to put the file data back together again.

What’s missing?

I am assuming the cellular data storage system has multiple distinct cells of data, which are clustered together into some sort of organism.

Each cell in the cellular data storage system would hold unique file data and could be extracted and a file read out individually from the cell and then the cell could be placed back in the organism. Cells of data could be replicated within an organism or to other organisms.

To be a true storage system, I would think we need to add:

DNA data parity – inside each DNA data segment, every eighth base pair would be a parity for the eight preceding base pairs, used to indicate when a particular base pair in eight has mutated.
DNA data segment (block) and file checksums – standard data checksums, used to verify and correct for double and triple base pair (bit) corruption in DNA data segments and in the whole file.
Cell directory – used to indicate the unique Cell ID of the cell, a file [name] to PCR primer tag mapping table, a version of DNA file metadata tags, a version of the DNA file XOR string, a DNA file data R-S version/level, the DNA file length or number of DNA data segments, the DNA data creation data time stamp, the DNA last access date-time stamp,and DNA data modification data-time stamp (these last two could be omited)
Organism directory – used to indicate unique organism ID, organism metadata version number, organism unique cell count, unique cell ID to file list mapping, cell ID creation data-time stamp and cell ID replication count.

The problem with an organism cell-ID file list is that this could be quite long. It might be better to somehow indicate a range or list of ranges of PCR primer tags that are in the cell-ID. I can see other alternatives using a segmented organism directory or indirect organism cell to file lists b-tree, which could hold file name lists to cell-ID mapping.

It’s unclear whether DNA data storage should support a multi-level hierarchy, like file system directories structures or a flat hierarchy like object storage data, which just has buckets of objects data. Considering the cellular structure of DNA data it appears to me more like buckets and the glacial access seems to be more useful to archive systems. So I would lean to a flat hierarchy and an object storage structure.

Is DNA data is WORM or modifiable? Given the effort required to encode and create DNA data segment storage, it would seem it’s more WORM like than modifiable storage.

How will the DNA data storage system persist or be kept alive, if that’s the right word for it. There must be some standard internal cell mechanisms to maintain its existence. Perhaps, the researchers have just inserted file data DNA into a standard cell as sort of junk DNA.

If this were the case, you’d almost want to create a separate, data nucleus inside a cell, that would just hold file data and wouldn’t interfere with normal cellular operations.

But doesn’t the PCR primer tag approach lend itself better to a key-value store data base?

Photo Credit(s): Cell structure National Cancer Institute

Prentice Hall textbook

Guide to Open VMS file applications

Unix Inodes CSE410 Washington.edu

Key Value Databases, Wikipedia By Clescop – Own work, CC BY-SA 4.0, Link

Domesticating data

Posted on January 25, 2017 by Ray in Data discovery, Data index, Data search, System effectiveness

Read an article the other day from MIT News (Taming Data) about a new system that scans all your tabular data and provides an easy way to query all this data from one system. The researchers call the system the Data Civilizer.

What does it do

Tabular data seems to be the one constant in corporate data (that and for me PowerPoint and Word docs). Most data bases are tables of one form or another (some row and some column based). Lots of operational data is in spreadsheets (tables by another name) of some type. And when I look over most IT/Networking/Storage management GUIs, tables (rows and columns) of data are the norm.

The Data Civilizer takes all this tabular data and analyzes it all, column by column, and calculates descriptive characterization statistics for each column.

Numerical data could be characterized by range, standard deviation, median/average, cardinality etc. For textual data a list of words in the column by frequency might suffice. It also indexes every word in the tables it analyzes.

Armed with its statistical characterization of each column, the Data Civilizer can then generate a similarity index between any two columns of data across the tables it has analyzed. In that way it can connect data in one table with data in another.

Once it has a similarity matrix and has indexed all the words in every table column it has analyzed, it can then map the tabular data, showing which columns look similar to other columns. Then any arbitrary query for data, can be executed on any table that contains similar data supplying the results of the query across the multiple tables it has analyzed.

Potential improvements

The researchers indicated that they currently don’t support every table data format. This may be a sizable task on its own.

In addition statistical characterization or classification seems old school nowadays. Most new AI is moving off statistical analysis to more neural net types of classification. Unclear if you could just feed all the tabular data to a deep learning neural net, but if the end game is to find similarities across disparate data sets, then neural nets are probably a better way to go. How you would combine this with brute force indexing of all tabular data words is another question.

~~~~

In the end as I look at my company’s information, even most of my Word docs are organized in some sort of table, so cross table queries could help me a lot. Let me know when it can handle Excel and Word docs and I’ll take another look.

Photo Credit(s): Linear system table representation 2 by Ronald O’ Daniel

Glenda Sims by Glendathegood

NSA’s huge (YBs) new data center to turn on in 2013

Posted on March 20, 2012March 20, 2012 by Ray in Data index, data protection, Data security, Distributed computing, Information economy, Storage density, Strategic Inflection Points, System effectiveness

Ran across a story in Wired about the new NSA Utah data center today which is scheduled to be operational in September of 2013.

This new data center is intended to house copies of all communications intercepted the NSA. We have talked about this data center before and how it’s going to store YB of data (See my Yottabytes by 2015?! post).

One major problem with having a YB of communications intercepts is that you need to have multiple copies of it for protection in case of human or technical error.

Apparently, NSA has a secondary data center to backup its Utah facility in San Antonio. That’s one copy. We also wrote another post on protecting and indexing all this data (see my Protecting the Yottabyte Archive post)

NSA data centers

The Utah facility has enough fuel onsite to power and cool the data center for 3 days. They have a special power station to supply the 65MW of power needed. They have two side by side raised floor halls for servers, storage and switches, each with 25K square feet of floor space. That doesn’t include another 900K square feet of technical support and office space to secure and manage the data center.

In order to help collect and temporarily storage all this information, apparently the agency has been undergoing a data center building boom, renovating and expanding their data centers throughout the states. The article discusses some of other NSA information collection points/data centers, in Texas, Colorado, Georgia, Hawaii, Tennessee, and of course, Maryland.

New NSA super computers

In addition to the communication intercept storage, the article also talks about a special purpose, decrypting super computer that NSA has invented over the past decade which will also be housed in the Utah data center. The NSA seems to have created a super powerful computer that dwarfs the current best Cray XT5 super computer clusters that operate at 1.75 petaflops available today.

I suppose what with all the encrypted traffic now being generated, NSA would need some way to decrypt this information in order to understand it. I was under the impression that they were interested in the non-encrypted communications, but I guess NSA is even more interested in any encrypted traffic.

Decrypting old data

With all this data being stored, the thought is that the data now encrypted with unbreakable AES-128, -192 or -256 encryption will eventually become decypherable. At that time, foriegn government and other secret communications will all be readable.

By storing this secret communications now, they can scan this treasure trove for patterns that eventually occur and once found, such patterns will ultimately lead to decrypting the data. Now we know why they need YB of storage.

So NSA will at least know what was going on in the past. However, how soon they can move that up to do real time decryption of communications today is another question. But knowing the past, may help in understanding what’s going on today.

~~~~

So be careful what you say today even if it’s encrypted. Someone (NSA and its peers around the world) will probably be listening in and someday soon, will understand every word that’s been said.

Comments?

Hadoop – part 2

Posted on June 14, 2011April 10, 2012 by Ray in Data, Data analytics, Data index, Data reduction, Data science, Decision making, Distributed computing, Strategy, System effectiveness, Systems

Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data. Namely,

Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently. Data access is via a Thrift based API supporting many languages. Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp. One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween. Also Casandra is optimized for writes. Cassandra can be used as the Map portion of a MapReduce run.
Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB. Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS). In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations. Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications. Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop. Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
Hypertable – is a Google open source project which is a c++ implementation of BigTable only using HDFS rather than GFS . Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families. Hypertable supports both a client (c++) and Thrift API. Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS in combination with an interpretive analysis. Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS. Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction. However, Mahout works on non-Hadoop clusters as well. Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions. While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data. The focus here is on scientific computation. Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

Chukwa – which is used for monitoring large distributed clusters of servers.
ZooKeeper – which is a cluster configuration tool and distributed serialization manager useful to build large clusters of Hadoop nodes.
MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well. In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.

5 killer apps for $0.10/TB/year

Posted on July 27, 2010 by Ray in Data, Data growth, Data index, Data search, storage economics

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

Cloud storage keeps getting more viable and I see storage pricing going down considerably over time. All of which got me thinking what could be done with a dime per TB per year storage ($0.10/TB/yr). Now most cloud providers charge 10 cents or more per GB per month so this is at least 12,000 times less expensive but it’s inevitable at some point in time.

So here are my 5 killer apps for $0.10/TB/yr cloud storage:

Photo record of life – something akin to glasses which would record a wide angle, high mega-pixel video record of everything I looked at, for every second of my waking life. I think at a photo shot every second for 12hrs/day 365days/yr would be about ~16M photos and at 4 MB per photo this would be about ~64TB per person year. For my 4 person family this would cost ~$26/year for each year of family life and for a 40 year family time span, the last payment for this would be ~$1040 or an average payment of $520/year.
Audio recording of life – something akin to a always on bluetooth headset which would record an audio feed to go with the semi-video or photo record above. By being an always on bluetooth headset it would automatically catch cell phone as well as spoken conversations but it would need to plug to landlines as well. As discussed in my YB by 2015 archive post, one minute of MP3 audio recording takes up roughly a MB of storage. Lets say I converse with someone ~33% of my waking day. So this would be about 4 hrs of MP3 audio/day 365days/yr or about 21TB per year per person. For my family this would cost or ~$8.40/year for storage and for a 40 year family life span my last payment would be ~$336 or an average of $168/yr.
Home security cameras – with ethernet based security cameras, it wouldn’t be hard to record a 360 degree outside as well as inside points of entry coverage video. The quantities for the photo record of my life would suffice for here as well but one doesn’t need to retain the data for a whole year perhaps a rolling 30 day record would suffice but it would be recorded for 24 hours. Assuming 8 cameras outside and inside, this could be stored in about 10TB of storage per camera, or about 80TB of storage or $8/year but would not increase over time.
No more deletes/version everything – if storage were cheap enough we would never delete data. Normal data change activity is in the 5 to 10% per week rate, but this does not account for duplicating deleted data. So let’s say we would need to store an additional 20% of your primary/active data per week for deleted data. For a 1TB primary storage working set, a ~20% deletion rate per week would be 10TB of deleted data per year per person and for my family ~$4/yr and my last yearly payment would be ~$160. If we were to factor in data growth rates of ~20%/year, this would go up substantially averaging ~$7.3k/yr over 40 years.
Customized search engines – if storage AND bandwidth were cheap enough it would be nice to have my own customized search engine. Such a capability would follow all my web clicks, spawning a search spider for every website I traverse and provide customized “deep” searching for every web page I view. Such an index might take 50% of the size of a page and on average my old website used ~18KB per page, so at 50% this index would require 9KB. Assuming, I look at ~250 web pages per business day of which maybe ~170 are unique and each unique page probably links to 2 more unique pages, which links to two more, which links to two more, … If we go 10 pages deep, then for 170 pages viewed, an average branching factor of 2, we would need to index ~174K pages/day and for a year, this would represent about represent about 0.6TB of page index. For my household, a customized search engine would cost ~$0.25 of additional storage per year and for 40 years my last payment would be $10.

I struggled with coming with ideas that would cost between $10 and $500 a year as every other storage use came out significantly less than $1/year for a family of four. This seems to say that there might be plenty of applications in the range of under a $10 per TB per year, still 1200X current cloud storage costs.

Any other applications out there that could take advantage of a dime/TB/year?

Google vs. National Information Exchange Model

Posted on February 23, 2010March 15, 2010 by Ray in data access, Data index, Data search, Data security, Strategy, Systems

Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov — Information Exchange Package Documents (IEPI) lifecycle from http://www.niem.gov

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records. Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records. Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing. But true real-time sharing would be impossible even with NIEM. And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

“Near” realtime sharing of information
Security constraints on information being shared
Contextual information surrounding data records,
Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information. As discussed above government information management tools would need to convert data records into text. This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it. Indexing such a repository would be no more complex than doing the same for the web today. Even so it will take time to scan and index the data. Until this is done, searching the data will not be available. However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level. Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites. It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion. Federal funding can be used as one means to enforce compliance. Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped. What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast. Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not. However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs. Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach. Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so. Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.

5 laws of unstructured data

Posted on January 6, 2010 by Ray in Data growth, Data index, Data search, Storage, storage economics

Richard (Dick) Nafzger with Apollo data tape by Goddard Photo and Video (cc) (from flickr)

All data operates under a set of laws but unstructured data suffers from these tendencies more than most of all. Although, information technology has helped us to create and manage data easier, it hasn’t done much to minimize the problems these laws produce.

As such, I introduce here my 5 laws of unstructured data in the hopes that they may help us better understand the data we create.

Law 1: Unstructured data grows 50% per year

This has been a truism in the data center for as far back as I can remember. In the data center this is driven by business transactions, new applications and new products/services. On top of all that corporate compliance often dictate that data be retained long after it’s usefulness has passed.

Nowadays, Law 1 is also true for the home user as well. Here it’s a combination of email and media. Not only are cameras moving from 6 to 9 megapixels, home video is moving to high definition and there is just a whole lot more media being created everyday. Also, now social media seems to have doubled or tripled our outreach data creation above “normal email” alone.

Law 2: Unstructured data access frequency diminishes over time

Data created today is accessed frequently during it’s first 90 days of life and then less often after that. Reasons for this decaying access pattern vary, but human memory has to play a significant part in this.

Furthermore, business transactions encounter a life cycle from initiation, to delivery and finally, to termination. During these transitions various unstructured data are created representing the transaction state. Such data may be examined at quarter end and possibly at year end but may never see the light of day after that.

Law 3: Unsearchable data is lost data

Given Law 2’s data access decay and Law 1’s data growth, unsearchable data is by definition, inaccessible data. It’s not hard to imagine how this plays out in the data center or home.

For the data center, unstructured data mostly resides in user and application directories. I am constantly amazed that it’s easier to find data out on the web than it is to find data elsewhere in the data center. Moreover, E-discovery has become a major business segment in recent years by attempting to search unstructured corporate data.

As a Mac user my home environment is searchable for any text string. However, my photo library is another matter. Finding a specific photo from a couple of years ago is a sequential perusal of iPhoto’s library and as such, is seldom done.

Law 4: Unstructured data is copied often

Over a decade ago, a company I worked with sponsored a study to see how often data is copied. The numbers we came up with were impressive. A small but significant % of data is copied often, it’s not unusual to see 6-8 copies of such data. Some of this copying occurs when final documents are passed on, some comes from teamwork and other joint collaboration as working documents are reviewed and some is just interesting information that deserves broader dissemination. As such, data copies can represent a significant portion of any data center’s storage.

I suppose data proliferation may not be as evident in the home but our home would be an exception. Each of our Macs has a copy of all email account and have copies of the best photos. In addition, with laptops and multiple desktops, most Mac’s have copies of each (adult) user’s work environment,

Law 5: Unstructured data manual classification schemes degrade over time

In the data center, one could easily classify any file data created and maintain a database of file meta-data to facilitate access to file data. But who has the discipline or spare time to update such a database whenever they create a file or document. While this may work for “official records”, the effort involved makes it unusable for everything else.

My favorite home example of this is once again, our iPhoto library with it’s manual classification system using stars, e.g., I can assign anything from 0 to 5 stars to any photo. Used to be that after each camera import, I would assign a star rating to each new photo. Nowadays, the only time I do this is once a year and as such, it’s becoming more problematic and less useful. As we take more photographs each year this becomes much more of a burden.

Not sure these 5 laws of unstructured data are mutually exclusive and completely exhaustive but it’s a start. If anyone has any ideas on how to improve my unstructured data laws, feel free to comment below. In the mean time, as for structured data laws, …

7 grand challenges for the next storage century

Posted on November 19, 2009April 10, 2012 by Ray in Data index, Data search, Data security, Market dynamics, Networking, Storage, Storage density, Storage Features, Storage reliability, Strategic Inflection Points

Clock tower (4) by TJ Morris (cc) (from flickr)

I saw a recent IEEE Spectrum article on engineering’s grand challenges for the next century and thought something similar should be done for data storage. So this is a start:

Replace magnetic storage – most predictions show that magnetic disk storage has another 25 years and magnetic tape another decade after that before they run out of steam. Such end-dates have been wrong before but it is unlikely that we will be using disk or tape 50 years from now. Some sort of solid state device seems most probable as the next evolution of storage. I doubt this will be NAND considering its write endurance and other long-term reliability issues but if such issues could be re-solved maybe it could replace magnetic storage.
1000 year storage – paper can be printed today with non-acidic based ink and retain its image for over a 1000 years. Nothing in data storage today can claim much more than a 100 year longevity. The world needs data storage that lasts much longer than 100 years.
Zero energy storage – today SSD/NAND and rotating magnetic media consume energy constantly in order to be accessible. Ultimately, the world needs some sort of storage that only consumes energy when read or written or such storage would provide “online access with offline power consumption”.
Convergent fabrics running divergent protocols – whether it’s ethernet, infiniband, FC, or something new, all fabrics should be able to handle any and all storage (and datacenter) protocols. The internet has become so ubiquitous becauset it handles just about any protocol we throw at it. We need the same or something similar for datacenter fabrics.
Securing data – securing books or paper is relatively straightforward today, just throw them in a vault/safety deposit box. Securing data seems simple but yet is not widely used today. It doesn’t have to be that way. We need better, more long lasting tools and methodology to secure our data.
Public data repositories – libraries exist to provide access to the output of society in the form of books, magazines, papers and other printed artifacts. No such repository exists today for data. Society would be better served if we could store and retrieve data if there were library like institutions could store data. Most of these issues are legal due to data ownership but technological issues exist here as well.
Associative accessed storage – Sequential and random access have been around for over half a century now. Associative storage could complement these and be another approach allowing storage to be retrieved by its content. We can kind of do this today by keywording and indexing data. Biological memory is accessed associations or linkages to other concepts, once accessed memory seem almost sequentially accessed from there. Something comparable to biological memory may be required to build more intelligent machines.

Some of these are already being pursued and yet others receive no interest today. Nonetheless, I believe they all deserve investigation, if storage is to continue to serve its primary role to society, as a long term storehouse for society’s culture, thoughts and deeds.

Comments?