Today's data and the 1000 year archive

Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)
Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)

Somewhere in my basement I have card boxes dating back to the 1970s and paper tape canisters dating back to the 1960s with basic, 360-assembly, COBOL, PL/1 programs on them. These could be reconstructed if needed, by reading the Hollerith encoding and typing them out into text files. Finding a compiler/assembler/interpreter to interpret and execute them is another matter. But, just knowing the logic may suffice to translate them into another readily compilable language of today. Hollerith is a data card format which is well known and well described. But what of the data being created today. How will we be able to read such data in 50 years let alone 500? That is the problem.

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)

Civilization needs to come up with some way to keep information around for 1000 years or more. There are books relevant today (besides the Bible, Koran, and other sacred texts) that would alter the world as we know it if they were unable to be read 900 years ago. No doubt, data or information like this, being created today will survive to posterity, by virtue of its recognized importance to the world. But there are a few problems with this viewpoint:

  • Not all documents/books/information are recognized as important during their lifetime of readability
  • Some important information is actively suppressed and may never be published during a regime’s lifetime
  • Even seemingly “unimportant information” may have significance to future generations

From my perspective, knowing what’s important to the future needs to be left to future generations to decide.

Formats are the problem

Consider my blog posts, WordPress creates MySQL database entries for blog posts. Imagine deciphering MySQL database entries, 500 or 1000 years in the future and the problem becomes obvious. Of course, WordPress is open source, so this information could conceivable be readily interpretable by reading it’s source code.

I have written before about the forms that such long lived files can take but for now consider that some form of digital representation of a file (magnetic, optical, paper, etc.) can be constructed that lasts a millennia. Some data forms are easier to read than others (e.g., paper) but even paper can be encoded with bar codes that would be difficult to decipher without a key to their format.

The real problem becomes file or artifact formats. Who or what in 1000 years will be able to render a Jpeg file, able to display an old MS/Word file of 1995, or be able to read a WordPerfect file from 1985. Okay, a Jpeg is probably a bad example as it’s a standard format but, older Word and WordPerfect file formats constitute a lot of information today. Although there may be programs available to read them today, the likelihood that they will continue to do so in 50, let alone 500 years, is pretty slim.

The problem is that as applications evolve, from one version to another, formats change and developers have negative incentive to publicize these new file formats. Few developers today wants to supply competitors with easy access to convert files to a competitive format. Hence, as developers or applications go out of business, formats cease to be readable or convertable into anything that could be deciphered 50 years hence.

Solutions to disappearing formats

What’s missing, in my view, is a file format repository. Such a repository could be maintained by an adjunct of national patent trade offices (nPTOs). Just like todays patents, file formats once published, could be available for all to see, in multiple databases or print outs. Corporations or other entities that create applications with new file formats would be required to register their new file format with the local nPTO. Such a format description would be kept confidential as long as that application or its descendants continued to support that format or copyright time frames, whichever came first.

The form that a file format could take could be the subject of standards activities but in the mean time, anything that explains the various fields, records, and logical organization of a format, in a text file, would be a step in the right direction.

This brings up another viable solution to this problem – self defining file formats. Applications that use native XML as their file format essentially create a self defining file format. Such a file format could be potentially understood by any XML parser. And XML format, as a defined standard, are wide enough defined that they could conceivable be available to archivists of the year 3000. So I applaud Microsoft for using XML for their latest generation of Office file formats. Others, please take up the cause.

If such repositories existed today, people in the year 3010 could still be reading my blog entries and wonder why I wrote them…