Data of the world, lay down your chains

Prison Planet by AZRainman (cc) (from Flickr)
Prison Planet by AZRainman (cc) (from Flickr)

GitHub, that open source free repository of software, is taking on a new role, this time as a repository for municipal data sets. At least that’s what a recent article on the website (see Catch my Diff: GitHub’s New Feature Means Big Things for Open Data) after GitHub announced new changes in its .GeoJSON support (see Diffable, more customizable maps)

The article talks about the fact that maps in Github (using .GeoJSON data) can be now DIFFed, that is see at a glance what changes have been made to it. In the one example in the article (easier to see in GitHub) you can see how one Chicago congressional district has changed over time.

Unbeknownst to me, GitHub started becoming a repository for geographical data. That is any .GeoJson data file can be now be saved as a repository on GitHub and can be rendered as a map using desktop or web based tools. With the latest changes at GitHub, now one can see changes that are made to a .GeoJSON file as two or more views of a map or properties of map elements.

Of course all the other things one can do with GitHub repositories are also available, such as FORK, PULL, PUSH, etc. All this functionality was developed to support software coding but can apply equally well to .GeoJSON data files. Because .GeoJSON data files look just like source code (really more like .XML, but close enough).

So why maps as source code data?

Municipalities have started to use GitHub to host their Open Data initiatives. For example Digital Chicago has started converting some of their internal datasets into .GeoJSON data files and loading them up on GitHub for anyone to see, fork, modify, etc.

I was easily able to login and fork one of the data sets. But there’s a little matter of pushing your committed changes to the project owner that needs to happen before you can modify the original dataset.

Also I was able to render the .GeoJSON data into a viewable map by just clicking on a commit file (I suppose this is a web service). The ReadME file has instructions for doing this on your desktop outside of a web browser for R, Ruby and Python.

In any case, having the data online, editable and commitable would allow anyone with GitHub account to augment the data to make it better and more comprehensive. Of course with the data now online, any application could make use of it to offer services based on the data.

I guess that’s what Open Data movement is all about, make government, previously proprietary data freely available in a standardized format, and add tools to view and modify it, in the hope that businesses see a way to make use of it in new ways. As such, In  the data should become more visible and more useful to the world and the cities that are supporting it.

If you want to learn more about Project Open Data see the blog post from last year on or the GitHub Project [wiki] pages.


Super long term archive

Read an article this past week in Scientific American about a new fused silica glass storage device from Hitachi Ltd., announced last September. The new media is recorded with lasers burning dots which represent binary one or leaving spaces which represents binary 0 onto the media.

As can be seen in the photos above, the data can readily be read by microscope which makes it pretty easy for some future civilization to read the binary data. However, knowing how to decode the binary data into pictures, documents and text is another matter entirely.

We have discussed the format problem before in our Today’s data and the 1000 year archive as well as Digital Rosetta stone vs. 3D barcodes posts. And this new technology would complete with the currently available, M-disc long term achive-able, DVD technology from Millenniata which we have also talked about before.

Semi-perpetual storage archive!!

Hitachi tested the new fused silica glass storage media at 1000C for several hours which they say indicates that it can survive several 100 million years without degradation. At this level it can provide a 300 million year storage archive (M-disc only claims 1000 years).   They are calling their new storage device, “semi-perpetual” storage.  If 100s of millions of years is semi-perpetual, I gotta wonder what perpetual storage might look like.

At CD recording density, with higher densities possible

They were able to achieve CD levels of recording density with a four layer approach. This amounted to about 40Mb/sqin.  While DVD technology is on the order of 330Mb/sqin and BlueRay is ~15Gb/sqin, but neither of these technologies claim even a million year lifetime.   Also, there is the possibility of even more layers so the 40Mb/sqin could double or quadruple potentially.

But data formats change every few years nowadays

My problem with all this is the data format issue, we will need something like a digital rosetta stone for every data format ever conceived in order to make this a practical digital storage device.

Alternatively we could plan to use it more like an analogue storage device, with something like a black and white or grey scale like photographs of  information to be retained imprinted in the media.  That way, a simple microscope could be used to see the photo image.  I suppose color photographs could be implemented using different plates per color, similar to four color magazine production processing. Texts could be handled by just taking a black and white photo of a document and printing them in the media.

According to a post I read about the size of the collection at the Library of Congress, they currently have about 3PB of digital data in their collections which in 650MB CD chunks would be about 4.6M CDs.  So if there is an intent to copy this data onto the new semi-perpetual storage media for the year 300,002012 we probably ought to start now.

Another tidbit to add to the discussion at last months Hitachi Data Systems Influencers Summit, HDS was showing off some of their recent lab work and they had an optical jukebox on display that they claimed would be used for long term archive. I get the feeling that maybe they plan to commercialize this technology soon – stay tuned for more



Image: website (c) 2012 Hitachi, Ltd.,

The end of NAND is near, maybe…

In honor of today’s Flash Summit conference, I give my semi-annual amateur view of competing NAND technologies.

I was talking with a major storage vendor today and they said they were sampling sub-20nm NAND chips with P/E cycles of 300 with a data retention period under a week at room temperatures. With those specifications these chips almost can’t get out of the factory with any life left in them.

On the other hand the only sub-20nm (19nm) NAND information I could find online were inside the new Toshiba THNSNF SSDs with toggle MLC NAND that guaranteed data retention of 3 months at 40°C.   I could not find any published P/E cycle specifications for the NAND in their drive but presumably this is at most equivalent to their prior generation 24 nm NAND or at worse somewhere below that generations P/E cycles. (Of course, I couldn’t find P/E cycle specifications for that drive either but similar technology in other drives seems to offer native 3000 P/E cycles.)

Intel-Micron, SanDisk and others have all recently announced 20nm MLC NAND chips with a P/E cycles around 3K to 5K.

Nevertheless, as NAND chips go beyond their rated P/E cycle quantities, NAND bit errors increase. With a more powerful ECC algorithm in SSDs and NAND controllers, one can still correct the data coming off the NAND chips.  However at some point beyond 24 bit ECC this probably becomes unsustainable. (See interesting post by NexGen on ECC capabilities as NAND die size shrinks).

Not sure how to bridge the gap between 3-5K P/E cycles and the 300 P/E cycles being seen by storage vendors above but this may be a function of prototype vs. production technology and possibly it had other characteristics they were interested in.

But given the declining endurance of NAND below 20nm, some industry players are investigating other solid state storage technologies to replace NAND, e.g.,  MRAM, FeRAM, PCM and ReRAM all of which are current contenders, at least from a research perspective.

MRAM is currently available in small capacities from Everspin and elsewhere but hasn’t really come up with similar densities on the order of today’s NAND technologies.

ReRAM is starting to emerge in low power applications as a substitute for SRAM/DRAM, but it’s still early yet.

I haven’t heard much about FeRAM other than last year researchers at Purdue having invented a new non-destructive read FeRAM they call FeTRAM.   Standard FeRAMs are already in commercial use, albeit in limited applications from Ramtron and others but density is still a hurdle and write performance is a problem.

Recently the PCM approach has heated up as PCM technology is now commercially available being released by Micro.  Yes the technology has a long way to go to catch up with NAND densities (available at 45nm technology) but it’s yet another start down a technology pathway to build volume and research ways to reduce cost, increase density and generally improve the technology.  In the mean time I hear it’s an order of magnitude faster than NAND.

Racetrack memory, a form of MRAM using wires to store multiple bits, isn’t standing still either.  Last December, IBM announced they have demonstrated  Racetrack memory chips in their labs.  With this milestone IBM has shown how a complete Racetrack memory chip could be fabricated on a CMOS technology lines.

However, in the same press release from IBM on recent research results, they announced a new technique to construct CMOS compatible graphene devices on a chip.  As we have previously reported, another approach to replacing standard NAND technology  uses graphene transistors to replace the storage layer of NAND flash.  Graphene NAND holds the promise of increasing density with much better endurance, retention and reliability than today’s NAND.

So as of today, NAND is still the king of solid state storage technologies but there are a number of princelings and other emerging pretenders, all vying for its throne of tomorrow.


Image: 20 nanometer NAND Flash chip by IntelFreePress

Million year optical disk

Read an article the other day about scientists creating an optical disk that would be readable in a million years or so. The article in Science Mag titled A million – year hard disk was intended to warn people about potential dangers in the way future that were being created today.

A while back I wrote about a 1000 year archive which was predominantly about disappearing formats. At the time, I believed given the growth in data density that information could easily be copied and saved over time but the formats for that data would be long gone by the time someone tried to read it.

The million year optical disk eliminates the format problem by using pixelated images etched on media. Which works just dandy if you happen to have a microscope handy.

Why would you need a million year disk

The problem is how do you warn people in the far future not to mess with radioactive waste deposits buried below. If the waste is radioactive for a million years, you need something around to tell people to keep away from it.

Stone markers last for a few thousand years at best but get overgrown and wear down in time. For instance, my grandmother’s tombstone in Northern Italy has already been worn down so much that it’s almost unreadable. And that’s not even 80 yrs old yet.

But a sapphire hard disk that could easily be read with any serviceable microscope might do the job.

How to create a million year disk

This new disk is similar to the old StorageTek 100K year optical tape. Both would depend on microscopic impressions, something like bits physically marked on media.

For the optical disk the bits are created by etching a sapphire platter with platinum. Apparently the prototype costs €25K but they’re hoping the prices go down with production.

There are actually two 20cm (7.9in) wide disks that are molecularly fused together and each disk can store 40K miniaturized pages that can hold text or images. They are doing accelerated life testing on the sapphire disks by bathing them in acid to insure a 10M year life for the media and message.

Presumably the images are grey tone (or in this case platinum tone). If I assume 100Kbytes per page that’s about 4GB, something around a single layer DVD disk in a much larger form factor.

Why sapphire

It appears that sapphire is available from industrial processes and it seems impervious to wear that harms other material. But that’s what they are trying to prove.

Unclear why the decided to “molecularly” fuse two platters together. It seems to me this could easily be a weak link in the technology over the course of dozen millennia or so. On the other hand, more storage is always a good thing.


In the end, creating dangers today that last millions of years requires some serious thought about how to warn future generations.

Image: Clock of the Long Now by Arenamontanus

e-pathology and data growth

Blue nevus (4 of 4) by euthman (cc) (From Flickr)
Blue nevus (4 of 4) by euthman (cc) (From Flickr)

I was talking with another analyst the other day by the name of John Koller of Kai Consulting who specializes in the medical space and he was talking about the rise of electronic pathology (e-pathology).  I hadn’t heard about this one.

He said that just like radiology had done in the recent past, pathology investigations are moving to make use of digital formats.

What does that mean?

The biopsies taken today for cancer and disease diagnosis which involve one more specimens of tissue examined under a microscope will now be digitized and the digital files will be inspected instead of the original slide.

Apparently microscopic examinations typically use a 1×3 inch slide that can have the whole slide devoted to some tissue matter.  To be able to do a pathological examination, one has to digitize the whole slide, under magnification at various depths within the tissue.  According to Koller, any tissue is essentially a 3D structure and pathological exams, must inspect different depths (slices) within this sample to form their diagnosis.

I was struck by the need for different slices of the same specimen. I hadn’t anticipated that but whenever I look in a microscope, I am always adjusting the focal length, showing different depths within the slide.   So it makes sense, if you want to understand the pathology of a tissue sample, multiple views (or slices) at different depths are a necessity.

So what does a slide take in storage capacity?

Koller said, an uncompressed, full slide will take about 300GB of space. However, with compression and the fact that most often the slide is not completely used, a more typical space consumption would be on the order of 3 to 5GB per specimen.

As for volume, Koller indicated that a medium hospital facility (~300 beds) typically does around 30K radiological studies a year but do about 10X that in pathological studies.  So at 300K pathological examinations done a year, we are talking about 90 to 150TB of digitized specimen images a year for a mid-sized hospital.

Why move to  e-pathology?

It can open up a whole myriad of telemedicine offerings similar to the radiological study services currently available around the globe.  Today, non-electronic pathology involves sending specimens off to a local lab and examination by medical technicians under microscope.  But with e-pathology, the specimen gets digitized (where, the hospital, the lab, ?) and then the digital files can be sent anywhere around the world, wherever someone is qualified and available to scrutinize them.


At a recent analyst event we were discussing big data and aside from the analytics component and other markets, the vendor made mention of content archives are starting to explode.  Given where e-pathology is heading, I can understand why.

It’s great to be in the storage business

The problems with digital audio archives

ldbell15 by Zyada (cc) (from Flickr)
ldbell15 by Zyada (cc) (from Flickr)

A recent article in Rolling Stone (File Not Found: The Record Industry’s Digital Storage Crisis) laments the fact that digital recordings can go out of service due to format changes, plugin changes, and/or files not being readable (file not found).

In olden days, multi-track masters were recorded on audio tape and kept in vaults.  Audio tape formats never seemed to change or at least changed infrequently, and thus, re-usable years or decades after being recorded.  And the audio tape drives seemed to last forever.

Digital audio recordings on the other hand, are typically stored in book cases/file cabinets/drawers, on media that can easily become out-of-date technology (i.e., un-readable) and in digital formats that seem to change with every new version of software.

Consumer grade media doesn’t archive very well

The article talks about using hard drives for digital recordings and trying to read them decades after they were recorded.  I would be surprised if they still spin up (due to stiction) let alone still readable.  But even if these were CDs or DVDs, the lifetime of consumer grade media is not that long, maybe a couple of years at best, if treated well and if abused by writing on them or by bad handling, it’s considerably less than that.

Digital audio formats change frequently

The other problem with digital audio recordings is that formats go out of date.  I am no expert but let’s take Apple’s Garage Band as an example.  I would be surprised if 15 years down the line that a 2010 Garage Band session recorded today was readable/usable with Garage Band 2025, assuming it even existed.  Sounds like a long time but it’s probably nothing for popular music coming out today.

Solutions to digital audio media problems

Audio recordings must use archive grade media if it’s to survive for longer than 18-36 months.  I am aware of archive grade DVD disks but have never tested any, so cannot speak to their viability in this application.  However, for an interesting discussion on archive quality CD&DVD media see How to choose CD/DVD archival media. But, there are other alternatives.

Removable data center class archive media today includes magnetic tape, removable magnetic disks or removable MO disks.

  • Magnetic tape – LTO media vendors specify archive life on the order of 30 years, however this assumes a drive exists that can read the media.  The LTO consortium states that current generation drives will read back two generations (LTO-5 drive today reads LTO-4 and LTO-3 media) and write back one generation (LTO-5 drive can write on LTO-4 media [in LTO-4 format]).  With LTO generations coming every 2 years or so, it would only take 6 years for a LTO volume, recorded today to be unreadable by current drives.  Naturally, one could keep an old drive around but maintenance/service would no longer be available for it after a couple of years.  LTO drives are available from a number of vendors.
  • Magnetic disk – The RDX Storage Alliance claims a media archive life of 30 years but I wonder whether a RDX drive would exist that could read it and the other question is how archive life was validated. Today’s removable disk typically imitates a magnetic tape drive/format.  The most prominent removable disk vendor is ProStor Systems but there are others.
  • Magneto-optical (MO) media – Plasmon UDO claims a media life of 50+ years for their magneto-optical media.  UDO has been used for years to record check images, medical information and other data.  Nonetheless,  recently UDO technology has not been able to keep up with other digital archive solutions and have gained a pretty bad rap for usability problems.  However, they plan to release a new generation of UDO product line in 2010 which may shake things up if it arrives and can address their usability issues.

Finally, one could use non-removable, high density disk drives and migrate the audio data every 2-3 years to new generation disks.  This would keep the data readable and continuously accessible.  Modern storage systems with RAID and other advanced protection schemes can protect data from any single and potentially double drive failure but as drives age, their error rate goes up.  This is why the data needs to be moved to new disks periodically.  Naturally, this is more frequently than magnetic tape, but given disk drive usability and capacity gains, might make sense in certain applications.

As for removable USB sticks – unclear what the archive life is for these consumer devices but potentially some version that went after the archive market might make sense.  It would need to be robust, have a long archive life and be cheap enough to compete with all the above.  I just don’t see anything here yet.

Solutions to digital audio format problems

There needs to be an XML-like description of a master recording that reduces everything to a more self-defined level which describes the hierarchy of the recording, and provides object buckets for various audio tracks/assets.  Plugins that create special effects would need to convert their effects to something akin to a MPEG-like track that could be mixed with the other tracks, surrounded by meta-data describing where it starts, ends and other important info.

Baring that, some form of standardization on a master recording format would work.  Such a standard could be supported by all major recording tools and would allow a master recording to be exported and imported across software tools/versions.  As this format evolved, migration/conversion products could be supplied to upgrade old formats to new ones.

Another approach is to have some repository for current master audio recording formats.  As software packages go out of date/business, their recording format could be stored in some “format repository”, funded by the recording industry and maintained in perpetuity.  Plug-in use would need to be documented similarly.  With a repository like this around and “some amount” of coding, no master recording need be lost to out-of-date software formats.

Nonetheless, If your audio archive needs to be migrated periodically, it be a convenient time to upgrade the audio format as well.


I have written about these problems before in a more general sense (see Today’s data and the 1000 year archive) but the recording industry seems to be “leading edge” for these issues. When Producer T Bone Burnett testifies at a hearing that “Digital is a feeble storage medium” it’s time to step up and take action.

Digital storage is no more feeble than analog storage – they each have their strengths and weaknesses.  Analog storage has gone away because it couldn’t keep up with digital recording densities, pricing, and increased functionality.  Just because data is recorded digitally doesn’t mean it has to be impermanent, hard to read 15-35 years hence, or in formats that are no longer supported.  But it does take some careful thought on what storage media you use and on how you format your data.


Today's data and the 1000 year archive

Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)
Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)

Somewhere in my basement I have card boxes dating back to the 1970s and paper tape canisters dating back to the 1960s with basic, 360-assembly, COBOL, PL/1 programs on them. These could be reconstructed if needed, by reading the Hollerith encoding and typing them out into text files. Finding a compiler/assembler/interpreter to interpret and execute them is another matter. But, just knowing the logic may suffice to translate them into another readily compilable language of today. Hollerith is a data card format which is well known and well described. But what of the data being created today. How will we be able to read such data in 50 years let alone 500? That is the problem.

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)

Civilization needs to come up with some way to keep information around for 1000 years or more. There are books relevant today (besides the Bible, Koran, and other sacred texts) that would alter the world as we know it if they were unable to be read 900 years ago. No doubt, data or information like this, being created today will survive to posterity, by virtue of its recognized importance to the world. But there are a few problems with this viewpoint:

  • Not all documents/books/information are recognized as important during their lifetime of readability
  • Some important information is actively suppressed and may never be published during a regime’s lifetime
  • Even seemingly “unimportant information” may have significance to future generations

From my perspective, knowing what’s important to the future needs to be left to future generations to decide.

Formats are the problem

Consider my blog posts, WordPress creates MySQL database entries for blog posts. Imagine deciphering MySQL database entries, 500 or 1000 years in the future and the problem becomes obvious. Of course, WordPress is open source, so this information could conceivable be readily interpretable by reading it’s source code.

I have written before about the forms that such long lived files can take but for now consider that some form of digital representation of a file (magnetic, optical, paper, etc.) can be constructed that lasts a millennia. Some data forms are easier to read than others (e.g., paper) but even paper can be encoded with bar codes that would be difficult to decipher without a key to their format.

The real problem becomes file or artifact formats. Who or what in 1000 years will be able to render a Jpeg file, able to display an old MS/Word file of 1995, or be able to read a WordPerfect file from 1985. Okay, a Jpeg is probably a bad example as it’s a standard format but, older Word and WordPerfect file formats constitute a lot of information today. Although there may be programs available to read them today, the likelihood that they will continue to do so in 50, let alone 500 years, is pretty slim.

The problem is that as applications evolve, from one version to another, formats change and developers have negative incentive to publicize these new file formats. Few developers today wants to supply competitors with easy access to convert files to a competitive format. Hence, as developers or applications go out of business, formats cease to be readable or convertable into anything that could be deciphered 50 years hence.

Solutions to disappearing formats

What’s missing, in my view, is a file format repository. Such a repository could be maintained by an adjunct of national patent trade offices (nPTOs). Just like todays patents, file formats once published, could be available for all to see, in multiple databases or print outs. Corporations or other entities that create applications with new file formats would be required to register their new file format with the local nPTO. Such a format description would be kept confidential as long as that application or its descendants continued to support that format or copyright time frames, whichever came first.

The form that a file format could take could be the subject of standards activities but in the mean time, anything that explains the various fields, records, and logical organization of a format, in a text file, would be a step in the right direction.

This brings up another viable solution to this problem – self defining file formats. Applications that use native XML as their file format essentially create a self defining file format. Such a file format could be potentially understood by any XML parser. And XML format, as a defined standard, are wide enough defined that they could conceivable be available to archivists of the year 3000. So I applaud Microsoft for using XML for their latest generation of Office file formats. Others, please take up the cause.

If such repositories existed today, people in the year 3010 could still be reading my blog entries and wonder why I wrote them…