Data created by individuals – Silverton Consulting

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)

My recent post on an exabyte-a-day generated a comment that got me thinking. What we need in the world today is a universal deduped archive. Such an archive would be a repository for all information generated by the world, nation, state, etc. and would automatically deduplicate the data and back it up.

Such an archive could be a new form of the current library. Keeping data for future generations and also for a nation’s population. Data held in the library repository would need to have:

Iron-clad data security via some form of data-at-rest encryption. This is a bit tricky since we would want to dedupe all the data from everywhere yet at the same time have the data be encrypted.
Enforceable digital rights management that would allow authorized users data access but unauthorized users would be restricted from viewing the information
Easy accessibility that would allow home consumers access to their data in an “always on” type of environment or access from any internet enabled location.
Dependable backups that would allow user restore of data.
Time limited protection scheme that after so many years (60 or 100) of data non-access/non-modification, the data would revert to public access/non-secured access for future research.
Government funding akin to today’s libraries that are publicly funded but serve those consumers that take the time to access their library facilities.

I see this as another outgrowth of current libraries which supports a repository for todays books, magazines, media, maps, and other published artifacts. However, in this case most data would not be published during a person’s lifetime but would become public property sometime after that person dies.

Benefits to society and the individual

Of what use could such a data repository be? Once the data becomes publicly accessible:

Future historians could find out what life was really like, in a detail never before available. Find out what people were watching/listening to, who people wrote to/conversed with, and what people cared about in the 21st century by perusing the data feeds of that generation.
Future scientists could mine the data for insights into a generation, network links, and personal data consumption.
Future governments could mine the data looking for what people thought about a nation, its economy, politics, etc., to help create better government.

But mostly, we don’t know what future researchers could do with the data. If such a repository existed today for what people were thinking and doing 60 to 100 years ago, history would be much more person derived rather than media derived. Economists would have a much more accurate picture of the great depression’s affect on humankind. Medicine would have a much better picture of how the pollutants and lifestyles of yesterday impact the health of today.

Also, as more and more of society’s activity involve data, the detail available on a person’s life becomes even more pervasive. Consider medical imaging, if you had a repository for a person’s x-rays from birth to death, this data could potentially be invaluable to the medicine of tomorrow.

While the data is still protected people

Would have a secure repository to store all their data, accessible from any internet enabled location
Would have an unlimited repository for their data storage not unlike timemachine on the Mac which they could go back to at anytime in the past to retrieve data.
Would have the potential to record even more information about their daily activities.
Would have a way to license their data feeds to researchers for a price sort of like registering for Nielsen TV or Alexa web tracking.

Costs to society

The price society would pay could be minimized by appropriate storage and systems technology. If in reality the data created by individuals (~87PB/day from the above mentioned post) could be deduped by a factor of 50X, this would account for only 1.7PB of unique data per day worldwide. If I take a nation’s portion of world GDP as a surrogate for data created by a nation, then for the US with 23.6% of the world’s ’08 GDP, creates ~0.4PB of individual deduped data per day or ~150PB of data per year.

Of course this would be split up by state or by municipality so the load on any one juristiction would be considerably smaller than this. But storing 150PB of data today would take 75K-2TB drives and would cost about ~$15.8M in drive costs (2TB WD drive costs $210 on Amazon) in the US. This does not account for servers, backups, power, cooling, floorspace, administration, etc but let’s triple this to incorporate these other costs. So to store all the data created by individuals in the US in 2009 would cost around $47.4M today with today’s technology.

Also consider that this cost is being cut in half every 18 to 24 months but counteracting that trend is a significant growth in data created/stored by individuals each year (~50%). Hence, by my calculations, the cost to store all this data is declining slightly every year depending on the speed of density increase and average individual data growth rate.

In any event, $47.4M is not a lot to spend to keep a nation’s worth of individual data. The benefits to today’s society would be considerable and future generations would have a treasure trove of data to analyze whenever the need presented itself.

Holding this back today is the obvious cost but also all of the data security considerations. I believe the costs are manageable, at least at the state or municipal level. As for the data security considerations, simple data-at-rest encryption is one viable solution. Although how to encrypt while still providing deduplication is a serious problem to be overcome. Enforceable digital rights, time limited protection, and the other technological features could come with time.

Tag: Data created by individuals

The future of libraries

Benefits to society and the individual

Costs to society