The future of libraries

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
My recent post on an exabyte-a-day generated a comment that got me thinking. What we need in the world today is a universal deduped archive. Such an archive would be a repository for all information generated by the world, nation, state, etc. and would automatically deduplicate the data and back it up.

Such an archive could be a new form of the current library. Keeping data for future generations and also for a nation’s population. Data held in the library repository would need to have:

  • Iron-clad data security via some form of data-at-rest encryption. This is a bit tricky since we would want to dedupe all the data from everywhere yet at the same time have the data be encrypted.
  • Enforceable digital rights management that would allow authorized users data access but unauthorized users would be restricted from viewing the information
  • Easy accessibility that would allow home consumers access to their data in an “always on” type of environment or access from any internet enabled location.
  • Dependable backups that would allow user restore of data.
  • Time limited protection scheme that after so many years (60 or 100) of data non-access/non-modification, the data would revert to public access/non-secured access for future research.
  • Government funding akin to today’s libraries that are publicly funded but serve those consumers that take the time to access their library facilities.

I see this as another outgrowth of current libraries which supports a repository for todays books, magazines, media, maps, and other published artifacts. However, in this case most data would not be published during a person’s lifetime but would become public property sometime after that person dies.

Benefits to society and the individual

Of what use could such a data repository be? Once the data becomes publicly accessible:

  • Future historians could find out what life was really like, in a detail never before available. Find out what people were watching/listening to, who people wrote to/conversed with, and what people cared about in the 21st century by perusing the data feeds of that generation.
  • Future scientists could mine the data for insights into a generation, network links, and personal data consumption.
  • Future governments could mine the data looking for what people thought about a nation, its economy, politics, etc., to help create better government.

But mostly, we don’t know what future researchers could do with the data. If such a repository existed today for what people were thinking and doing 60 to 100 years ago, history would be much more person derived rather than media derived. Economists would have a much more accurate picture of the great depression’s affect on humankind. Medicine would have a much better picture of how the pollutants and lifestyles of yesterday impact the health of today.

Also, as more and more of society’s activity involve data, the detail available on a person’s life becomes even more pervasive. Consider medical imaging, if you had a repository for a person’s x-rays from birth to death, this data could potentially be invaluable to the medicine of tomorrow.

While the data is still protected people

  • Would have a secure repository to store all their data, accessible from any internet enabled location
  • Would have an unlimited repository for their data storage not unlike timemachine on the Mac which they could go back to at anytime in the past to retrieve data.
  • Would have the potential to record even more information about their daily activities.
  • Would have a way to license their data feeds to researchers for a price sort of like registering for Nielsen TV or Alexa web tracking.

Costs to society

The price society would pay could be minimized by appropriate storage and systems technology. If in reality the data created by individuals (~87PB/day from the above mentioned post) could be deduped by a factor of 50X, this would account for only 1.7PB of unique data per day worldwide. If I take a nation’s portion of world GDP as a surrogate for data created by a nation, then for the US with 23.6% of the world’s ’08 GDP, creates ~0.4PB of individual deduped data per day or ~150PB of data per year.

Of course this would be split up by state or by municipality so the load on any one juristiction would be considerably smaller than this. But storing 150PB of data today would take 75K-2TB drives and would cost about ~$15.8M in drive costs (2TB WD drive costs $210 on Amazon) in the US. This does not account for servers, backups, power, cooling, floorspace, administration, etc but let’s triple this to incorporate these other costs. So to store all the data created by individuals in the US in 2009 would cost around $47.4M today with today’s technology.

Also consider that this cost is being cut in half every 18 to 24 months but counteracting that trend is a significant growth in data created/stored by individuals each year (~50%). Hence, by my calculations, the cost to store all this data is declining slightly every year depending on the speed of density increase and average individual data growth rate.

In any event, $47.4M is not a lot to spend to keep a nation’s worth of individual data. The benefits to today’s society would be considerable and future generations would have a treasure trove of data to analyze whenever the need presented itself.

Holding this back today is the obvious cost but also all of the data security considerations. I believe the costs are manageable, at least at the state or municipal level. As for the data security considerations, simple data-at-rest encryption is one viable solution. Although how to encrypt while still providing deduplication is a serious problem to be overcome. Enforceable digital rights, time limited protection, and the other technological features could come with time.

An Exabyte-a-day

snp microarray data by mararie (cc) (from flickr)
snp microarray data by mararie (cc) (from flickr)

At HPTechDay this week Jim Pownell, office of CTO, HP StorageWorks Division, reported on an IDC study that said this year the world is creating about an Exabyte of data each day.  An Exabyte (XB) is 10**18 bytes or 1000 PB of data.  Seems a bit high from my perspective.

Data creation by individuals

Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)
Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)

The US Census bureau estimates todays worldwide population at around 6.8 Billion people. Given that estimate, the XB/day number says that the average person is creating about 150MB/day.

Now I don’t know about you but we probably create that much data during our best week. That being said our family average over the last 3.5 years is more like 30.1MB/day. This average, over the last year, has been closer to 75.1MB/day (darn new digital camera).

If I take our 75.1 MB/day as a reasonable approximate average for our family and with 2 adults in our family, this would say each adult creates ~37.6MB of data per day.

Probably about 50% of todays world wide population probably has no access to create any data whatsoever. Of the remaining 50%, maybe 33% is at an age where data creation is insignificant. All this leaves about 2.3B people actively creating data at around 37.6MB/day. This would account for about 86.5PB of data creation a day.

Naturally, I would consider myself a power data creator but

  • We are not doing much with video production which takes creates gobs of data.
  • Also, my wife retains camera rights and I only take the occasional photo with my cell phone. So I wouldn’t say we are heavy into photography.

Nonetheless, 37.6MB/day on average seems exceptionally high, even for us.

Data creation by companies

However, that XB a day also accounts for corporate data generation as well as individuals. Hoovers, a US corporate database lists about 33M companies worldwide. These are probably the biggest 33M and no doubt creating lot’s of data each day.

Given the above that individuals probably account for 86.5PB/day, that leaves about ~913.5PB/day for the Hoover’s DB of 33M companies to create. By my calculations this would say each of these companies is generating about ~27.6GB/day. No doubt there are plenty of companies out there doing this each day but the average company generates 27.6GB a day?? I don’t think so.

Ok, my count of companies could be wildly off. Perhaps the 33M companies in Hoover’s DB represent only the top 20% of companies worldwide, which means that maybe there are another 132M smaller companies out there totaling 165M companies. Now the 913.5PB/day says the average company generates ~5.5GB/day. This still seems high to me, especially considering this is an average of all 165M companies world wide.

Most analysts predict data creation is growing by over 100% per year, so that XB/day number for this year will be 2XB/day next year.

Of course I have been looking at a new HD video camera for my birthday…

Sony_HDR-TG5V_Vanity350
Sony_HDR-TG5V_Vanity350

Chart of the month: SPC-1 LRT performance results

Chart of the Month: SPC-1 LRT(tm) performance resultsThe above chart shows the top 12 LRT(tm) (least response time) results for Storage Performance Council’s SPC-1 benchmark. The vertical axis is the LRT in milliseconds (msec.) for the top benchmark runs. As can be seen the two subsystems from TMS (RamSan400 and RamSan320) dominate this category with LRTs significantly less than 2.5msec. IBM DS8300 and it’s turbo cousin come in next followed by a slew of others.

The 1msec. barrier

Aside from the blistering LRT from the TMS systems one significant item in the chart above is that the two IBM DS8300 systems crack the <1msec. barrier using rotating media. Didn’t think I would ever see the day, of course this happened 3 or more years ago. Still it’s kind of interesting that there haven’t been more vendors with subsystems that can achieve this.

LRT is probably most useful for high cache hit workloads. For these workloads the data comes directly out of cache and the only thing between a server and it’s data is subsystem IO overhead, measured here as LRT.

Encryption cheap and fast?

The other interesting tidbit from the chart is that the DS5300 with full drive encryption (FDE), (drives which I believe come from Seagate) cracks into the top 12 at 1.8msec exactly equivalent with the IBM DS5300 without FDE. Now FDE from Seagate is a hardware drive encryption capability and might not be measurable at a subsystem level. Nonetheless, it shows that having data security need not reduce performance.

What is not shown in the above chart is that adding FDE to the base subsystem only cost an additional US$10K (base DS5300 listed at US$722K and FDE version at US$732K). Seems like a small price to pay for data security which in this case is simply turn it on, generate keys, and forget it.

FDE is a hard drive feature where the drive itself encrypts all data written and decrypts all data read to from a drive and requires a subsystem supplied drive key at power on/reset. In this way the data is never in plaintext on the drive itself. If the drive were taken out of the subsystem and attached to a drive tester all one would see is ciphertext. Similar capabilities have been available in enterprise and SMB tape drives is the past but to my knowledge the IBM DS5300 FDE is the first disk storage benchmark with drive encryption.

I believe the key manager for the DS5300 FDE is integrated within the subsystem. Most shops would need a separate, standalone key manager for more extensive data security. I believe the DS5300 can also interface with an standalone (IBM) key manager. In any event, it’s still an easy and simple step towards increased data security for a data center.

The full report on the latest SPC results will be up on my website later this week but if you want to get this information earlier and receive your own copy of our newsletter – email me at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.