3.3 Exabytes-a-day?!

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)
Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

NetworkWorld announced today information from an EMC funded IDC study that said the world will create 1.2 Zettabytes (ZB, 10**21 bytes) of data in 2010. By my calculations this is 3.3 Exabytes-a-day (XB,10**18 bytes), 2.3PB (10**15 bytes) a minute or 38TB (10**12 bytes) a second.  This seems high and I have talked about how we could get here last year in my Exabyte-a-day post.  But what interested me most was a statement that about 35% more information is created than can be stored.  Not sure I understand this claim. (Deduplication perhaps?)

Aside from deduplication, what this must mean is that data is being created, sent across the Internet and not stored anywhere except while in flight to be discarded soon after.  I assume this data is associated with something like VOIP phone calls and Video chats/conferences, only some portion of which is ever recorded and stored.   (Although that will soon no longer be true for audio, see my Yottabytes by 2015 post).

But 35% would indicate ~1 out of every 3 bytes of data is discarded shortly after creation.  IDC also expects this factor to grow, not shrink and “… to over 60% over the next few years.”  So 3 out of 5 bytes of data will only be available during real-time to be discarded thereafter.

Why this portion should be growing more rapidly than data being stored is hard to fathom. Again video and voice over the internet must be a significant part of the reason.

Storing voice data

I don’t know about most people but I record only a few of my more important calls.  Also, these calls happen to be longer on average than my normal calls.  Does this mean that 35% of my call data volume is not stored, maybe.  All my business calls are done via the Internet nowadays so this data is being created and shipped across the net, used while the call is occurring but never stored other than in flight or by call participants.  So non-recorded calls easily qualifies as data created but not stored.  Even so, while I may listen to maybe ~33% of the recorded calls afterwards, I overwrite all of them ultimately, keeping only the ones that fit on the recorder’s flash device.  Hence, in the end even the voice data I do keep is only retained until I need storage to record more.

Not sure how this is treated in the IDC study but it seems to me to be yet another class of data, maybe call this transient data.  I can see similarities of transient data in company backups, log files, database dumps, etc.  Most of this data is stored for a limited time only to be later erased/recorded over in the end.  How IDC classified such data I cannot tell.

But will transient data grow?

As for video, I currently do no video conferencing so have no information on this.  But I am considering moving to another communication platform that supplies Video chat’s and which will make it less intrusive to record calls.  While demoing this new capability I have rapidly consumed over 200MB of storage for call recordings.  (I need to cap this some way before it gets out of hand).  In any case, I believe recording convenience should make such data more store-able over time, not less.

So while I may agree that 1 out of 3 bytes of data created today is not stored, I definitely don’t think that over time that ratio will grow and certainly not to 60%.  My only caveat is that there is a limit to the amount of data the world can readily store at any one time and this will ultimately drive all of us to delete data we would rather keep.

But maybe all this just points to a more interesting question, how much data does the world create that is kept for a year, a decade, or a century.  But that will need to await another post…

An Exabyte-a-day

snp microarray data by mararie (cc) (from flickr)
snp microarray data by mararie (cc) (from flickr)

At HPTechDay this week Jim Pownell, office of CTO, HP StorageWorks Division, reported on an IDC study that said this year the world is creating about an Exabyte of data each day.  An Exabyte (XB) is 10**18 bytes or 1000 PB of data.  Seems a bit high from my perspective.

Data creation by individuals

Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)
Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)

The US Census bureau estimates todays worldwide population at around 6.8 Billion people. Given that estimate, the XB/day number says that the average person is creating about 150MB/day.

Now I don’t know about you but we probably create that much data during our best week. That being said our family average over the last 3.5 years is more like 30.1MB/day. This average, over the last year, has been closer to 75.1MB/day (darn new digital camera).

If I take our 75.1 MB/day as a reasonable approximate average for our family and with 2 adults in our family, this would say each adult creates ~37.6MB of data per day.

Probably about 50% of todays world wide population probably has no access to create any data whatsoever. Of the remaining 50%, maybe 33% is at an age where data creation is insignificant. All this leaves about 2.3B people actively creating data at around 37.6MB/day. This would account for about 86.5PB of data creation a day.

Naturally, I would consider myself a power data creator but

  • We are not doing much with video production which takes creates gobs of data.
  • Also, my wife retains camera rights and I only take the occasional photo with my cell phone. So I wouldn’t say we are heavy into photography.

Nonetheless, 37.6MB/day on average seems exceptionally high, even for us.

Data creation by companies

However, that XB a day also accounts for corporate data generation as well as individuals. Hoovers, a US corporate database lists about 33M companies worldwide. These are probably the biggest 33M and no doubt creating lot’s of data each day.

Given the above that individuals probably account for 86.5PB/day, that leaves about ~913.5PB/day for the Hoover’s DB of 33M companies to create. By my calculations this would say each of these companies is generating about ~27.6GB/day. No doubt there are plenty of companies out there doing this each day but the average company generates 27.6GB a day?? I don’t think so.

Ok, my count of companies could be wildly off. Perhaps the 33M companies in Hoover’s DB represent only the top 20% of companies worldwide, which means that maybe there are another 132M smaller companies out there totaling 165M companies. Now the 913.5PB/day says the average company generates ~5.5GB/day. This still seems high to me, especially considering this is an average of all 165M companies world wide.

Most analysts predict data creation is growing by over 100% per year, so that XB/day number for this year will be 2XB/day next year.

Of course I have been looking at a new HD video camera for my birthday…

Sony_HDR-TG5V_Vanity350
Sony_HDR-TG5V_Vanity350