3.3 Exabytes-a-day?!

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)
Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

NetworkWorld announced today information from an EMC funded IDC study that said the world will create 1.2 Zettabytes (ZB, 10**21 bytes) of data in 2010. By my calculations this is 3.3 Exabytes-a-day (XB,10**18 bytes), 2.3PB (10**15 bytes) a minute or 38TB (10**12 bytes) a second.  This seems high and I have talked about how we could get here last year in my Exabyte-a-day post.  But what interested me most was a statement that about 35% more information is created than can be stored.  Not sure I understand this claim. (Deduplication perhaps?)

Aside from deduplication, what this must mean is that data is being created, sent across the Internet and not stored anywhere except while in flight to be discarded soon after.  I assume this data is associated with something like VOIP phone calls and Video chats/conferences, only some portion of which is ever recorded and stored.   (Although that will soon no longer be true for audio, see my Yottabytes by 2015 post).

But 35% would indicate ~1 out of every 3 bytes of data is discarded shortly after creation.  IDC also expects this factor to grow, not shrink and “… to over 60% over the next few years.”  So 3 out of 5 bytes of data will only be available during real-time to be discarded thereafter.

Why this portion should be growing more rapidly than data being stored is hard to fathom. Again video and voice over the internet must be a significant part of the reason.

Storing voice data

I don’t know about most people but I record only a few of my more important calls.  Also, these calls happen to be longer on average than my normal calls.  Does this mean that 35% of my call data volume is not stored, maybe.  All my business calls are done via the Internet nowadays so this data is being created and shipped across the net, used while the call is occurring but never stored other than in flight or by call participants.  So non-recorded calls easily qualifies as data created but not stored.  Even so, while I may listen to maybe ~33% of the recorded calls afterwards, I overwrite all of them ultimately, keeping only the ones that fit on the recorder’s flash device.  Hence, in the end even the voice data I do keep is only retained until I need storage to record more.

Not sure how this is treated in the IDC study but it seems to me to be yet another class of data, maybe call this transient data.  I can see similarities of transient data in company backups, log files, database dumps, etc.  Most of this data is stored for a limited time only to be later erased/recorded over in the end.  How IDC classified such data I cannot tell.

But will transient data grow?

As for video, I currently do no video conferencing so have no information on this.  But I am considering moving to another communication platform that supplies Video chat’s and which will make it less intrusive to record calls.  While demoing this new capability I have rapidly consumed over 200MB of storage for call recordings.  (I need to cap this some way before it gets out of hand).  In any case, I believe recording convenience should make such data more store-able over time, not less.

So while I may agree that 1 out of 3 bytes of data created today is not stored, I definitely don’t think that over time that ratio will grow and certainly not to 60%.  My only caveat is that there is a limit to the amount of data the world can readily store at any one time and this will ultimately drive all of us to delete data we would rather keep.

But maybe all this just points to a more interesting question, how much data does the world create that is kept for a year, a decade, or a century.  But that will need to await another post…