I don’t know about you, but 4TB disk drives for a desktop seem about as much as I could ever use. But when looking seriously at my desktop environment my CAGR for storage (revealed as fully compressed TAR files) is ~61% year over year. At that rate, I will need a 4TB drive for backup purposes in about 7 years and if I assume a 2X compression rate then a 4TB desktop drive will be needed in ~3.5 years, (darn music, movies, photos, …). And we are not heavy digital media consumers, others that shoot and edit their own video probably use orders of magnitude more storage.
Hard to believe, but given current trends inevitable, a 4TB disk drive will become a necessity for us within the next 4 years.
I was at another conference the other day where someone showed a chart that said the world will create 35ZB (10**21) of data and content in 2020 from 800EB (10**18) in 2009.
Every time I see something like this I cringe. Yes, lot’s of data is being created today but what does that tell us about corporate data growth. Not much, I’d wager.
That being said, I have a couple of questions I would ask of the people who estimated this:
How much is personal data and how much is corporate data.
Did you factor how entertainment data growth rates will change over time.
These two questions are crucial.
Entertainment dominates data growth
Just as personal entertainment is becoming the major consumer of national bandwidth (see study [requires login]), it’s clear to me that the majority of the data being created today is for personal consumption/entertainment – video, music, and image files.
I look at my own office, our corporate data (office files, PDFs, text, etc.) represents ~14% of the data we keep. Images, music, video, audio take up the remainder of our data footprint. Is this data growing yes, faster than I would like but the corporate data is only averaging ~30% YoY growth while the overall data growth for our shop is averaging a total of ~116% YoY growth . [As I interrupt this activity to load up another 3.3GB of photos and videos from our camera]
Moreover, although some media content is of significant external interest to select (Media and Entertainment, social media-photo/video sharing sites, mapping/satellite, healthcare, etc.) companies today, most corporations don’t deal with lot’s of video, music or audio data. Thus, I personally see that the 30% growth is a more realistic growth rate for corporate data than 116%.
Will entertainment data growth flatten?
Will we see a drop in the entertainment data growth rates over time, undoubtedly.
Two factors will reduce the growth of this data.
What happens to entertainment data recording formats. I believe media recording formats are starting to level out. I think the issue here is one of fidelity to nature, in terms of how closely a digital representation matches reality as we perceive it. For example, the fact is that most digital projection systems in movie theaters today run from ~2 to 8TBs per feature length motion picture which seems to indicate that at some point further gains in fidelity (or in more pixels/frame) may not be worth it. Similar issues, will ultimately lead to a slowing down of other media encoding formats.
When will all the people that can create content be doing so? Recent data indicates that more than 2B people will be on the internet this year or ~28% of the world’s. But sometime we must reach saturation on internet penetration and when that happens data growth rates should also start to level out. Let’s say for argument sake, that 800EB in 2009 was correct and let’s assume there were 1.5B internet users (in 2009). As such, 1B internet users correlates to a data and content footprint of about 533EB or ~0.5TB/internet user — seems high but certainly doable.
Once these two factors level off, we should see world data and content growth rates plummet. Nonetheless, internet user population growth could be driving data growth rates for some time to come.
The scary part is that the 35ZB represents only a ~41% growth rate over the period against the baseline 2009 data and content creation levels.
But I must assume this estimate doesn’t consider much growth in digital creators of content, otherwise these numbers should go up substantially. In the last week, I ran across someone who said there would be 6B internet users by the end of the decade (can’t seem to recall where, but it was a TEDx video). I find that a little hard to believe but this was based on the assumption that most people will have smart phones with cellular data plans by that time. If that be the case, 35ZB seems awfully short of the mark.
A previous post blows this discussion completely away with just one application, (see Yottabytes by 2015 for the NSA A Yottabyte (YB) is 10**24 bytes of data) and I had already discussed an Exabyte-a-day and 3.3 Exabytes-a-day in prior posts. [Note, those YB by 2015 are all audio (phone) recordings but if we start using Skype Video, FaceTime and other video communications technologies can Nonabytes (10**27) be far behind… BOOM!]
I started out thinking that 35ZB by 2020 wasn’t pertinent to corporate considerations and figured things had to flatten out, then convinced myself that it wasn’t large enough to accommodate internet user growth, and then finally recalled prior posts that put all this into even more perspective.
All data operates under a set of laws but unstructured data suffers from these tendencies more than most of all. Although, information technology has helped us to create and manage data easier, it hasn’t done much to minimize the problems these laws produce.
As such, I introduce here my 5 laws of unstructured data in the hopes that they may help us better understand the data we create.
Law 1: Unstructured data grows 50% per year
This has been a truism in the data center for as far back as I can remember. In the data center this is driven by business transactions, new applications and new products/services. On top of all that corporate compliance often dictate that data be retained long after it’s usefulness has passed.
Nowadays, Law 1 is also true for the home user as well. Here it’s a combination of email and media. Not only are cameras moving from 6 to 9 megapixels, home video is moving to high definition and there is just a whole lot more media being created everyday. Also, now social media seems to have doubled or tripled our outreach data creation above “normal email” alone.
Law 2: Unstructured data access frequency diminishes over time
Data created today is accessed frequently during it’s first 90 days of life and then less often after that. Reasons for this decaying access pattern vary, but human memory has to play a significant part in this.
Furthermore, business transactions encounter a life cycle from initiation, to delivery and finally, to termination. During these transitions various unstructured data are created representing the transaction state. Such data may be examined at quarter end and possibly at year end but may never see the light of day after that.
Law 3: Unsearchable data is lost data
Given Law 2’s data access decay and Law 1’s data growth, unsearchable data is by definition, inaccessible data. It’s not hard to imagine how this plays out in the data center or home.
For the data center, unstructured data mostly resides in user and application directories. I am constantly amazed that it’s easier to find data out on the web than it is to find data elsewhere in the data center. Moreover, E-discovery has become a major business segment in recent years by attempting to search unstructured corporate data.
As a Mac user my home environment is searchable for any text string. However, my photo library is another matter. Finding a specific photo from a couple of years ago is a sequential perusal of iPhoto’s library and as such, is seldom done.
Law 4: Unstructured data is copied often
Over a decade ago, a company I worked with sponsored a study to see how often data is copied. The numbers we came up with were impressive. A small but significant % of data is copied often, it’s not unusual to see 6-8 copies of such data. Some of this copying occurs when final documents are passed on, some comes from teamwork and other joint collaboration as working documents are reviewed and some is just interesting information that deserves broader dissemination. As such, data copies can represent a significant portion of any data center’s storage.
I suppose data proliferation may not be as evident in the home but our home would be an exception. Each of our Macs has a copy of all email account and have copies of the best photos. In addition, with laptops and multiple desktops, most Mac’s have copies of each (adult) user’s work environment,
Law 5: Unstructured data manual classification schemes degrade over time
In the data center, one could easily classify any file data created and maintain a database of file meta-data to facilitate access to file data. But who has the discipline or spare time to update such a database whenever they create a file or document. While this may work for “official records”, the effort involved makes it unusable for everything else.
My favorite home example of this is once again, our iPhoto library with it’s manual classification system using stars, e.g., I can assign anything from 0 to 5 stars to any photo. Used to be that after each camera import, I would assign a star rating to each new photo. Nowadays, the only time I do this is once a year and as such, it’s becoming more problematic and less useful. As we take more photographs each year this becomes much more of a burden.
Not sure these 5 laws of unstructured data are mutually exclusive and completely exhaustive but it’s a start. If anyone has any ideas on how to improve my unstructured data laws, feel free to comment below. In the mean time, as for structured data laws, …
My recent post on an exabyte-a-day generated a comment that got me thinking. What we need in the world today is a universal deduped archive. Such an archive would be a repository for all information generated by the world, nation, state, etc. and would automatically deduplicate the data and back it up.
Such an archive could be a new form of the current library. Keeping data for future generations and also for a nation’s population. Data held in the library repository would need to have:
Iron-clad data security via some form of data-at-rest encryption. This is a bit tricky since we would want to dedupe all the data from everywhere yet at the same time have the data be encrypted.
Enforceable digital rights management that would allow authorized users data access but unauthorized users would be restricted from viewing the information
Easy accessibility that would allow home consumers access to their data in an “always on” type of environment or access from any internet enabled location.
Dependable backups that would allow user restore of data.
Time limited protection scheme that after so many years (60 or 100) of data non-access/non-modification, the data would revert to public access/non-secured access for future research.
Government funding akin to today’s libraries that are publicly funded but serve those consumers that take the time to access their library facilities.
I see this as another outgrowth of current libraries which supports a repository for todays books, magazines, media, maps, and other published artifacts. However, in this case most data would not be published during a person’s lifetime but would become public property sometime after that person dies.
Benefits to society and the individual
Of what use could such a data repository be? Once the data becomes publicly accessible:
Future historians could find out what life was really like, in a detail never before available. Find out what people were watching/listening to, who people wrote to/conversed with, and what people cared about in the 21st century by perusing the data feeds of that generation.
Future scientists could mine the data for insights into a generation, network links, and personal data consumption.
Future governments could mine the data looking for what people thought about a nation, its economy, politics, etc., to help create better government.
But mostly, we don’t know what future researchers could do with the data. If such a repository existed today for what people were thinking and doing 60 to 100 years ago, history would be much more person derived rather than media derived. Economists would have a much more accurate picture of the great depression’s affect on humankind. Medicine would have a much better picture of how the pollutants and lifestyles of yesterday impact the health of today.
Also, as more and more of society’s activity involve data, the detail available on a person’s life becomes even more pervasive. Consider medical imaging, if you had a repository for a person’s x-rays from birth to death, this data could potentially be invaluable to the medicine of tomorrow.
While the data is still protected people
Would have a secure repository to store all their data, accessible from any internet enabled location
Would have an unlimited repository for their data storage not unlike timemachine on the Mac which they could go back to at anytime in the past to retrieve data.
Would have the potential to record even more information about their daily activities.
Would have a way to license their data feeds to researchers for a price sort of like registering for Nielsen TV or Alexa web tracking.
Costs to society
The price society would pay could be minimized by appropriate storage and systems technology. If in reality the data created by individuals (~87PB/day from the above mentioned post) could be deduped by a factor of 50X, this would account for only 1.7PB of unique data per day worldwide. If I take a nation’s portion of world GDP as a surrogate for data created by a nation, then for the US with 23.6% of the world’s ’08 GDP, creates ~0.4PB of individual deduped data per day or ~150PB of data per year.
Of course this would be split up by state or by municipality so the load on any one juristiction would be considerably smaller than this. But storing 150PB of data today would take 75K-2TB drives and would cost about ~$15.8M in drive costs (2TB WD drive costs $210 on Amazon) in the US. This does not account for servers, backups, power, cooling, floorspace, administration, etc but let’s triple this to incorporate these other costs. So to store all the data created by individuals in the US in 2009 would cost around $47.4M today with today’s technology.
Also consider that this cost is being cut in half every 18 to 24 months but counteracting that trend is a significant growth in data created/stored by individuals each year (~50%). Hence, by my calculations, the cost to store all this data is declining slightly every year depending on the speed of density increase and average individual data growth rate.
In any event, $47.4M is not a lot to spend to keep a nation’s worth of individual data. The benefits to today’s society would be considerable and future generations would have a treasure trove of data to analyze whenever the need presented itself.
Holding this back today is the obvious cost but also all of the data security considerations. I believe the costs are manageable, at least at the state or municipal level. As for the data security considerations, simple data-at-rest encryption is one viable solution. Although how to encrypt while still providing deduplication is a serious problem to be overcome. Enforceable digital rights, time limited protection, and the other technological features could come with time.
At HPTechDay this week Jim Pownell, office of CTO, HP StorageWorks Division, reported on an IDC study that said this year the world is creating about an Exabyte of data each day. An Exabyte (XB) is 10**18 bytes or 1000 PB of data. Seems a bit high from my perspective.
Now I don’t know about you but we probably create that much data during our best week. That being said our family average over the last 3.5 years is more like 30.1MB/day. This average, over the last year, has been closer to 75.1MB/day (darn new digital camera).
If I take our 75.1 MB/day as a reasonable approximate average for our family and with 2 adults in our family, this would say each adult creates ~37.6MB of data per day.
Probably about 50% of todays world wide population probably has no access to create any data whatsoever. Of the remaining 50%, maybe 33% is at an age where data creation is insignificant. All this leaves about 2.3B people actively creating data at around 37.6MB/day. This would account for about 86.5PB of data creation a day.
Naturally, I would consider myself a power data creator but
We are not doing much with video production which takes creates gobs of data.
Also, my wife retains camera rights and I only take the occasional photo with my cell phone. So I wouldn’t say we are heavy into photography.
Nonetheless, 37.6MB/day on average seems exceptionally high, even for us.
Given the above that individuals probably account for 86.5PB/day, that leaves about ~913.5PB/day for the Hoover’s DB of 33M companies to create. By my calculations this would say each of these companies is generating about ~27.6GB/day. No doubt there are plenty of companies out there doing this each day but the average company generates 27.6GB a day?? I don’t think so.
Ok, my count of companies could be wildly off. Perhaps the 33M companies in Hoover’s DB represent only the top 20% of companies worldwide, which means that maybe there are another 132M smaller companies out there totaling 165M companies. Now the 913.5PB/day says the average company generates ~5.5GB/day. This still seems high to me, especially considering this is an average of all 165M companies world wide.
Most analysts predict data creation is growing by over 100% per year, so that XB/day number for this year will be 2XB/day next year.
Of course I have been looking at a new HD video camera for my birthday…