Data search – Silverton Consulting

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

Cloud storage keeps getting more viable and I see storage pricing going down considerably over time. All of which got me thinking what could be done with a dime per TB per year storage ($0.10/TB/yr). Now most cloud providers charge 10 cents or more per GB per month so this is at least 12,000 times less expensive but it’s inevitable at some point in time.

So here are my 5 killer apps for $0.10/TB/yr cloud storage:

Photo record of life – something akin to glasses which would record a wide angle, high mega-pixel video record of everything I looked at, for every second of my waking life. I think at a photo shot every second for 12hrs/day 365days/yr would be about ~16M photos and at 4 MB per photo this would be about ~64TB per person year. For my 4 person family this would cost ~$26/year for each year of family life and for a 40 year family time span, the last payment for this would be ~$1040 or an average payment of $520/year.
Audio recording of life – something akin to a always on bluetooth headset which would record an audio feed to go with the semi-video or photo record above. By being an always on bluetooth headset it would automatically catch cell phone as well as spoken conversations but it would need to plug to landlines as well. As discussed in my YB by 2015 archive post, one minute of MP3 audio recording takes up roughly a MB of storage. Lets say I converse with someone ~33% of my waking day. So this would be about 4 hrs of MP3 audio/day 365days/yr or about 21TB per year per person. For my family this would cost or ~$8.40/year for storage and for a 40 year family life span my last payment would be ~$336 or an average of $168/yr.
Home security cameras – with ethernet based security cameras, it wouldn’t be hard to record a 360 degree outside as well as inside points of entry coverage video. The quantities for the photo record of my life would suffice for here as well but one doesn’t need to retain the data for a whole year perhaps a rolling 30 day record would suffice but it would be recorded for 24 hours. Assuming 8 cameras outside and inside, this could be stored in about 10TB of storage per camera, or about 80TB of storage or $8/year but would not increase over time.
No more deletes/version everything – if storage were cheap enough we would never delete data. Normal data change activity is in the 5 to 10% per week rate, but this does not account for duplicating deleted data. So let’s say we would need to store an additional 20% of your primary/active data per week for deleted data. For a 1TB primary storage working set, a ~20% deletion rate per week would be 10TB of deleted data per year per person and for my family ~$4/yr and my last yearly payment would be ~$160. If we were to factor in data growth rates of ~20%/year, this would go up substantially averaging ~$7.3k/yr over 40 years.
Customized search engines – if storage AND bandwidth were cheap enough it would be nice to have my own customized search engine. Such a capability would follow all my web clicks, spawning a search spider for every website I traverse and provide customized “deep” searching for every web page I view. Such an index might take 50% of the size of a page and on average my old website used ~18KB per page, so at 50% this index would require 9KB. Assuming, I look at ~250 web pages per business day of which maybe ~170 are unique and each unique page probably links to 2 more unique pages, which links to two more, which links to two more, … If we go 10 pages deep, then for 170 pages viewed, an average branching factor of 2, we would need to index ~174K pages/day and for a year, this would represent about represent about 0.6TB of page index. For my household, a customized search engine would cost ~$0.25 of additional storage per year and for 40 years my last payment would be $10.

I struggled with coming with ideas that would cost between $10 and $500 a year as every other storage use came out significantly less than $1/year for a family of four. This seems to say that there might be plenty of applications in the range of under a $10 per TB per year, still 1200X current cloud storage costs.

Any other applications out there that could take advantage of a dime/TB/year?

Richard (Dick) Nafzger with Apollo data tape by Goddard Photo and Video (cc) (from flickr)

All data operates under a set of laws but unstructured data suffers from these tendencies more than most of all. Although, information technology has helped us to create and manage data easier, it hasn’t done much to minimize the problems these laws produce.

As such, I introduce here my 5 laws of unstructured data in the hopes that they may help us better understand the data we create.

Law 1: Unstructured data grows 50% per year

This has been a truism in the data center for as far back as I can remember. In the data center this is driven by business transactions, new applications and new products/services. On top of all that corporate compliance often dictate that data be retained long after it’s usefulness has passed.

Nowadays, Law 1 is also true for the home user as well. Here it’s a combination of email and media. Not only are cameras moving from 6 to 9 megapixels, home video is moving to high definition and there is just a whole lot more media being created everyday. Also, now social media seems to have doubled or tripled our outreach data creation above “normal email” alone.

Law 2: Unstructured data access frequency diminishes over time

Data created today is accessed frequently during it’s first 90 days of life and then less often after that. Reasons for this decaying access pattern vary, but human memory has to play a significant part in this.

Furthermore, business transactions encounter a life cycle from initiation, to delivery and finally, to termination. During these transitions various unstructured data are created representing the transaction state. Such data may be examined at quarter end and possibly at year end but may never see the light of day after that.

Law 3: Unsearchable data is lost data

Given Law 2’s data access decay and Law 1’s data growth, unsearchable data is by definition, inaccessible data. It’s not hard to imagine how this plays out in the data center or home.

For the data center, unstructured data mostly resides in user and application directories. I am constantly amazed that it’s easier to find data out on the web than it is to find data elsewhere in the data center. Moreover, E-discovery has become a major business segment in recent years by attempting to search unstructured corporate data.

As a Mac user my home environment is searchable for any text string. However, my photo library is another matter. Finding a specific photo from a couple of years ago is a sequential perusal of iPhoto’s library and as such, is seldom done.

Law 4: Unstructured data is copied often

Over a decade ago, a company I worked with sponsored a study to see how often data is copied. The numbers we came up with were impressive. A small but significant % of data is copied often, it’s not unusual to see 6-8 copies of such data. Some of this copying occurs when final documents are passed on, some comes from teamwork and other joint collaboration as working documents are reviewed and some is just interesting information that deserves broader dissemination. As such, data copies can represent a significant portion of any data center’s storage.

I suppose data proliferation may not be as evident in the home but our home would be an exception. Each of our Macs has a copy of all email account and have copies of the best photos. In addition, with laptops and multiple desktops, most Mac’s have copies of each (adult) user’s work environment,

Law 5: Unstructured data manual classification schemes degrade over time

In the data center, one could easily classify any file data created and maintain a database of file meta-data to facilitate access to file data. But who has the discipline or spare time to update such a database whenever they create a file or document. While this may work for “official records”, the effort involved makes it unusable for everything else.

My favorite home example of this is once again, our iPhoto library with it’s manual classification system using stars, e.g., I can assign anything from 0 to 5 stars to any photo. Used to be that after each camera import, I would assign a star rating to each new photo. Nowadays, the only time I do this is once a year and as such, it’s becoming more problematic and less useful. As we take more photographs each year this becomes much more of a burden.

Not sure these 5 laws of unstructured data are mutually exclusive and completely exhaustive but it’s a start. If anyone has any ideas on how to improve my unstructured data laws, feel free to comment below. In the mean time, as for structured data laws, …