Domesticating data

4111674475_76be20e180_zRead an article the other day from MIT News (Taming Data) about a new system that scans all your tabular data and provides an easy way to query all this data from one system. The researchers call the system the Data Civilizer.

What does it do

Tabular data seems to be the one constant in corporate data (that and for me PowerPoint and Word docs). Most data bases are tables of one form or another (some row and some column based). Lots of operational data is in spreadsheets (tables by another name) of some type.  And when I look over most IT/Networking/Storage management GUIs, tables (rows and columns) of data are the norm.

156788318_628fb0e4dc_oThe Data Civilizer takes all this tabular data and analyzes it all, column by column, and calculates descriptive characterization statistics for each column.

Numerical data could be characterized by range, standard deviation, median/average, cardinality etc. For textual data a list of words in the column by frequency might suffice. It also indexes every  word in the tables it analyzes.

Armed with its statistical characterization of each column, the Data Civilizer can then generate a similarity index between any two columns of data across the tables it has analyzed. In that way it can connect data in one table with data in another.

Once it has a similarity matrix and has indexed all the words in every table column it has analyzed, it can then map the tabular data, showing which columns look similar to other columns. Then any arbitrary query for data, can be executed on any table that contains similar data supplying the results of the query across the multiple tables it has analyzed.

Potential improvements

The researchers indicated that they currently don’t support every table data format. This may be a sizable task on its own.

In addition statistical characterization or classification seems old school nowadays. Most new AI is moving off statistical analysis to more neural net types of classification. Unclear if you could just feed all the tabular data to a deep learning neural net, but if the end game is to find similarities across disparate data sets, then neural nets are probably a better way to go. How you would combine this with brute force indexing of all tabular data words is another question.


In the end as I look at my company’s information, even most of my Word docs are organized in some sort of table, so cross table queries could help me a lot. Let me know when it can handle Excel and Word docs and I’ll take another look.

Photo Credit(s): Linear system table representation 2 by Ronald O’ Daniel

Glenda Sims by Glendathegood


Hadoop – part 1

Hadoop Logo (from website)
Hadoop Logo (from website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.


I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.


EMCWorld day 2

Day 2 saw releases for new VMAX  and VPLEX capabilities hinted at yesterday in Joe’s keynote. Namely,

VMAX announcements

VMAX now supports

  • Native FCoE with 10GbE support now VMAX supports directly FCoE, 10GbE iSCSI and SRDF
  • Enhanced Federated Live Migration supports other multi-pathing software, specifically it now adds MPIO to PowerPath and soon to come more multi-pathing solutions
  • Support for RSA’s external key management (RSA DPM) for their internal VMAX data security/encryption capability.

It was mentioned more than once that the latest Enginuity release 5875 is being adopted at almost 4x the rate of the prior generation code.  The latest release came out earlier this year and provided a number of key enhancements to VMAX capabilities not the least of which was sub-LUN migration across up to 3 storage tiers called FAST VP.

Another item of interest was that FAST VP was driving a lot of flash sales.  It seems its leading to another level of flash adoption. According to EMC they feel that almost 80-90% of customers can get by with 3% of their capacity in flash and still gain all the benefits of flash performance at significantly less cost.

VPLEX announcements

VPLEX announcements included:

  • VPLEX Geo – a new asynchronous VPLEX cluster-to-cluster communications methodology which can have the alternate active VPLEX cluster up to 50msec latency away
  • VPLEX Witness –  a virtual machine which provides adjudication between the two VPLEX clusters just in case the two clusters had some sort of communications breakdown.  Witness can run anywhere with access to both VPLEX clusters and is intended to be outside the two fault domains where the VPLEX clusters reside.
  • VPLEX new hardware – using the latest Intel microprocessors,
  • VPLEX now supports NetApp ALUA storage – the latest generation of NetApp storage.
  • VPLEX now supports thin-to-thin volume migration- previously VPLEX had to re-inflate thinly provisioned volumes but with this release there is no need to re-inflate prior to migration.


The new Geo product in conjuncton with VMware and Hyper V allows for quick migration of VMs across distances that support up to 50msec of latency.  There are some current limitations with respect to specific VMware VM migration types that can be supported but Microsoft Hyper-V Live Migration support is readily available at full 50msec latencies.  Note,  we are not talking about distance here but latency as the limiting factor to how far the VPLEX clusters can be apart.

Recall that VPLEX has three distinct use cases:

  • Infrastructure availability which proides fault tolerance for your storage and system infrastructure
  • Application and data mobility which means that applications can move from data center to data center and still access the same data/LUNs from both sites.  VPLEX maintains cache and storage coherency across the two clusters automatically.
  • Distributed data collaboration which means that data can be shared and accessed across vast distances. I have discussed this extensively in my post on Data-at-a-Distance (DaaD) post, VPLEX surfaces at EMCWorld.

Geo is the third product version for VPLEX, from VPLEX Local that supports within data center virtualization, to Vplex Metro which supports two VPLEX clusters which are up to 10msec latency away which generally is up to metropolitan wide distances apart, and Geo which moves to asynchronous cache coherence technologies. Finally coming sometime later is VPLEX Global which eliminates the restriction of two VPLEX clusters or data centers and can support 3-way or more VPLEX clusters.

Along with Geo, EMC showed some new partnerships such as with SilverPeak, Cienna and others used to reduce bandwidth requirements and cost for their Geo asynchronous solution.  Also announced and at the show were some new VPLEX partnerships with Quantum StorNext and others which addresses DaaD solutions

Other announcements today

  • Cloud tiering appliance – The new appliance is a renewed RainFinity solution which provides policy based migration to and from the cloud for unstructured data. Presumably the user identifies file aging criteria which can be used to trigger cloud migration for Atmos supported cloud storage.  Also the new appliance can support archiving file data to the Data Domain Archiver product.
  • Google enterprise search connector to VNX – Showing a Google search appliance (GSA) to index VNX stored data. Thus bringing enterprise class and scaleable search capabilities for VNX storage.

A bunch of other announcements today at EMCWorld but these seemed most important to me.


5 killer apps for $0.10/TB/year

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)
iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

Cloud storage keeps getting more viable and I see storage pricing going down considerably over time.  All of which got me thinking what could be done with a dime per TB per year storage ($0.10/TB/yr).  Now most cloud providers charge 10 cents or more per GB per month so this is at least 12,000 times less expensive but it’s inevitable at some point in time.

So here are my 5 killer apps for $0.10/TB/yr cloud storage:

  1. Photo record of life – something akin to glasses which would record a wide angle, high mega-pixel video record of everything I looked at, for every second of my waking life.  I think at a photo shot every second for 12hrs/day 365days/yr would be about ~16M photos and at 4 MB per photo this would be about ~64TB per person year.  For my 4 person family this would cost ~$26/year for each year of family life and for a 40 year family time span, the last payment for this would be ~$1040 or an average payment of $520/year.
  2. Audio recording of life – something akin to a always on bluetooth headset which would record an audio feed to go with the semi-video or photo record above.  By being an always on bluetooth headset it would automatically catch cell phone as well as  spoken conversations but it would need to plug to landlines as well.  As discussed in my YB by 2015 archive post, one minute of MP3 audio recording takes up roughly a MB of storage.  Lets say I converse with someone ~33% of my waking day.  So this would be about 4 hrs of MP3 audio/day 365days/yr or about 21TB per year per person.  For my family this would cost or ~$8.40/year for storage and for a 40 year family life span my last payment would be ~$336 or an average of $168/yr.
  3. Home security cameras – with ethernet based security cameras, it wouldn’t be hard to record a 360 degree outside as well as inside points of entry coverage video.  The quantities for the photo record of my life would suffice for here as well but one doesn’t need to retain the data for a whole year perhaps a rolling 30 day record would suffice but it would be recorded for 24 hours. Assuming 8 cameras outside and inside,  this could be stored in about 10TB of storage per camera, or  about 80TB of storage or $8/year but would not increase over time.
  4. No more deletes/version everything – if storage were cheap enough we would never delete data.  Normal data change activity is in the 5 to 10% per week rate, but this does not account for duplicating deleted data.  So let’s say we would need to store an additional 20% of your primary/active data per week for deleted data.  For a 1TB primary storage working set, a ~20% deletion rate per week would be 10TB of deleted data per year per person and for my family ~$4/yr and my last yearly payment would be ~$160.  If we were to factor in data growth rates of ~20%/year, this would go up substantially averaging ~$7.3k/yr over 40 years.
  5. Customized search engines – if storage AND bandwidth were cheap enough it would be nice to have my own customized search engine. Such a capability would follow all my web clicks, spawning a search spider for every website I traverse and provide customized “deep” searching for every web page I view.   Such an index might take 50% of the size of a page and on average my old website used ~18KB per page, so at 50% this index would require 9KB. Assuming, I look at ~250 web pages per business day of which maybe ~170 are unique and each unique page probably links to 2 more unique pages, which links to two more, which links to two more, … If we go 10 pages deep, then for 170 pages viewed, an average branching factor of 2,  we would need to index ~174K pages/day and for a year, this would represent about represent about 0.6TB of page index.  For my household, a customized search engine would cost  ~$0.25 of additional storage per year and for 40 years my last payment would be $10.

I struggled with coming with ideas that would cost between $10 and $500 a year as every other storage use came out significantly less than $1/year for a family of four.  This seems to say that there might be plenty of applications in the range of under a $10 per TB per year, still 1200X current cloud storage costs.

Any other applications out there that could take  advantage of a dime/TB/year?

Google vs. National Information Exchange Model

Information Exchange Package Documents (IEPI) lifecycle from
Information Exchange Package Documents (IEPI) lifecycle from

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records.  Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records.   Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing.  But true real-time sharing would be impossible even with NIEM.  And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully  implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

  • “Near” realtime sharing of information
  • Security constraints on information being shared
  • Contextual information surrounding data records,
  • Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information.  As discussed above government information management tools would need to convert data records into text.  This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it.  Indexing such a repository would be no more complex than doing the same for the web today.  Even so it will take time to scan and index the data.  Until this is done, searching the data will not be available.  However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level.   Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites.  It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion.  Federal funding can be used as one means to enforce compliance.  Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped.   What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast.  Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not.  However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs.  Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach.  Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so.  Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.

5 laws of unstructured data

Richard (Dick) Nafzger with Apollo data tape by Goddard Photo and Video (cc) (from flickr)
Richard (Dick) Nafzger with Apollo data tape by Goddard Photo and Video (cc) (from flickr)

All data operates under a set of laws but unstructured data suffers from these tendencies more than most of all. Although, information technology has helped us to create and manage data easier, it hasn’t done much to minimize the problems these laws produce.

As such, I introduce here my 5 laws of unstructured data in the hopes that they may help us better understand the data we create.

Law 1: Unstructured data grows 50% per year

This has been a truism in the data center for as far back as I can remember. In the data center this is driven by business transactions, new applications and new products/services. On top of all that corporate compliance often dictate that data be retained long after it’s usefulness has passed.

Nowadays, Law 1 is also true for the home user as well. Here it’s a combination of email and media. Not only are cameras moving from 6 to 9 megapixels, home video is moving to high definition and there is just a whole lot more media being created everyday. Also, now social media seems to have doubled or tripled our outreach data creation above “normal email” alone.

Law 2: Unstructured data access frequency diminishes over time

Data created today is accessed frequently during it’s first 90 days of life and then less often after that. Reasons for this decaying access pattern vary, but human memory has to play a significant part in this.

Furthermore, business transactions encounter a life cycle from initiation, to delivery and finally, to termination. During these transitions various unstructured data are created representing the transaction state. Such data may be examined at quarter end and possibly at year end but may never see the light of day after that.

Law 3: Unsearchable data is lost data

Given Law 2’s data access decay and Law 1’s data growth, unsearchable data is by definition, inaccessible data. It’s not hard to imagine how this plays out in the data center or home.

For the data center, unstructured data mostly resides in user and application directories. I am constantly amazed that it’s easier to find data out on the web than it is to find data elsewhere in the data center. Moreover, E-discovery has become a major business segment in recent years by attempting to search unstructured corporate data.

As a Mac user my home environment is searchable for any text string. However, my photo library is another matter. Finding a specific photo from a couple of years ago is a sequential perusal of iPhoto’s library and as such, is seldom done.

Law 4: Unstructured data is copied often

Over a decade ago, a company I worked with sponsored a study to see how often data is copied. The numbers we came up with were impressive. A small but significant % of data is copied often, it’s not unusual to see 6-8 copies of such data. Some of this copying occurs when final documents are passed on, some comes from teamwork and other joint collaboration as working documents are reviewed and some is just interesting information that deserves broader dissemination. As such, data copies can represent a significant portion of any data center’s storage.

I suppose data proliferation may not be as evident in the home but our home would be an exception. Each of our Macs has a copy of all email account and have copies of the best photos. In addition, with laptops and multiple desktops, most Mac’s have copies of each (adult) user’s work environment,

Law 5: Unstructured data manual classification schemes degrade over time

In the data center, one could easily classify any file data created and maintain a database of file meta-data to facilitate access to file data. But who has the discipline or spare time to update such a database whenever they create a file or document. While this may work for “official records”, the effort involved makes it unusable for everything else.

My favorite home example of this is once again, our iPhoto library with it’s manual classification system using stars, e.g., I can assign anything from 0 to 5 stars to any photo. Used to be that after each camera import, I would assign a star rating to each new photo. Nowadays, the only time I do this is once a year and as such, it’s becoming more problematic and less useful. As we take more photographs each year this becomes much more of a burden.

Not sure these 5 laws of unstructured data are mutually exclusive and completely exhaustive but it’s a start. If anyone has any ideas on how to improve my unstructured data laws, feel free to comment below. In the mean time, as for structured data laws, …

Backup is for (E)discovery too

Electronic Discovery Reference Model (from
Electronic Discovery Reference Model (from

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…

7 grand challenges for the next storage century

Clock tower (4) by TJ Morris (cc) (from flickr)
Clock tower (4) by TJ Morris (cc) (from flickr)

I saw a recent IEEE Spectrum article on engineering’s grand challenges for the next century and thought something similar should be done for data storage. So this is a start:

  • Replace magnetic storage – most predictions show that magnetic disk storage has another 25 years and magnetic tape another decade after that before they run out of steam. Such end-dates have been wrong before but it is unlikely that we will be using disk or tape 50 years from now. Some sort of solid state device seems most probable as the next evolution of storage. I doubt this will be NAND considering its write endurance and other long-term reliability issues but if such issues could be re-solved maybe it could replace magnetic storage.
  • 1000 year storage – paper can be printed today with non-acidic based ink and retain its image for over a 1000 years. Nothing in data storage today can claim much more than a 100 year longevity. The world needs data storage that lasts much longer than 100 years.
  • Zero energy storage – today SSD/NAND and rotating magnetic media consume energy constantly in order to be accessible. Ultimately, the world needs some sort of storage that only consumes energy when read or written or such storage would provide “online access with offline power consumption”.
  • Convergent fabrics running divergent protocols – whether it’s ethernet, infiniband, FC, or something new, all fabrics should be able to handle any and all storage (and datacenter) protocols. The internet has become so ubiquitous becauset it handles just about any protocol we throw at it. We need the same or something similar for datacenter fabrics.
  • Securing data – securing books or paper is relatively straightforward today, just throw them in a vault/safety deposit box. Securing data seems simple but yet is not widely used today. It doesn’t have to be that way. We need better, more long lasting tools and methodology to secure our data.
  • Public data repositories – libraries exist to provide access to the output of society in the form of books, magazines, papers and other printed artifacts. No such repository exists today for data. Society would be better served if we could store and retrieve data if there were library like institutions could store data. Most of these issues are legal due to data ownership but technological issues exist here as well.
  • Associative accessed storage – Sequential and random access have been around for over half a century now. Associative storage could complement these and be another approach allowing storage to be retrieved by its content. We can kind of do this today by keywording and indexing data. Biological memory is accessed associations or linkages to other concepts, once accessed memory seem almost sequentially accessed from there. Something comparable to biological memory may be required to build more intelligent machines.

Some of these are already being pursued and yet others receive no interest today. Nonetheless, I believe they all deserve investigation, if storage is to continue to serve its primary role to society, as a long term storehouse for society’s culture, thoughts and deeds.