There’s been an ongoing debate in the analyst community about the advantages of software only innovation vs. hardware-software innovation (see Commodity hardware loses again and Commodity hardware always loses posts). Here is another example where two separate companies have turned to hardware innovation to take storage innovation to the next level.
These two arrays seem to be going after opposite ends of the storage market: the 5U DSSD D5 is going after both structured and unstructured data that needs ultra high speed IO access (<100µsec) times and the 4U FlashBlade going after more general purpose unstructured data. And yet the two have have many similarities at least superficially. Continue reading “A tale of two AFAs: EMC DSSD D5 & Pure Storage FlashBlade”→
In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.
Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data. Namely,
Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently. Data access is via a Thrift based API supporting many languages. Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp. One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween. Also Casandra is optimized for writes. Cassandra can be used as the Map portion of a MapReduce run.
Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB. Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS). In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations. Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications. Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop. Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
Hypertable – is a Google open source project which is a c++ implementation of BigTable only using HDFS rather than GFS . Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families. Hypertable supports both a client (c++) and Thrift API. Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS in combination with an interpretive analysis. Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS. Pig supports both batch and interactive execution but can also be used through a Java API.
Hadoop also supports special purpose tools used for very specialized analysis such as
Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction. However, Mahout works on non-Hadoop clusters as well. Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions. While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data. The focus here is on scientific computation. Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.
There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely
Chukwa – which is used for monitoring large distributed clusters of servers.
ZooKeeper – which is a cluster configuration tool and distributed serialization manager useful to build large clusters of Hadoop nodes.
MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.
As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well. In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.
Ran across a web posting yesterday providing information on a University of Illinois summer program in Data Science. I had never encountered the term before so I was intrigued. When I first saw the article I immediately thought of data analytics but data science should be much broader than that.
What exactly is a data scientist? I suppose someone who studies what can be learned from data but also what happens throughout data lifecycles.
Data science is like biology
I look to biology for an example. A biologist studies all sorts of activity/interactions from what happens in a single cell organism, to plants, and animal kingdoms. They create taxonomies which organizes all biological entities, past and present. They study current and past food webs, ecosystems, and species. They work in an environment of scientific study where results are openly discussed and repeatable. In peer reviewed journals, they document everything from how a cell interacts within an organism, to how an organism interacts with its ecosystem, to whole ecosystem lifecycles. I fondly remember my biology class in high school talking about DNA, the life of a cell, biological taxonomy and disection.
Where are these counterparts in Data Science? Not sure but for starters let’s call someone who does data science an informatist.
What constitutes a data ecosystem in data science? Perhaps an informatist would study the IT infrastructure(s) where a datum is created, stored, and analyzed. Such infrastructure (especially with cloud) may span data centers, companies, and even the whole world. Nonetheless, migratory birds can cover large distances, across multiple ecosystems and are still valid subjects for biologists.
So where a datum exists, where/when it’s moved throughout its lifecycle, and how it interacts with other datums is a proper subject for data ecosystem study. I suppose my life’s study of storage could properly be called the study of data ecosytems.
Next, what’s a reasonable way for an informatist to organize data like a biological taxonomy with domain, kingdom, phylum, class, order, family, genus, and species (see wikipedia). Seems to me that applications that create and access the data represent a rational way to organize data. However my first thought on this was structured or unstructured data as the defining first level breakdown (maybe Phylum). Order could be general application type such as email, ERP, office documents, etc. Family could be application domain, genus could be application version and species could be application data type. So that something like an Exchange 2010 email would be Order=EMAILus, Family=EXCHANGius, Genus=E2010ius, and Species=MESSAGius.
I think higher classifications such as kingdom and domain need to consider things such as oral history, handcopied manuscripts, movable type printed documents, IT, etc., at the Kingdom level. Maybe Domain would be such things as biological domain, information domain, physical domain, etc. Although where oral-h
When first thinking of higher taxonomical designations I immediately went into O/S but now I think of an O/S as part of the ecological niche where data temporarily resides.
I could go on, there are probably hundreds if not thousands of other characteristics of data science that need to be discussed – data lifecycle, the data cell, information use webs, etc.
Another surprise is how well the study of biology fits the study of data science. Counterparts to biology seem to exist everywhere I look. At some deep level, biology is information, wet-ware perhaps, but information nonetheless. It seems to me that the use of biology to guide our elaboration of data science can be very useful.