To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).
On the other hand, for structured data there are a number of other options currently available. Namely:
EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.
Probably missing a few other solutions but these appear to be the main ones at the moment.
In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).
Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.
One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.
In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.
Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data. Namely,
Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently. Data access is via a Thrift based API supporting many languages. Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp. One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween. Also Casandra is optimized for writes. Cassandra can be used as the Map portion of a MapReduce run.
Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB. Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS). In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations. Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications. Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop. Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
Hypertable – is a Google open source project which is a c++ implementation of BigTable only using HDFS rather than GFS . Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families. Hypertable supports both a client (c++) and Thrift API. Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS in combination with an interpretive analysis. Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS. Pig supports both batch and interactive execution but can also be used through a Java API.
Hadoop also supports special purpose tools used for very specialized analysis such as
Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction. However, Mahout works on non-Hadoop clusters as well. Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions. While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data. The focus here is on scientific computation. Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.
There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely
Chukwa – which is used for monitoring large distributed clusters of servers.
ZooKeeper – which is a cluster configuration tool and distributed serialization manager useful to build large clusters of Hadoop nodes.
MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.
As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well. In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.