BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.
In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).
What is Hadoop and why is it important
Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?
Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,
- Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode). These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything. HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
- MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output. MapReduce uses the HDFS to access file segments and to store reduced results. MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.
It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.
Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.
HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible. In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.
Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides. In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.
MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process. As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, Hive, Mahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs. I will try to provide more information on these and other popular projects in a follow on post.
Hadoop fault tolerance
When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.
Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster. In this fashion, work will complete even in the face of node failures.
Hadoop distributions, support and who uses them
Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support. Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).
As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested. The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage. Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.
I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was. This is my first attempt – hopefully, more to follow.