Hadoop – part 1

Hadoop Logo (from http://hadoop.apache.org website)
Hadoop Logo (from http://hadoop.apache.org website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.

—-

I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.

Comments?

What’s wrong with tape?

StorageTek Automated Cartridge System by brewbooks (cc) (from Flickr)
StorageTek Automated Cartridge System by brewbooks (cc) (from Flickr)

Was on a conference call today with Oracle’s marketing discussing their tape business.  Fred Moore (from Horison Information Systems) was on the call and mentioned something which surprised me.  What’s missing in open and distributed systems was some standalone mechanism to stack volumes onto a single tape cartridge.

The advantages of tape are significant, namely:

  • Low power utilization for offline or nearline storage
  • Cheap media, drives, and automation systems
  • Good sequential throughput
  • Good cartridge density

But most of these advantages fade when cartridge capacity utilization drops.  One way to increase cartridge capacity utilization is to stack multiple tape volumes on a single cartridge.

Mainframes (like system/z) have had cartridge stacking since the late 90’s.  Such capabilities came about due to the increasing cartridge capacities then available. Advance a decade and the problem still exists, Oracle’s StorageTek T10000 has a 1TB cartridge capacity and LTO-5 supports 1.5TB per cartridge both uncompressed.  Nonetheless, open or distributed systems still have no tape stacking capability.

Although I agree with Fred that volume stacking is missing in open systems, but does it really need such a thing.  Currently it seems open systems uses tape for backups, archive data and the occasional batch run.  Automated hierarchical storage management can readily fill up tape cartridges by holding their data movement to tape until enough data is ready to be moved.  On the other hand, backups by their very nature create large sequential streams of data which should result in high capacity utilization except for the last tape in a series.  Which only leaves the problem of occasional batch runs using large datasets or files.

I believe most batch processing today already takes place on the mainframe, leaving relatively little for open or distributed systems.  There are certainly some verticals that do lots of batch processing, for example banks and telcos.  But most heavy batch users grew up in the heyday of the mainframe and are still using them today.

Condor notwithstanding, open and distributed systems never had any sophisticated batch processing capabilities readily available on the mainframe. As such, of those new companies that need batch processing, my guess is that they start with open and as their needs for batch grow move these applications to mainframe.

So the real question becomes how do we increase open systems batch processing.   I don’t think a tape volume stacking system solves that problem.

Given all the above, I see tape use in open being relegated to backup and archive and used less and less for any other activities.

What do you think?