Hadoop – part 1

Hadoop Logo (from http://hadoop.apache.org website)
Hadoop Logo (from http://hadoop.apache.org website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.

—-

I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.

Comments?

Real-time data analytics from customer interactions

the ghosts in the machine by MelvinSchlubman (cc) (From Flickr)
the ghosts in the machine by MelvinSchlubman (cc) (From Flickr)

At a recent EMC product launch in New York, there was a customer question and answer session for industry analysts with four of EMC’s leading edge customers. One customer, Marco Pacelli, was the CEO of ClickFox, a company providing real-time data analytics to retailers, telecoms, banks and other high transaction volume companies.

Interactions vs. transactions

Marco was very interesting, mostly because at first I didn’t understand what his company was doing or how they were doing it.  He made the statement that for every transaction (customer activity that generates revenue) companies encounter (and their are millions of them), there can be literally 10 to a 100 distinct customer interactions.  And it’s the information in these interactions which can most help companies maximize transaction revenue, volume and/or throughput.

Tracking and tracing through all these interactions in real-time, to try to make sense of the customer interaction sphere is a new and emerging discipline.  Apparently, ClickFox makes extensive use of GreenPlum, one of EMC’s recent acquisitions to do all this but I was more interested in what they were trying to achieve than the products used to accomplish this.

Banking interactions

For example, it seems that the websites, bank tellers, ATM machines and myriad of other devices one uses to interact with a bank are all capable of recording any interaction or actions we perform. What ClickFox seems to do is to track customer interactions across all these mechanisms to trace what transpired that led to any transaction, and determines how it can be done better. The fact that most banking interactions are authenticated to one account, regardless of origin, makes tracking interactions across all facets of customer activity possible.

By doing this, ClickFox can tell companies how to generate more transactions, faster.  If a bank can somehow change their interactions with a customer across websites, bank tellers, ATM machines, phone banking and any other touchpoint, so that more transactions can be done with less trouble, it can be worth lots of money.

How all that data is aggregated and sent offsite or processed onsite is yet another side to this problem but ClickFox is able to do all this with the help of GreenPlum database appliances.  Moreover, ClickFox can host interaction data and perform analytics at their own secure site(s) or perform their analysis on customer premises depending on company preference.

—-

Marco’s closing comments were something like the days of offloading information to a data warehouse, asking a question and waiting weeks for an answer are over, the time when a company can optimize their customer interactions by using data just gathered, across every touchpoint they support, are upon us.

How all this works for non-authenticated interactions was another mystery to me.  Marco indicated in later discussions that it was possible to identify patterns of behavior that led to transactions and that this could be used instead to help trace customer interactions across company touchpoints for similar types of analyses!?  Sounds like AI on top of database machines…

Comments?

Data Science!!

perspective by anomalous4 (cc) (from Flickr)
perspective by anomalous4 (cc) (from Flickr)

Ran across a web posting yesterday providing information on a University of Illinois summer program in Data Science.  I had never encountered the term before so I was intrigued.  When I first saw the article I immediately thought of data analytics but data science should be much broader than that.

What exactly is a data scientist?  I suppose someone who studies what can be learned from data but also what happens throughout data lifecycles.

Data science is like biology

I look to biology for an example.  A biologist studies all sorts of activity/interactions from what happens in a single cell organism, to plants, and animal kingdoms.  They create taxonomies which organizes all biological entities, past and present.  They study current and past food webs, ecosystems, and species.  They work in an environment of scientific study where results are openly discussed and repeatable.   In peer reviewed journals, they document everything from how a cell interacts within an organism, to how an organism interacts with its ecosystem, to whole ecosystem lifecycles.  I fondly remember my biology class in high school talking about DNA, the life of a cell, biological taxonomy and disection.

Where are these counterparts in Data Science?  Not sure but for starters let’s call someone who does data science an informatist.

Data ecosystems

What constitutes a data ecosystem in data science?  Perhaps an informatist would study the IT infrastructure(s) where a datum is created, stored, and analyzed.  Such infrastructure (especially with cloud) may span data centers, companies, and even the whole world.  Nonetheless, migratory birds can cover large distances, across multiple ecosystems and are still valid subjects for biologists.

So where a datum exists, where/when it’s moved throughout its lifecycle, and how it interacts with other datums is a proper subject for data ecosystem study.  I suppose my life’s study of storage could properly be called the study of data ecosytems.

Data taxonomy

Next, what’s a reasonable way for an informatist to organize data like a biological taxonomy with domain, kingdom, phylum, class, order, family, genus, and species (see wikipedia).  Seems to me that applications that create and access the data represent a rational way to organize data.  However my first thought on this was structured or unstructured data as the defining first level breakdown (maybe Phylum).  Order could be general application type such as email, ERP, office documents,  etc. Family could be application domain, genus could be application version and species could be application data type.  So that something like an Exchange 2010 email would be Order=EMAILus, Family=EXCHANGius, Genus=E2010ius, and Species=MESSAGius.

I think higher classifications such as kingdom and domain need to consider things such as oral history, handcopied manuscripts, movable type printed documents, IT, etc., at the Kingdom level.  Maybe Domain would be such things as biological domain, information domain, physical domain, etc.  Although where oral-h

When first thinking of  higher taxonomical designations I immediately went into O/S but now I think of an O/S as part of the ecological niche where data temporarily resides.

—-

I could go on, there are probably hundreds if not thousands of other characteristics of data science that need to be discussed – data lifecycle, the data cell, information use webs, etc.

Another surprise is how well the study of biology fits the study of data science.  Counterparts to biology seem to exist everywhere I look.  At some deep level, biology is information, wet-ware perhaps, but information nonetheless.  It seems to me that the use of biology to guide our elaboration of data science can be very useful.

Comments?