Archeology meets Big Data

Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)
Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)

Read an article yesterday about the use of LIDAR (light detection and ranging, Wikipedia) to map the residues of an pre-columbian civilization in Central America, the little know Purepecha empire, peers of the Aztecs.

The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.

Why LIDAR?

LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.

The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.

So how much data?

I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.

Another paper I found (see Evaluation of MapReduce for Gridding LIDAR Data) said that a LIDAR “grid point” (containing X, Y & Z coordinates) takes 52 bytes of data.

Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.

My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.

With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.

Indiana Jones meets Hadoop

That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.

The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.

The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.

~~~~

Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.

 

One day with HDS

HDS CEO Jack Domme shares the company’s vision and strategy with Influencer Summit attendees #HDSday by HDScorp
HDS CEO Jack Domme shares the company’s vision and strategy with Influencer Summit attendees #HDSday by HDScorp

Attended #HDSday yesterday in San Jose.  Listened to what seemed like the majority of the executive team. The festivities were MCed by Asim Zaheer, VP Corp and Product Marketing, a long time friend and employee, that came to HDS with the acquisition of Archivas five or so years ago.   Some highlights of the day’s sessions are included below.

The first presenter was Jack Domme, HDS CEO, and his message was that there is a new, more aggressive HDS, focused on executing and growing the business.

Jack said there will be almost a half a ZB by 2015 and ~80% of that will be unstructured data.  HDS firmly believes that much of this growing body of  data today lives in silos, locked into application environments and can’t become truly information until it can be liberated from this box.  Getting information out of the unstructured data is one of the key problems facing the IT industry.

To that end, Jack talked about the three clouds appearing on the horizon:

  • infrastructure cloud – cloud as we know and love it today where infrastructure services can be paid for on a per use basis, where data and applications move seemlessly across various infrastructural boundaries.
  • content cloud – this is somewhat new but here we take on the governance, analytics and management of the millions to billions pieces of content using the infrastructure cloud as a basic service.
  • information cloud – the end game, where any and all data streams can be analyzed in real time to provide information and insight to the business.

Jack mentioned the example of when Japan had their earthquake earlier this year they automatically stopped all the trains operating in the country to prevent further injury and accidents, until they could assess the extent of track damage.  Now this was a specialized example in a narrow vertical but the idea is that the information cloud does that sort of real-time analysis of data streaming in all the time.

For much of the rest of the day the executive team filled out the details that surrounded Jack’s talk.

For example Randy DeMont, Executive VP & GM Global Sales, Services and Support talked about the new, more focused sales team. On that has moved to concentrate on better opportunities and expanded to take on new verticals/new emerging markets.

Then Brian Householder, SVP WW Marketing and Business Development got up and talked about some of the key drivers to their growth:

  • Current economic climate has everyone doing more with less.  Hitachi VSP and storage virtualization is a unique position to be able to obtain more value out of current assets, not a rip and replace strategy.  With VSP one layers better management on top of your current infrastructure, that helps get more done with the same equipment.
  • Focus on the channel and verticals are starting to pay off.  More than 50% of HDS revenues now come from indirect channels.  Also, healthcare and life sciences are starting to emerge as a crucial vertical for HDS.
  • Scaleability of their storage solutions is significant. Used to be a PB was a good sized data center but these days we are starting to talk about multiple PBs and even much more.  I think earlier Jack mentioned that in the next couple of years HDS will see their first 1EB customer.

Mark Mike Gustafson,  SVP & GM NAS (former CEO BlueArc) got up and talked about the long and significant partnership between the two companies regarding their HNAS product.  He mentioned that ~30% of BlueArc’s revenue came from HDS.  He also talked about some of the verticals that BlueArc had done well in such as eDiscovery and Media and Entertainment.  Now these verticals will become new focus areas for HDS storage as well.

John Mansfield, SVP Global Solutions Strategy and Developmentcame up and talked about the successes they have had in the product arena.  Apparently they have over 2000 VSPs intsalled, (announced just a year ago), and over 50% of the new systems are going in with virtualization. When asked later what has led to the acceleration in virtualization adoption, the consensus view was that server virtualization and in general, doing more with less (storage efficiency) were driving increased use of this capability.

Hicham Abdessamad, SVP, Global Services got up and talked about what has been happening in the services end of the business.  Apparently there has been a serious shift in HDS services revenue stream from break fix over to professional services (PS).  Such service offerings now include taking over customer data center infrastructure and leasing it back to the customer at a monthly fee.   Hicham re-iterated that ~68% of all IT initiatives fail, while 44% of those that succeed are completed over time and/or over budget.  HDS is providing professional services to help turn this around.  His main problem is finding experienced personnel to help deliver these services.

After this there was a Q&A panel of John Mansfield’s team, Roberto Bassilio, VP Storage Platforms and Product Management, Sean Moser,  VP Software Products, and Scan Putegnat, VP File and Content Services, CME.  There were a number of questions one of which was on the floods in Thailand and their impact on HDS’s business.

Apparently, the flood problems are causing supply disruptions in the consumer end of the drive market and are not having serious repercussions for their enterprise customers. But they did mention that they were nudging customers to purchase the right form factor (LFF?) disk drives while the supply problems work themselves out.

Also, there was some indication that HDS would be going after more SSD and/or NAND flash capabilities similar to other major vendors in their space. But there was no clarification of when or exactly what they would be doing.

After lunch the GMs of all the Geographic regions around the globe got up and talked about how they were doing in their particular arena.

  • Jeff Henry, SVP &GM Americas talked about their success in the F500 and some of the emerging markets in Latin America.  In fact, they have been so successful in Brazil, they had to split the country into two regions.
  • Niels Svenningsen, SVP&GM EMAE talked about the emerging markets in his area of the globe, primarily eastern Europe, Russia and Africa. He mentioned that many believe Africa will be the next area to take off like Asia did in the last couple of decades of last century.  Apparently there are a Billion people in Africa today.
  • Kevin Eggleston, SVP&GM APAC, talked about the high rate of server and storage virtualization, the explosive growth and heavy adoption of Cloud pay as you go services. His major growth areas were India and China.

The rest of the afternoon was NDA presentations on future roadmap items.

—-

All in all a good overview of HDS’s business over the past couple of quarters and their vision for tomorrow.  It was a long day and there was probably more than I could absorb in the time we had together.

Comments?

 

Big data and eMedicine combine to improve healthcare

fix_me by ~! (cc) (from Flickr)
fix_me by ~! (cc) (from Flickr)

We have talked before ePathology and data growth, but Technology Review recently reported that researchers at Stanford University have used Electronic Medical Records (EMR) from multiple medical institutions to identify a new harmful drug interaction. Apparently, they found that when patients take Paxil (a depressant) and Pravachol (a cholresterol reducer) together, the drugs interact to raise blood sugar similar to what diabetics have.

Data analytics to the rescue

The researchers started out looking for new drug interactions which could result in conditions seen by diabetics. Their initial study showed a strong signal that taking both Paxil and Pravachol could be a problem.

Their study used FDA Adverse Event Reports (AERs) data that hospitals and medical care institutions record.  Originally, the researchers at Stanford’s Biomedical Informatics group used AERs available at Stanford University School of Medicine but found that although they had a clear signal that there could be a problem, they didn’t have sufficient data to statistically prove the combined drug interaction.

They then went out to Harvard Medical School and Vanderbilt University and asked that to access their AERs to add to their data.  With the combined data, the researchers were now able to clearly see and statistically prove the adverse interactions between the two drugs.

But how did they analyze the data?

I could find no information about what tools the biomedical informatics researchers used to analyze the set of AERs they amassed,  but it wouldn’t surprise me to find out that Hadoop played a part in this activity.  It would seem to be a natural fit to use Hadoop and MapReduce to aggregate the AERs together into a semi-structured data set and reduce this data set to extract the AERs which matched their interaction profile.

Then again, it’s entirely possible that they used a standard database analytics tool to do the work.  After all, we were only talking about a 100 to 200K records or so.

Nonetheless, the Technology Review article stated that some large hospitals and medical institutions using EMR are starting to have database analysts (maybe data scientists) on staff to mine their record data and electronic information to help improve healthcare.

Although EMR was originally envisioned as a way to keep better track of individual patients, when a single patient’s data is combined with 1000s more patients one creates something entirely different, something that can be mined to extract information.  Such a data repository can be used to ask questions about healthcare inconceivable before.

—-

Digitized medical imagery (X-Rays, MRIs, & CAT scans), E-pathology and now EMR are together giving rise to a new form of electronic medicine or E-Medicine.  With everything being digitized, securely accessed and amenable to big data analytics medical care as we know is about to undergo a paradigm shift.

Big data and eMedicine combined together are about to change healthcare for the better.

The sensor cloud comes home

We thought the advent of smart power meters would be the killer app for building the sensor cloud in the home.  But, this week Honeywell announced a new smart thermostat that attaches to the Internet and uses Opower’s cloud service to record and analyze home heating and cooling demand.  Looks to be an even better bet.

9/11 Memorial renderings, aerial view (c) 9/11 Memorial.org (from their website)
9/11 Memorial renderings, aerial view (c) 9/11 Memorial.org (from their website)

Just this past week, on a NPR NOVA telecast: Engineering Ground Zero on building the 9/11 memorial in NYC, it was mentioned that all the trees planted in the memorial had individual sensors to measure soil chemistry, dampness, and other tree health indicators. Yes, even trees are getting on the sensor cloud.

And of course the buildings going up at Ground Zero are all smart buildings as well, containing sensors embedded in the structure, the infrastructure, and anywhere else that matters.

But what does this mean in terms of data

Data requirements will explode as the smart home and other sensor clouds build out.  For example, even if a smart thermostat only issues a message every 15 minutes and the message is only 256 bytes, the data from the 130 million households in the US alone would be an additional ~3.2TB/day.  And that’s just one sensor per household.

If you add the smart power meter, lawn sensor, intrusion/fire/chemical sensor, and god forbid, the refrigerator and freezer product sensors to the mix that’s another another 16TB/day of incoming data.

And that’s just assuming a 256 byte payload per sensor every 15 minutes.  The intrusion sensors could easily be a combination of multiple, real time exterior video feeds as well as multi-point intrusion/motion/fire/chemical sensors which would generate much, much more data.

But we have smart roads/bridges, smart cars/trucks, smart skyscrapers, smart port facilities, smart railroads, smart boats/ferries, etc. to come.   I could go on but the list seems long enouch already.  Each of these could generate another ~19TB/day data stream, if not more.  Some of these infrastructure entities/devices are much more complex than a house and there are a lot more cars on the road than houses in the US.

It’s great to be in the (cloud) storage business

All that data has to be stored somewhere and that place is going to be the cloud.  The Honeywell smart thermostat uses Opower’s cloud storage and computing infrastructure specifically designed to support better power management for heating and cooling the home.  Following this approach, it’s certainly feasible that more cloud services would come online to support each of the smart entities discussed above.

Naturally, using this data to provide real time understanding of the infrastructure they monitor will require big data analytics. Hadoop, and it’s counterparts are the only platforms around today that are up to this task.

—-

So cloud computing, cloud storage, and big data analytics have yet another part to play. This time in the upcoming sensor cloud that will envelope the world and all of it’s infrastructure.

Welcome to the future, it’s almost here already.

Comments?

 

 

NewSQL and the curse of Old SQL database systems

database 2 by Tim Morgan (cc) (from Flickr)
database 2 by Tim Morgan (cc) (from Flickr)

There was some twitter traffic yesterday on how Facebook was locked into using MySQL (see article here) and as such, was having to shard their MySQL database across 1000s of database partitions and memcached servers in order to keep up with the processing load.

The article indicated  that this was painful, costly and time consuming. Also they said Facebook would be better served moving to something else. One answer was to replace MySQL with recently emerging, NewSQL database technology.

One problem with old SQL database systems is they were never architected to scale beyond a single server.  As such, multi-server transactional operations was always a short-term fix to the underlying system, not a design goal. Sharding emerged as one way to distribute the data across multiple RDBMS servers.

What’s sharding?

Relational database tables are sharded by partitioning them via a key.  By hashing this key one can partition a busy table across a number of servers and use the hash function to lookup where to process/access table data.   An alternative to hashing is to use a search lookup function to determine which server has the table data you need and process it there.

In any case, sharding causes a number of new problems. Namely,

  • Cross-shard joins – anytime you need data from more than one shard server you lose the advantages of distributing data across nodes. Thus, cross-shard joins need to be avoided to retain performance.
  • Load balancing shards – to spread workload you need to split the data by processing activity.  But, knowing ahead of time what the table processing will look like is hard and one weeks processing may vary considerably from the next weeks load. As such, it’s hard to load balance shard servers.
  • Non-consistent shards – by spreading transactions across multiple database servers and partitions, transactional consistency can no longer be guaranteed.  While for some applications this may not be a concern, traditional RDBMS activity is consistent.

These are just some of the issues with sharding and I am certain there are more.

What about Hadoop projects and its alternatives?

One possibility is to use Hadoop and its distributed database solutions.  However, Hadoop systems were not intended to be used for transaction processing. Nonetheless, Cassandra and HyperTable (see my post on Hadoop – Part 2) can be used for transaction processing and at least Casandra can be tailored to any consistency level. But both Cassandra and HyperTable are not really meant to support high throughput, consistent transaction processing.

Also, the other, non-Hadoop distributed database solutions support data analytics and most are not positioned as transaction processing systems (see Big Data – Part 3).  Although Teradata might be considered the lone exception here and can be a very capable transaction oriented database system in addition to its data warehouse operations. But it’s probably not widely distributed or scaleable above a certain threshold.

The problems with most of the Hadoop and non-Hadoop systems above mainly revolve around the lack of support for ACID transactions, i.e., atomic, consistent, isolated, and durable transaction processing. In fact, most of the above solutions relax one or more of these characteristics to provide a scaleable transaction processing model.

NewSQL to the rescue

There are some new emerging database systems that are designed from the ground up to operate in distributed environments called “NewSQL” databases.  Specifically,

  • Clustrix – is a MySQL compatible replacement, delivered as a hardware appliance that can be distributed across a number of nodes that retains fully ACID transaction compliance.
  • GenieDB – is a NoSQL and SQL based layered database that is consistent (atomic), available and partition tolerant (CAP) but not fully ACID compliant, offers a MySQL and popular content management systems plugins that allow MySQL and/or CMSs to execute using GenieDB clusters with minimal modification.
  • NimbusDB – is a client-cloud based SQL service which distributes copies of data across multiple nodes and offers a majority of SQL99 standard services.
  • VoltDB – is a fully SQL compatible, ACID compliant, distributed, in-memory database system offered as a software only solution executing on 64bit CentOS system but is compatible with any POSIX-compliant, 64bit Linux platform.
  • Xeround –  is a cloud based, MySQL compatible replacement delivered as a (Amazon, Rackspace and others) service offering that provides ACID compliant transaction processing across distributed nodes.

I might be missing some, but these seem to be the main ones today.  All the above seem to take a different tack to offer distributed SQL services.  Some of the above relax ACID compliance in order to offer distributed services. But for all of them distributed scale out performance is key and they all offer purpose built, distributed transactional relational database services.

—–

RDBMS technology has evolved over the last century and have had at least ~35 years of running major transactional systems. But todays hardware architecture together with web scale performance requirements stretch these systems beyond their original design envelope.  As such, NewSQL database systems have emerged to replace old SQL technology, with a new, intrinsically distributed system architecture providing high performing, scaleable transactional database services for today and the foreseeable future.

Comments?

Big data – part 3

Linkedin maps data visualization by luc legay (cc) (from Flickr)
Linkedin maps data visualization by luc legay (cc) (from Flickr)

I have renamed this series to “Big data” because it’s no longer just about Hadoop (see Hadoop – part 1 & Hadoop – part 2 posts).

To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).

On the other hand, for structured data there are a number of other options currently available. Namely:

  • EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
  • HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
  • IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
  • ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
  • Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.

Probably missing a few other solutions but these appear to be the main ones at the moment.

In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).

Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.

—-

One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.

Comments?

Hadoop – part 2

Hadoop Graphic (c) 2011 Silverton Consulting
Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data.  Namely,

  • Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently.  Data access is via a Thrift based API supporting many languages.  Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp.  One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween.  Also Casandra is optimized for writes.  Cassandra can be used as the Map portion of a MapReduce run.
  • Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB.  Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS).  In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations.  Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications.  Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
  • Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop.  Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
  • Hypertable – is a Google open source project which is a  c++ implementation of BigTable only using HDFS rather than GFS .  Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families.   Hypertable supports both a client (c++) and Thrift API.  Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
  • Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS  in combination with an interpretive analysis.  Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS.  Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

  • Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction.  However, Mahout works on non-Hadoop clusters as well.  Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions.  While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
  • Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data.  The focus here is on scientific computation.  Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

  • Chukwa – which is used for monitoring large distributed clusters of servers.
  • ZooKeeper – which is a cluster configuration tool  and distributed serialization manager useful to build large clusters of Hadoop nodes.
  • MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
  • Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well.  In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.

Hadoop – part 1

Hadoop Logo (from http://hadoop.apache.org website)
Hadoop Logo (from http://hadoop.apache.org website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.

—-

I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.

Comments?