HDFS – Silverton Consulting

Intel acquires InfiniBand fabric technology from Qlogic

Posted on January 24, 2012April 10, 2012 by Ray in Clustered storage, Distributed computing, Ethernet, Infiniband, Information economy, Networking, RDMA, RoCE, Storage performance, Strategic Inflection Points

Isilon Packaging by ChrisDag (cc) (from Flickr)”]Intel announced today that they are going to acquire the InfiniBand (IB) fabric technology business from Qlogic.

From many analyst’s perspective, IB is one of the only technologies out there that can efficiently interconnect a cluster of commodity servers into a supercomputing system.

What’s InfiniBand?

Recall that IB is one of three reigning data center fabric technologies available today which include 10GbE, and 16 Gb/s FC. IB is currently available in DDR, QDR and FDR modes of operation, that is 5Gb/s, 10Gb/s or 14Gb/s, respectively per single lane, according to the IB update (see IB trade association (IBTA) technology update). Systems can aggregate multiple IB lanes in units of 4 or 12 paths (see wikipedia IB article), such that an IB QDRx4 supports 40Gb/s and a IB FDRx4 currently supports 56Gb/s.

The IBTA pitch cited above showed that IB is the most widely used interface for the top supercomputing systems and supports the most power efficient interconnect available (although how that’s calculated is not described).

Where else does IB make sense?

One thing IB has going for it is low latency through the use of RDMA or remote direct memory access. That same report says that an SSD directly connected through a FC takes about ~45 μsec to do a read whereas the same SSD directly connected through IB using RDMA would only take ~26 μsec.

However, RDMA technology is now also coming out on 10GbE through RDMA over Converged Ethernet (RoCE, pronounced “rocky”). But ITBA claims that IB RDMA has a 0.6 μsec latency and the RoCE has a 1.3 μsec. Although at these speed, 0.7 μsec doesn’t seem to be a big thing, it doubles the latency.

Nonetheless, Intel’s purchase is an interesting play. I know that Intel is focusing on supporting an ExaFLOP HPC computing environment by 2018 (see their release). But IB is already a pretty active technology in the HPC community already and doesn’t seem to need their support.

In addition, IB has been gradually making inroads into enterprise data centers via storage products like the Oracle Exadata Storage Server using the 40 Gb/s IB QDRx4 interconnects. There are a number of other storage products out that use IB as well from EMC Isilon, SGI, Voltaire, and others.

Of course where IB can mostly be found today is in computer to computer interconnects and just about every server vendor out today, including Dell, HP, IBM, and Oracle support IB interconnects on at least some of their products.

Whose left standing?

With Qlogic out I guess this leaves Cisco (de-emphasized lately), Flextronix, Mellanox, and Intel as the only companies that supply IB switches. Mellanox, Intel (from Qlogic) and Voltaire supply the HCA (host channel adapter) cards which provide the server interface to the switched IB network.

Probably a logical choice for Intel to go after some of this technology just to keep it moving forward and if they want to be seriously involved in the network business.

IB use in Big Data?

On the other hand, it’s possible that Hadoop and other big data applications could conceivably make use of IB speeds and as these are mainly vast clusters of commodity systems it would be a logical choice.

There is some interesting research on the advantages of IB in HDFS (Hadoop) system environments (see Can high performance interconnects boost Hadoop distributed file system performance) out of Ohio State University. This research essentially says that Hadoop HDFS can perform much better when you combine IB with IPoIB (IP over IB, see OpenFabrics Alliance article) and SSDs. But SSDs alone do not provide as much benefit. (Although my reading of the performance charts seems to indicate it’s not that much better than 10GbE with TOE?).

It’s possible other Big data analytics engines are considering using IB as well. It would seem to be a logical choice if you had even more control over the software stack.

~~~~

Comments?

Hadoop – part 2

Posted on June 14, 2011April 10, 2012 by Ray in Data, Data analytics, Data index, Data reduction, Data science, Decision making, Distributed computing, Strategy, System effectiveness, Systems

Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data. Namely,

Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently. Data access is via a Thrift based API supporting many languages. Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp. One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween. Also Casandra is optimized for writes. Cassandra can be used as the Map portion of a MapReduce run.
Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB. Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS). In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations. Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications. Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop. Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
Hypertable – is a Google open source project which is a c++ implementation of BigTable only using HDFS rather than GFS . Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families. Hypertable supports both a client (c++) and Thrift API. Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS in combination with an interpretive analysis. Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS. Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction. However, Mahout works on non-Hadoop clusters as well. Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions. While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data. The focus here is on scientific computation. Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

Chukwa – which is used for monitoring large distributed clusters of servers.
ZooKeeper – which is a cluster configuration tool and distributed serialization manager useful to build large clusters of Hadoop nodes.
MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well. In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.

Hadoop – part 1

Posted on June 8, 2011April 10, 2012 by Ray in Data, Data analytics, Data grid, Data science, Data search, Distributed computing, System effectiveness

Hadoop Logo (from http://hadoop.apache.org website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode). These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything. HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output. MapReduce uses the HDFS to access file segments and to store reduced results. MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible. In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides. In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process. As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, Hive, Mahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs. I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster. In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support. Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested. The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage. Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.

—-

I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was. This is my first attempt – hopefully, more to follow.

Comments?