SCISPC110527-004 (c) 2011 Silverton Consulting, Inc., All Rights Reserved
The above chart is from our May Storage Intelligence newsletter dispatch on system performance and shows the latest Storage Performance Council SPC-1 benchmark results in a scatter plot with IO/sec [or IOPS(tm)] on the vertical axis and number of disk drives on the horizontal axis. We have tried to remove all results that used NAND flash as a cache or SSDs. Also this displays only results below a $100/GB.
One negative view of benchmarks such as SPC-1 is that published results are almost entirely due to the hardware thrown at it or in this case, the number of disk drives (or SSDs) in the system configuration. An R**2 of 0.93 shows a pretty good correlation of IOPS performance against disk drive count and would seem to bear this view out, but is an incorrect interpretation of the results.
Just look at the wide variation beyond the 500 disk drive count versus below that where there are only a few outliers with a much narrower variance. As such, we would have to say that at some point (below 500 drives), most storage systems can seem to attain a reasonable rate of IOPS as a function of the number of spindles present, but after that point the relationship starts to break down. There are certainly storage systems at the over 500 drive level that perform much better than average for their drive configuration and some that perform worse.
For example, consider the triangle formed by the three best performing (IOPS) results on this chart. The one at 300K IOPS with ~1150 disk drives is from Huawei Symantec and is their 8-node Oceanspace S8100 storage system whereas the other system with similar IOPS performance at ~315K IOPS used ~2050 disk drives and is a 4-node, IBM SVC (5.1) system with DS8700 backend storage. In contrast, the highest performer on this chart at ~380K IOPS, also had ~2050 disk drives and is a 6-node IBM SVC (5.1) with DS8700 backend storage.
Given the above analysis there seems to be much more to system performance than merely disk drive count, at least at the over 500 disk count level.
—-
The full performance dispatch will be up on our website after the middle of next month but if you are interested in viewing this today, please sign up for our free monthly newsletter (see subscription request, above right) or subscribe by email and we’ll send you the current issue. If you need a more analysis of SAN storage performance please consider purchasing SCI’s SAN Storage Briefing.
As always, we welcome all constructive suggestions on how to improve any of our storage performance analyses.
Linkedin maps data visualization by luc legay (cc) (from Flickr)
I have renamed this series to “Big data” because it’s no longer just about Hadoop (see Hadoop – part 1 & Hadoop – part 2 posts).
To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).
On the other hand, for structured data there are a number of other options currently available. Namely:
EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.
Probably missing a few other solutions but these appear to be the main ones at the moment.
In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).
Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.
—-
One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.
SolidFire SF3010 node (c) 2011 SolidFire (from their website)
I was talking with a local start up called SolidFire the other day with an interesting twist on SSD storage. They were targeting cloud service providers with a scale-out, cluster based SSD iSCSI storage system. Apparently a portion of their team had come from Lefthand (now owned by HP) another local storage company and the rest came from Rackspace, a national cloud service provider.
The hardware
Their storage system is a scale-out cluster of storage nodes that can range from 3 to a theoretical maximum of 100 nodes (validated node count ?). Each node comes equipped with 2-2.4GHz, 6-core Intel processors and 10-300GB SSDs for a total of 3TB raw storage per node. Also they have 8GB of non-volatile DRAM for write buffering and 72GB read cache resident on each node.
The system also uses 2-10GbE links for host to storage IO and inter-cluster communications and support iSCSI LUNs. There are another 2-1GigE links used for management communications.
SolidFire states that they can sustain 50K IO/sec per node. (This looks conservative from my viewpoint but didn’t state any specific R:W ratio or block size for this performance number.)
The software
They are targeting cloud service providers and as such the management interface was designed from the start as a RESTful API but they also have a web GUI built out of their API. Cloud service providers will automate whatever they can and having a RESTful API seems like the right choice.
QoS and data reliability
The cluster supports 100K iSCSI LUNs and each LUN can have a QoS SLA associated with it. According to SolidFire one can specify a minimum/maximum/burst level for IOPS and a maximum or burst level for throughput at a LUN granularity.
With LUN based QoS, one can divide cluster performance into many levels of support for multiple customers of a cloud provider. Given these unique QoS capabilities it should be relatively easy for cloud providers to support multiple customers on the same storage providing very fine grained multi-tennancy capabilities.
This could potentially lead to system over commitment, but presumably they have some way to ascertain over commitment is near and not allowing this to occur.
Data reliability is supplied through replication across nodes which they call Helix(tm) data protection. In this way if an SSD or node fails, it’s relatively easy to reconstruct the lost data onto another node’s SSD storage. Which is probably why the minimum number of nodes per cluster is set at 3.
Storage efficiency
Aside from the QoS capabilities, the other interesting twist from a customer perspective is that they are trying to price an all-SSD storage solution at the $/GB of normal enterprise disk storage. They believe their node with 3TB raw SSD storage supports 12TB of “effective” data storage.
They are able to do this by offering storage efficiency features of enterprise storage using an all SSD configuration. Specifically they provide,
Thin provisioned storage – which allows physical storage to be over subscribed and used to support multiple LUNs when space hasn’t completely been written over.
Data compression – which searches for underlying redundancy in a chunk of data and compresses it out of the storage.
Data deduplication – which searches multiple blocks and multiple LUNs to see what data is duplicated and eliminates duplicative data across blocks and LUNs.
Space efficient snapshot and cloning – which allows users to take point-in-time copies which consume little space useful for backups and test-dev requirements.
Tape data compression gets anywhere from 2:1 to 3:1 reduction in storage space for typical data loads. Whether SolidFire’s system can reach these numbers is another question. However, tape uses hardware compression and the traditional problem with software data compression is that it takes lots of processing power and/or time to perform it well. As such, SolidFire has configured their node hardware to dedicate a CPU core to each physical disk drive (2-6 core processors for 10 SSDs in a node).
Deduplication savings are somewhat trickier to predict but ultimately depends on the data being stored in a system and the algorithm used to deduplicate it. For user home directories, typical deduplication levels of 25-40% are readily attainable. SolidFire stated that their deduplication algorithm is their own patented design and uses a small fixed block approach.
The savings from thin provisioning ultimately depends on how much physical data is actually consumed on a storage LUN but in typical environments can save 10-30% of physical storage by pooling non-written or free storage across all the LUNs configured on a storage system.
Space savings from point-in-time copies like snapshots and clones depends on data change rates and how long it’s been since a copy was made. But, with space efficient copies and a short period of existence, (used for backups or temporary copies in test-development environments) such copies should take little physical storage.
Whether all of this can create a 4:1 multiplier for raw to effective data storage is another question but they also have a eScanner tool which can estimate savings one can achieve in their data center. Apparently the eScanner can be used by anyone to scan real customer LUNs and it will compute how much SolidFire storage will be required to support the scanned volumes.
—–
There are a few items left on their current road map to be delivered later, namely remote replication or mirroring. But for now this looks to be a pretty complete package of iSCSI storage functionality.
SolidFire is currently signing up customers for Early Access but plan to go GA sometime around the end of the year. No pricing was disclosed at this time.
I was at SNIA’s BoD meeting the other week and stated my belief that SSDs will ultimately lead to the commoditization of storage. By that I meant that it would be relatively easy to configure enough SSD hardware to create a 100K IO/sec or 1GB/sec system without having to manage 1000 disk drives. Lo and behold, SolidFire comes out the next week. Of course, I said this would happen over the next decade – so I am off by a 9.99 years…
I was talking with another analyst the other day by the name of John Koller of Kai Consulting who specializes in the medical space and he was talking about the rise of electronic pathology (e-pathology). I hadn’t heard about this one.
He said that just like radiology had done in the recent past, pathology investigations are moving to make use of digital formats.
What does that mean?
The biopsies taken today for cancer and disease diagnosis which involve one more specimens of tissue examined under a microscope will now be digitized and the digital files will be inspected instead of the original slide.
Apparently microscopic examinations typically use a 1×3 inch slide that can have the whole slide devoted to some tissue matter. To be able to do a pathological examination, one has to digitize the whole slide, under magnification at various depths within the tissue. According to Koller, any tissue is essentially a 3D structure and pathological exams, must inspect different depths (slices) within this sample to form their diagnosis.
I was struck by the need for different slices of the same specimen. I hadn’t anticipated that but whenever I look in a microscope, I am always adjusting the focal length, showing different depths within the slide. So it makes sense, if you want to understand the pathology of a tissue sample, multiple views (or slices) at different depths are a necessity.
So what does a slide take in storage capacity?
Koller said, an uncompressed, full slide will take about 300GB of space. However, with compression and the fact that most often the slide is not completely used, a more typical space consumption would be on the order of 3 to 5GB per specimen.
As for volume, Koller indicated that a medium hospital facility (~300 beds) typically does around 30K radiological studies a year but do about 10X that in pathological studies. So at 300K pathological examinations done a year, we are talking about 90 to 150TB of digitized specimen images a year for a mid-sized hospital.
Why move to e-pathology?
It can open up a whole myriad of telemedicine offerings similar to the radiological study services currently available around the globe. Today, non-electronic pathology involves sending specimens off to a local lab and examination by medical technicians under microscope. But with e-pathology, the specimen gets digitized (where, the hospital, the lab, ?) and then the digital files can be sent anywhere around the world, wherever someone is qualified and available to scrutinize them.
—–
At a recent analyst event we were discussing big data and aside from the analytics component and other markets, the vendor made mention of content archives are starting to explode. Given where e-pathology is heading, I can understand why.
In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.
Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data. Namely,
Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently. Data access is via a Thrift based API supporting many languages. Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp. One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween. Also Casandra is optimized for writes. Cassandra can be used as the Map portion of a MapReduce run.
Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB. Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS). In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations. Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications. Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop. Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
Hypertable – is a Google open source project which is a c++ implementation of BigTable only using HDFS rather than GFS . Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families. Hypertable supports both a client (c++) and Thrift API. Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS in combination with an interpretive analysis. Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS. Pig supports both batch and interactive execution but can also be used through a Java API.
Hadoop also supports special purpose tools used for very specialized analysis such as
Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction. However, Mahout works on non-Hadoop clusters as well. Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions. While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data. The focus here is on scientific computation. Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.
There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely
Chukwa – which is used for monitoring large distributed clusters of servers.
ZooKeeper – which is a cluster configuration tool and distributed serialization manager useful to build large clusters of Hadoop nodes.
MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.
As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well. In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.
BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.
In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).
What is Hadoop and why is it important
Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?
Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,
Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode). These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything. HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output. MapReduce uses the HDFS to access file segments and to store reduced results. MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.
Hadoop explained
It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.
Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.
HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible. In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.
Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides. In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.
MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process. As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya,Hbase, Hive, Mahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs. I will try to provide more information on these and other popular projects in a follow on post.
Hadoop fault tolerance
When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.
Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster. In this fashion, work will complete even in the face of node failures.
Hadoop distributions, support and who uses them
Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support. Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).
As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested. The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage. Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.
—-
I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was. This is my first attempt – hopefully, more to follow.