Archeology meets Big Data

Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)
Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)

Read an article yesterday about the use of LIDAR (light detection and ranging, Wikipedia) to map the residues of an pre-columbian civilization in Central America, the little know Purepecha empire, peers of the Aztecs.

The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.


LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.

The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.

So how much data?

I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.

Another paper I found (see Evaluation of MapReduce for Gridding LIDAR Data) said that a LIDAR “grid point” (containing X, Y & Z coordinates) takes 52 bytes of data.

Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.

My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.

With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.

Indiana Jones meets Hadoop

That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.

The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.

The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.


Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.


A day and a half with HP Storage

A photo of bloggers and HP personnel waiting to go on the lab tour
Bloggers and HP people waiting to tour lab

[long post 945 wds] HP held their (annual?) HP Tech Days in Fort Collins, Colorado this last week. We had presentations from a number of HP product managers and got to meet a number of new and old bloggers there.

In attendance from the blogosphere were: Alastair Cooke (@DemitasseNZ), Brian Knudtson (@bknudtson), Howard Marks (@DeepStorageNet), John Obeto (@JohnObeto), Jeff Powers (@Geekazine), Rich Schandler (@recklessop), Derek Schauland (@webjunkie), Justin Vashisht (@3cVGuy), and Matt Vogt (@MattVogt).

Craig Nunes VP of Marketing, HP Storage got up and led off the day’s discussion talking about recent results. HP disk storage is up 11% for the quarter, 3par is growing by triple digit growth (QoQ maybe YoY?) and channel sales are growing by 10%.  HP storage is gaining market share, grew 3% for the quarter.  Also, HP is #2 is shipped backup appliances (1H11).  The current focus for HP storage is in three areas:

  • Invest in established platforms, MSA and EVA (with a 100K customers)
  • Invest in converged storage aimed at new data centers, 3PAR, Lefthand, IBRIX and StoreOnce.
  • Invest in converged systems knocking down barriers between servers, storage and networking with Virtual Systems.

Craig spent most of his time talking about converged storage. HP’s converged storage includes:

  • built in autonomic storage automating operations with one pain of glass and an orchestration layer on top to oversee everything.
  • scale out storage providing simpler ways to grow storage.
  • built on standardized platforms using off the shelf server platform technology

Craig ended up discussing HP’s Virtual System, their response to VCE’s Vblock, NetApp’s FlexPod and Dell’s vStart Bundle.   HP’s Virtual System was announced earlier last year and has been doing well in the market.

Brad Katz, Product Manager got up next and talked about Lefthand storage solutions.  Lefthand’s portfolio now ranges from the Virtual Storage Appliance (VSA) all the way up to a P4800 SAN storage blade with P4300 and P4500 rackmountable storage systems between those two.   Lefthand systems provide a clustered, scale-out IP/SAN and NAS storage.   Cluster data is striped across all disks in all storage nodes.

The VSA runs as a virtual machine and utilizes any ESX  (direct or SAN attached) storage.  The P4800 operates as a storage blade in an HP blade server and uses storage in the blade system.  The two rackmount systems P4300 and P4500 connect to SAS attached, external disk shelves.

HP's Steve Johnson, at the front of the room discussing slide on StoreOnce
Steve Johnson on StoreOnce

Steve Johnson and Mat Jacoby talked next about the StoreOnce deduplicating backup appliance product line.  StoreOnce is an HP R&D Labs home grown, deduplication technology which provides balanced ingest-restore rates and memory efficient deduplication.  The current product line spans D2D25xx, D2D41xx, D2D43xx and the recently announced, B6200 backup storage blade.

StoreOnce use a variable block, 4K chunksize and a sparse index which saves on server memory size which both lead to great deduplication rates.   Most deduplication functionality is memory intensive making it hard to scale without increasing memory or using different dedupe engines across a product line.  StoreOnce’s sparse indexing fixed that issue and as such, can use the same deduplication engine across their entire product line.

HP's JR (Jim Richardson) at the front of the room discussing 3PAR's advantages
JR talking about 3PAR advantages

Jim Richardson or JR, a 3PAR SE from the start, got up and discussed 3PAR.  Early on, 3PAR brought to the market three characteristics that differentiated it from other enterprise storage products:

  • Multi-tennancy – today’s cloud service providers and just about anyone running enterprise storage needs to support mixed workloads on shared storage. 3PAR’s ASIC allows data to be placed on any storage node and be serviced at direct access speeds to better support these multi-application environments. 
  • Thin provisioning – although certainly not the first to support thin provisioning (Iceberg was the first), 3PAR did much to popularize it.  Once again the ASIC provides automated support for thin provisioning.  
  • Autonomic functionality – optimization of storage performance across nodes and tiers of storage was also helped by their ASIC’s ability to transfer data without involving processor interaction.  Also 3PAR, tried to take the drudgery out of administration by automatically wide striping and making provisioning easier.

Jim Hankins and Chris Duffy came up next and talked about the X9000 IBRIX storage system.  Ibrix has intrinsic scale out NAS support and provides automatic failover across dual processing nodes called couplets. The B6200 backup system (see above) is based on Ibrix technology.  Ibrix supports a 15PB single name space that is segmented across cluster couplets.  Ibrix also comes in a gateway configuration using shared SAN storage behind it.

A picture of a X5000 without skins, and a couple of CRUs taken out
HP X5000 NAS system

Robert Thompson got up and talked about the X5000 Windows Server WSS based NAS product.  It is the industry’s first two node file system with active/active clustering in a box.  As the product runs Windows Server, one can run Anti-Virus or other server applications directly on the storage and is customer maintainable. Robert pulled out every replaceable unit in the system.  Apparently the E5000, HP Storage’s Exchange Appliance is also based on the same hardware.   The two servers in the storage system are clustered together using MSCS.

A photo of an intelligent data center floor tile with remotely controlled mechanical louvres to control air flow.
HPer showing off intelligent floor tiles

In the afternoon we went on a lab tour and got to see some of HP’s storage and data center cooling technology on display.

On the second day, Mike Koponen got up and discussed HP’s Virtual System (or Vblock competitor) and Aboubacar Diare gave some of his opinions on VMware VAAI & VASA integration from his testing perspective.  Finally, Calvin Zito wrapped up the two day event and everyone (except me and a few others) went on a brewery tour.


All in all, we had a good time with HP.  Too bad, I didn’t get to go on the New Belgium Brewery tour, perhaps next time.




SSD news roundup

The NexGen n5 Storage System (c) 2011 NexGen Storage, All Rights Reserved
The NexGen n5 Storage System (c) 2011 NexGen Storage, All Rights Reserved

NexGen comes out of stealth

NexGen Storage a local storage company came out of stealth today and is also generally available.  Their storage system has been in beta since April 2011 and is in use by a number of customers today.

Their product uses DRAM caching, PCIe NAND flash, and nearline SAS drives to provide guaranteed QoS for LUN I/O.  The system can provision IOP rate, bandwidth and (possibly) latency over a set of configured LUNs.    Such provisioning can change using policy management on a time basis to support time-based tiering. Also, one can prioritize how important the QoS is for a LUN so that it could be guaranteed or could be sacrificed to support performance for other storage system LUNs.

The NexGen storage provides a multi-tiered hybrid storage system that supports 10GBE iSCSI, and uses MLC NAND PCIe card  to boost performance for SAS nearline drives.  NexGen also supports data deduplication which is done during off-peak times to reduce data footprint.

DRAM replacing Disk!?

In a report by ARS Technica, a research group out of Stanford is attempting to gang together server DRAM to create a networked storage system.  There have been a number of attempts to use DRAM as a storage system in the past but the Stanford group is going after it in a different way by aggregating together DRAM across a gaggle of servers.  They are using standard disks or SSDs for backup purposes because DRAM is, of course, a volatile storage device but the intent is to keep all in memory to speed up performance.

I was at SNW USA a couple of weeks ago talking to a Taiwanese company that was offering a DRAM storage accelerator device which also used DRAM as a storage service. Of course, Texas Memory Systems and others have had DRAM based storage for a while now. The cost for such devices was always pretty high but the performance was commensurate.

In contrast, the Stanford group is trying to use commodity hardware (servers) with copious amounts of DRAM, to create a storage system.  The article seems to imply that the system could take advantage of unused DRAM, sitting around your server farm. But, I find it hard to believe that.  Most virtualized server environments today are running lean on memory and there shouldn’t be a lot of excess DRAM capacity hanging around.

The other achilles heel of the Stanford DRAM storage is that it is highly dependent on low latency networking.  Although Infiniband probably qualifies as low latency, it’s not low latency enough to support this systems IO workloads. As such, they believe they need even lower latency networking than Infiniband to make it work well.

OCZ ups the IOP rate on their RevoDrive3 Max series PCIe NAND storage

Speaking of PCIe NAND flash, OCZ just announced speedier storage, upping the random read IO rates up to 245K from the 230K IOPS offered in their previous PCIe NAND storage.  Unclear what they did to boost this but, it’s entirely possible that they have optimized their NAND controller to support more random reads.

OCZ announces they will ship TLC SSD storage in 2012

OCZ’s been busy.  Now that the enterprise is moving to adopt MLC and eMLC SSD storage, it seems time to introduce TLC (3-bits/cell) SSDs.  With TLC, the price should come down a bit more (see chart in article), but the endurance should also suffer significantly.  I suppose with the capacities available with TLC and enough over provisioning OCZ can make a storage device that would be reliable enough for certain applications at a more reasonable cost.

I never thought I would see MLC in enterprise storage so, I suppose at some point even TLC makes sense, but I would be even more hesitant to jump on this bandwagon for awhile yet.

Solid Fire obtains more funding

Early last week Solid Fire, another local SSD startup obtained $25M in additional funding.  Solid Fire, an all SSD storage system company,  is still technically in beta but expect general availability near the end of the year.   We haven’t talked about them before in RayOnStorage but they are focusing on cloud service providers with an all SSD solution which includes deduplication.  I promise to talk about them some more when they reach GA.

LaCIE introduces a Little Big Disk, a Thunderbolt SSD

Finally, in the highend consumer space, LaCie just released a new SSD which attaches to servers/desktops using the new Apple-Intel Thunderbolt IO interface.  Given the expense (~$900) for 128GB SSD, it seems a bit much but if you absolutely have to have the performance this may be the only way to go.



Well that’s about all I could find on SSD and DRAM storage announcements. However, I am sure I missed a couple so if you know one I should have mentioned please comment.

Pure Storage surfaces

1 controller X 1 storage shelf (c) 2011 Pure Storage (from their website)
1 controller X 1 storage shelf (c) 2011 Pure Storage (from their website)

We were talking with Pure Storage last week, another SSD startup which just emerged out of stealth mode today.  Somewhat like SolidFire which we discussed a month or so ago, Pure Storage uses only SSDs to provide primary storage.  In this case, they are supporting a FC front end, with an all SSDs backend, and implementing internal data deduplication and compression, to try to address the needs of enterprise tier 1 storage.

Pure Storage is in final beta testing with their product and plan to GA sometime around the end of the year.

Pure Storage hardware

Their system is built around MLC SSDs which are available from many vendors but with a strategic investment from Samsung, currently use that vendor’s storage.  As we know, MLC has write endurance limitations but Pure Storage was built from the ground up knowing they were going to use this technology and have built their IP to counteract these issues.

The system is available in one or two controller configurations, with an Infiniband interconnect between the controllers, 6Gbps SAS backend, 48GB of DRAM per controller for caching purposes, and NV-RAM for power outages.  Each controller has 12-cores supplied by 2-Intel Xeon processor chips.

With the first release they are limiting the controllers to one or two (HA option) but their storage system is capable of clustering together many more, maybe even up to 8-controllers using the Infiniband back end.

Each storage shelf provides 5.5TB of raw storage using 2.5″ 256GB MLC SSDs.  It looks like each controller can handle up to 2-storage shelfs with the HA (dual controller option) supporting 4 drive shelfs for up to 22TB of raw storage.

Pure Storage Performance

Although these numbers are not independently verified, the company says a single controller (with 1-storage shelf) they can do 200K sustained 4K random read IOPS, 2GB/sec bandwidth, 140K sustained write IOPS, or 500MB/s of write bandwidth.  A dual controller system (with 2-storage shelfs) can achieve 300K random read IOPS, 3GB/sec bandwidth, 180K write IOPS or 1GB/sec of write bandwidth.  They also claim that they can do all this IO with an under 1 msec. latency.

One of the things they pride themselves on is consistent performance.  They have built their storage such that they can deliver this consistent performance even under load conditions.

Given the amount of SSDs in their system this isn’t screaming performance but is certainly up there with many enterprise class systems sporting over 1000 disks.  The random write performance is not bad considering this is MLC.  On the other hand the sequential write bandwidth is probably their weakest spec and reflects their use of MLC flash.

Purity software

One key to Pure Storage (and SolidFire for that matter) is their use of inline data compression and deduplication. By using these techniques and basing their system storage on MLC, Pure Storage believes they can close the price gap between disk and SSD storage systems.

The problems with data reduction technologies is that not all environments can benefit from them and they both require lots of CPU power to perform well.  Pure Storage believes they have the horsepower (with 12 cores per controller) to support these services and are focusing their sales activities on those (VMware, Oracle, and SQL server) environments which have historically proven to be good candidates for data reduction.

In addition, they perform a lot of optimizations in their backend data layout to prolong the life of MLC storage. Specifically, they use a write chunk size that matches the underlying MLC SSDs page width so as not to waste endurance with partial data writes.  Also they migrate old data to new locations occasionally to maintain “data freshness” which can be a problem with MLC storage if the data is not touched often enough.  Probably other stuff as well, but essentially they are tuning their backend use to optimize endurance and performance of their SSD storage.

Furthermore, they have created a new RAID 3D scheme which provides an adaptive parity scheme based on the number of available drives that protects against any dual SSD failure.  They provide triple parity, dual parity for drive failures and another parity for unrecoverable bit errors within a data payload.  In most cases, a failed drive will not induce an immediate rebuild but rather a reconfiguration of data and parity to accommodate the failing drive and rebuild it onto new drives over time.

At the moment, they don’t have snapshots or data replication but they said these capabilities are on their roadmap for future delivery.


In the mean time, all SSD storage systems seem to be coming out of the wood work. We mentioned SolidFire, but WhipTail is another one and I am sure there are plenty more in stealth waiting for the right moment to emerge.

I was at a conference about two months ago where I predicted that all SSD systems would be coming out with little of the engineering development of storage systems of yore. Based on the performance available from a single SSD, one wouldn’t need 100s of SSDs to generate 100K IOPS or more.  Pure Storage is doing this level of IO with only 22 MLC SSDs and a high-end, but essentially off-the-shelf controller.

Just imagine what one could do if you threw some custom hardware at it…


SolidFire supplies scale-out SSD storage for cloud service providers

SolidFire SF3010 node (c) 2011 SolidFire (from their website)
SolidFire SF3010 node (c) 2011 SolidFire (from their website)

I was talking with a local start up called SolidFire the other day with an interesting twist on SSD storage.  They were targeting cloud service providers with a scale-out, cluster based SSD iSCSI storage system.  Apparently a portion of their team had come from Lefthand (now owned by HP) another local storage company and the rest came from Rackspace, a national cloud service provider.

The hardware

Their storage system is a scale-out cluster of storage nodes that can range from 3 to a theoretical maximum of 100 nodes (validated node count ?). Each node comes equipped with 2-2.4GHz, 6-core Intel processors and 10-300GB SSDs for a total of 3TB raw storage per node.  Also they have 8GB of non-volatile DRAM for write buffering and 72GB read cache resident on each node.

The system also uses 2-10GbE links for host to storage IO and inter-cluster communications and support iSCSI LUNs.  There are another 2-1GigE links used for management communications.

SolidFire states that they can sustain 50K IO/sec per node. (This looks conservative from my viewpoint but didn’t state any specific R:W ratio or block size for this performance number.)

The software

They are targeting cloud service providers and as such the management interface was designed from the start as a RESTful API but they also have a web GUI built out of their API.  Cloud service providers will automate whatever they can and having a RESTful API seems like the right choice.

QoS and data reliability

The cluster supports 100K iSCSI LUNs and each LUN can have a QoS SLA associated with it.  According to SolidFire one can specify a minimum/maximum/burst level for IOPS and a maximum or burst level for throughput at a LUN granularity.

With LUN based QoS, one can divide cluster performance into many levels of support for multiple customers of a cloud provider.  Given these unique QoS capabilities it should be relatively easy for cloud providers to support multiple customers on the same storage providing very fine grained multi-tennancy capabilities.

This could potentially lead to system over commitment, but presumably they have some way to ascertain over commitment is near and not allowing this to occur.

Data reliability is supplied through replication across nodes which they call Helix(tm) data protection.  In this way if an SSD or node fails, it’s relatively easy to reconstruct the lost data onto another node’s SSD storage.  Which is probably why the minimum number of nodes per cluster is set at 3.

Storage efficiency

Aside from the QoS capabilities, the other interesting twist from a customer perspective is that they are trying to price an all-SSD storage solution at the $/GB of normal enterprise disk storage. They believe their node with 3TB raw SSD storage supports 12TB of “effective” data storage.

They are able to do this by offering storage efficiency features of enterprise storage using an all SSD configuration. Specifically they provide,

  • Thin provisioned storage – which allows physical storage to be over subscribed and used to support multiple LUNs when space hasn’t completely been written over.
  • Data compression – which searches for underlying redundancy in a chunk of data and compresses it out of the storage.
  • Data deduplication – which searches multiple blocks and multiple LUNs to see what data is duplicated and eliminates duplicative data across blocks and LUNs.
  • Space efficient snapshot and cloning – which allows users to take point-in-time copies which consume little space useful for backups and test-dev requirements.

Tape data compression gets anywhere from 2:1 to 3:1 reduction in storage space for typical data loads. Whether SolidFire’s system can reach these numbers is another question.  However, tape uses hardware compression and the traditional problem with software data compression is that it takes lots of processing power and/or time to perform it well.  As such, SolidFire has configured their node hardware to dedicate a CPU core to each physical disk drive (2-6 core processors for 10 SSDs in a node).

Deduplication savings are somewhat trickier to predict but ultimately depends on the data being stored in a system and the algorithm used to deduplicate it.  For user home directories, typical deduplication levels of 25-40% are readily attainable.  SolidFire stated that their deduplication algorithm is their own patented design and uses a small fixed block approach.

The savings from thin provisioning ultimately depends on how much physical data is actually consumed on a storage LUN but in typical environments can save 10-30% of physical storage by pooling non-written or free storage across all the LUNs configured on a storage system.

Space savings from point-in-time copies like snapshots and clones depends on data change rates and how long it’s been since a copy was made.  But, with space efficient copies and a short period of existence, (used for backups or temporary copies in test-development environments) such copies should take little physical storage.

Whether all of this can create a 4:1 multiplier for raw to effective data storage is another question but they also have a eScanner tool which can estimate savings one can achieve in their data center. Apparently the eScanner can be used by anyone to scan real customer LUNs and it will compute how much SolidFire storage will be required to support the scanned volumes.


There are a few items left on their current road map to be delivered later, namely remote replication or mirroring. But for now this looks to be a pretty complete package of iSCSI storage functionality.

SolidFire is currently signing up customers for Early Access but plan to go GA sometime around the end of the year. No pricing was disclosed at this time.

I was at SNIA’s BoD meeting the other week and stated my belief that SSDs will ultimately lead to the commoditization of storage.  By that I meant that it would be relatively easy to configure enough SSD hardware to create a 100K IO/sec  or 1GB/sec system without having to manage 1000 disk drives.  Lo and behold, SolidFire comes out the next week.  Of course, I said this would happen over the next decade – so I am off by a 9.99 years…


Hadoop – part 2

Hadoop Graphic (c) 2011 Silverton Consulting
Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data.  Namely,

  • Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently.  Data access is via a Thrift based API supporting many languages.  Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp.  One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween.  Also Casandra is optimized for writes.  Cassandra can be used as the Map portion of a MapReduce run.
  • Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB.  Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS).  In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations.  Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications.  Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
  • Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop.  Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
  • Hypertable – is a Google open source project which is a  c++ implementation of BigTable only using HDFS rather than GFS .  Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families.   Hypertable supports both a client (c++) and Thrift API.  Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
  • Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS  in combination with an interpretive analysis.  Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS.  Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

  • Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction.  However, Mahout works on non-Hadoop clusters as well.  Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions.  While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
  • Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data.  The focus here is on scientific computation.  Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

  • Chukwa – which is used for monitoring large distributed clusters of servers.
  • ZooKeeper – which is a cluster configuration tool  and distributed serialization manager useful to build large clusters of Hadoop nodes.
  • MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
  • Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well.  In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.

Tape vs. Disk, the saga continues

Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)
Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)

Was on a call late last month where Oracle introduced their latest generation T1000C tape system (media and drive) holding 5TB native (uncompressed) capacity. In the last 6 months I have been hearing about the coming of a 3TB SATA disk drive from Hitachi GST and others. And last month, EMC announced a new Data Domain Archiver, a disk only archive appliance (see my post on EMC Data Domain products enter the archive market).

Oracle assures me that tape density is keeping up if not gaining on disk density trends and capacity. But density or capacity are not the only issues causing data to move off of tape in today’s enterprise data centers.

“Dedupe Rulz”

A problem with the data density trends discussion is that it’s one dimensional (well literally it’s 2 dimensional). With data compression, disk or tape systems can easily double the density on a piece of media. But with data deduplication, the multiples start becoming more like 5X to 30X depending on frequency of full backups or duplicated data. And number’s like those dwarf any discussion of density ratios and as such, get’s everyone’s attention.

I can remember talking to an avowed tape enginerr, years ago and he was describing deduplication technology at the VTL level as being architecturally inpure and inefficient. From his perspective it needed to be done much earlier in the data flow. But what they failed to see was the ability of VTL deduplication to be plug-compatible with the tape systems of that time. Such ease of adoption allowed deduplication systems to build a beach-head and economies of scale. From there such systems have no been able to move up stream, into earlier stages of the backup data flow.

Nowadays, what with Avamar, Symantec Pure Disk and others, source level deduplication, or close by source level deduplication is a reality. But all this came about because they were able to offer 30X the density on a piece of backup storage.

Tape’s next step

Tape could easily fight back. All that would be needed is some system in front of a tape library that provided deduplication capabilities not just to the disk media but the tape media as well. This way the 30X density over non-deduplicated storage could follow through all the way to the tape media.

In the past, this made little sense because a deduplicated tape would require potentially multiple volumes in order to restore a particular set of data. However, with today’s 5TB of data on a tape, maybe this doesn’t have to be the case anymore. In addition, by having a deduplication system in front of the tape library, it could support most of the immediate data restore activity while data restored from tape was sort of like pulling something out of an archive and as such, might take longer to perform. In any event, with LTO’s multi-partitioning and the other enterprise class tapes having multiple domains, creating a structure with meta-data partition and a data partition is easier than ever.

“Got Dedupe”

There are plenty of places, that today’s tape vendors can obtain deduplication capabilities. Permabit offers Dedupe code for OEM applications for those that have no dedupe systems today. FalconStor, Sepaton and others offer deduplication systems that can be OEMed. IBM, HP, and Quantum already have tape libraries and their own dedupe systems available today all of which can readily support a deduplicating front-end to their tape libraries, if they don’t already.

Where “Tape Rulz”

There are places where data deduplication doesn’t work very well today, mainly rich media, physics, biopharm and other non-compressible big-data applications. For these situations, tape still has a home but for the rest of the data center world today, deduplication is taking over, if it hasn’t already. The sooner tape get’s on the deduplication bandwagon the better for the IT industry.


Of course there are other problems hurting tape today. I know of at least one large conglomerate that has moved all backup off tape altogether, even data which doesn’t deduplicate well (see my previous Oracle RMAN posts). And at least another rich media conglomerate that is considering the very same move. For now, tape has a safe harbor in big science, but it won’t last long.