Big data – part 3

Linkedin maps data visualization by luc legay (cc) (from Flickr)
Linkedin maps data visualization by luc legay (cc) (from Flickr)

I have renamed this series to “Big data” because it’s no longer just about Hadoop (see Hadoop – part 1 & Hadoop – part 2 posts).

To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).

On the other hand, for structured data there are a number of other options currently available. Namely:

  • EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
  • HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
  • IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
  • ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
  • Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.

Probably missing a few other solutions but these appear to be the main ones at the moment.

In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).

Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.

—-

One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.

Comments?

Database appliances!?

The Sun Oracle Database Machine by Oracle OpenWorld San Francisco 2009 (cc) (from Flickr)
The Sun Oracle Database Machine by Oracle OpenWorld San Francisco 2009 (cc) (from Flickr)

Was talking with Oracle the other day and discussing their Exadata database system.  They have achieved a lot of success with this product.  All of which got me to wondering whether database specific storage ever makes sense.  I suppose the ultimate arbiter of “making sense” is commercial viability and Oracle and others have certainly proven this, but from a technologist perspective I still wonder.

In my view, the Exadate system combines database servers and storage servers in one rack (with extensions to other racks).  They use an Infiniband bus between the database and storage servers and have a proprietary storage access protocol between the two.

With their proprietary protocol they can provide hints to the storage servers as to what’s coming next and how to manage the database data which make the Exadata system a screamer of a database machine.  Such hints can speed up database query processing, more efficiently store database structures, and overall speed up Oracle database activity.  Given all that it makes sense to a lot of customers.

Now, there are other systems which compete with Exadata like Teradata and Netezza (am I missing anyone?) that also support onboard database servers and storage servers.  I don’t know much about these products but they all seem targeted at data warehousing and analytics applications similar to Exadata but perhaps more specialized.

  • As far as I can tell Teradata has been around for years since they were spun out of NCR (or AT&T) and have enjoyed tremendous success.  The last annual report I can find for them shows their ’09 revenue around $772M with net income $254M.
  • Netezza started in 2000 and seems to be doing OK in the database appliance market given their youth.  Their last annual report for ’10 showed revenue of ~$191M and net income of $4.2M.  Perhaps not doing as well as Teradata but certainly commercially viable.

The only reason database appliances or machines exist is to speed up database processing.  If they can do that then they seem able to build a market place for themselves.

Database to storage interface standards

The key question from a storage analyst perspective is shouldn’t there be some sort of standards committee, like SNIA or others, that work to define a standard protocol between database servers and storage that can be adopted by other storage vendors.  I understand the advantage that proprietary interfaces can supply to an enlightened vendor’s equipment but there are more database vendors out there than just Oracle, Teradata and Netezza and there are (at least for the moment) many more storage vendors out there as well.

A decade or so ago, when I was with another storage company we created a proprietary interface for backup activity and it sold ok but in the end it didn’t sell enough to be worthwhile for either the backup or storage company to continue the approach.  At the time we were looking to support another proprietary interface for sorting but couldn’t seem to justify it.

Proprietary interfaces tend to lock customers in and most customers will only accept lockin if there is a significant advantage to your functionality.  But customer lock-in can lull vendors into not investing R&D funding in the latest technology and over time this affect will cause the vendor to lose any advantage they previously enjoyed.

It seems to me that the more successful companies (with the possible exception of Apple) tend to focus on opening up their interfaces rather than closing them down.  By doing so they introduce more competition which serves their customers better, in the long run.

I am not saying that if Oracle would standardize/publicize their database server to storage server interface that there would be a lot of storage vendors going after that market.  But the high revenues in this market, as evident from Teradata and Netezza, would certainly interest a few select storage vendors.  Now not all of Teradata’s or Netezza’s revenues derive from pure storage sales but I would wager a significant part do.

Nevertheless, a standard database storage protocol could readily be defined by existing database vendors in conjunction with SNIA.  Once defined, I believe some storage vendors would adopt this protocol along with every other storage protocol (iSCSI, FCoE, FC, FCIP, CIFS, NFS, etc.). Once that occurs, customers across the board would benefit from the increased competition and break away from the current customer lock-in with today’s database appliances.

Any significant success in the market from early storage vendor adopters of this protocol would certainly interest other  vendors inducing a cycle of increased adoption, higher competition, and better functionality.  In the end, database customers world wide will benefit from the increased price performance available in the open market.  And in the end that makes a lot more sense to me than the database appliances of today.

As to why Apple has excelled within a closed system environment, that will need to wait for a future post.