There’s been an ongoing debate in the analyst community about the advantages of software only innovation vs. hardware-software innovation (see Commodity hardware loses again and Commodity hardware always loses posts). Here is another example where two separate companies have turned to hardware innovation to take storage innovation to the next level.
These two arrays seem to be going after opposite ends of the storage market: the 5U DSSD D5 is going after both structured and unstructured data that needs ultra high speed IO access (<100µsec) times and the 4U FlashBlade going after more general purpose unstructured data. And yet the two have have many similarities at least superficially. Continue reading “A tale of two AFAs: EMC DSSD D5 & Pure Storage FlashBlade”→
To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).
On the other hand, for structured data there are a number of other options currently available. Namely:
EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.
Probably missing a few other solutions but these appear to be the main ones at the moment.
In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).
Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.
One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.
Ran across a web posting yesterday providing information on a University of Illinois summer program in Data Science. I had never encountered the term before so I was intrigued. When I first saw the article I immediately thought of data analytics but data science should be much broader than that.
What exactly is a data scientist? I suppose someone who studies what can be learned from data but also what happens throughout data lifecycles.
Data science is like biology
I look to biology for an example. A biologist studies all sorts of activity/interactions from what happens in a single cell organism, to plants, and animal kingdoms. They create taxonomies which organizes all biological entities, past and present. They study current and past food webs, ecosystems, and species. They work in an environment of scientific study where results are openly discussed and repeatable. In peer reviewed journals, they document everything from how a cell interacts within an organism, to how an organism interacts with its ecosystem, to whole ecosystem lifecycles. I fondly remember my biology class in high school talking about DNA, the life of a cell, biological taxonomy and disection.
Where are these counterparts in Data Science? Not sure but for starters let’s call someone who does data science an informatist.
What constitutes a data ecosystem in data science? Perhaps an informatist would study the IT infrastructure(s) where a datum is created, stored, and analyzed. Such infrastructure (especially with cloud) may span data centers, companies, and even the whole world. Nonetheless, migratory birds can cover large distances, across multiple ecosystems and are still valid subjects for biologists.
So where a datum exists, where/when it’s moved throughout its lifecycle, and how it interacts with other datums is a proper subject for data ecosystem study. I suppose my life’s study of storage could properly be called the study of data ecosytems.
Next, what’s a reasonable way for an informatist to organize data like a biological taxonomy with domain, kingdom, phylum, class, order, family, genus, and species (see wikipedia). Seems to me that applications that create and access the data represent a rational way to organize data. However my first thought on this was structured or unstructured data as the defining first level breakdown (maybe Phylum). Order could be general application type such as email, ERP, office documents, etc. Family could be application domain, genus could be application version and species could be application data type. So that something like an Exchange 2010 email would be Order=EMAILus, Family=EXCHANGius, Genus=E2010ius, and Species=MESSAGius.
I think higher classifications such as kingdom and domain need to consider things such as oral history, handcopied manuscripts, movable type printed documents, IT, etc., at the Kingdom level. Maybe Domain would be such things as biological domain, information domain, physical domain, etc. Although where oral-h
When first thinking of higher taxonomical designations I immediately went into O/S but now I think of an O/S as part of the ecological niche where data temporarily resides.
I could go on, there are probably hundreds if not thousands of other characteristics of data science that need to be discussed – data lifecycle, the data cell, information use webs, etc.
Another surprise is how well the study of biology fits the study of data science. Counterparts to biology seem to exist everywhere I look. At some deep level, biology is information, wet-ware perhaps, but information nonetheless. It seems to me that the use of biology to guide our elaboration of data science can be very useful.