Releasing social and mobile data as a public good

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free.  Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc.  This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

  • Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data.  I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
  • Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
  • Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
  • Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years.  Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.

Comments?

Picture Credit(s): Amazon and Wikipedia 

Cheap phones + big data = better world

Big data visualization, Facebook friend connections, Data science
Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an article today in MIT Technical Review website (Big data from cheap phones) that shows how cheap phones, call detail records (CDRs) and other phone logs can be used to help fight disease and help understand disaster impacts.

Cheap phones generate big data

In one example, researchers took cell phone data from Kenya and used it to plot people movements throughout the country. What they were looking for is people who frequented malaria disease hot spots so that they could try to intervene in the transmission of this disease. Researchers discovered one region (cell tower) that had many people that were frequenting a particular bad location for malaria.  It turned out the region they identified had a large plantation with many migrant workers. These workers moved around a lot.  In order to reduce the transmission of the disease public health authorities could target this region to use more bed nets or try to reduce infestation at source of the disease.  In either case, people mobility was easier to see with cell phone data than actually putting people on the ground and counting where people go or come from.

In another example, researchers took cell phone data from Haiti before and after the earthquake and were able to calculate how many people were in the region hardest hit by the earthquake.  They were also able to identify how many people left the region and where the went to.  As a follow on to this, researchers were able to in real time show how many people had fled the cholera epidemic.

Gaining access to cheap phone data

Most of this call detail record data is limited to specific researchers for very specialized activities requested by the host countries. But recently  Orange released 2.5 billion cell phone call and text data records for five million customers they have in Ivory Coast that occurred during five months time.  They released the data to the public under some specific restrictions in order to see what data scientists could do with it. The papers detailing their activities will be published at a MIT Data for Development conference.

~~~~

Big data’s contribution to a better world is just beginning but from what we see here there’s real value in data that already exists, if only the data were made more widely available.

Comments?

Big science/big data ENCODE project decodes “Junk DNA”

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression.  Or junk DNA is really on-off switches for protein encoding DNA.  ENCODE project results were published by Nature,  Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to  the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2).  But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

  • Junk DNA is actually programming for protein production in a cell.  Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins.  Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
  • Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species.  One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
  • Project ENCODE has further narrowed the “Known Unknowns” of human DNA.  For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
  • There are cell specific regulation DNA.  That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc.  Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions.  I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA.  Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

  • Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project.   All told almost 12,000 files were analyzed from these experiments.
  • 15TB of data were used in the project
  • ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration.   It took about half way into the project before they figured out  how to define and assess data quality from experiments.   What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project.  In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team.  The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another.  This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”.  This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction.  This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools.  I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools.  Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables.  But even a search of the project didn’t reveal any hadoop references.

~~~~

The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M.  Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line.  I would love to hear more about it or you can always comment here.

Comments?

Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

How has IBM research changed?

20111207-204420.jpg
IBM Neuromorphic Chip (from Wired story)

What does Watson, Neuromorphic chips and race track memory have in common. They have all emerged out of IBM research labs.

I have been wondering for some time now how it is that a company known for it’s cutting edge research but lack of product breakthrough has transformed itself into an innovation machine.

There has been a sea change in the research at IBM that is behind the recent productization of tecnology.

Talking the past couple of days with various IBMers at STGs Smarter Computing Forum, I have formulate a preliminary hypothesis.

At first I heard that there was a change in the way research is reviewed for product potential. Nowadays, it almost takes a business case for research projects to be approved and funded. And the business case needs to contain a plan as to how it will eventually reach profitability for any project.

In the past it was often said that IBM invented a lot of technology but productized only a little of it. Much of their technology would emerge in other peoples products and IBM would not recieve anything for their efforts (other than some belated recognition for their research contribution).

Nowadays, its more likely that research not productized by IBM is at least licensed from them after they have patented the crucial technologies that underpin the advance. But it’s just as likely if it has something to do with IT, the project will end up as a product.

One executive at STG sees three phases to IBM research spanning the last 50 years or so.

Phase I The ivory tower:

IBM research during the Ivory Tower Era looked a lot like research universities but without the tenure of true professorships. Much of the research of this era was in materials and pure mathematics.

I suppose one example of this period was Mandlebrot and fractals. It probably had a lot of applications but little of them ended up in IBM products and mostly it advanced the theory and practice of pure mathematics/systems science.

Such research had little to do with the problems of IT or IBM’s customers. The fact that it created pretty pictures and a way of seeing nature in a different light was an advance to mankind but it didn’t have much if any of an impact to IBM’s bottom line.

Phase II Joint project teams

In IBM research’s phase II, the decision process on which research to move forward on now had people from not just IBM research but also product division people. At least now there could be a discussion across IBM’s various divisions on how the technology could enhance customer outcomes. I am certain profitability wasn’t often discussed but at least it was no longer purposefully ignored.

I suppose over time these discussions became more grounded in fact and business cases rather than just the belief in the value of the research for research sake. Technological roadmaps and projects were now looked at from how well they could impact customer outcomes and how such technology enabled new products and solutions to come to market.

Phase III Researchers and product people intermingle

The final step in IBM transformation of research involved the human element. People started moving around.

Researchers were assigned to the field and to product groups and product people were brought into the research organization. By doing this, ideas could cross fertilize, applications could be envisioned and the last finishing touches needed by new technology could be envisioned, funded and implemented. This probably led to the most productive transition of researchers into product developers.

On the flip side when researchers returned back from their multi-year product/field assignments they brought a new found appreciation of problems encountered in the real world. That combined with their in depth understanding of where technology could go helped show the path that could take research projects into new more fruitful (at least to IBM customers) arenas. This movement of people provided the final piece in grounding research in areas that could solve customer problems.

In the end, many research projects at IBM may fail but if they succeed they have the potential to make change IT as we know it.

I heard today that there were 700 to 800 projects in IBM research today if any of them have the potential we see in the products shown today like Watson in Healthcare and Neuromorphic chips, exciting times are ahead.

Big data and eMedicine combine to improve healthcare

fix_me by ~! (cc) (from Flickr)
fix_me by ~! (cc) (from Flickr)

We have talked before ePathology and data growth, but Technology Review recently reported that researchers at Stanford University have used Electronic Medical Records (EMR) from multiple medical institutions to identify a new harmful drug interaction. Apparently, they found that when patients take Paxil (a depressant) and Pravachol (a cholresterol reducer) together, the drugs interact to raise blood sugar similar to what diabetics have.

Data analytics to the rescue

The researchers started out looking for new drug interactions which could result in conditions seen by diabetics. Their initial study showed a strong signal that taking both Paxil and Pravachol could be a problem.

Their study used FDA Adverse Event Reports (AERs) data that hospitals and medical care institutions record.  Originally, the researchers at Stanford’s Biomedical Informatics group used AERs available at Stanford University School of Medicine but found that although they had a clear signal that there could be a problem, they didn’t have sufficient data to statistically prove the combined drug interaction.

They then went out to Harvard Medical School and Vanderbilt University and asked that to access their AERs to add to their data.  With the combined data, the researchers were now able to clearly see and statistically prove the adverse interactions between the two drugs.

But how did they analyze the data?

I could find no information about what tools the biomedical informatics researchers used to analyze the set of AERs they amassed,  but it wouldn’t surprise me to find out that Hadoop played a part in this activity.  It would seem to be a natural fit to use Hadoop and MapReduce to aggregate the AERs together into a semi-structured data set and reduce this data set to extract the AERs which matched their interaction profile.

Then again, it’s entirely possible that they used a standard database analytics tool to do the work.  After all, we were only talking about a 100 to 200K records or so.

Nonetheless, the Technology Review article stated that some large hospitals and medical institutions using EMR are starting to have database analysts (maybe data scientists) on staff to mine their record data and electronic information to help improve healthcare.

Although EMR was originally envisioned as a way to keep better track of individual patients, when a single patient’s data is combined with 1000s more patients one creates something entirely different, something that can be mined to extract information.  Such a data repository can be used to ask questions about healthcare inconceivable before.

—-

Digitized medical imagery (X-Rays, MRIs, & CAT scans), E-pathology and now EMR are together giving rise to a new form of electronic medicine or E-Medicine.  With everything being digitized, securely accessed and amenable to big data analytics medical care as we know is about to undergo a paradigm shift.

Big data and eMedicine combined together are about to change healthcare for the better.

Big data – part 3

Linkedin maps data visualization by luc legay (cc) (from Flickr)
Linkedin maps data visualization by luc legay (cc) (from Flickr)

I have renamed this series to “Big data” because it’s no longer just about Hadoop (see Hadoop – part 1 & Hadoop – part 2 posts).

To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).

On the other hand, for structured data there are a number of other options currently available. Namely:

  • EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
  • HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
  • IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
  • ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
  • Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.

Probably missing a few other solutions but these appear to be the main ones at the moment.

In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).

Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.

—-

One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.

Comments?

Hadoop – part 2

Hadoop Graphic (c) 2011 Silverton Consulting
Hadoop Graphic (c) 2011 Silverton Consulting

(Sorry about the length).

In part 1 we discussed some of Hadoop’s core characteristics with respect to the Hadoop distributed file system (HDFS) and the MapReduce analytics engine. Now in part 2 we promised to discuss some of the other projects that have emerged to make Hadoop and specifically MapReduce even easier to use to analyze unstructured data.

Specifically, we have a set of tools which use Hadoop to construct a database like out of unstructured data.  Namely,

  • Casandra – which maps HDFS data into a database but into a columnar based sparse table structure rather than the more traditional relational database row form. Cassandra was written by Facebook for Mbox search. Columnar databases support a sparse data much more efficiently.  Data access is via a Thrift based API supporting many languages.  Casandra’s data model is based on column, column families and column super-families. The datum for any column item is a three value structure and consists of a name, value of item and a time stamp.  One nice thing about Cassandra is that one can tune it for any consistency model one requires, from no consistency to always consistent and points inbetween.  Also Casandra is optimized for writes.  Cassandra can be used as the Map portion of a MapReduce run.
  • Hbase – which also maps HDFS data into a database like structure and provides Java API access to this DB.  Hbase is useful for million row tables with arbitrary column counts. Apparently Hbase is an outgrowth of Google’s Bigtable which did much the same thing only against the Google file system (GFS).  In contrast to Hive below Hbase doesn’t run on top of MapReduce rather it replaces MapReduce, however it can be used as a source or target of MapReduce operations.  Also, Hbase is somewhat tuned for random access read operations and as such, can be used to support some transaction oriented applications.  Moreover, Hbase can run on HDFS or Amazon S3 infrastructure.
  • Hive – which maps a” simple SQL” (called QL) ontop of a data warehouse built on Hadoop.  Some of these queries may take a long time to execute and as the HDFS data is unstructured the map function must extract the data using a database like schema into something approximating a relational database. Hive operates ontop of Hadoop’s MapReduce function.
  • Hypertable – is a Google open source project which is a  c++ implementation of BigTable only using HDFS rather than GFS .  Actually Hypertable can use any distributed file systemand and is another columnar database (like Cassandra above) but only supports columns and column families.   Hypertable supports both a client (c++) and Thrift API.  Also Hypertable is written in c++ and is considered the most optimized of the Hadoop oriented databases (although there is some debate here).
  • Pig – is a dataflow processing (scripting) language built ontop of Hadoop which supports a sort of database interpreter for HDFS  in combination with an interpretive analysis.  Essentially, Pig uses the scripting language and emits a dataflow graph which is then used by MapReduce to analyze the data in HDFS.  Pig supports both batch and interactive execution but can also be used through a Java API.

Hadoop also supports special purpose tools used for very specialized analysis such as

  • Mahout – an Apache open source project which applies machine learning algorithms to HDFS data providing classification, characterization, and other feature extraction.  However, Mahout works on non-Hadoop clusters as well.  Mahout supports 4 techniques: recommendation mining, clustering, classification, and itemset machine learning functions.  While Mahout uses the MapReduce framework of Hadoop, it doesnot appear that Mahout uses Hadoop MapReduce directly but is rather a replacement for MapReduce focused on machine learning activities.
  • Hama – an Apache open source project which is used to perform paralleled matrix and graph computations against Hadoop cluster data.  The focus here is on scientific computation.  Hama also supports non-Hadoop frameworks including BSP and Dryad (DryadLINQ?). Hama operates ontop of MapReduce and can take advantage of Hbase data structures.

There are other tools that have sprung up around Hadoop to make it easier to configure, test and use, namely

  • Chukwa – which is used for monitoring large distributed clusters of servers.
  • ZooKeeper – which is a cluster configuration tool  and distributed serialization manager useful to build large clusters of Hadoop nodes.
  • MRunit – which is used to unit test MapReduce programs without having to test it on the whole cluster.
  • Whirr – which extends HDFS to use cloud storage services, unclear how well this would work with PBs of data to be processed but maybe it can colocate the data and the compute activities into the same cloud data center.

As for who uses these tools, Facebook uses Hive and Cassandra, Yahoo uses Pig, Google uses Hypertable and there are myriad users of the other projects as well.  In most cases the company identified in the previous list developed the program source code originally, and then contributed it to the Apache for use in the Hadoop open source project. In addition, those companies continue to fix, support and enhance these packages as well.