Stanford Data Lab students hit the ground running…

Read an article (Students confront the messiness of data) today about Stanford’s Data Lab  and how their students are trained to cleanup and analyze real world data.

The Data Lab teaches two courses the Data Challenge Lab course and the Data Impact Lab course. The Challenge Lab is an introductory course in data gathering, cleanup and analysis. The Impact Lab is where advanced students tackle real world, high impact problems through data analysis.

Data Challenge Lab

Their Data Challenge Lab course is a 10 week course with no pre-requisites that teaches students how to analyze real world data to solve problems.

Their are no lectures. You’re given project datasets and the tools to manipulate, visualize and analyze the data. Your goal is to master the tools, cleanup the data and gather insights from the data. Professors are there to provide one on one help so you can step through the data provided and understand how to use the tools.

In the information provided on their website there were no references and no information about the specific tools used in the Data Challenge Lab to manipulate, visualize and analyze the data. From an outsiders’ viewpoint it would be great to have a list of references or even websites describing the tools being used and maybe the datasets that are accessed.

Data Impact Lab

The Data Impact lab course is an independent study course, whose only pre-req is the Data Challenge Lab.

Here students are joined into interdisplinary teams with practitioner partners to tackle ongoing, real world problems with their new data analysis capabilities.

There is no set time frame for the course and it is a non-credit activity. But here students help to solve real world problems.

Current projects in the Impact lab include:

  • The California Poverty Project  to create an interactive map of poverty in California to supply geographic guidance to aid agencies helping the poor
  • The Zambia Malaria Project to create an interactive map of malarial infestation to help NGOs and other agencies target remediation activity.

Previous Impact Lab projects include: the Poverty Alleviation Project to provide a multi-dimensional index of poverty status for areas in Kenya so that NGOs can use these maps to target randomized experiments in poverty eradication and the Data Journalism Project to bring data analysis tools to breaking stories and other journalistic endeavors.


Courses like these should be much more widely available. It’s almost the analog to the scientific method, only for the 21st century.

Science has gotten to a point, these days, where data analysis is a core discipline that everyone should know how to do. Maybe it doesn’t have to involve Hadoop but rudimentary data analysis, manipulation, and visualization needs to be in everyone’s tool box.

Data 101 anyone?

Photo Credit(s): Big_Data_Prob | KamiPhuc;

Southbound traffic speeds on Masonic avenue on different dates | Eric Fisher;

Unlucky Haiti (1981-2010) | Jer Thorp;

Bristol Cycling Level by Wards 2011 | Sam Saunders

Domesticating data

4111674475_76be20e180_zRead an article the other day from MIT News (Taming Data) about a new system that scans all your tabular data and provides an easy way to query all this data from one system. The researchers call the system the Data Civilizer.

What does it do

Tabular data seems to be the one constant in corporate data (that and for me PowerPoint and Word docs). Most data bases are tables of one form or another (some row and some column based). Lots of operational data is in spreadsheets (tables by another name) of some type.  And when I look over most IT/Networking/Storage management GUIs, tables (rows and columns) of data are the norm.

156788318_628fb0e4dc_oThe Data Civilizer takes all this tabular data and analyzes it all, column by column, and calculates descriptive characterization statistics for each column.

Numerical data could be characterized by range, standard deviation, median/average, cardinality etc. For textual data a list of words in the column by frequency might suffice. It also indexes every  word in the tables it analyzes.

Armed with its statistical characterization of each column, the Data Civilizer can then generate a similarity index between any two columns of data across the tables it has analyzed. In that way it can connect data in one table with data in another.

Once it has a similarity matrix and has indexed all the words in every table column it has analyzed, it can then map the tabular data, showing which columns look similar to other columns. Then any arbitrary query for data, can be executed on any table that contains similar data supplying the results of the query across the multiple tables it has analyzed.

Potential improvements

The researchers indicated that they currently don’t support every table data format. This may be a sizable task on its own.

In addition statistical characterization or classification seems old school nowadays. Most new AI is moving off statistical analysis to more neural net types of classification. Unclear if you could just feed all the tabular data to a deep learning neural net, but if the end game is to find similarities across disparate data sets, then neural nets are probably a better way to go. How you would combine this with brute force indexing of all tabular data words is another question.


In the end as I look at my company’s information, even most of my Word docs are organized in some sort of table, so cross table queries could help me a lot. Let me know when it can handle Excel and Word docs and I’ll take another look.

Photo Credit(s): Linear system table representation 2 by Ronald O’ Daniel

Glenda Sims by Glendathegood


Big open data leads to citizen science

Read an article the other day in ScienceLine about the Astronomical Data Explosion.  It appears that as international observatories start to open up their archives and their astronomical data to anyone and anybody, people are starting to do useful science with it.

Hunting for planets

The story talked about a pair of amateur astronomers who were looking through Kepler telescope data which had recently been put online (see to find anomalies that signal the possibility of a planet.  They saw a diming of a particular star’s brightness and then saw it again 132 days later. At that point they brought it to the attention of real scientists who later discovered that what they found was a 4 star solar system which they labeled Tatooine.

It seems with all the latest astronomical observations coming in from Kepler, the Sloan Digital Sky Survey and Hubble observatories are generating a deluge of data. And although all this data is being subjected to intense scrutiny by professional astronomers, they can’t do everything they want to do with it.

Consequently, in astronomy today we now have come to a new world of abundant data but not enough resources to do all the science that can be done.  This is where the citizen or amateur scientist enters the picture. Using standard web accessible tools they are able to subject the data to many more eyes each looking for whatever interest spurs them on and as such, can often contribute real science from their efforts.

Citizen science platforms

It turns out is one of a number of similar websites put up by Zooniverse to support citizen science in astronomy, biology, nature, climate and humanities. Their latest project is to classify animal found in snapshots taken on the Serengheti (see

Of course crowdsourced scientific activity like this has been going on for a long time now with Boinc projects like SETI@Home screen savers that sifted through radio signals searching for extra-terestial signals. But that made use of the extra desktop compute cycles people were waisting with screen savers.


In contrast, Zooniverse started with the GalaxyZoo project (original retired site here). They put Hubble telescope images online and asked for amateur astronomers to classify the type of galaxies found in the images.

GalaxyZoo had modest aspirations at first but when they put the Hubble images online their servers were overwhelmed with the response and had to be beefed up considerably to deal with the traffic.  Overtime, they were able to get literally millions of galaxy classifications. Now they want more, and the recent incarnation of GalaxyZoo has put the brightest 250K galaxies online and they are asking for even finer, more detailed classifications of them.

Today’s Zooniverse projects are taking advantage of recent large and expanding data repositories plus newer data visualization tools to help employ human analysis to their data.  Automated tools are not yet sophisticated enough to classify images as well as a human can.

One criteria for Zooniverse projects is to have a massive amount of data which needs to be classified.  In this way, science is once again returning to it’s amateur roots but this time guided by professionals.  Together we can do more than what either could do apart.


I suppose it was only a matter of time before science got inundated with more data than they could process effectively.  Having the ability to put all this data online, parcel it out to concerned citizens and ask them to help understand/classify it has brought a new dawn to citizen science.


Photo credits:
Twin Suns on Mos Espa by Stéfan
BONIC running SETI@Home by Keng Susumpow
Galaxy Group Stephan’s Quintet by HubbleColor {Zolt}

Backup is for (E)discovery too

Electronic Discovery Reference Model (from
Electronic Discovery Reference Model (from

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…