Big open data leads to citizen science

Read an article the other day in ScienceLine about the Astronomical Data Explosion.  It appears that as international observatories start to open up their archives and their astronomical data to anyone and anybody, people are starting to do useful science with it.

Hunting for planets

The story talked about a pair of amateur astronomers who were looking through Kepler telescope data which had recently been put online (see PlanetHunters.org) to find anomalies that signal the possibility of a planet.  They saw a diming of a particular star’s brightness and then saw it again 132 days later. At that point they brought it to the attention of real scientists who later discovered that what they found was a 4 star solar system which they labeled Tatooine.

It seems with all the latest astronomical observations coming in from Kepler, the Sloan Digital Sky Survey and Hubble observatories are generating a deluge of data. And although all this data is being subjected to intense scrutiny by professional astronomers, they can’t do everything they want to do with it.

Consequently, in astronomy today we now have come to a new world of abundant data but not enough resources to do all the science that can be done.  This is where the citizen or amateur scientist enters the picture. Using standard web accessible tools they are able to subject the data to many more eyes each looking for whatever interest spurs them on and as such, can often contribute real science from their efforts.

Citizen science platforms

It turns out PlanetHunters.org is one of a number of similar websites put up by Zooniverse to support citizen science in astronomy, biology, nature, climate and humanities. Their latest project is to classify animal found in snapshots taken on the Serengheti (see SnapshotSerengeti.org).

Of course crowdsourced scientific activity like this has been going on for a long time now with Boinc projects like SETI@Home screen savers that sifted through radio signals searching for extra-terestial signals. But that made use of the extra desktop compute cycles people were waisting with screen savers.

 

In contrast, Zooniverse started with the GalaxyZoo project (original retired site here). They put Hubble telescope images online and asked for amateur astronomers to classify the type of galaxies found in the images.

GalaxyZoo had modest aspirations at first but when they put the Hubble images online their servers were overwhelmed with the response and had to be beefed up considerably to deal with the traffic.  Overtime, they were able to get literally millions of galaxy classifications. Now they want more, and the recent incarnation of GalaxyZoo has put the brightest 250K galaxies online and they are asking for even finer, more detailed classifications of them.

Today’s Zooniverse projects are taking advantage of recent large and expanding data repositories plus newer data visualization tools to help employ human analysis to their data.  Automated tools are not yet sophisticated enough to classify images as well as a human can.

One criteria for Zooniverse projects is to have a massive amount of data which needs to be classified.  In this way, science is once again returning to it’s amateur roots but this time guided by professionals.  Together we can do more than what either could do apart.

~~~~

I suppose it was only a matter of time before science got inundated with more data than they could process effectively.  Having the ability to put all this data online, parcel it out to concerned citizens and ask them to help understand/classify it has brought a new dawn to citizen science.

Comments?

Photo credits:
Twin Suns on Mos Espa by Stéfan
BONIC running SETI@Home by Keng Susumpow
Galaxy Group Stephan’s Quintet by HubbleColor {Zolt}

Free P2P-Cloud Storage and Computing Services?

FFT_graph from Seti@home
FFT_graph from Seti@home

What would happen if somebody came up with a peer-to-peer cloud (P2P-Cloud) storage or computing service.  I see this as

  • Operating a little like Napster/Gnutella where many people come together and share out their storage/computing resources.
  • It could operate in a centralized or decentralized fashion
  • It  would allow access to data/computing resources anywhere from the internet

Everyone joining the P2P-cloud would need to set aside computing and/or storage resources they were willing to devote to the cloud.  By doing so, they would gain access to an equivalent amount (minus overhead) of other nodes computing and storage resources to use as they see fit.

P2P-Cloud Storage

For cloud storage the P2P-Cloud would create a common cloud data repository spread across all nodes in the network:

  • Data would be distributed across the network in such a way that would allow reconstruction within any reasonable time frame and would handle any reasonable amount of node outages without loss of data.
  • Data would be encrypted before being sent to the cloud rendering the data unreadable without the key.
  • Data would NOT necessarily be shared, but would be hosted on other users systems.

As such, if I were to offer up 100GB of storage to the P2P-Cloud, I would get at least a 100GB (less overhead) of protected storage elsewhere on the cloud to use as I see fit.  Some % of this would be lost to administration say 1-3% and redundancy protection say ~25% but the remaining 72GB of off-site storage could be very useful for DR purposes.

P2P-Cloud storage would provide a reliable, secure, distributed file repository that could be easily accessible from any internet location.  At a minimum, the service would be free and equivalent to what someone supplies (less overhead) to the P2P-Cloud Storage service.  If storage needs exceeded your commitment, more cloud storage could be provided at a modest cost to the consumer.  Such fees would be shared by all the participants offering excess [=offered – (consumed + overhead)] storage to the cloud .

P2P-Cloud Computing

Cloud computing is definitely more complex, but generally follows the Seti@HOME/BOINC model:

  • P2P-Cloud computing suppliers would agree to use something like a “new screensaver” which would perform computation while generating a viable screensaver.
  • Whenever the screensaver was invoked, it would start execution on the last assigned processing unit.  Intermediate work results would need to be saved and when completed, the answer could be sent to the requester and a new processing unit assigned.
  • Processing units would be assigned by the P2P-Cloud computing consumer, would be timeout-able and re-assignable at will.

Computing users won’t gain much if the computing time they consume is <= the computing time they offer (less overhead).  However, computing time offset may be worth something, i.e., computing time now might be more valuable than computing time tonite.  Which may offer a slight margin of value to help get this off the ground.  As such, P2P-Cloud computing suppliers would need to be able to specify when computing resources might be mostly available along with the type, quality and quantity.

Unclear how to secure the processing unit and this makes legal issues more prevalent.  That may not be much of a problem, as a complex distributed computing task makes little sense in isolation. But the (il-)legality of some data processing activities could conceivably put the provider in a precarious position. (Somebody from the legal profession would need clarify all this, but I would think that some “Amazon C2” like licensing might offer safe harbor here).

P2P-Cloud computing services wouldn’t necessarily be amenable to the more normal, non-distributed or linear computing tasks but one could view these as just a primitive version of distributed computing tasks.  In either case, any data needed for computation would need to be sent along with the computing software to be run on a distributed node.  Whether it’s worth the effort is something for the users to debate.

BOINC can provide a useful model here.  Also, the Condor(R) project at U. of Wisconsin/Madison can provide a similar framework for scheduling the work of a “less distributed” computing task model.  In my mind, both types of services ultimately need to be provided.

To generate more compute servers, the SETI@Home and similar BOINC projects rely on doing good deeds.  As such, if you can make your computing task  do something of value to most users then maybe that’s enough. In that case, I would suggest joining up as a BOINC project. For the rest of us, doing more mundane data processing, just offering our compute services to the P2P-Cloud will need to suffice.

Starting up the P2P-Cloud

Bootstrapping the P2P-Cloud might take some effort but once going it should be self sustaining (assuming no centralized infrastructure).  I envision an open source solution, taking off from the work done on Napster&Gnutella and/or Boinc&Condor.

I believe the P2P-Cloud Storage service would be the easiest to get started.  BOINC and SETI@home (list of active Boinc projects) have been around a lot longer than cloud storage but their existence suggests that with the right incentives, even the P2P-Cloud Computing service can make sense.