Where should IoT data be processed – part 2

I wrote a post a while back on Where IOT data should be processed – part 1. We will get back to that post in a moment, but recently I read an article (How big data forced the hunt for ET intelligence to evolve) that mentioned after 20 years, they were shutting down SETI@home.

SETI@home was a crowdsourced computational network that took snippets of radio spectrum, sent them to 1000s of home computers to be analyzed during idle computer time, once processed the analysis was sent back to SETI@home. It was one of the first to use a crowdsourced approach to perform data processing. The data was collected at a radio telescope, sent to SETI@home and distributed from there.

6 Factors for IOT data processing

In my post I talked about 6 factors that should help determine where data is processed. Those 6 factors included

  • Data size which is a measure of the amount (GB, TB or PBs) of data that is being generated at an IOT node
  • Data pipe availability, which is all about the networking bandwidth that’s available at the IOT node. If we are talking some sort of low-bandwidth networking access then it probably makes sense to process the data more locally and send only results of processing up the stack.
  • Processing criticality which indicates how important is the processing of the data. If the processing could save a life then maybe it should be done as close as possible to where the data is generated. If the data processing is less critical it could perhaps be done at other nodes in an IOT network
  • Processing time and infrastructure cost which is all about what sort of computational resources are required to perform the processing and how much would it cost. If processing of the data is to undergo multiple passes or requires multi-core CPUs or GPUs, moving data off the IoT node and onto a more comprehensive server to process it, could make sense.
  • Compliance, governance and archive requirements, which discussed the potential need for all data to be available for regulatory audits and as such may need to be available at a central location anyway so why not perform processing there.
  • Data information funnel, which talked about the fact that an IoT network should be configured in layers and that each layer in the stack should probably be responsible for some portion of the data processing needed by the overall system, if nothing more than compressing the information before it is sent elsewhere.

Now that I review the list, the last, Data information funnel, factor really should be a function of the other factors rather than a separate factor.

In that blog post I promised to follow it up with some examples of the logic applied to real world problems. SETI is the first one I’ve seen in the literature

SETI’s IoT processing problem

Closeup front view of one antenna of the Allan Telescope Array, a radio telescope for combined radio astronomy and SETI (Search for Extraterrestrial Intelligence) research being built by the University of California at Berkeley, outside San Francisco. The first phase, consisting of 42 6 meter dish antennas like the one shown here, was completed in 2007. Eventually it will have 350 antennas. This type of antenna is called an offset Gregorian design. The incoming radio waves are reflected by the large parabolic dish onto a secondary concave parabolic reflector in front of the dish, and then into a feed horn. A metal shroud can be seen along the bottom of the secondary reflector which shields the antenna from ground noise. It covers the frequency range from 0.5 to 11.2 GHz.

The SETI researchers found that “The telescopes are now capable of producing so much data that it’s not possible to get that volume of data out to volunteers,” And “The discovery space is in these massive, massive data streams. And it’s just not efficient to distribute many terabits per second out to volunteers all over the world. It’s more efficient for that data processing to happen at the actual observatory.”

So they moved the data processing for the SETI IoT network from being distributed out to home computers throughout the world to being done at the (telescope) source where the data was originally generated.

This decision seems to rely on a couple of the factors above. Namely the pipe availability and data size factors. They had to move processing because no pipes existed to send Tb of data to 1000s of home computers. And finally, the processing time and infrastructure cost has come down so much, that it was just easier to do the processing onsite.

It doesn’t seem like processing criticality or compliance-governance-archive had any bearing on the decision.

So there’s the first example that seems to fit well into our data processing framework.

~~~~

We ought to be able to come up with a formula that uses all these factors and comes up to with a yes or no as to whether to process the data on the node or not.

Photo Credit(s)

Free P2P-Cloud Storage and Computing Services?

FFT_graph from Seti@home
FFT_graph from Seti@home

What would happen if somebody came up with a peer-to-peer cloud (P2P-Cloud) storage or computing service.  I see this as

  • Operating a little like Napster/Gnutella where many people come together and share out their storage/computing resources.
  • It could operate in a centralized or decentralized fashion
  • It  would allow access to data/computing resources anywhere from the internet

Everyone joining the P2P-cloud would need to set aside computing and/or storage resources they were willing to devote to the cloud.  By doing so, they would gain access to an equivalent amount (minus overhead) of other nodes computing and storage resources to use as they see fit.

P2P-Cloud Storage

For cloud storage the P2P-Cloud would create a common cloud data repository spread across all nodes in the network:

  • Data would be distributed across the network in such a way that would allow reconstruction within any reasonable time frame and would handle any reasonable amount of node outages without loss of data.
  • Data would be encrypted before being sent to the cloud rendering the data unreadable without the key.
  • Data would NOT necessarily be shared, but would be hosted on other users systems.

As such, if I were to offer up 100GB of storage to the P2P-Cloud, I would get at least a 100GB (less overhead) of protected storage elsewhere on the cloud to use as I see fit.  Some % of this would be lost to administration say 1-3% and redundancy protection say ~25% but the remaining 72GB of off-site storage could be very useful for DR purposes.

P2P-Cloud storage would provide a reliable, secure, distributed file repository that could be easily accessible from any internet location.  At a minimum, the service would be free and equivalent to what someone supplies (less overhead) to the P2P-Cloud Storage service.  If storage needs exceeded your commitment, more cloud storage could be provided at a modest cost to the consumer.  Such fees would be shared by all the participants offering excess [=offered – (consumed + overhead)] storage to the cloud .

P2P-Cloud Computing

Cloud computing is definitely more complex, but generally follows the Seti@HOME/BOINC model:

  • P2P-Cloud computing suppliers would agree to use something like a “new screensaver” which would perform computation while generating a viable screensaver.
  • Whenever the screensaver was invoked, it would start execution on the last assigned processing unit.  Intermediate work results would need to be saved and when completed, the answer could be sent to the requester and a new processing unit assigned.
  • Processing units would be assigned by the P2P-Cloud computing consumer, would be timeout-able and re-assignable at will.

Computing users won’t gain much if the computing time they consume is <= the computing time they offer (less overhead).  However, computing time offset may be worth something, i.e., computing time now might be more valuable than computing time tonite.  Which may offer a slight margin of value to help get this off the ground.  As such, P2P-Cloud computing suppliers would need to be able to specify when computing resources might be mostly available along with the type, quality and quantity.

Unclear how to secure the processing unit and this makes legal issues more prevalent.  That may not be much of a problem, as a complex distributed computing task makes little sense in isolation. But the (il-)legality of some data processing activities could conceivably put the provider in a precarious position. (Somebody from the legal profession would need clarify all this, but I would think that some “Amazon C2” like licensing might offer safe harbor here).

P2P-Cloud computing services wouldn’t necessarily be amenable to the more normal, non-distributed or linear computing tasks but one could view these as just a primitive version of distributed computing tasks.  In either case, any data needed for computation would need to be sent along with the computing software to be run on a distributed node.  Whether it’s worth the effort is something for the users to debate.

BOINC can provide a useful model here.  Also, the Condor(R) project at U. of Wisconsin/Madison can provide a similar framework for scheduling the work of a “less distributed” computing task model.  In my mind, both types of services ultimately need to be provided.

To generate more compute servers, the SETI@Home and similar BOINC projects rely on doing good deeds.  As such, if you can make your computing task  do something of value to most users then maybe that’s enough. In that case, I would suggest joining up as a BOINC project. For the rest of us, doing more mundane data processing, just offering our compute services to the P2P-Cloud will need to suffice.

Starting up the P2P-Cloud

Bootstrapping the P2P-Cloud might take some effort but once going it should be self sustaining (assuming no centralized infrastructure).  I envision an open source solution, taking off from the work done on Napster&Gnutella and/or Boinc&Condor.

I believe the P2P-Cloud Storage service would be the easiest to get started.  BOINC and SETI@home (list of active Boinc projects) have been around a lot longer than cloud storage but their existence suggests that with the right incentives, even the P2P-Cloud Computing service can make sense.