AWS Data Exchange vs Data Banks – part 2

Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

AWS Data Exchange

1918 Farm Auction by dok1 (cc) (from Flickr)
1918 Farm Auction by dok1 (cc) (from Flickr)

With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

Changes I would like to AWS Data Exchange service

Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

  • Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
  • Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
  • Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
  • Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

Probably much more that could be added here but this should suffice for now.

other changes to AWS Data Exchange processes

The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

~~~~

I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

Comments?

Safe AI

I’ve been writing about AGI (see part-0 [ish]part-1 [ish]part-2 [ish]part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.

~~~

The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.

~~~~

Comments?

Photo Credit(s):

New era of graphical AI is near #AIFD2 @Intel

I attended AIFD2 ( videos of their sessions available here) a couple of weeks back and for the last session, Intel presented information on what they had been working on for new graphical optimized cores and a partner they have, called Katana Graph, which supports a highly optimized graphical analytics processing tool set using latest generation Xeon compute and Optane PMEM.

What’s so special about graphs

The challenges with graphical processing is that it’s nothing like standard 2D tables/images or 3D oriented data sets. It’s essentially a non-Euclidean data space that has nodes with edges that connect them.

But graphs are everywhere we look today, for instance, “friend” connection graphs, “terrorist” networks, page rank algorithms, drug impacts on biochemical pathways, cut points (single points of failure in networks or electrical grids), and of course optimized routing.

The challenge is that large graphs aren’t easily processed with standard scale up or scale out architectures. Part of this is that graphs are very sparse, one node could point to one other node or to millions. Due to this sparsity, standard data caching fetch logic (such as fetching everything adjacent to a memory request) and standardized vector processing (same instructions applied to data in sequence) don’t work very well at all. Also standard compute branch prediction logic doesn’t work. (Not sure why but apparently branching for graph processing depends more on data at the node or in the edge connecting nodes).

Intel talked about a new compute core they’ve been working on, which was was in response to a DARPA funded activity to speed up graphical processing and activities 1000X over current CPU/GPU hardware capabilities.

Intel presented on their PIUMA core technology was also described in a 2020 research paper (Programmable Integrated and Unified Memory Architecture) and YouTube video (Programmable Unified Memory Architecture).

Intel’s PIUMA Technology

DARPA’s goals became public in 2017 and described their Hierarchical Identity Verify Exploit (HIVE) architecture. HIVE is DOD’s description of a graphical analytics processor and is a multi-institutional initiative to speed up graphical processing. .

Intel PIUMA cores come with a multitude of 64-bit RISC processor pipelines with a global (shared) address space, memory and network interfaces that are optimized for 8 byte data transfers, a (globally addressed) scratchpad memory and an offload engine for common operations like scatter/gather memory access.

Each multi-thread PIUMA core has a set of instruction caches, small data caches and register files to support each thread (pipeline) in execution. And a PIUMA core has a number of multi-thread cores that are connected together.

PIUMA cores are optimized for TTEPS (Tera-Traversed Edges Per Second) and attempt to balance IO, memory and compute for graphical activities. PIUMA multi-thread cores are tied together into (completely connected) clique into a tile, multiple tiles are connected within a single node and multiple nodes are tied together with a 8 byte transfer optimized network into a PIUMA system.

P[I]UMA (labeled PUMA in the video) multi-thread cores apparently eschew extensive data and instruction caching to focus on creating a large number of relatively simple cores, that can process a multitude of threads at the same time. Most of these threads will be waiting on memory, so the more threads executing, the less likely that whole pipeline will need to be idle, and hopefully the more processing speedup can result.

Performance of P[I]UMA architecture vs. a standard Xeon compute architecture on graphical analytics and other graph oriented tasks were simulated with some results presented below.

Simulated speedup for a single node with P[I]UMAtechnology vs. Xeon range anywhere from 3.1x to 279x and depends on the amount of computation required at each node (or edge). (Intel saw no speedups between a single Xeon node and multiple Xeon Nodes, so the speedup results for 16 P[I]UMA nodes was 16X a single P[I]UMA node).

Having a global address space across all PIUMA nodes in a system is pretty impressive. We guess this is intrinsic to their (large) graph processing performance and is dependent on their use of photonics HyperX networking between nodes for low latency, small (8 byte) data access.

Katana Graph software

Another part of Intel’s session at AIFD2 was on their partnership with Katana Graph, a scale out graph analytics software provider. Katana Graph can take advantage of ubiquitous Xeon compute and Optane PMEM to speed up and scale-out graph processing. Katana Graph uses Intel’s oneAPI.

Katana graph is architected to support some of the largest graphs around. They tested it with the WDC12 web data commons 2012 page crawl with 3.5B nodes (pages) and 128B connections (links) between nodes.

Katana runs on AWS, Azure, GCP hyperscaler environment as well as on prem and can scale up to 256 systems.

Katana Graph performance results for Graph Neural Networks (GNNs) is shown below. GNNs are similar to AI/ML/DL CNNs but use graphical data rather than images. One can take a graph and reduce (convolute) and summarize segments to classify them. Moreover, GNNs can be used to understand whether two nodes are connected and whether two (sub)graphs are equivalent/similar.

In addition to GNNs, Katana Graph supports Graph Transformer Networks (GTNs) which can analyze meta paths within a larger, heterogeneous graph. The challenge with large graphs (say friend/terrorist networks) is that there are a large number of distinct sub-graphs within the graph. GTNs can break heterogenous graphs into sub- or meta-graphs, which can then be used to understand these relationships at smaller scales.

At AIFD2, Intel also presented an update on their Analytics Zoo, which is Intel’s MLops framework. But that will need to wait for another time.

~~~~

It was sort of a revelation to me that graphical data was not amenable to normal compute core processing using today’s GPUs or CPUs. DARPA (and Intel) saw this defect as a need for a completely different, brand new compute architecture.

Even so, Intel’s partnership with Katana Graph says that even today compute environment could provide higher performance on graphical data with suitable optimizations.

It would be interesting to see what Katana Graph could do using PIUMA technology and appropriate optimizations.

In any case, we shouldn’t need to wait long, Intel indicated in the video that P[I]UMA Technology chips could be here within the next year or so.

Comments?

Photo Credit(s):

  • From Intel’s AIFD2 presentations
  • From Intel’s PUMA you tube video

Swarm learning for distributed & confidential machine learning

Read an article the other week about researchers in Germany working with a form of distributed machine learning they called swarm learning (see: AI with swarm intelligence: a novel technology for cooperative analysis …) which was reporting on a Nature magazine article (see: Swarm Learning for decentralized and confidential clinical machine learning).

The problem of shared machine learning is particularly accute with medical data. Many countries specifically call out patient medical information as data that can’t be shared between organizations (even within country) unless specifically authorized by a patient.

So these organizations and others are turning to use distributed machine learning as a way to 1) protect data across nodes and 2) provide accurate predictions that uses all the data even though portions of that data aren’t visible. There are two forms of distributed machine learning that I’m aware of federated and now swarm learning.

The main advantages of federated and swarm learning is that the data can be kept in the hospital, medical lab or facility without having to be revealed outside that privileged domain BUT the [machine] learning that’s derived from that data can be shared with other organizations and used in aggregate, to increase the prediction/classification model accuracy across all locations.

How distributed machine learning works

Distributed machine learning starts with a common model that all nodes will download and use to share learnings. At some agreed to time (across the learning network), all the nodes use their latest data to re-train the common model and share new training results (essentially weights used in the neural network layers) with all other members of the learning network.

Shared learnings would be encrypted with TLS plus some form of homomorphic encryption that allowed for calculations over the encrypted data.

In both federated and swarm learning, the sharing mechanism was facilitated by a privileged block chain (apparently Etherium for swarm). All learning nodes would use this blockchain to share learnings and download any updates to the common model after sharing.

Federated vs. Swarm learning

The main difference between federated and swarm learning is that with federated learning there is a central authority that updates the model(s) and with swarm learning that processing is replaced by a smart contract executing within the blockchain. Updating model(s) is done by each node updating the blockchain with shared data and then once all updates are in, it triggers a smart contract to execute some Etherium VM code which aggregates all the learnings and constructs a new model (or at least new weights for the model). Thus no node is responsible for updating the model, it’s all embedded into a smart contract within the Etherium block chain. .

Buthow does the swarm (or smart contract) update the common model’s weights. The Nature article states that they used either a straight average or a weighted average (weighted by “weight” of a node [we assume this is a function of the node’s re-training dataset size]) to update all parameters of the common model(s).

Testing Swarm vs. Centralized vs. Individual (node) model learning

In the Nature paper, the researchers compared a central model, where all data is available to retrain the models, with one utilizing swarm learning. To perform the comparison, they had all nodes contribute 20% of their test data to a central repository, which ran the common swarm updated model against this data to compute an accuracy metric for the swarm. The resulting accuracy of the central vs swarm learning comparison look identical.

They also ran the comparison of each individual node (just using the common model and then retraining it over time without sharing this information to the swarm versus using the swarm learning approach. In this comparison the swarm learning approach alway seemed to have as good as if not better accuracy and much narrower dispersion.

In the Nature paper, the researchers used swarm learning to manage the machine learning model predictions for detecting COVID19, Leukemia, Tuberculosis, and other lung diseases. All of these used public data, which included PBMC (peripheral blood mono-nuclear cells) transcription data, whole blood transcription data, and X-ray images.

Swarm learning also provides the ability to onboard new nodes in the network. Which would supply the common model and it’s current weights to the new node and add it to the shared learning smart contract.

The code for the swarm learning can be downloaded from HPE (requires an HPE passport login [it’s free]). The code for the models and data processing used in the paper are available from github. All this seems relatively straight forward, one could use the HPE Swarm Learning Library to facilitate doing this or code it up oneself.

Photo Credit(s):

Tattoos that light up

Read an article the other day, titled Light-emitting tattoo engineered in ScienceDaily. Which was reporting on research done by University College London and Istituto Italiano di Tecnologia (Italian Institute of Technology) (Ultrathin, ultra-comfortable and free-standing, tattooable LEDs – behind paywall).

The new technology out of their research can construct OLEDs, found in TVs, phones, and other displays, and apply them as temporary tattoos. The tattoos will eventually degrade, wash off but while present on the skin they can light up and display information.

According to the Nanowerk news article reporting on the research, (see Light emitting tattoos engineered for the 1st time), the OLEDs are printed onto paper which can then be transferred to skin by the application of water. The picture above shows a number of the OLED tattoos ready for application.

The vision is that OLED tattoos along with other flexible electronics could provide wearable sensors of bio-chemical activity of a person. Such sensors could be used in hospitals and in the home to display dehydration, glucose status, oxygenation, etc. as well as be able to display heart and breath rates. But in order to get to that vision there’s a few steps that are needed.

Flexible, stretchable electronics

There have been a number of articles about creating flexible electronics, (e.g., see A design to improve the resilience and electrical performance thin metal film based electrodes). This article was reporting on research done at the University of Illinois, Champaign-Urbana reported in Nature (behind paywall) but one of the researchers blogged about in NaturePortfolio Devices & Materials (see: An atom-thick interlayer enables the electrical ductility of thin-film metal electrodes).

Flexible electronics can be constructed by creating a thin metal film with the electronics embedded in it placed on top of a flexible substrate. However, when that flexible substrate starts to deform or stretch it induces cracks in the thin metal films which lead to loss of conductivity, or loss of electronics function.

The research cited in the article above showed videos of cracking that takes place during deformation and stretching which would lead to loss of conductivity.

But the researchers at UofI found out that if you place a thin layer of graphene or other 2D sheet of material between the electronic thin film and the flexible substrate, the cracks that eventually happen are much less harmful to electronic conduction or functioning or provide electronic ductability. To add ductablity to an electronic circuit using LEDs the team applied an atomically thin (<1nm), 2D layer of graphene between it and the flexible substrate.

Somehow the graphene provided a mechanical buffer between the flexible substrate and the thin film electronics that allowed the circuits to have much more ductility. It appears that this mechanical buffer changed the type of cracking that occurs on the thin metal film such that they are shorter and more varied in direction rather than straight across and this helped them retain functioning longer than without the

The researchers at U of I actually created a led display that could be bent without failure. See a video of them comparing the thin film vs thin film with 2D substrate.

Skin sensors

Moreover, there have been a number of articles discussing new wearable technologies that could be used to sense a persons bio-chemical state. For example, research reported on recently (see Do Sweat It! Wearable Microfluidic Sensor to Measure Lactate Concentration in Real Time) done at the Tokyo University of Science, published in Electochimica Acta (behind paywall) talks about a sweat sensor that can be applied to skin to determine when athletes or others are getting dehydrated.

This sensor uses a micro-fluidics device which printed with electronic ink. Such a device could be manufactured in volume and be readily printed onto surfaces, that could be applied to the skin, anywhere sweat was being produced.

Future tattoos

Wearable sensors already surround us. We have watches that can tell our heart rates, walk/running speed/rates, step counts, etc. It doesn’t take much to imagine that most if not all of these could be fabricated on a thin film and with the proper 2D substrate layer be applied as a tattoo to a person while in the hospital but all these sensors have lacked a read out or display up until now. With OLED readouts wearable sensors now have a reasonable display capability.

The sweat sensor above uses microfluidics to do a lactate assay of sweat. The motion sensors in my watch uses MEMs and onboard IMU/GPS to determine speed and direction of movement. Electronic temperature sensors use thermoelectric effects. Blood oxygen sensors use LEDs and light sensors. None of these appears unable to be fabricated, miniaturized and printed on thin films. Adding OLEDs and why do we need a watch anymore?

What seems to be the most glaring omission is gas sensors (although the lactate micro-fluidic sensor is close). If we could somehow miniaturize gas sensors with enough sensitivity to glucose levels, immunological load, specific diseases (COVID19), then maybe there’d be a mass market for such devices, outside of a hospital or smart watch users.

Then with OLED and electronics that can be temporarily tattooed onto a person skin., why couldn’t this be a fashion accessory. I can imagine lot’s of people would have interest in lighting up messages, iconography or other data on their arms, hands, or other areas of a person’s body. I wonder if it could be used to display hair on the top of my head :)?

And of course these OLED-electronics based tattoos are temporary. But if they are all made from electronic ink, it seems to me that such tattoos could be permanently printed (implanted?) onto a persons skin.

Maybe at some future point a permanent OLED-electronics based tattoo could provide an electronic display and input device that could be used in conjunction with a phone or a smart-watch. All it would take would be blue-tooth.

Comments?

Photo credits:

cOAlition S requires open access to funded research

I read a Science article this last week (A new mandate highlights costs and benefits of making all scientific articles free) about a group of funding organizations that have come together to mandate open access to all peer-reviewed research they fund called Plan S. The list of organizations in cOAlition S is impressive including national R&D funding agencies from UK, Ireland, Norway, and a number of other countries, charitable R&D funding agencies from WHO, Welcome Trust, Bill&Melinda Gates Foundation and more, and the group is also being funded by the EU. Plan S takes effect this year.

Essentially, all research funded by these organizations must be immediately published in open access forum, open access journals or be freely available in an open access section of a publishers website which means it could be free to be read by anyone worldwide with access to the web. Authors and institutions will retain copyright for the work and the work will be published under an open access license such as the CC BY (Creative Commons Attribution) license.

Why open access is important

At this blog, frequently we find ourselves writing about research which is only available on a paid subscription or on a pay per article basis. However, sometimes, if we search long enough, we find a duplicate of the article published in pre-print form in some preprint server or open access journal.

We have written about open access journals before (see our New Science combats Coronavirus post). Much of what we do on this blog would not be possible without open access journals like PLoS, BioRxiv, and PubMed.

Open access mandates are trending

Open access mandates have been around for a while now. And even the US Gov’t got into the act, mandating all research funded by the NIH be open access by 2008, with Dept of Agriculture and Energy following later (see wikipedia Open access mandates).

In addition, given the pandemic emergency, many research publishers like Nature and Elsevier made any and all information about the Coronavirus free access on their websites.

Impacts and R&D research publishing business model

Although research is funded by public organizations such as charities and government agencies, prior to open access mandates, most research was published in peer-reviewed journal magazines which charged a fee for access. For many research organizations, those fees were a cost of doing research. If you were an independent researcher or in an institution that couldn’t afford these fees, attempting to do cutting edge research was impossible without this access.

Yes in some cases, those journal repositories waved these fees for deserving institutions and organizations but this wasn’t the case for individual researchers. Or If you were truly diligent, you could request a copy of a paper from an author and wait.

Of course, journal publishers have real expenses they needed to cover, as well as make a reasonable profit. But due to business consolidation, there were fewer independent journals around and as a result, they charged bundled license fees for vast swathes of research articles. Such a wide bundle may or may not be of interest to an individual or an institution. That plus with consolidation, profits were becoming a more significant consideration.

So open access mandates, often included funding to cover fees for publishers to supply open access. Such fees varied widely. So open access mandates also began to require fees to be published and to be supplied a description how prices were calculated. By doing so, their hope was to make such costs more transparent

Impacts on authors of research articles

Somewhere there’s an aphorism for researchers that says “publish or perish“, which means you must publish research in order to become a recognized expert in your field. Recognition often the main driver behind better academic employment and more research funding.

However, it’s not just about volume of published papers, the quality of research also matters. And the more highly regarded publishing outlets have an advantage here, in that they are de facto gatekeepers to whats published in their journals. As such, where you publish can often lend credibility to any research.

Another thing changed over the last few decades, judging the quality of research has become more quantative. Nowadays, research quality is also dependent on the number of citations it receives. The more popular a publisher is, the more readers it has which increases the possibility for citations.

Thus, most researchers try to publish their best work in highly regarded journals. And of course, these journals have a high cost to provide open access.

Successful research institutions can afford to pay these prices but those further down the totem pole cannot.

Most mandates come with additional funding to support paying the cost to supply open access. But they also require publishing and justifying these. In the belief that in doing this so it will lend some transparency to these costs.

So the researcher is caught in the middle. Funding organizations want open access to research they fund. And publishers want to be paid a profit for that access.

History of research publication

Nature magazine first started publishing research in 1859, Science magazine first published in 1880, the Royal Society first published research in 1665. So publishing research has been going on for 350 years, and at least as a for profit business model, since the mid-1800s.

Research prior to being published in journals was only available in books. And more than likely, the author of the research had to pay to have a book published and the publisher made money only when those books were sold. And prior to that, scientific research was mostly only available in a course of study, also mostly paid for by the student.

So science has always had a cost to access. What open access mandates are doing is moving this cost to something added to the funding of research.

Now if open access can only solve the reproducibility crisis in science we could have us a real scientific revolution.

Comments?

Photo Credits:

Open source digital assistant

I’ve come by and purchased a number of digital assistants over the last couple of years from both Google and Amazon but not Apple. At first their novelty drove me to take advantage of them to do a number of things. But over time I started to only use them for music playing or jokes. But then I started to hear about some other concerns with the technology.

The problems with today’s vendor based, digital assistants

My and others main concern was their ability to listen into conversations in the home and workplace without being queried. Yes, there are controls on some of them to turn off the mic and thus any recordings. But these are not hardwired switches and as software may or may not work depending on the implementation. As such, there is no guarantee that they won’t still be recording audio feeds even with their mic (supposedly) turned off.

At one point I saw a news article where police had subpoenaed recordings of a digital assistant to use in a criminal case. Now I’m ok with use of this for specific, court approved, criminal cases but what’s to limit its use to such. And not all courts, or governments for that matter, are as protective of personal privacy as some.

Open source digital assistant on the way

But with an open source version of a digital assistant, one where the user had complete programmatical control over its recording and use of audio data is another matter. I suppose this doesn’t necessarily help the technically challenged among us that can’t program our way out of a paper bag but even for those individuals, the fact that an open source version exists to protect privacy, could be construed as something much more secure than a company or vendor’s product.

All that made it very interesting when I saw an article recently about a project put together at Standford on an Open source challenger to popular virtual assistants”.

How to create a open source digital assistant

The main problem facing an open source digital assistant is the need for massive amounts of annotated training request data. This is one of the main reasons that commercial digital assistants often record conversations when not specifically requested.

But Stanford University who is responsible for creating the open source digital assistant above has managed to design and create a “rules based” system to help generate all the training data needed for a virtual assistant.

With all this automatically generated training data they can use it to train a digital assistant’s natural language processing neural network to understand what’s being asked and drive whatever action is being requested.

At the moment the digital assistant (and its conversation generator) has somewhat limited skills, or rather only works in a restricted set of domains such as restaurants, people, movies, books and music. For example, “identify a restaurant near me that has deep dish pizza and is rated greater than 4 on a 5 point scale”, “find me an mystery novel talking that is about magic”, or “who was the 22nd president of the USA”.

But as the digital assistant and its annotated, rules based conversation generator are both open source, anyone can contribute more skills code or add more conversational capabilities. Over time, if there’s enough participation, perhaps even someday perform all of the skills or capabilities of commercial digital assistants.

Introducing Almond and Stanford’s OVAL

Stanford work on this project is out of their OVAL (Open Virtual Assistant Lab). Their open source virtual assistant is called Almond.

Almond’s verbal generator is called Genie and uses compositional technology to generate conversations that are used to train their linguistic user interface (LUInet). Almond also uses ThingTalk a new declaritive program language to process responses to queries and requests. Finally, Almond makes use of Thingpedia, a repository of information about internet services and IoT devices to tell it how to interact with these systems.

Stanford Genie technology

The technology behind Genie is based on using source text statements to create templates that can generate sentences for any domain you wish to have Almond work in. If one is interested in expanding the Almond domains, they can create their own templates using the Genie toolkit.

One essentially provides a small set of input sentences that are converted into templates and used by Genie to understand how to parse all similar sentences. This enables Almond to “understand” what’s being requested of it

The set of input sentences can start small and be augmented or added to over time to handle more diverse or complex queries or requests. Their GitHub toolkit and Genie technology is described more fully in a paper Genie: A generator of natural language symantec parsers for virtual assistant commands

Stanford ThingTalk declarative language

ThingTalk is the programming language used to control what Almond can do for requests and queries. Essentially it’s a multi-part statement about what to do when a request comes along. The main parts in a ThingTalk statement include:

  1. When a particular action is supposed to be triggered.
  2. What service does the request need in order to perform its action.
  3. What action is requested

The “what service does a request need” are based on Open API calls (See ThingPedia below). The “what action is requested” can either be standard Almond actions or invoke other ThingPedia open source API calls, such as create a tweet, post on FB, send email etc.

For example, a ThingTalk statement looks like:

monitor @com.foxnews.get() => @com.slack.send();

Which monitors Fox news for any new news articles and sends them (the link) to your Slack channel.

Stanford Thingpedia

Thingpedia is an open source repository of structured information available on the Web and of API services available on the web. Structured information or data is the information behind calendars, contact databases, article repositories, etc. Any of which can be queried for information and some of which can be updated or have actions performed on them. API services are the way that those queries and actions are performed.

One page of the Thingpedia multi-page summary of services that are offered

The Thingpedia web page shows a number of services that already have Open source APIs defined and registered. For example, things like twitter, facebook, bing search, BBC news, gmail and a host of other services. More are being added all the time and these represent the domains that Almond can be used to act upon.

Some of these domains are more defined that others. But in any case any service that takes the form of an web based API can be added to Thingpedia.

Thingpedia as a standalone open source repository is valuable in and of itself regardless of its use by Almond. But Almond would be impossible without Thingpedia. Thingpedia wants to be the wikipedia of APIs.

Almond, putting it all together

Almond consists of mainly the Almond Agent, Engine and Thingpedia. The Agent is used by the various Almond implementions to parse and understand the request and access the ThinkTalk program statement. Almond Agent uses its LUInet natural language interpreter, interpret that request and to select the ThingTalk program for the request. Once the ThinkTalk program is identified, it uses the various Thingpedia APIs requested by the ThinkTalk statement to generate the proper API calls to the service being requested and generate any output that is requested.

Where can you run Almond

Almond is available currently as a web app, an Android App, a Gnome (Linux) desktop/laptop App, a CLI application or can be run on your Mac or Windows computers. You could of course create your own smart speaker to run Almond or perhaps hack a current smart speaker to do so.

One important consideration is that with the Android app, all your data and credentials are only stored on the phone. And will not go out into the cloud or elsewhere. I didn’t see similar statements about privacy protections for the web app or any of the other deployments. But as Almond is open source, you potentially have much greater control over where your data resides.

~~~~

What I would really like is a smart speaker app running on a RPi with a microphones and a decent speaker attached, all in the package of a cube or cylinder.

I thought their videos on Almond were pretty cheesy but the technology is very interesting and could potentially make for an interesting competitor of today’s smar

Photo Credit(s):

All photos and graphics from Stanford Almond and OVAL Lab websites.