Safe AI

I’ve been writing about AGI (see part-0 [ish]part-1 [ish]part-2 [ish]part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.

~~~

The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

Is AGI just a question of scale now – AGI part-5

Read two articles over the past month or so. The more recent one was an Economist article (AI enters the industrial age, paywall) and the other was A generalist agent (from Deepmind). The Deepmind article was all about the training of Gato, a new transformer deep learning model trained to perform well on 600 separate task arenas from image captioning, to Atari games, to robotic pick and place tasks.

And then there was this one tweet from Nando De Frietas, research director at Deepmind:

Someone’s opinion article. My opinion: It’s all about scale now! The Game is Over! It’s about making these models bigger, safer, compute efficient, faster at sampling, smarter memory, more modalities, INNOVATIVE DATA, on/offline, … 1/N

I take this to mean that AGI is just a matter of more scale. Deepmind and others see the way to attain AGI is just a matter of throwing more servers, GPUs and data at the training the model.

We have discussed AGI in the past (see part-0 [ish], part-1 [ish], part-2 [ish], part-3ish and part-4 blog posts [We apologize, only started numbering them at 3ish]). But this tweet is possibly the first time we have someone in the know, saying they see a way to attain AGI.

Transformer models

It’s instructive from my perspective that, Gato is a deep learning transformer model. Also the other big NLP models have all been transformer models as well.

Gato (from Deepmind), SWITCH Transformer (from Google), GPT-3/GPT-J (from OpenAI), OPT (from meta), and Wu Dai 2.0 (from China’s latest supercomputer) are all trained on more and more text and image data scraped from the web, wikipedia and other databases.

Wikipedia says transformer models are an outgrowth of RNN and LSTM models that use attention vectors on text. Attention vectors encode, into a vector (matrix), all textual symbols (words) prior to the latest textual symbol. Each new symbol encountered creates another vector with all prior symbols plus the latest word. These vectors would then be used to train RNN models using all vectors to generate output.

The problem with RNN and LSTM models is that it’s impossible to parallelize. You always need to wait until you have encountered all symbols in a text component (sentence, paragraph, document) before you can begin to train.

Instead of encoding this attention vectors as it encounters each symbol, transformer models encode all symbols at the same time, in parallel and then feed these vectors into a DNN to assign attention weights to each symbol vector. This allows for complete parallelism which also reduced the computational load and the elapsed time to train transformer models.

And transformer models allowed for a large increase in DNN parameters (I read these as DNN nodes per layer X number of layers in a model). GATO has 1.2B parameters, GPT-3 has 175B parameters, and SWITCH Transformer is reported to have 7X more parameters than GPT-3 .

Estimates for how much it cost to train GPT-3 range anywhere from $10M-20M USD.

AGI will be here in 10 to 20 yrs at this rate

So if it takes ~$15M to train a 175B transformer model and Google has already done SWITCH which has 7-10X (~1.5T) the number of GPT-3 parameters. It seems to be an arms race.

If we assume it costs ~$65M (~2X efficiency gain since GPT-3 training) to train SWITCH, we can create some bounds as to how much it will cost to train an AGI model.

By the way, the number of synapses in the human brain is approximately 1000T (See Basic NN of the brain, …). If we assume that DNN nodes are equivalent to human synapses (a BIG IF), we probably need to get to over 1000T parameter model before we reach true AGI.

So my guess is that any AGI model lies somewhere between 650X to 6,500X parameters beyond SWITCH or between 1.5Q to 15Q model parameters.

If we assume current technology to do the training this would cost $40B to $400B to train. Of course, GPUs are not standing still and NVIDIA’s Hopper (introduced in 2022) is at least 2.5X faster than their previous gen, A100 GPU (introduced in 2020). So if we waited a 10 years, or so we might be able to reduce this cost by a factor of 100X and in 20 years, maybe by 10,000X, or back to where roughly where SWITCH is today.

So in the next 20 years most large tech firms should be able to create their own AGI models. In the next 10 years most governments should be able to train their own AGI models. And as of today, a select few world powers could train one, if they wanted to.

Where they get the additional data to train these models (I assume that data counts would go up linearly with parameter counts) may be another concern. However, I’m sure if you’re willing to spend $40B on AGI model training, spending a few $B more on data acquisition shouldn’t be a problem.

~~~~

At the end of the Deepmind article on Gato, it talks about the need for AGI safety in terms of developing preference learning, uncertainty modeling and value alignment. The footnote for this idea is the book, Human Compatible (AI) by S. Russell.

Preference learning is a mechanism for AGI to learn the “true” preference of a task it’s been given. For instance, if given the task to create toothpicks, it should realize the true preference is to not destroy the world in the process of making toothpicks.

Uncertainty modeling seems to be about having AI assume it doesn’t really understand what the task at hand truly is. This way there’s some sort of (AGI) humility when it comes to any task. Such that the AGI model would be willing to be turned off, if it’s doing something wrong. And that decision is made by humans.

Deepmind has an earlier paper on value alignment. But I see this as the ability of AGI to model human universal values (if such a thing exists) such as the sanctity of human life, the need for the sustainability of the planet’s ecosystem, all humans are created equal, all humans have the right to life, liberty and the pursuit of happiness, etc.

I can see a future post is needed soon on Human Compatible (AI).

Photo Credit(s):

Living forever – the end of evolution part-3

Read an article yesterday on researchers who had been studying various mammals and trying to determine the number of DNA mutations they accumulate at about the time they die. The researchers found that after about 800 mutations for mole rats, they die, see Nature article Somatic mutation rates scale with lifespan across mammals and Telegraph article reporting on the research, Mystery of why humans die around 80 may finally be solved.

Similarly, at around 3500 mutations humans die, at around 3000 mutations dogs die and at around 1500 mutations mice die. But the real interesting thing is that the DNA mutation rates and mammal lifespan are highly (negatively) correlated. That is higher mutation rates lead to mammals with shorter life spans.

C. Linear regression of somatic substitution burden (corrected for analysable genome size) on individual age for dog, human, mouse and naked mole-rat samples. Samples from the same individual are shown in the same colour. Regression was performed using mean mutation burdens per individual. Shaded areas indicate 95% confidence intervals of the regression line. A shows microscopic images of sample mammalian cels and the DNA strands examined and B shows the distribution of different types of DNA mutations (substitutions or indels [insertion/deletions of DNA]).

The Telegraph article seems to imply that at 800 mutations all mammals die. But the Nature Article clearly indicates that death is at different mutation counts for each different type of mammal.

Such research show one way on how to live forever. We have talked about similar topics in the distant past see …-the end of evolution part 1 & part 2

But in any case it turns out that one of the leading factors that explains the average age of a mammal at death is its DNA mutation rate. Again, mammals with lower DNA mutation rates live longer on average and mammals with higher DNA mutation rates live shorter lives on average.

Moral of the story

if you want to live longer reduce your DNA mutation rates.

c, Zero-intercept LME regression of somatic mutation rate on inverse lifespan (1/lifespan), presented on the scale of untransformed lifespan (axis). For simplicity, the axis shows mean mutation rates per species, although rates per crypt were used in the regression. The darker shaded area indicates 95% CI of the regression line, and the lighter shaded area marks a twofold deviation from the line. Point estimate and 95% CI of the regression slope (k), FVE and range of end-of-lifespan burden are indicated.

All astronauts are subject to significant forms of cosmic radiation which can’t help but accelerate DNA mutations. So one would have to say that the risk of being an astronaut is that you will die younger.

Moon and Martian colonists will also have the same problem. People traveling, living and working there will have an increased risk of dying young. And of course anyone that works around radiation has the same risk.

Note, the mutation counts/mutation rates, that seem to govern life span are averages. Some individuals have lower mutation rates than their species and some (no doubt) have higher rates. These should have shorter and longer lives on average, respectively.

Given this variability in DNA mutation rates, I would propose that space agencies use as one selection criteria, the astronauts/colonists DNA mutation rate. So that humans which have lower than average DNA mutation rates have a higher priority of being selected to become astronauts/extra-earth colonists. One could using this research and assaying astronauts as they come back to earth for their DNA mutation counts, could theoretically determine the impact to their average life span.

In addition, most life extension research is focused on rejuvenating cellular or organism functionality, mainly through the use of young blood, other select nutrients, stem cells that target specific organs, etc. For example, see MIT Scientists Say They’ve Invented a Treatment That Reverses Hearing Loss which involves taking human cells, transform them into stem cells (at a certain maturity) and injecting them into the ear drum.

Living forever

In prior posts on this topic (see parts 1 &2 linked above) we suggested that with DNA computation and DNA storage (see or listen rather, to our GBoS podcast with CTO of Catalog) now becoming viable, one could potentially come up with a DNA program that could

  • Store an individuals DNA using some very reliable and long lived coding fashion (inside a cell or external to the cell) and
  • Craft a DNA program that could periodically be activated (cellular crontab) to access the stored DNA for the individual(in the cell would be easiest) and use this copy to replace/correct any DNA mutation throughout an individuals cells.

And we would need a very reliable and correct copy of that person’s DNA (using SHA256 hashing, CRCs, ECC, Parity and every other way to insure the DNA as captured is stored correctly forever). And the earlier we obtained the DNA copy for an individual human, the better.

Also, we would need a copy of the program (and probably the DNA) to be present in every cell in a human for this to work effectively. .

However, if we could capture a good copy of a person’s DNA early in their life we could, perhaps, sometime later, incorporate DNA code/program into the individual to use this copy and sweep through a person’s body (at that point in time) and correct any mutations that have accumulated to date. Ultimately, one could schedule this activity to occur like an annual checkup.

So yeah, life extension research can continue along the lines they are going and you can have a bunch of point solutions for cellular/organism malfunction OR it can focus on correctly copying and storing DNA forever and creating a DNA program that can correct DNA defects in every individual cell, using the stored DNA.

End of evolution

Yes mammals and that means any human could live forever this way. But it would signify the start of the end of evolution for the human species. That is whenever we captured their DNA copy, from that point on evolution (by mutating DNA) of that individual and any offspring of that individual could no longer take place. And if enough humans do this, throughout their lifespan, it means the end of evolution for humanity as a species

This assumes that evolution (which is natural variation driven by genetic mutation & survival of the fittest) requires DNA variation (essentially mutation) to drive the species forward.

~~~~

So my guess, is either we can live forever and stagnate as a species OR live normal lifespans and evolve as a species into something better over time. I believe nature has made it’s choice.

The surprising thing is that we are at a point in humanities existence where we can conceive of doing away with this natural process – evolution, forever.

Photo Credit(s):

Deepmind does code – part 1: the data

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

  • Github code is in any source code language the author feels most appropriate to use.
  • GitHub code is not guaranteed to work correctly.
  • GitHub code is not guaranteed to be completed code.
  • GitHub code represents a wide range of coding skill.
  • GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits:

AI navigation goes with the flow

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

BEHAVIOR, an in-home robot, benchmark

As my readers probably already know, I’m a long time benchmark geek. So when I recently read an article out of Stanford (AI Experts Establish the “North Star” for Domestic Robotics Field) where a research team there developed a new robotic benchmark, I was interested. The new robotics benchmark is called BEHAVIOR which was documented in an ARXIV.org article (see: BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and ecOlogical enviRonments). It essentially uses real world data to identify domestic work activities that any robot would need to perform in a home.

The problems with robot benchmarks

The problem with benchmarks are multi-faceted:

  • How realistic are the workloads used to evaluate the systems being measured?
  • How accurate are the metrics used to rank and judge benchmark submissions?
  • How costly/complex is it to run a benchmark?
  • How are submissions audited and are they reproducible?.
  • Where are benchmark results reported and are they public?

And of course robotics brings in it’s own issues that makes benchmarking more difficult:

  • What sensors does the robot have to understand how to complete tasks?
  • What manipulators does the robot have to perform the tasks required of it?
  • Do the robots move in the environment and if so, how do the robots move?
  • Does the robot perform the task in the real world on in a simulated environment.

And of course, when using a simulated environment, how realistic is it.

BEHAVIOR with iGibson (see below) seem to answer many of these concerns for an in home robot benchmarking.

What is BEHAVIOR?

First, BEHAVIOR’s home making tasks were selected from an American Time Use Survey maintained by the USA Bureau of Labor Statistics which identifies tasks Americans perform in their homes. With BEHAVIOR 1.0 there are 100 tasks ranging from building a fruit basket to cleaning a toilet, and just about everything in between. I didn’t see any cooking or mixing drinks tasks but maybe those will be added.

Second, BEHAVIOR uses a predicate logic, called BDDL (BEHAVIOR Domain Definition Language) to define initial conditions for tasks such as tables, chairs, books, etc located in the room, where objects need to be placed, and successful completion goals or what task completion should look like.

BEHAVIOR uses 15 different rooms or scenes in their benchmark, such as a kitchen, garage, study, etc. Each of the 100 tasks are performed in a specific room.

BEHAVIOR incorporates 1217 different objects in 391 categories. Once initial conditions are defined for a task, BEHAVIOR essentially randomly selects different object for the task and randomly locates them throughout the room.

In order to run the benchmark, one could conceivably create a real room, with all the objects and have them placed according to BEHAVIOR BDDL’s randomly assigned locations with a robot physically present in the room and have it perform the assigned task OR one could use a simulation engine and have the robot run the task in the simulation environment, with simulated room, objects and robot.

It appears as if BEHAVIOR could operate in any robotics simulation environment but has been currently implemented in Stanford’s open source robotics simulation engine called iGibson 2.0 (see: iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks and iGibson 2.0 website). iGibson uses the Bullet real time physics engine for realistic physical environment simulation.

A robot operating within iGibson is provided a 3D rendering of the room and objects in images or LIDAR sensor scans. It can then identify the objects that it needs to manipulate to perform the tasks. One can define the robot simulated sensors and manipulators in iGibnot 2.0 and it’s written in Python, is open source (GitHub Repo) and can be installed to run on (Ubuntu 16.04) Linux, Windows (10) or Mac (10.15) systems.

Finally, BEHAVIOR uses a set of metrics to determine how well a robot has performed its assigned task. Their first metric is success score defined as the fraction of goal conditions satisfied by the robot performing the task. Such as the number of dishes properly cleaned and placed in the drying rack divided by the total number of dishes for a “washing dishes” task. And their second metric is a set of efficiency metrics, like time to complete a task, sum total of object distance moved during the task, how well objects are arranged at task completion (is the toilet seat down…), etc.

Another feature of iGibson 2.0 is that it offers the ability to record a human (in VR) doing a task in its simulated environment. So if your robotic system is able to learn by example, then iGibson could be used to provide training data for an activity.

~~~~

A couple of additions to the BEHAVIOR benchmark/iGibson simulation environment that I would like to see:

  • There ought to be a way to construct a house/apartment where multiple rooms are arranged in a hierarchy, i.e., rooms associated with floors with connections using hallways, doors, stairs, etc. between them. This way one could conceivably have a define a set of homes/apartments (let’s say 5) that a robot would perform its tasks in.
  • They need a task list to drive robot activities. Assume that there’s some amount of time let’s say 8-12 hours that a robot is active and construct a series of tasks that need to be accomplished during that period.
  • Robots should be placed in the rooms/apartments/homes at random with random orientation and then they would have to navigate through rooms/passageways to the rooms to perform the tasks.
  • They need to add pet/human avatars in the rooms throughout a home. These would represent real time obstacles to task completion/navigation as well as add more tasks associated with caring for pets/humans.
  • They need the ability to add non-home rooms that could encompass factory floors, emergency response debris fields, grocery stores, etc. and their own unique set of tasks for each of these so that it could be used as a benchmark for more than just domestic robots.

Aside from the above additions to BEHAVIOR/iGibson 2.0, there’s the question of the organization that manages the benchmark and submissions. There needs to be a website/place to publish benchmark results for a robot AND a mechanism to audit results for accuracy to insure fair play.

Typically this would be associated with an organization responsible for publishing and auditing submissions as well as guide further development of BEHAVIOR/iGibson 2.0. BEHAVIOR 1.0 is not the end but it’s a great start at providing realistic tasks that any domestic robot would need to perform. 

Benchmarks have always aided the development and assessment of new technologies. Having a in home robot benchmark like BEHAVIOR makes getting domestic robots that do what we want them to do a more likely possibility someday.

There’s a new benchmark in town and it signals the dawning of the domestic robot age.

Photo Credit(s):

NASA’s journey to the cloud – part 1

Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.

It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.

Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.

The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) (https://earthdata.nasa.gov/community/community-data-system-programs/measures-projects) Program….

But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.

NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.

NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.

Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.

NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:

  • EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
  • EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
  • EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.

One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.

Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.

Where to from here?

I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.

I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.

Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.

Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.

Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.

Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.

It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.

Picture Credit(s):

For AGI, is reward enough – part 4

Last May, an article came out of DeepMind research titled Reward is enough. It was published in an artificial intelligence journal but PDFs of it are available free of charge.

The article points out that according to DeepMind researchers, using reinforcement learning and an appropriate reward signal is sufficient to attain AGI (artificial general intelligence). We have written about the perils and pitfalls of AGI before (see Existential event risks [-part-0]NVIDIA Triton GMI, a step to far[-part-1]The Myth of AGI [-part-2], and Towards a better AGI – part 3ish. (Sorry, I only started numbering them after part 3ish).

My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.

Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.

Reinforcement learning, explained

Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.

The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).

The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.

Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.

Reinforcement learning, a chess playing example

Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.

Figure 2: Comparison with specialized programs. (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against respectively Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (‘W’, green), draw (‘D’, grey), loss (‘L’, red). (B) Scalability of AlphaZero with thinking time, compared to Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 seconds per move), time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing, and against Stockfish with a strong opening book. Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program Aperyqhapaq at full time controls and against Elmo under 2017 CSA world championship time controls (10 minutes per game plus 10 seconds per move). (D) Average result of chess matches starting from different opening positions: either common human positions, or the 2016 TCEC world championship opening positions . Average result of shogi matches starting from common human positions . CSA world
championship games start from the initial board position.

If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.

Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.

The researchers at DeepMind explain their AlphaZero program which plays chess, shogi, and Go in another research article, A general reinforcement learning algorithm that masters chess, shogi and Go through self-play.

Reinforcement learning and AGI

So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.

For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.

The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”

Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.

Does rewards and reinforcement learning provide a roadmap to AGI

To use reinforcement learning to achieve AGI implies that

  • One can identify all the arenas required for (human) intelligence
  • One can compute a proper reward signal for each arena involved in (human) intelligence,
  • One can programmatically compute appropriate steps to take to solve that arena’s activity,
  • One can save a sequence of state-steps taken to solve that arena’s problem, and
  • One can run sequences of steps enough times to produce a good policy for that arena.

There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.

For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.

The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.

So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.

One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.

Alternatives to full AGI

But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.

Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.

I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.

Perhaps the MuZero techniques might have an answer to some or all of the above.

~~~~

Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.

Photo credit(s):