AI navigation goes with the flow

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

DeepMind takes on poker & Scotland Yard

Read an article the other day (DeepMind makes bet on AI system that can play poker, chess, Go, and more) about a new DeepMind game playing program that used a new approach to taking on perfect and imperfect information games with the same algorithms.

As you may recall, DeepMind prior game playing programs, AlphaZero and MuZero played perfect information games chess, shoji, & Go and achieved top rankings in all of them. These were all based on reinforcement learning and advanced search. .

Perfect information games have no hidden information, that is all the information needed to play a game is visible to all players (see wikipedia Perfect information article). Imperfect information games have private or hidden information, only visible to one or a select set of players. In card playing, any card that’s not shown to all players, would represent hidden information. The other difference in imperfect games is that players attempt to keep their private information hidden as long as possible.

The latest DeepMind paper (see: Player of Games Arxiv paper) discusses a new approach to automated game playing that works for both perfect and imperfect information games. DeepMind’s latest game playing program is called Player of Games (PoG).

As many may know, Texas hold’em is a form of poker where everyone is dealt two cards down and five cards are dealt up, that everyone shares (see: Texas hold’em and Betting in poker articles on wikipedia). Betting happens after the two down cards are dealt, after the next 3 up cards (called the “flop”) are dealt, then after each of the remaining 2 up cards are dealt. Players select any of the (2 down and 5 up) cards to create the best 5 card poker hand. Betting is based on a blind (sort of minimal bet). PoG plays as a single player, performing all the betting as well as card playing for Texas hold’em. No limit betting says there’s no limit (maximum) to the amount of a bet in the game.

Scotland Yard is a board game where detectives chase down a criminal (Mr. X) on the run across the city of London (see wikipedia Scotland Yard (board game) article). Detectives each get 23 transportation tickets for taxies (11), busses (8), and underground trains (4). The game takes place on a board layout of London and starts with each detective and the criminal selecting a card with their hidden position on the board. The criminal gets (not quite, but almost) an unlimited amount of transportation tickets plus 5 (in USA) universal tickets (which can be used to take ferries as well as any other form of transport). Every time (except when using universal tickets) the criminal moves, he reveals his form of transportation. And 5 times during the game the criminal also reveals his current location. The detective that finds the criminal wins.

I assume all my readers know how to play chess and Go (or at least understand them).

While MuZero and AlphaZero used reinforcement learning for training and sophisticated search for in game play, PoG needed to do something different due to the imperfect (or hidden) information present in the hold’em and Scotland Yard games.

How PoG is different

In imperfect information games, it’s important to hide private information. In poker when I got a great hand, I raised my betting levels extensively. But this often caused my opponents to withdraw from betting unless they had a great hand as well. I sometimes think that if I were to bet more consistently and only at the last betting round, bet big when I have a good hand, I might win more $. No doubt, why I don’t play poker anymore.

Like AlphaZero and MuZero, PoG also uses reinforcement learning through self game play but adds something they call Counterfactual Regret (CFR) Minimization to their game trees.

In addition to normally selecting and computing a value (reward) for the optimal move as in reinforcement learning, PoG uses CFR minimization to compute values (rewards) for all moves not taken during every stage in a game, for each player. As such, PoG computes possible rewards for the optimal move at a stage (step, move) in a game plus the values for all the regret (counterfactual or other) moves for all players. CFR minimization attempts to minimize the regret move values and maximize move optimal values at each move, for each player in a game.

CFR minimization is used during training for a game in self-play as well as during actual game play to generate sub-trees from wherever the game happens to be. PoG uses a depth limited CFR minimization to generate game sub-trees during game play which helps to reduce the time it takes to determine the best move for all players. Read the ArXiv paper to learn more.

The challenge with this approach is that it will never be as good as pure reinforcement learning + advanced search for perfect games, such as chess and Go. For example, below we show Exploitability ratings for various PoG training levels for Leduc Poker and Scotland Yard. Exploitability levels are one way to measure how good the player is playing. Lower is better.

Perfect play (in an imperfect information game) would have an Exploitability of 0. The charts show that the more training done the better the game play by PoG for (Leduc) poker and Scotland Yard. (Leduc poker is a simplified poker game with 6 cards and limited betting).

On the other hand, for perfect games the results were ok, but not stellar. Scockfish is the current non-reinforcement learning, chess playing champion. Gnugo and Pachi are non-reinforcement learning, Go playing programs. In tables below, they use a relative ranking based on a 0 baseline for chess (Stockfish with 1 thread and 100msec think time) and Go (GnuGo). Higher is better.

~~~~

So, yes PoG can do well in imperfect information games with decent training and ok (but much much better than I and probably the vast majority of humans), in perfect information games.

Why concern ourselves with imperfect games, The world is chock full of imperfect information games. They seem to occur everywhere, military strategy, sport play, finance, etc. In fact, perfect games are the exception in real world situations. Thus, any advance to play multiple imperfect information games better is yet another small step towards AGI.

Photo Credit(s):

For AGI, is reward enough – part 4

Last May, an article came out of DeepMind research titled Reward is enough. It was published in an artificial intelligence journal but PDFs of it are available free of charge.

The article points out that according to DeepMind researchers, using reinforcement learning and an appropriate reward signal is sufficient to attain AGI (artificial general intelligence). We have written about the perils and pitfalls of AGI before (see Existential event risks [-part-0]NVIDIA Triton GMI, a step to far[-part-1]The Myth of AGI [-part-2], and Towards a better AGI – part 3ish. (Sorry, I only started numbering them after part 3ish).

My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.

Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.

Reinforcement learning, explained

Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.

The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).

The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.

Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.

Reinforcement learning, a chess playing example

Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.

Figure 2: Comparison with specialized programs. (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against respectively Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (‘W’, green), draw (‘D’, grey), loss (‘L’, red). (B) Scalability of AlphaZero with thinking time, compared to Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 seconds per move), time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing, and against Stockfish with a strong opening book. Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program Aperyqhapaq at full time controls and against Elmo under 2017 CSA world championship time controls (10 minutes per game plus 10 seconds per move). (D) Average result of chess matches starting from different opening positions: either common human positions, or the 2016 TCEC world championship opening positions . Average result of shogi matches starting from common human positions . CSA world
championship games start from the initial board position.

If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.

Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.

The researchers at DeepMind explain their AlphaZero program which plays chess, shogi, and Go in another research article, A general reinforcement learning algorithm that masters chess, shogi and Go through self-play.

Reinforcement learning and AGI

So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.

For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.

The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”

Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.

Does rewards and reinforcement learning provide a roadmap to AGI

To use reinforcement learning to achieve AGI implies that

  • One can identify all the arenas required for (human) intelligence
  • One can compute a proper reward signal for each arena involved in (human) intelligence,
  • One can programmatically compute appropriate steps to take to solve that arena’s activity,
  • One can save a sequence of state-steps taken to solve that arena’s problem, and
  • One can run sequences of steps enough times to produce a good policy for that arena.

There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.

For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.

The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.

So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.

One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.

Alternatives to full AGI

But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.

Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.

I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.

Perhaps the MuZero techniques might have an answer to some or all of the above.

~~~~

Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.

Photo credit(s):