Essentially they have trained two neural networks (NN) at the same time and computed an adaptation coefficient matrix (with linear multipliers to compensate for wind speed). The first NN is trained to understand the wind invariant flight characteristics of a drone in wind and the second is trained to the predict the class of wind the drone is flying in (or wind index). These two plus the adaptation control matrix coefficients are used to predict the resultant force on drone flight in wind.
The data to train the two NNs and compute the adaptation matrix coefficients come from CalTech wind tunnel results with their custom built drone (essentially an RPi4 added to a pretty standard drone) doing random trajectories under different static wind conditions.
The two NNs and the adaptation control matrix functionality run on a Raspberry Pi 4 (RPi4) that’s added to a drone they custom built for the test vehicle. The 2 NNs and the adaptation control tracking are used in the P-I-D (proportional-integral-derivative) controller for drone path prediction. The Neural-Fly 2 NNs plus the adaptation functionality effectively replaces the residual force prediction portion of Integral section of the P-I-D controller.
The wind invariant neural net has 5 layers with relatively few parameters per layer. The wind class prediction neural network has 3 layers and even fewer parameters. Together these two NNs plus the adaptation coefficient provides real time resultant force predictions on the drone which can be fed into the drone controller to control drone flight. All being accomplished, in real time, on an RPi4.
The adaption factor matrix is also learned during 2 NN training. And this is what’s used in the NF-Constant results below. But the key is that the linear factors (adaptation matrix) are updated (periodically) during actual drone flight by sampling the measured actual force and predicated force on the drone. The adaption matrix coefficients are updated using a least squares estimation fit.
In the reports supplemental information, the team showed a couple of state of the art adaptation approaches to problem of drone flight in wind. In the above chart the upper section is the x-axis wind effect and the lower portion is the z-axis wind effect and f (grey) is the actual force in that direction and f-hat (red) is the predicted force. The first column represents results from a normal integral controller. The next two columns are state of the art approaches (INDI and L1, see paper references) to the force prediction using adaptive control. If you look closely at these two columns, and compare the force prediction (in red) and the actual force (in grey), the force prediction always lags behind the actual force.
The next three columns show Neural-Fly constant (Neural-Fly with a constant adaptive control matrix, not being updated during flight), Neural-Fly-transfer (Using the NN trained on one drone and applying it’s adaptation to another drone in flight) and Neural-Fly. Neural-Fly constant also shows a lag between the predicted force and the actual force. But the Neural-Fly Transfer and Neural-Fly reduce this lag considerably.
The measurement for drone flight accuracy is tracking positional error. That is the difference between the desired position and its actual position over a number of trajectories. As shown in the chart tracking error decreased from 5.6cm to ~4 cm at a wind speed of 4.2m/s (15.1km/h or 9.3mph). Tracking error increases for wind speeds that were not used in training and for NF-transfer but in all wind speeds the tracking error is better with Neural-Fly than with any other approach.
Pretty impressive results from just using an RPi4.
[The Eds. would like to thank the CalTech team and especiallyMike O’Connell for kindly answering our many questions on Neural-Fly.]
For AI ML DNN’s, we often witnesssupposedly well trained DNN models that do very well for classifications of examples of data similar to their training data but fail miserably on data that’s outside their training data.
Mathematicians call this attribute robustness and can measure this on a mapping function using a Lipschitz constant. One can consider this as a measure of variability of mapping from one set to another or in the case of DNNs, lack of robustness in classifications means they fail on relatively minor changes to input data.
Most serious AI researchers have empirically discovered that bigger DNNs work better and are more robust than smaller networks. There’s been somewhat of a conundrum as to why DNNs need to get bigger to properly generalize.
Universal Low of Robustness
What the researchers have proved is that in order to achieve some arbitrary level of robustness for a mapping function like DNNs, one needs many more parameters than expected the training data elements would indicate
For example, with the MNIST handwritten digit classification problem, models with 10**5 parameters to 10**6 parameters are required to achieve a 90% and 95% accuracy, respectively. But MNIST training data is 60K examples (10**4). Why should a MNIST DNN classification model need more than 10**4 parameters to achieve 100% accurate?
From what we all learned in high school maths, to solve a function with N variables one needs N equations. This would lead one to believe that MNIST DNNs (essentially solving classification equations) should only need 60K or 10**4 parameters. But real DNNs to solve MNIST need more than that.
Looking at it in 2D. If one has two points, (x,y) for point A that maps to another (x,y) point B, one should only need to know one of the points and the slope of the line that connects them, or two parameters: point A (or B) and line slope.
Now with MNIST data that maps handwritten digits to one of 10 digits, we have essentially 10 possibilities being mapped from 60K samples. At best, we should need to know the 60K initial points in this image data space and their slope to the 10 digits they represent. Again something that approaches 60K pairs of parameters: one for the image point and one for the slope. But why doesn’t a MNIST model with 60K parameters achieve 100% accuracy.
I won’t claim to understand the math but what the researchers seem to be saying is that in order to have a relatively smooth mapping from the image space to the digit space one has to have 10**4 parameters X the dimensionality of the data. In this case, for MNIST, the dimensionality of the data is related to image size of28X28, 0..255 grey scale pixel images. The image space alone would be on the order of 10**5. So multiplying this by the size of the training data, the researchers estimate that the number of parameters should be 10**9 to be 100% accurate.
Although, the researchers say that the data dimensionality of the MNIST images are probably not 10**5 (how they concluded this is not evident). As such, they believe one shouldn’t need 10**9 parameters to reach 100% proper classifications. They say it’s probably 1 or 2 orders of magnitude less, because not all of the image data space is populated. So if we use 10**3 as an estimate of the effective data dimensionality, they would estimate that one would need 10**7 parameter DNN to reach 100% accuracy on MNIST data.
The author’s MNIST model achieved a 99.2% accuracy after training for 15 Epochs, batch size=5. Although 688K parameters is not quite 10**6 parameters, it’s close. Unclear why one would need another factor of 10, but getting that extra 0.8% accuracy (to 100%) can be very difficult to achieve for any DNN model.
Another example, OpenAI’s GPT-3 NLP model
And OpenAI’s GPT-3 NLP model has 175B parameters. Their previous version, GPT-2,only had 1.5B parameters and they say that GPT-4 will have over a 100T parameters. The chart above shows accuracy stats for 3 versions of the GPT-3 model, one with 175B, one with 13B and another with 1.3B parameters.
According to OpenAI’s GPT-3 description, it can complete “almost any english language task” (text in ==> text out). This includes writing articles from a few prompts and text summarization.
GPT-3 was trained on almost 500B tokens (from web crawls to wikipedia dumps). Each token probably represents an english word or word phrase. According to the universal law, 175B parameters would not be sufficient. Probably why GPT-3 in the above chart didn’t reach 70%^ accuracy.
Probably would need at least another 3 orders of magnitude to get there or 175T parameters. Maybe with GPT-4, I can have it start writing my blog posts.
I don’t know about you, but I’m going to need more GPUs for my (home) AI lab.
1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.
2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.
Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.
AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.
It’s all about the (training) data
The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.
They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model
There are a couple of problems with using GitHub source code for training:
Github code is in any source code language the author feels most appropriate to use.
GitHub code is not guaranteed to work correctly.
GitHub code is not guaranteed to be completed code.
GitHub code represents a wide range of coding skill.
GitHub code doesn’t always come with a problem statement.
But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.
Their final pre-training dataset was 715GB of data over 86 million source files.
Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.
The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.
For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.
One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.
To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.
So much for pre-training, let’s discuss fine tuning
AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)
So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.
Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.
They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.
Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.
This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..
This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.
Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.
In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.
Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).
Any comments, please don’t hesitate to let me know.
The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.
One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.
Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.
The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).
The simulation setup
The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.
The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.
For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.
For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).
They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.
They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation
It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).
That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.
More to be done
They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).
The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.
So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.
The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.
However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.
The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.
As you may recall, DeepMind prior game playing programs, AlphaZero and MuZero played perfect information games chess, shoji, & Go and achieved top rankings in all of them. These were all based on reinforcement learning and advanced search. .
Perfect information games have no hidden information, that is all the information needed to play a game is visible to all players (see wikipedia Perfect informationarticle). Imperfect information games have private or hidden information, only visible to one or a select set of players. In card playing, any card that’s not shown to all players, would represent hidden information. The other difference in imperfect games is that players attempt to keep their private information hidden as long as possible.
The latest DeepMind paper (see: Player of Games Arxiv paper) discusses a new approach to automated game playing that works for both perfect and imperfect information games. DeepMind’s latest game playing program is called Player of Games (PoG).
As many may know, Texas hold’em is a form of poker where everyone is dealt two cards down and five cards are dealt up, that everyone shares (see: Texas hold’em and Betting in poker articles on wikipedia). Betting happens after the two down cards are dealt, after the next 3 up cards (called the “flop”) are dealt, then after each of the remaining 2 up cards are dealt. Players select any of the (2 down and 5 up) cards to create the best 5 card poker hand. Betting is based on a blind (sort of minimal bet). PoG plays as a single player, performing all the betting as well as card playing for Texas hold’em. No limit betting says there’s no limit (maximum) to the amount of a bet in the game.
Scotland Yard is a board game where detectives chase down a criminal (Mr. X) on the run across the city of London (see wikipedia Scotland Yard (board game) article). Detectives each get 23 transportation tickets for taxies (11), busses (8), and underground trains (4). The game takes place on a board layout of London and starts with each detective and the criminal selecting a card with their hidden position on the board. The criminal gets (not quite, but almost) an unlimited amount of transportation tickets plus 5 (in USA) universal tickets (which can be used to take ferries as well as any other form of transport). Every time (except when using universal tickets) the criminal moves, he reveals his form of transportation. And 5 times during the game the criminal also reveals his current location. The detective that finds the criminal wins.
I assume all my readers know how to play chess and Go (or at least understand them).
While MuZero and AlphaZero used reinforcement learning for training and sophisticated search for in game play, PoG needed to do something different due to the imperfect (or hidden) information present in the hold’em and Scotland Yard games.
How PoG is different
In imperfect information games, it’s important to hide private information. In poker when I got a great hand, I raised my betting levels extensively. But this often caused my opponents to withdraw from betting unless they had a great hand as well. I sometimes think that if I were to bet more consistently and only at the last betting round, bet big when I have a good hand, I might win more $. No doubt, why I don’t play poker anymore.
Like AlphaZero and MuZero, PoG also uses reinforcement learning through self game play but adds something they call Counterfactual Regret (CFR) Minimization to their game trees.
In addition to normally selecting and computing a value (reward) for the optimal move as in reinforcement learning, PoG uses CFR minimization to compute values (rewards) for all moves not taken during every stage in a game, for each player. As such, PoG computes possible rewards for the optimal move at a stage (step, move) in a game plus the values for all the regret (counterfactual or other) moves for all players. CFR minimization attempts to minimize the regret move values and maximize move optimal values at each move, for each player in a game.
CFR minimization is used during training for a game in self-play as well as during actual game play to generate sub-trees from wherever the game happens to be. PoG uses a depth limited CFR minimization to generate game sub-trees during game play which helps to reduce the time it takes to determine the best move for all players. Read the ArXiv paper to learn more.
The challenge with this approach is that it will never be as good as pure reinforcement learning + advanced search for perfect games, such as chess and Go. For example, below we show Exploitability ratings for various PoG training levels for Leduc Poker and Scotland Yard. Exploitability levels are one way to measure how good the player is playing. Lower is better.
Perfect play (in an imperfect information game) would have an Exploitability of 0. The charts show that the more training done the better the game play by PoG for (Leduc) poker and Scotland Yard. (Leduc poker is a simplified poker game with 6 cards and limited betting).
On the other hand, for perfect games the results were ok, but not stellar. Scockfish is the current non-reinforcement learning, chess playing champion. Gnugo and Pachi are non-reinforcement learning, Go playing programs. In tables below, they use a relative ranking based on a 0 baseline for chess (Stockfish with 1 thread and 100msec think time) and Go (GnuGo). Higher is better.
So, yes PoG can do well in imperfect information games with decent training and ok (but much much better than I and probably the vast majority of humans), in perfect information games.
Why concern ourselves with imperfect games, The world is chock full of imperfect information games. They seem to occur everywhere, military strategy, sport play, finance, etc. In fact, perfect games are the exception in real world situations. Thus, any advance to play multiple imperfect information games better is yet another small step towards AGI.
How realistic are the workloads used to evaluate the systems being measured?
How accurate are the metrics used to rank and judge benchmark submissions?
How costly/complex is it to run a benchmark?
How are submissions audited and are they reproducible?.
Where are benchmark results reported and are they public?
And of course robotics brings in it’s own issues that makes benchmarking more difficult:
What sensors does the robot have to understand how to complete tasks?
What manipulators does the robot have to perform the tasks required of it?
Do the robots move in the environment and if so, how do the robots move?
Does the robot perform the task in the real world on in a simulated environment.
And of course, when using a simulated environment, how realistic is it.
BEHAVIOR with iGibson (see below) seem to answer many of these concerns for an in home robot benchmarking.
What is BEHAVIOR?
First, BEHAVIOR’s home making tasks were selected from an American Time Use Survey maintained by the USA Bureau of Labor Statistics which identifies tasks Americans perform in their homes. With BEHAVIOR 1.0 there are 100 tasks ranging from building a fruit basket to cleaning a toilet, and just about everything in between. I didn’t see any cooking or mixing drinks tasks but maybe those will be added.
Second, BEHAVIOR uses a predicate logic, called BDDL (BEHAVIOR Domain Definition Language) to define initial conditions for tasks such as tables, chairs, books, etc located in the room, where objects need to be placed, and successful completion goals or what task completion should look like.
BEHAVIOR uses 15 different rooms or scenes in their benchmark, such as a kitchen, garage, study, etc. Each of the 100 tasks are performed in a specific room.
BEHAVIOR incorporates 1217 different objects in 391 categories. Once initial conditions are defined for a task, BEHAVIOR essentially randomly selects different object for the task and randomly locates them throughout the room.
In order to run the benchmark, one could conceivably create a real room, with all the objects and have them placed according to BEHAVIOR BDDL’s randomly assigned locations with a robot physically present in the room and have it perform the assigned task OR one could use a simulation engine and have the robot run the task in the simulation environment, with simulated room, objects and robot.
A robot operating within iGibson is provided a 3D rendering of the room and objects in images or LIDAR sensor scans. It can then identify the objects that it needs to manipulate to perform the tasks. One can define the robot simulated sensors and manipulators in iGibnot 2.0 and it’s written in Python, is open source (GitHub Repo) and can be installed to run on (Ubuntu 16.04) Linux, Windows (10) or Mac (10.15) systems.
Finally, BEHAVIOR uses a set of metrics to determine how well a robot has performed its assigned task. Their first metric is success score defined as the fraction of goal conditions satisfied by the robot performing the task. Such as the number of dishes properly cleaned and placed in the drying rack divided by the total number of dishes for a “washing dishes” task. And their second metric is a set ofefficiency metrics, like time to complete a task, sum total of object distance moved during the task, how well objects are arranged at task completion (is the toilet seat down…), etc.
Another feature of iGibson 2.0 is that it offers the ability to record a human (in VR) doing a task in its simulated environment. So if your robotic system is able to learn by example, then iGibson could be used to provide training data for an activity.
A couple of additions to the BEHAVIOR benchmark/iGibson simulation environment that I would like to see:
There ought to be a way to construct a house/apartment where multiple rooms are arranged in a hierarchy, i.e., rooms associated with floors with connections using hallways, doors, stairs, etc. between them. This way one could conceivably have a define a set of homes/apartments (let’s say 5) that a robot would perform its tasks in.
They need a task list to drive robot activities. Assume that there’s some amount of time let’s say 8-12 hours that a robot is active and construct a series of tasks that need to be accomplished during that period.
Robots should be placed in the rooms/apartments/homes at random with random orientation and then they would have to navigate through rooms/passageways to the rooms to perform the tasks.
They need to add pet/human avatars in the rooms throughout a home. These would represent real time obstacles to task completion/navigation as well as add more tasks associated with caring for pets/humans.
They need the ability to add non-home rooms that could encompass factory floors, emergency response debris fields, grocery stores, etc. and their own unique set of tasks for each of these so that it could be used as a benchmark for more than just domestic robots.
Aside from the above additions to BEHAVIOR/iGibson 2.0, there’s the question of the organization that manages the benchmark and submissions. There needs to be a website/place to publish benchmark results for a robot AND a mechanism to audit results for accuracy to insure fair play.
Typically this would be associated with an organization responsible for publishing and auditing submissions as well as guide further development of BEHAVIOR/iGibson 2.0. BEHAVIOR 1.0 is not the end but it’s a great start at providing realistic tasks that any domestic robot would need to perform.
Benchmarks have always aided the development and assessment of new technologies. Having a in home robot benchmark like BEHAVIOR makes getting domestic robots that do what we want them to do a more likely possibility someday.
There’s a new benchmark in town and it signals the dawning of the domestic robot age.
Read an article the other week about how Deepmind (at Google) is approaching the training of robotics using simulation, reinforcement learning, elastic weights, knowledge distillation and progressive learning.
It seems relatively easy to train a robot to handle some task like grabbing or walking. But doing so can take an awfully long time. If you want to try to train a robot to grab something and put it someplace. You can have it start out making some random movements of its arm, wrist and fingers (if they have such things) and then use reinforcement learning to help it improve its movements over time.
But if each grab attempt takes 10 seconds, using reinforcement learning may take 10,000 attempts before it starts to make any significant progress and perhaps another 20,000-50,000 more to get expert at it. Let’s see 60K *10 seconds is 10,000 minutes or ~170 hours. And that’s just one object pick and place. But then maybe you would like to grab different parts and maybe place them in different locations. All these combinations start adding up.
And of course doing 1000s of movements will wear out gears, motors, mechanisms etc. If only this could all be done in electronic simulations. Then assuming the simulations are accurate enough the whole thing could be done in a matter of hours without wearing anything out. Enter robot simulators such as NVIDIA Isaac Sim, OpenAI RoboSchool/PyBullet
But the problems with simulation are …
Simulations are getting more accurate but at some point their accuracy defeats its purpose because the real world is always noisy, windy and not as deterministic as any simulation. One researcher said you could conceivable have a two armed robot be trained to throw all of a cell phones components up into the air and they will all land in their proper places, proper orientations. But in the real world this could never actually happen, or if it did, it could only happen once.
Weather researchers have been dealing with this problem in spades for a long time. There appears to be a fundamental limit to how far in advance we can predict weather and it’s due to the accuracy with which sensors operate and the complexity of feedback loops between the atmosphere, oceans, landforms, etc. So at a fundamental level, simulations can never be completely accurate. But they can be better.
Today’s weather simulations we see on TV/radio use models that average a number of distinct simulations, where sensor information has been slightly and randomly modified. Something similar could be done for robotic simulation environments, to make them more realistic.
But there are other problems with training robots to do lots of tasks.
Forget me not…
AI deep learning and reinforcement learning algorithms are great when charged with learning a single task, but having it learn multiple tasks is much harder to do. Because each task requires its own deep neural network (DNN) and if you train a DNN on one task and then try to train in on a another task, it forgets all the learnings from the original task. Researchers call this catastrophic forgetting.
One way researchers have dealt with this problem is to effectively freeze certain DNN nodes from having their weights changed during subsequent training rounds and leave others flexible or changeable. One can see this when one trains an image recognition DNN to classify different objects by importing a well trained object classifier and freezing all of it’s layers except the top one or two and then training these layers to classify new objects.
This works well but you have effectively changed the DNN to forget the original object classification training and replaced it with a new one. One solution to this approach is to have multiple passes of training, after each one, certain nodes and connections (of importance to that particular task) are selectively frozen. This works well for a limited number of different tasks but over time all nodes become frozen which means that no more learning can take place. Researchers call this approach to the catastrophic forgetting problem elastic weights.
One way to get around the all nodes frozen issue in elastic weights is to have multiple NNs. One which is trained on a specific task and whose weights are frozen and then a DNN that exists alongside this one with it’s own initialized set of weights. But which uses the original DNN as part of the new DNN inputs. This effectively includes and incorporates all the previously learned knowledge into the new, combined DNN. This is called Progressive Neural Networks.
In this fashion one progressive DNN can be sequentially trained on any number of tasks each of which ends up providing input to all subsequent task training activity. Such a progressive network never forgets and can use previously learned knowledge on new tasks.
The problem with progressive DNNs is a proliferation of DNN column. one for each trained task. However there are a couple of approaches to shrinking an ensemble of DNN like progressive training creates into one that is simpler and just as effective. One way is to perturb weights in DNN nodes and see how model prediction accuracy is impacted on all its tasks. If accuracy isn’t impacted that much, then that node and all its connections could be deleted from the model with minimal impact on model accuracy.
Another approach is to use one DNN to train another. Sort of like a teacher-student. This is called Knowledge Distilation. Where one DNN is a large network (the teacher) and a smaller (student) network that is trained to mimic the teacher DNN to achieve similar accuracy. This is done by training the smaller student network to match the predictions/classifications of the larger one.
Google researchers have shown that knowledge distillation works best when the gap in the sizes of the two networks (teacher and student) aren’t that large. They have solved this problem by introducing an intermediate step (called teachers assistent). They train this TA first then use the TA to train the student.
In the above graphic, when using a teacher of size 110 and a student of size 8 the resulting accuracy suffers but if one uses an intermediate DNN, with a size 20 the resultant accuracy of the student is much closer to the teacher..
So with realistic simulation we can train a robot to do any specific task, all using only compute resources. And using progressive DNN training, a robot could conceivably be trained to do any number of tasks. And with appropriate knowledge distillation one can reduce the DNN from progressive training into something much smaller (<10%) than the original DNN.
Want a personal robot that can clean up around your place, do the wash, cook your food and do anything else needed. You know what to do.
My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.
Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.
Reinforcement learning, explained
Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.
The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).
The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.
Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.
Reinforcement learning, a chess playing example
Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.
If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.
Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.
So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.
For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.
The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”
Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.
Does rewards and reinforcement learning provide a roadmap to AGI
To use reinforcement learning to achieve AGI implies that
One can identify all the arenas required for (human) intelligence
One can compute a proper reward signal for each arena involved in (human) intelligence,
One can programmatically compute appropriate steps to take to solve that arena’s activity,
One can save a sequence of state-steps taken to solve that arena’s problem, and
One can run sequences of steps enough times to produce a good policy for that arena.
There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.
For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.
The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.
So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.
One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.
Alternatives to full AGI
But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.
Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.
I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.
Perhaps the MuZero techniques might have an answer to some or all of the above.
Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.