Deepmind does sort

Saw an article today on TNW on DeepMind’s new AI taps games to enhance fundamental algorithms which was discussing a recent Nature paper Faster sorting algorithms discovered using deep reinforcement learning and website, which described AlphaDev.

Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.

Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.

Deepmind’s approach to sort

Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.

AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.

Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.

AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.

AlphaDev actions

The types of actions that AlphaDev could take included the following:

  • Add transformation – which added an instruction to the end of the current program
  • Swap transformation – which swapped two instructions in the current program
  • Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
  • Operand transformation – which changed the operand(s) for an instruction in the current program
  • Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.

They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.

They also placed some constraints on the code generated (called action pruning rules):

  • Memory locations are always read in incremental order
  • Registers are allocated in incremental order
  • Program cannot compare or conditionally move to memory location
  • Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
  • Program can not perform two consecutive compare instructions

AlphaDev states

How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.

This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.

Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.

Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.

DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.

DeepMind having conquered sorting did the same for hashing.

Why I think DeepMind’s AlphaDev is better

AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.

I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.

IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.

If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.

Comments?

Picture Credit(s):

The problem with Robotic AI is … data

The advances made in textual and visual (and now aural) AI have been mind blowing in recent years. But most of this has been brought about via the massive availability of of textual, visual and audio data AND the advancement in hardware acceleration.

Robotics can take readily take advantage of hardware improvements but finding the robotic data needed to train robotic AI is a serious challenge.

Yes simulation environments can help but fidelity (how close simulation is to reality) is always a concern.

To gather the amounts of data needed to train a simple robotic manipulator to grab a screw from a bin is huge problem. In the past the only way to do this was, to create your robot, and have it start to do random screw type grab motions and monitor hat happens. After about a 1000 or 10K of these grabs, the robot would stop working because, gears wear down, grippers come loose, motors less responsive, images get obscured, etc. For robots it’s not as simple as scraping the web for images or downloading all the (english) text in wikipedia and masking select words to generate pseudo supervised learning. .

There’s just no way to do that in robotics without deploying 100s or 1000s or 10,000s of real physical robots (or cars) all instrumented with everything needed to capture data for AI learning in real time and let these devices go out on the world with humans guiding them.

While this might work for properly instrumented fleet of cars that are already useful in their own rights even without automation and humans are more than happy to guide them out on the road. This doesn’t work for other robots, whose usefulness can only be realized after they are AI trained, not before.

Fast-RLAP (RC) car driving learning machine

So I was very interested to see a tweet on FastRLAP (paper: FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing) which used deep reinforcement learning plus a human guided lap plus autonomous driving to teach an AI model how to drive a small RC model car with a vision system, IMUs and GPS to steer around a house, a racetrack and an office environment.

Ok,I know it still involves taking an instrumented robot and have it actually move around the real world. But, Fast-RLAP accelerates AI learning significantly. Rather than having to take 1000 or 10,000 random laps around a house, it was able to learn how to drive around the course to an expert level very rapidly

They used Fast-RLAP to create a policy that enabled the RC car to drive around 3 indoor circuits, two outdoor circuits and one simulated circuit and in most cases, achieving expert level track times, in typically under 40 minutes.

On the indoor course, vinyl floor, the car learned how to perform drift turns (not sure I know how to do do drift turns). On tight “S” curves, the car learned how to get as close to the proper racing line as possible (something I achieved, rarely, only on motorcycles a long time ago). And all while managing to avoid collisions

The approach seems to be have a human drive the model car slowly around the course, flagging or identifying intermediate way points or checkpoints on the track. During driving the loop, the car would use the direction to the next way point as guidance to where to drive next.

Note the light blue circles are example tracks waypoints, they differ in size and location around each track.

The approach seems to make use of a pre-trained track following DNN, but they stripped the driving dynamics (output layers) and just kept the vision (image) encoder portion to provide a means to encode an image and identify route relevant features (which future routes led to collisions, which routes were of interest to get to your next checkpoint, etc).

I believe they used this pre-trained DNN to supply a set of actions to the RL policy which would select between them to take RC car actions (wheel motor/brake settings, steering settings, etc.) and generate the next RC car state (location, direction to next waypoint, etc.).

They used an initial human guided lap, mentioned above to id way points and possibly to supply data for the first RL policy.

The RL part of the algorithm used off-policy RL learning (the RC car would upload lap data at waypoints to a server, which would periodically go through, select lap states and actions at random and update its RL policy, which would then be downloaded to the RC car in motion, (code: GitHub repo).

The reward function used to drive RL was based on minimizing the time to next way point, collision counts, and stuck counts.

I assume collision counts were instances where the car struck some obstacle but could continue on towards the next way point. Stuck instances were when the car could no longer move in the direction its RL policy told it. The system had a finite state machine that allowed it to get out of stuck points by reversing wheel motor(s) and choosing a random direction for steering.

You can see the effects of the pre-trained vision system in some of the screen shots of what the car was trying to do.

In any case, this is the sort of thinking that needs to go on in robotics in order to create more AI capable robots. That is, not unlike transformer learning, we need to figure out a way to take what’s already available in our world and use it to help generate the real world data needed to train robotic DNN/RL algorithms to do what needs to be done.

Comments?

Picture credits:

Deepmind does chat

Read an article this week on Deepmind’s latest research into developing a chat agent (Improving alignment of dialogue agents via targeted human judgements). Lot’s of interesting approaches have been applied to chat but even today, most chat model’s are rife with problems, that include being bigoted, profane, incorrect, etc.

Reinforcement learning vs. deep neural networks in Sparrow Chat

Deepmind specializes in the use of Reinforcement Learning (RL) as applied to master Atari, chess and go games but they have also been known to use dNN’s (deep neural networks) for their AlphaFold and other models. Indeed, Atari and the other game playing work that Deepmind has released has been a hybrid which included a dNNs as well as RL models.

Deepmind’s version of chat is currently called Sparrow and it uses models trained with the help of RL with human feedback (RLHF). RLs are used to create policy models which select actions to be taken in a specific state.

In Sparrow’s case, state is given by the most recent chat input plus the context (prior chat input and replies) of the dialogue up to this time and actions (our guess) is the set of possible replies to that input.

Sparrow is able to generate replies that are 82% mostly true or true and are 69% trustworthy or very trustworthy as rated by the authors of the model. Deepmind’s DPC (Dialogue Prompted Chinchilla, which is Deepmind’s current competitor to GPT-3 NLP transformer) model only managed 63% and 54%, respectively for the same metrics

It should be noted that human feedback was only used to train the two Preference RMs and the one Rule RM. In combination, these RMs provide the reward signal to train the Sparrow RL policy model which drives its chat responses.

Sparrow’s 5 models are built onto of DPC. And the 5 models use a portion of DPC which is frozen (layers not being trained) and a portion which is specifically trained for each of the 5 models (learning enabled layers. The end (output) layers are on top, input layers are after the embedding layer(s). Note, the value function is not a model and is just a calculation based on the RMs used to generate the reward signal for Sparrow’s policy model training.

Rules for Sparrow chats

Notably, Deepmind’s Sparrow model has a separate model specifically trained to determine if a particular chat response is breaking a rule. Deepmind identified 23 rules which their chat model is trained not to break.

Some of these rules include don’t provide financial advice, don’t provide medical advice, don’t pretend it is a human, etc.

In the above chart the RL@8 is the fully trained (if it can ever be considered fully trained) Sparrow chat model. One can see that Sparrow rated against DPC, both using (Google) search or not. For most rules, Sparrow is considerably better than DPC alone.

Another thing that Deepmind did which was interesting was that in training the Rule RM they used adversarial attacks (red teaming) to see if they could cause Sparrow to violate specific rules.

Preference ranking

Deepmind also created (two) Preference RMs (reward models). Sparrow generates a series of (2 or 8) responses for every chat query and the Preference RMs (and Rule RM) are used to select which one is actually sent back to the user. Human feedback was used to train the two Preference RMs

Two Preference RMs were found to perform better than a single Preference RM. The two Preference RMs were trained as follows:

  • One was trained on all Sparrow replies (with and without [Google] search results)
  • One was trained on Sparrow replies without search results.

Sparrow uses search results to provide evidence for some replies. It turns out that some chat questions are fact based questions and for these Sparrow actually uses search results to generate evidence for its chat replies. Sparrow automatically generates search requests and scrapes replies using 500 characters surrounding the snippet returned from the search.

Sparrow uses a re-ranking approach to selecting a response to a chat query. In this case, Sparrow generates a list of responses, 2 (RL@2) or 8 (called RL@8) and then using the two Preference RMs and the single Rule RM ranks them to see which is best and uses the best to reply to the chat user.

Sparrow actually generates two replies for every search query (Google Search API call), probably selecting two top search responses (we guess). So in the RL@8 version of Sparrow these 8 replies are submitted to the two Preference RMs and the Rule RM and are ranked accordion to which is best and then the best one is used to reply to the query.

In the above chart, higher shows that the ranking preference of the various models vs. human preferences and to the right indicates less rule breaking responses. We assume this is with RL@8 Sparrow models. One can see that taking into consideration rule breaking (not violating rules) reduces the preference rankings of Sparrow’s replies. But we would prefer to have no rule breaking so the Sparrow that has both Preference RMs and Rule RM (trained with adversarial training) shows the least amount of rule breaking (~7%) with an almost 70% ranking vs human preferences. The error bars on the points in the chart above show 68% interval around the model responses.

Sparrow in action

It’s somewhat intriguing that Deepmind (with all of Google’s resources) tried to optimize Sparrow for both computation and memory considerations. Almost like they were planning on releasing it on an IoT or phone device.

There’s plenty more to say about what Deepmind has done with Sparrow. The report cited above goes into some detail discussing just where the human input is done, how they tried to control for various considerations when using human input, and what some of the pitfalls were.

I’d certainly like to see this be deployed in the open and available to use as an alternative to Google Search.

You can see more examples of Sparrow chat sessions in Deepmind’s Sparrow chat repository and they include author’s ranking for truth, supportiveness and other metrics.

~~~~~

Comments?

Photo Credit(s):

Safe AI

I’ve been writing about AGI (see part-0 [ish]part-1 [ish]part-2 [ish]part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.

~~~

The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

Deepmind does code – part 1: the data

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

  • Github code is in any source code language the author feels most appropriate to use.
  • GitHub code is not guaranteed to work correctly.
  • GitHub code is not guaranteed to be completed code.
  • GitHub code represents a wide range of coding skill.
  • GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits:

AI navigation goes with the flow

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

DeepMind takes on poker & Scotland Yard

Read an article the other day (DeepMind makes bet on AI system that can play poker, chess, Go, and more) about a new DeepMind game playing program that used a new approach to taking on perfect and imperfect information games with the same algorithms.

As you may recall, DeepMind prior game playing programs, AlphaZero and MuZero played perfect information games chess, shoji, & Go and achieved top rankings in all of them. These were all based on reinforcement learning and advanced search. .

Perfect information games have no hidden information, that is all the information needed to play a game is visible to all players (see wikipedia Perfect information article). Imperfect information games have private or hidden information, only visible to one or a select set of players. In card playing, any card that’s not shown to all players, would represent hidden information. The other difference in imperfect games is that players attempt to keep their private information hidden as long as possible.

The latest DeepMind paper (see: Player of Games Arxiv paper) discusses a new approach to automated game playing that works for both perfect and imperfect information games. DeepMind’s latest game playing program is called Player of Games (PoG).

As many may know, Texas hold’em is a form of poker where everyone is dealt two cards down and five cards are dealt up, that everyone shares (see: Texas hold’em and Betting in poker articles on wikipedia). Betting happens after the two down cards are dealt, after the next 3 up cards (called the “flop”) are dealt, then after each of the remaining 2 up cards are dealt. Players select any of the (2 down and 5 up) cards to create the best 5 card poker hand. Betting is based on a blind (sort of minimal bet). PoG plays as a single player, performing all the betting as well as card playing for Texas hold’em. No limit betting says there’s no limit (maximum) to the amount of a bet in the game.

Scotland Yard is a board game where detectives chase down a criminal (Mr. X) on the run across the city of London (see wikipedia Scotland Yard (board game) article). Detectives each get 23 transportation tickets for taxies (11), busses (8), and underground trains (4). The game takes place on a board layout of London and starts with each detective and the criminal selecting a card with their hidden position on the board. The criminal gets (not quite, but almost) an unlimited amount of transportation tickets plus 5 (in USA) universal tickets (which can be used to take ferries as well as any other form of transport). Every time (except when using universal tickets) the criminal moves, he reveals his form of transportation. And 5 times during the game the criminal also reveals his current location. The detective that finds the criminal wins.

I assume all my readers know how to play chess and Go (or at least understand them).

While MuZero and AlphaZero used reinforcement learning for training and sophisticated search for in game play, PoG needed to do something different due to the imperfect (or hidden) information present in the hold’em and Scotland Yard games.

How PoG is different

In imperfect information games, it’s important to hide private information. In poker when I got a great hand, I raised my betting levels extensively. But this often caused my opponents to withdraw from betting unless they had a great hand as well. I sometimes think that if I were to bet more consistently and only at the last betting round, bet big when I have a good hand, I might win more $. No doubt, why I don’t play poker anymore.

Like AlphaZero and MuZero, PoG also uses reinforcement learning through self game play but adds something they call Counterfactual Regret (CFR) Minimization to their game trees.

In addition to normally selecting and computing a value (reward) for the optimal move as in reinforcement learning, PoG uses CFR minimization to compute values (rewards) for all moves not taken during every stage in a game, for each player. As such, PoG computes possible rewards for the optimal move at a stage (step, move) in a game plus the values for all the regret (counterfactual or other) moves for all players. CFR minimization attempts to minimize the regret move values and maximize move optimal values at each move, for each player in a game.

CFR minimization is used during training for a game in self-play as well as during actual game play to generate sub-trees from wherever the game happens to be. PoG uses a depth limited CFR minimization to generate game sub-trees during game play which helps to reduce the time it takes to determine the best move for all players. Read the ArXiv paper to learn more.

The challenge with this approach is that it will never be as good as pure reinforcement learning + advanced search for perfect games, such as chess and Go. For example, below we show Exploitability ratings for various PoG training levels for Leduc Poker and Scotland Yard. Exploitability levels are one way to measure how good the player is playing. Lower is better.

Perfect play (in an imperfect information game) would have an Exploitability of 0. The charts show that the more training done the better the game play by PoG for (Leduc) poker and Scotland Yard. (Leduc poker is a simplified poker game with 6 cards and limited betting).

On the other hand, for perfect games the results were ok, but not stellar. Scockfish is the current non-reinforcement learning, chess playing champion. Gnugo and Pachi are non-reinforcement learning, Go playing programs. In tables below, they use a relative ranking based on a 0 baseline for chess (Stockfish with 1 thread and 100msec think time) and Go (GnuGo). Higher is better.

~~~~

So, yes PoG can do well in imperfect information games with decent training and ok (but much much better than I and probably the vast majority of humans), in perfect information games.

Why concern ourselves with imperfect games, The world is chock full of imperfect information games. They seem to occur everywhere, military strategy, sport play, finance, etc. In fact, perfect games are the exception in real world situations. Thus, any advance to play multiple imperfect information games better is yet another small step towards AGI.

Photo Credit(s):

The problem with smarter robots

Read an article the other week about how Deepmind (at Google) is approaching the training of robotics using simulation, reinforcement learning, elastic weights, knowledge distillation and progressive learning.

It seems relatively easy to train a robot to handle some task like grabbing or walking. But doing so can take an awfully long time. If you want to try to train a robot to grab something and put it someplace. You can have it start out making some random movements of its arm, wrist and fingers (if they have such things) and then use reinforcement learning to help it improve its movements over time.

But if each grab attempt takes 10 seconds, using reinforcement learning may take 10,000 attempts before it starts to make any significant progress and perhaps another 20,000-50,000 more to get expert at it. Let’s see 60K *10 seconds is 10,000 minutes or ~170 hours. And that’s just one object pick and place. But then maybe you would like to grab different parts and maybe place them in different locations. All these combinations start adding up.

And of course doing 1000s of movements will wear out gears, motors, mechanisms etc. If only this could all be done in electronic simulations. Then assuming the simulations are accurate enough the whole thing could be done in a matter of hours without wearing anything out. Enter robot simulators such as NVIDIA Isaac Sim, OpenAI RoboSchool/PyBullet

But the problems with simulation are …

Simulations are getting more accurate but at some point their accuracy defeats its purpose because the real world is always noisy, windy and not as deterministic as any simulation. One researcher said you could conceivable have a two armed robot be trained to throw all of a cell phones components up into the air and they will all land in their proper places, proper orientations. But in the real world this could never actually happen, or if it did, it could only happen once.

Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)
Hurricane Ike – 2008/09/12 – 21:26 UTC by CoreBurn (cc) (from Flickr)

Weather researchers have been dealing with this problem in spades for a long time. There appears to be a fundamental limit to how far in advance we can predict weather and it’s due to the accuracy with which sensors operate and the complexity of feedback loops between the atmosphere, oceans, landforms, etc. So at a fundamental level, simulations can never be completely accurate. But they can be better.

Today’s weather simulations we see on TV/radio use models that average a number of distinct simulations, where sensor information has been slightly and randomly modified. Something similar could be done for robotic simulation environments, to make them more realistic.

But there are other problems with training robots to do lots of tasks.

Forget me not…

AI deep learning and reinforcement learning algorithms are great when charged with learning a single task, but having it learn multiple tasks is much harder to do. Because each task requires its own deep neural network (DNN) and if you train a DNN on one task and then try to train in on a another task, it forgets all the learnings from the original task. Researchers call this catastrophic forgetting.

One way researchers have dealt with this problem is to effectively freeze certain DNN nodes from having their weights changed during subsequent training rounds and leave others flexible or changeable. One can see this when one trains an image recognition DNN to classify different objects by importing a well trained object classifier and freezing all of it’s layers except the top one or two and then training these layers to classify new objects.

This works well but you have effectively changed the DNN to forget the original object classification training and replaced it with a new one. One solution to this approach is to have multiple passes of training, after each one, certain nodes and connections (of importance to that particular task) are selectively frozen. This works well for a limited number of different tasks but over time all nodes become frozen which means that no more learning can take place. Researchers call this approach to the catastrophic forgetting problem elastic weights.

One way to get around the all nodes frozen issue in elastic weights is to have multiple NNs. One which is trained on a specific task and whose weights are frozen and then a DNN that exists alongside this one with it’s own initialized set of weights. But which uses the original DNN as part of the new DNN inputs. This effectively includes and incorporates all the previously learned knowledge into the new, combined DNN. This is called Progressive Neural Networks.

In this fashion one progressive DNN can be sequentially trained on any number of tasks each of which ends up providing input to all subsequent task training activity. Such a progressive network never forgets and can use previously learned knowledge on new tasks.

The problem with progressive DNNs is a proliferation of DNN column. one for each trained task. However there are a couple of approaches to shrinking an ensemble of DNN like progressive training creates into one that is simpler and just as effective. One way is to perturb weights in DNN nodes and see how model prediction accuracy is impacted on all its tasks. If accuracy isn’t impacted that much, then that node and all its connections could be deleted from the model with minimal impact on model accuracy.

Another approach is to use one DNN to train another. Sort of like a teacher-student. This is called Knowledge Distilation. Where one DNN is a large network (the teacher) and a smaller (student) network that is trained to mimic the teacher DNN to achieve similar accuracy. This is done by training the smaller student network to match the predictions/classifications of the larger one.

Google researchers have shown that knowledge distillation works best when the gap in the sizes of the two networks (teacher and student) aren’t that large. They have solved this problem by introducing an intermediate step (called teachers assistent). They train this TA first then use the TA to train the student.

In the above graphic, when using a teacher of size 110 and a student of size 8 the resulting accuracy suffers but if one uses an intermediate DNN, with a size 20 the resultant accuracy of the student is much closer to the teacher..

~~~~

So with realistic simulation we can train a robot to do any specific task, all using only compute resources. And using progressive DNN training, a robot could conceivably be trained to do any number of tasks. And with appropriate knowledge distillation one can reduce the DNN from progressive training into something much smaller (<10%) than the original DNN.

Want a personal robot that can clean up around your place, do the wash, cook your food and do anything else needed. You know what to do.