Reinforcement Learning – Silverton Consulting

The curse of Scale & AGI

Posted on August 13, 2025 by Ray in AGI, Reinforcement Learning, Strategic Inflection Points

For the past 1/2 decade or more, new generation foundation models have all become significantly (10X or more) larger in parameters than their last versions. The presumption being that more parameters will always lead to better models, better inferences, more users, etc. This has been primarily driven by compute scaling, more compute thrown at training results in bigger models.

But the problem is at some point any process reaches saturation or a point of marginal return where throwing more (of anything) at it only gets marginally better, not incrementally or at least not commensurate with the additional cost. It’s unclear if we are there yet with foundation models, but my guess we are reaching it rapidly.

It’s interesting that ChatGPT-5 seems to have the same number of parameters as ChatGPT-4 (~1.8T).

Not being an active user of foundation models, I can’t really tell if …-5 is much better than …-4, but consensus seems to be they are not getting as better as they used to.

There are probably a number of reasons why this could be the case. The data wall for one. The power and cooling cost of exponentially increasing AI model size is impacting not just training costs but inferencing costs as well. But the end of the scaling advantage maybe another.

Don’t get me wrong if it wasn’t for compute scaling we wouldn’t have the AI we have today. NN training processes were invented in the 50s of last century, but they didn’t have the compute power to use them at the time.. It wasn’t until this century that computation caught up.

As more compute power became available, those old compute bound techniques proved to be the lynchpin for DNN training and we are still riding that curve today, up to a point.

It’s just that speeding up and doing the same old DNN training will lose effectiveness at some point, if not today, then tomorrow.

I’ve seen it myself in some rudimentary models I have trained. At some point adding nodes, layers, training epochs, etc., just doesn’t always result in better models. They often get worse.

AGI

And AGI, I believe, will require us to take a different tack than current foundational model DNN training to get right. Call it a hunch. But one can see glimmers of this in the fact that AGI is always just years away.

In order to achieve AGI, for safety reasons, for planetary climate reasons, and because scale is not getting us there anymore, I strongly believe we need to rethink our approach to foundation model training.

I’m no expert but I think what needs to change is more use of (deep) reinforcement learning (DRL), not just the human feedback reinforcement learning (HFRL) used today for fine tuning foundation models. This would mean using DRL much earlier, more comprehensively in all of phases of foundational model training.

Yes, DRL also consumes compute infrastructure and more “training episodes” for DRL can often lead to better model outcomes, but not always.

DRL training for AGI models

For any reinforcement learning to work, one needs a reward signal that can be used to signal how to optimize the DRL model. So, the real challenge in the use of more DRL for foundation model training is what (or who) supplies that reward signal from some action taken by the DRL model.

Historically, for games reward signals came from the game environment (or model), for robotic motion it can come from physics simulators or movement in the real world.

But any reward signal for AGI foundation models would need much more sophistication than the above.

The easy answer is to create world simulation models. Something that could simulate how the world (in total) would react to an action (or inference) of the foundation model.

But that’s not easy, world simulation models, at the fidelity needed to support DRL for AGI foundation models don’t exist and few if any researchers (AFAIK) are working on getting us there.

But there are some rudimentary baby steps that already exist. Physics engines (or models of real world physical processes) have existed for a long time now and would no doubt be the core of any world simulation model. Nature simulation models exist at least for climate and weather and these could also be incorporated into any world model.

What’s missing would be

Geophysical world simulations that would model how the world would react to actions taken by a AGI model. I’m aware of many petroleum earth based simulations ditto for plate tectonics, wind, and water movement, but these would all need to be combined into something that provides a entire world, geophysical reactions to model actions,
Biospherical world simulations that would model (at least at some level) how the (biological, i.e. animal, plant, fungi, microbe, etc.) natural world would react to actions. Weather models may have some of this, at least with respect to carbon cycles which span human-natural boundaries but we would need a lot more.
Psychological world simulations, or something that would simulate how a person and how a population of humans would react to actions taken by a model. I am unaware of anything available at this level except for a simulation of a baby I saw at SigGraph a couple of years ago. There would need to be a lot more work here to get this up to a level to support AGI training.
Sociological-Political world simulations or something that would model how human society across the world would react to model actions. Again some of these exist, at an even more rudimentary level than financial or weather modeling, and we would need a lot of work to get them to a level of fidelity needed for AGI training.
Financial-Business world simulations that would determine the financial reactions to model actions. Some of these exist for national economies, but would need broadened to the world at large and to much finer resolution, granularity to be suitable to support AGI foundational model training.

I am certainly missing some or more critical models that may be needed for true world simulations but these could provide a start. They would need to be combined, of course, in some fashion.

And determining the various reward weights would be non-trivial. It seems to me that each of these simulations could have multiple reward signals for any action. Combining them all may be non-trivial. But those are parameter optimizations, which once we have world models working in unison we can tweak at will.

Then there’s the “action space” for an AGI model. For games and robotic motion, the actions are well defined and finite. For an AGI model, it would seem that the actions are potentially infinite. Even if we limited it to a single domain such as tokenized text strings, the magnitude of such actions would be 10K-10M X anything tried before with DRL. But I still believe it’s doable

Once we had such a model together, with a decent reward function and had some way to categorize/grasp the infinite actions that could be taken by an AGI, DRL could be used to train an AGI.

Of course this may take a few “billion or trillion” actions/training episodes to get something worthwhile out of it.

But maybe after something like (or 10M X) that we could create a safe and effective AGI.

~~~~

Comments?

Photo Credit(s):

OCP Summit 2024, AMD Hardware Optimizations for power efficient AI, presentation slide
Thomas Jefferson National Accelerator Facility (Jefferson Lab), flickr photo
SigGraph 2024, Beyond the illusion of life, Keynote presentation slide

Reward is all you need – part 2, AGI part 12, ASI part 3

Posted on April 18, 2025 by Ray in AGI, ASI, Cognitive computing, Reinforcement Learning, Strategic Inflection Points

Read an article today about how current LLM technology is running out of steam as it approaches equivalents to all current human knowledge. The article is Welcome to the Age of Experience. Apparently it’s a preprint of a chapter in an upcoming book from MIT, Designing an Intelligence. One of the authors is well known for his research in reinforcement learning and is a co-author of the text book, Reinforcement Learning: An Introduction. .

Sometime back before ChatGPT came out there was a paper on reward is enough (see post: For AGI, is reward enough). And at the time it proposed that reinforcement learning with proper reward signals was sufficient to reach AGI.

Since then, attention has become the prominent road to AGI and is evident in all the LLM activity to date (see ArXiv paper: Attention is all you need).

This new paper (and presumably book) suggests that the current AI training technology focused on attention (to current human knowledge) will ultimately reach an impasse, a human wall if you will. Whenever it attains human levels of AG or the Humanity WalI, it will be unable to proceed any farther. And at that point, it will track human knowledge generation but go no further.

Now, from my perspective something like this is inherently safer than having something that can surpass human intelligence. But putting my reservations aside. The new paper on the Era of Experience shows a potential road map of sorts to achieve super human intelligence.

Era of attention

In the case of transformers (current LLM technology) they have billion parameter models based on learning what the next token in a sequence should be. There are ancillary models that determine, for instance, tokenization of text streams (multi dimensional locations for each portion of a word in a paragraph for instance). Tokenization encoded textual semantics and context as well as the textual word part being analyzed into a string of numbers for each token. Essentially, a multi-dimensional address in textual semantic space

But the big, billion+ parameter models were all essentially trained to predict what the next text token would be based on current context. Similarly, for graphical generation models it went from text tokens to predicting the diffusion pixels of a graphic and other visual artifacts.

But pretty much all of this was based on the underlying technology training approach as outlined in attention is all you need.

The Era of Experience paper suggests that this training approach will ultimately run out of steam. And all of these models will hit the Humanity Wall. Where they reach the equivalent to all human knowledge but will be unable to proceed past that point

Era of Games and Proofs

In an online course I took during Covid on reinforcement learning, the level 1 of the course ended up having us code a Reinforcement Learning algorithm to play pong. Mind you this ended up taking me much longer to get right than I had anticipated. But in the end this was essentially training a deep neural network as a value function (prediction whether a move was going to win or lose) to decide which direction to move the paddle based on the balls current position and velocity.

For this reinforcement learning algorithm reward was simply 0, if you continued the game, +1 if you won the game, and -1, if you lost (the ball went past your paddle).

The authors discuss Deep Mind’s “Alpha-Proof” (more of an explanation of the technology) and Alpha-Geometry2 (also described in the same page) as being an examples of super-human thinking capabilities only in the domain of mathematical proofs. Alpha-Proof and Alpha-Geometry2 have won a prestigious International Mathematics Olympiad silver medal for its capabilities.

Alpha-Proof & Alpha-Geometry2 depend on LEAN a formal mathematical description language (similar to coding for mathematics). So a proof request would be converted to LEAN code and then Alpha-Proof and Alpha-Geometry2

Alpha-proof was originally trained on the sum total of all human generated mathematical proofs but then used reinforcement learning to generate 100’s of million more proofs and trained on those, to reach the level of superhuman mathematical proof generator.

Alpha Proof is an example of deploying Alpha-Zero RL technologies to different domains. Alpha-zero already conquered Chess, Shoji and Go games with super-human skill.

These achieved super-human levels of skill, because human (knowledge) was essentially dropped out of the training loop (very early on) and from then on the algorithm trained itself on self-generated data (game play, mathematical proofs). Using a a game simulator and reward signal(s) to determine when play were good or bad.

Era of Experience

But the Era of Experience takes reward signals to a whole other level.

Essentially in order to create super human intelligence using RL, the reward function needs to become yet another Deep Neural Network or two. And it needs to be trained in a fashion which understands how the world, environment, humans, flora, fauna, etc. reacts to what a (super human) agent is doing.

Unclear how you tokenize (encode) all those real world, experience signals into something a DNN could be trained on but my guess is their book will delve into some of these topics.

But in addition to the multi-faceted reward DNN(s), in order to do effective RL, one also needs a (high fidelity) real world simulator. This would be used similar to internal game play, in game playing traditional RL algorithms so that the super human agent could generate a 100 million agentic scenarios in simulation to determine if they were successful or not long before it ever attempted activities in the real world.

So there you have it tokenization for LLMS DNNs and diffusion and text based agentic LLM DNNs, some sort of multi-faceted Reward DNNs (taking input from real and simulated world experience) and multi-faceted World simulator DNNs.

Once you have all that together and with sufficient time and processing powerand after some 100 million or so of generated actions in the simulated world, you should have a super human agent that you can unleash on the real world.

~~~~

You may wish to constrain your new super human intelligent agent early on to make sure the world simulation has true fidelity with the real world we live in. But after a suitable safety checkout period, one should have a super human intelligence agent ready to take over all human thought, society advancement, scientific research, etc.

Sound like fun!!?

Photo/Graphic Credit(s):

From Welcome to the new Era of Experience paper
From DeepMind’s Alpha-Proof webpage.
From DeepMind’s Alpha-Proof webpage.

Deepmind does sort

Posted on June 13, 2023 by Ray in Cognitive computing, Deep Learning, Reinforcement Learning, Strategic Inflection Points, System effectiveness, Visionary leadershp, Visionary organizations

Saw an article today on TNW on DeepMind’s new AI taps games to enhance fundamental algorithms which was discussing a recent Nature paper Faster sorting algorithms discovered using deep reinforcement learning and website, which described AlphaDev.

Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.

Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.

Deepmind’s approach to sort

Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.

AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.

Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.

AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.

AlphaDev actions

The types of actions that AlphaDev could take included the following:

Add transformation – which added an instruction to the end of the current program
Swap transformation – which swapped two instructions in the current program
Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
Operand transformation – which changed the operand(s) for an instruction in the current program
Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.

They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.

They also placed some constraints on the code generated (called action pruning rules):

Memory locations are always read in incremental order
Registers are allocated in incremental order
Program cannot compare or conditionally move to memory location
Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
Program can not perform two consecutive compare instructions

AlphaDev states

How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.

This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.

Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.

Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.

DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.

DeepMind having conquered sorting did the same for hashing.

Why I think DeepMind’s AlphaDev is better

AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.

I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.

IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.

If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.

Comments?

Picture Credit(s):

All graphics in this post have been taken from the Nature article and it’s appendices, see: Faster sorting algorithms discovered using deep reinforcement learning

The problem with Robotic AI is … data

Posted on June 8, 2023June 9, 2023 by Ray in Cognitive computing, Data, Machine Learning, Reinforcement Learning, Robots, Strategic Inflection Points

The advances made in textual and visual (and now aural) AI have been mind blowing in recent years. But most of this has been brought about via the massive availability of of textual, visual and audio data AND the advancement in hardware acceleration.

Robotics can take readily take advantage of hardware improvements but finding the robotic data needed to train robotic AI is a serious challenge.

Yes simulation environments can help but fidelity (how close simulation is to reality) is always a concern.

To gather the amounts of data needed to train a simple robotic manipulator to grab a screw from a bin is huge problem. In the past the only way to do this was, to create your robot, and have it start to do random screw type grab motions and monitor hat happens. After about a 1000 or 10K of these grabs, the robot would stop working because, gears wear down, grippers come loose, motors less responsive, images get obscured, etc. For robots it’s not as simple as scraping the web for images or downloading all the (english) text in wikipedia and masking select words to generate pseudo supervised learning. .

There’s just no way to do that in robotics without deploying 100s or 1000s or 10,000s of real physical robots (or cars) all instrumented with everything needed to capture data for AI learning in real time and let these devices go out on the world with humans guiding them.

While this might work for properly instrumented fleet of cars that are already useful in their own rights even without automation and humans are more than happy to guide them out on the road. This doesn’t work for other robots, whose usefulness can only be realized after they are AI trained, not before.

Fast-RLAP (RC) car driving learning machine

So I was very interested to see a tweet on FastRLAP (paper: FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing) which used deep reinforcement learning plus a human guided lap plus autonomous driving to teach an AI model how to drive a small RC model car with a vision system, IMUs and GPS to steer around a house, a racetrack and an office environment.

Ok,I know it still involves taking an instrumented robot and have it actually move around the real world. But, Fast-RLAP accelerates AI learning significantly. Rather than having to take 1000 or 10,000 random laps around a house, it was able to learn how to drive around the course to an expert level very rapidly

They used Fast-RLAP to create a policy that enabled the RC car to drive around 3 indoor circuits, two outdoor circuits and one simulated circuit and in most cases, achieving expert level track times, in typically under 40 minutes.

On the indoor course, vinyl floor, the car learned how to perform drift turns (not sure I know how to do do drift turns). On tight “S” curves, the car learned how to get as close to the proper racing line as possible (something I achieved, rarely, only on motorcycles a long time ago). And all while managing to avoid collisions

The approach seems to be have a human drive the model car slowly around the course, flagging or identifying intermediate way points or checkpoints on the track. During driving the loop, the car would use the direction to the next way point as guidance to where to drive next.

Note the light blue circles are example tracks waypoints, they differ in size and location around each track.

The approach seems to make use of a pre-trained track following DNN, but they stripped the driving dynamics (output layers) and just kept the vision (image) encoder portion to provide a means to encode an image and identify route relevant features (which future routes led to collisions, which routes were of interest to get to your next checkpoint, etc).

I believe they used this pre-trained DNN to supply a set of actions to the RL policy which would select between them to take RC car actions (wheel motor/brake settings, steering settings, etc.) and generate the next RC car state (location, direction to next waypoint, etc.).

They used an initial human guided lap, mentioned above to id way points and possibly to supply data for the first RL policy.

The RL part of the algorithm used off-policy RL learning (the RC car would upload lap data at waypoints to a server, which would periodically go through, select lap states and actions at random and update its RL policy, which would then be downloaded to the RC car in motion, (code: GitHub repo).

The reward function used to drive RL was based on minimizing the time to next way point, collision counts, and stuck counts.

I assume collision counts were instances where the car struck some obstacle but could continue on towards the next way point. Stuck instances were when the car could no longer move in the direction its RL policy told it. The system had a finite state machine that allowed it to get out of stuck points by reversing wheel motor(s) and choosing a random direction for steering.

You can see the effects of the pre-trained vision system in some of the screen shots of what the car was trying to do.

In any case, this is the sort of thinking that needs to go on in robotics in order to create more AI capable robots. That is, not unlike transformer learning, we need to figure out a way to take what’s already available in our world and use it to help generate the real world data needed to train robotic DNN/RL algorithms to do what needs to be done.

Comments?

Picture credits:

All pictures in this post are from the ArXiv paper FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing

Deepmind does chat

Posted on September 30, 2022 by Ray in Artificial Intelligence, Deep Learning, Machine Learning, Reinforcement Learning

Read an article this week on Deepmind’s latest research into developing a chat agent (Improving alignment of dialogue agents via targeted human judgements). Lot’s of interesting approaches have been applied to chat but even today, most chat model’s are rife with problems, that include being bigoted, profane, incorrect, etc.

Reinforcement learning vs. deep neural networks in Sparrow Chat

Deepmind specializes in the use of Reinforcement Learning (RL) as applied to master Atari, chess and go games but they have also been known to use dNN’s (deep neural networks) for their AlphaFold and other models. Indeed, Atari and the other game playing work that Deepmind has released has been a hybrid which included a dNNs as well as RL models.

Deepmind’s version of chat is currently called Sparrow and it uses models trained with the help of RL with human feedback (RLHF). RLs are used to create policy models which select actions to be taken in a specific state.

In Sparrow’s case, state is given by the most recent chat input plus the context (prior chat input and replies) of the dialogue up to this time and actions (our guess) is the set of possible replies to that input.

Sparrow is able to generate replies that are 82% mostly true or true and are 69% trustworthy or very trustworthy as rated by the authors of the model. Deepmind’s DPC (Dialogue Prompted Chinchilla, which is Deepmind’s current competitor to GPT-3 NLP transformer) model only managed 63% and 54%, respectively for the same metrics

It should be noted that human feedback was only used to train the two Preference RMs and the one Rule RM. In combination, these RMs provide the reward signal to train the Sparrow RL policy model which drives its chat responses.

Sparrow’s 5 models are built onto of DPC. And the 5 models use a portion of DPC which is frozen (layers not being trained) and a portion which is specifically trained for each of the 5 models (learning enabled layers. The end (output) layers are on top, input layers are after the embedding layer(s). Note, the value function is not a model and is just a calculation based on the RMs used to generate the reward signal for Sparrow’s policy model training.

Rules for Sparrow chats

Notably, Deepmind’s Sparrow model has a separate model specifically trained to determine if a particular chat response is breaking a rule. Deepmind identified 23 rules which their chat model is trained not to break.

Some of these rules include don’t provide financial advice, don’t provide medical advice, don’t pretend it is a human, etc.

In the above chart the RL@8 is the fully trained (if it can ever be considered fully trained) Sparrow chat model. One can see that Sparrow rated against DPC, both using (Google) search or not. For most rules, Sparrow is considerably better than DPC alone.

Another thing that Deepmind did which was interesting was that in training the Rule RM they used adversarial attacks (red teaming) to see if they could cause Sparrow to violate specific rules.

Preference ranking

Deepmind also created (two) Preference RMs (reward models). Sparrow generates a series of (2 or 8) responses for every chat query and the Preference RMs (and Rule RM) are used to select which one is actually sent back to the user. Human feedback was used to train the two Preference RMs

Two Preference RMs were found to perform better than a single Preference RM. The two Preference RMs were trained as follows:

One was trained on all Sparrow replies (with and without [Google] search results)
One was trained on Sparrow replies without search results.

Sparrow uses search results to provide evidence for some replies. It turns out that some chat questions are fact based questions and for these Sparrow actually uses search results to generate evidence for its chat replies. Sparrow automatically generates search requests and scrapes replies using 500 characters surrounding the snippet returned from the search.

Sparrow uses a re-ranking approach to selecting a response to a chat query. In this case, Sparrow generates a list of responses, 2 (RL@2) or 8 (called RL@8) and then using the two Preference RMs and the single Rule RM ranks them to see which is best and uses the best to reply to the chat user.

Sparrow actually generates two replies for every search query (Google Search API call), probably selecting two top search responses (we guess). So in the RL@8 version of Sparrow these 8 replies are submitted to the two Preference RMs and the Rule RM and are ranked accordion to which is best and then the best one is used to reply to the query.

In the above chart, higher shows that the ranking preference of the various models vs. human preferences and to the right indicates less rule breaking responses. We assume this is with RL@8 Sparrow models. One can see that taking into consideration rule breaking (not violating rules) reduces the preference rankings of Sparrow’s replies. But we would prefer to have no rule breaking so the Sparrow that has both Preference RMs and Rule RM (trained with adversarial training) shows the least amount of rule breaking (~7%) with an almost 70% ranking vs human preferences. The error bars on the points in the chart above show 68% interval around the model responses.

Sparrow in action

It’s somewhat intriguing that Deepmind (with all of Google’s resources) tried to optimize Sparrow for both computation and memory considerations. Almost like they were planning on releasing it on an IoT or phone device.

There’s plenty more to say about what Deepmind has done with Sparrow. The report cited above goes into some detail discussing just where the human input is done, how they tried to control for various considerations when using human input, and what some of the pitfalls were.

I’d certainly like to see this be deployed in the open and available to use as an alternative to Google Search.

You can see more examples of Sparrow chat sessions in Deepmind’s Sparrow chat repository and they include author’s ranking for truth, supportiveness and other metrics.

~~~~~

Comments?

Photo Credit(s):

Tables 4 & 5 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 8 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 4 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 5 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Table 13 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 6 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 7 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Figure 1 from Deepmind’s Improving alignment of dialogue agents via targeted human judgements paper.
Sample RL@8 chat from Deepmind’s repository of Sparrow chats

Safe AI

Posted on July 29, 2022 by Ray in AGI, Artificial Intelligence, Cognitive computing, Information economy, Machine Learning, Reinforcement Learning, Robots, Scenario planning, Strategic Inflection Points, System quality

I’ve been writing about AGI (see part-0 [ish], part-1 [ish], part-2 [ish], part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.

~~~

The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

From “Artificial Intelligence & AI & Machine Learning” by mikemacmarketing is licensed under CC BY 2.0
From The face of a robot with human-like features, Penn State
From Deepmind’s pre-print, Competition-Level Code Generation with AlphaCode
From [280] Easy come, easy go.” by Linh H. Nguyen is licensed under CC BY-NC-ND 2.0
From “1968- ‘2001’ – Hal’s eye” by x-ray delta one is licensed under CC BY-NC-SA 2.0.

Deepmind does code – part 1: the data

Posted on February 10, 2022August 9, 2022 by Ray in Artificial Intelligence, Cognitive computing, Data, Deep Learning, Machine Learning, Reinforcement Learning, Strategic Inflection Points

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

Github code is in any source code language the author feels most appropriate to use.
GitHub code is not guaranteed to work correctly.
GitHub code is not guaranteed to be completed code.
GitHub code represents a wide range of coding skill.
GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits:

From Deepmind’s pre-print, Competition-Level Code Generation with AlphaCode
From Deepmind’s pre-print, Competition-Level Code Generation with AlphaCode
From Deepmind’s pre-print, Competition-Level Code Generation with AlphaCode
From Deepmind’s GitHub repo with their fine-tuning data, at github.com/deepmind/code_contests

AI navigation goes with the flow

Posted on December 22, 2021August 9, 2022 by Ray in AGI, Deep Learning, Drones, Machine Learning, Mobile computing, Neural network, Reinforcement Learning, Robots, Strategic Inflection Points

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

From Learning efficient navigation in vortical flow fields Nature magazine article
From Learning efficient navigation in vortical flow fields Nature magazine article
From Learning efficient navigation in vortical flow fields Nature magazine article
From Learning efficient navigation in vortical flow fields Nature magazine article
From Learning efficient navigation in vortical flow fields Nature magazine article
From Engineers Teach AI to Navigate Ocean with Minimal Energy CalTech article