I was perusing Deepmind’s mountain of research today and ran across one article on their Gato agent (A Generalist Agent abstract, paper pdf). These days with Llama 2, GPT-4 and all the other LLM’s doing code, chatbots, image generation, etc. it seems generalist agents are everywhere. But that’s not quite right.
Gato can not only generate text from prompts, but can also control a robot arm for pick and place, caption images, navigate in 3D, play Atari and other (shooter) video games, etc. all with the same exact model architecture and the same exact NN weights with no transfer learning required.
Same weights/same model is very unusual for generalist agents. Historically, generalist agents were all specifically trained on each domain and each resultant model had distinct weights even if they used the same model architecture. For Deepmind, to train Gato and use the same model/same weights for multiple domains is a significant advance.
Gato has achieved significant success in multiple domains. See chart below. However, complete success is still a bit out of reach but they are making progress.

For instance, in the chart one can see that their are over 200 tasks in the DM Lab arena that the model is trained to perform and Gato’s mean performance for ~180 of them is above a (100%) expert level. I believe DM Lab stands for Deepmind Lab and is described as a (multiplayer, first person shooter) 3D video game built on top of Quake III arena.
Deepmind stated that the mean for each task in any domain was taken over 50 distinct iterations of the same task. Gato performs, on average, 450 out of 604 “control” tasks at better than 50% human expert level. Please note, Gato does a lot more than just “control tasks”.
Model size and RT robotic control
One thing I found interesting is that they kept the model size down to 1.2B parameters so that it can perform real time inferencing in controlling robot arms. Over time as hardware speed increases, they believe they should be able train larger models and still retain real-time control. But at the moment, with a 1.2B model it can still provide. real time inferencing.

In order to understand model size vs. expertise they used 3 different model sizes training on same data, 79M, 364M and 1.2B parameters. As can be seen on the above chart, the models did suffer in performance as they got smaller. (Unclear to me what “Tokens Processed” on the X axis actually mean other than data length trained with.) However, it seems to imply, that with similar data, bigger models performed better and the largest did 10 to 20% better than the smallest model trained with same data streams.
Examples of Gato in action
The robot they used to train for was a “Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action.” It seemed a very flexible robot arm that would be used in standard factory environments. One robot task was to stack different styles and colors of plastic blocks.

Deepmind says that Gato provides rudimentary dialogue generation and picture captioning capabilities. Looking at the chat streams persented, seems more than rudimentary to me.

Deepmind did try the (smaller) model on some tasks that it was not originally trained on and it seemed to perform well after “fine-tuning” on the task. In most cases, using fine-tuning of the original model, with just “same domain” (task specific) data, the finely tuned model achieved similar results to what it achieved if Gato was trained from scratch with all the data used in the original model PLUS that specific domain’s data.
Data and tokenization used to train Gato
Deepmind is known for their leading edge research in RL but Gato’s deep neural net model is all trained with supervised learning using transformer techniques. While text based transformer type learning is pervasive in LLM today, vast web class data sets on 3D shooter gaming, robotic block stacking, image captioning and others aren’t nearly as widely available. Below they list the data sets Deepmind used to train Gato.

One key to how they could train a single transformer NN model to do all this, is that they normalized ALL the different types of data above into flat arrays of tokens.
- Text was encoded into one of 32K subwords and was represented by integers from 0 to 32K. Text is presented to the model in word order
- Images were transformed into 16×16 pixel patches in rastor order. Each pixel is normalized -1,1.
- Other discrete values (e.g. Atari button pushes) are flattened into sequences of integers and presented to the model in row major order.
- Continuous values (robot arm joint torques) are 1st flattened into sequences of floats in row major order and then mu-law encoded into the range -1,1 and then discretized into one of 1024 bins.
After tokenization, the data streams are converted into embeddings. Much more information on the tokenization and embedding process used in the model is available in the paper.
One can see the token count of the training data above. Like other LLMs, transformers take a token stream and randomly zero one out and are trained to guess that correct token in sequence.
~~~~
The paper (see link above and below) has a lot more to say about the control and non-control domains and the data used in training/fine-tuning Gato, if you’re interested. They also have a lengthy section on risks and challenges present in models of this type.
My concern is that as generalist models become more pervasive and as they are trained to work in more domains, the difference between an true AGI agent and a Generalist agent starts to blur.
Something like Gato that can both work in real world (via robotics) and perform meta analysis (like in metaworld), play 1st person shooter games, and analyze 2D and 3D images, all at near expert levels, and oh, support real time inferencing, seems to not that far away from something that could be used as a killer robot in an army of the future and this is just where Gato is today.
One thing I note is that the model is not being made generally available outside of Google Deepmind. And IMHO, that for now is a good thing.
That is until some bad actor gets their hands on it….
Picture Credit(s):
All images, charts, and tables are from “A Generalist Agent” paper