Benchmarking Agentic AI using Factorio – AGI part 12

Yesterday a friend forwarded me something he saw online about a group of researchers who were using the game, Factorio, to benchmark AI Agent solutions (PDF of paper, Github repo).

The premise is that with an effective API for Factorio, AI agents can be tasked with creating various factories for artifacts. The best agents would be able to create the best factories.

Factorio factories can be easily judged by the number of artifacts they produce per time period and the energy use to manufacture those artifacts. They can also be graded based on how many steps it takes to generate those factories.

***Left is Factorio factory progression, middle is AI agent Python code that uses Factorio API, Right is agents submitting programs to Factorio server and receive feedback***

The team has created a Factorio framework for using AI agents that create Python code to drive a set of Factorio APIs to build factories to manufacture stuff.

Factorio is a game in which you create and operate factories. From Factorio website: “You will be mining resources, researching technologies, building infrastructure, automating production, and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working, and protect it from the creatures who don’t really like you.”

Presumably FLE has disabled the villainy and focused on just crafting and running factories all out.

FLE Results using current AI agents

***FLE Open-play Results***, ***for open-play, models are scored based on prediction quantities over time***, ***note the chart is log-log***

Factorio, similar to other games, has an inventory of elemens/components/machines used to build factories. And some of these elements are hidden until you one gains enough experience in the game.

The Factorio Learning Environment (FLE) is a complete framework that can prompt Agentic AI to create factories using Python code and Factorio API calls. The paper goes into great detail in it’s appendices as to what AI agent prompts look like, the Factorio API and other aspects of running the benchmark.

In the FLE as currently defined there’s “open-play” and “lab-play”.

Open-play is tasked with building a factory as large as the agent wants to create as much product as possible. The open-play winner is the AI agent that creates a factory that can manufacture the most widgets (iron plates) in the time available for the competition.
Lab-play is tasked with building factories for 24 specific items, with limited resource and time constraints and the winner is the AI agent that is able to build most of these lab-play factories successfull,y in the time and resource constraints available.

***FLE Lab-play (select) results – there were 24 tasks in the lab-play list, no agent completed all of them but Claude did the best on the 5 that were completed by most agents***

The team benchmarked 6 frontier LLM agents: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct, using them for both open-play and lab-play.

The overall winner for both open-play and lab-play was Claude 3.5-Sonnet, by a far margin. In open play it was able to create a factory to manufacture over 290K iron plates (per game minute, we think) and for lab-play was able to construct more (7 out of 24) factories, more than other AI agents.

***FLE Overall A***I ***Agent Results***

The FLE researchers listed some common failings of AI agents under test:

Most agents lack spatial understanding
Most agents don’t handle or recover from errors well
Most agents don’t have long enough planning horizons
Most agents don’t invest enough effort in research (finding out what new Factorio machines do and how they could be used).

They also mentioned that AI agent coding skills seemed to be a key indicator of FLE success and coding style differed substantially between the agents. The researchers characterized agent (Python) coding styles and determined that Claude used a REPL style with plenty of print statements while GPT-4o used more assertions in its code.

“***Example of an FLE program*** used to create a simple
automated iron-ore miner. In step 1 the agent uses a query to find
the nearest resources and place a mine. In step 3 the agent uses an
assert statement to verify that its action was successful.”

IMHO, as a way to measure AI agent ability to achieve long term and short term goals, at least w.r.t. building factories, this is the best I’ve seen so far.

More FLE Lab-play scenarios

I could see a number of additional lab-play benchmarks for FLE:

One focused on drug/pharmaceuticals manufacturing
One focused on electronics PCB manufacturing
One focused on chip manufacturing
One focused on nano technology/meta-materials manufacturing, etc.

What’s missing from all these benchmarks would be the actual science and research needed to come up with new drugs, new electronics, new meta-materials, that are the end product of Factorio factories. I guess that would need to be building of labs, running scientific experiments and understanding (simulated) results.

Although in the current round of FLE benchmarks, for one AI agent at least (Claude), there seemed to be a lot of research into how to use different Factorio tools and machinery.

Ultimate FLE

If FLE as an Ai agent benchmark succeeds, most Agentic AI solutions will start being trained to do better on the benchmark. Doing so should of course lead to better scores by AI agents.

Now people much more familiar with the game than I, say it’s not a great simulation of the real world. There’s only one type of fuel and the boiler is either on or off and numerous other simplifications of the real world are used throughout. And thankfully, for the moment there’s no linkage to actions that impact the real world.

But in reality, simulations like this that are all just stepping stones to AI capabilities. And simulations are all just code and it should not be that hard to increase its fidelity to the real world. .

Getting beyond just simulation, to real world factories is probably the much larger step. This would require physical (not unlimited) inventory of parts, cabling, machines, and belts; real mineral/petroleum deposits; real world physical constraints on where factories could be built. etc. Not to mention the physical automation/robotics that would allow a machine to be selected out of inventory, placed at a specific location inside a factory and connected to power and assembly lines, etc.

~~~~

One common motif in AGI existential crisises, is that some AGI (agent) will be given the task to build a paperclip factory and turns the earth into one giant factory, while inadvertently killing all life on the planet, including of course, humankind.

So training AI agents on “open-play” has ominous overtones.

It would be much better, IMHO, if somehow one could add to Factorio human settlements, plant, animal & sea life, ecosystems, etc. So that there would be natural components that if ruined/degraded/destroyed, could be used to reduce AI agent scores for the benchmarks.

Alas, there doesn’t appear to be anything like this in the current game.

Picture Credit(s):

From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) paper

Benchmarking Agentic AI using Factorio – AGI part 12

FLE Results using current AI agents

More FLE Lab-play scenarios

Ultimate FLE

Like this:

Related

Published by Ray

FLE Results using current AI agents

More FLE Lab-play scenarios

Ultimate FLE

Share this:

Like this:

Related

Published by Ray

Discover more from Silverton Consulting