Benchmarking Agentic AI using Factorio – AGI part 12

Yesterday a friend forwarded me something he saw online about a group of researchers who were using the game, Factorio, to benchmark AI Agent solutions (PDF of paper, Github repo).

A Factorio plastic bar factory

The premise is that with an effective API for Factorio, AI agents can be tasked with creating various factories for artifacts. The best agents would be able to create the best factories.

Factorio factories can be easily judged by the number of artifacts they produce per time period and the energy use to manufacture those artifacts. They can also be graded based on how many steps it takes to generate those factories.

Left is Factorio factory progression, middle is AI agent Python code that uses Factorio API, Right is agents submitting programs to Factorio server and receive feedback

The team has created a Factorio framework for using AI agents that create Python code to drive a set of Factorio APIs to build factories to manufacture stuff.

Factorio is a game in which you create and operate factories. From Factorio website: “You will be mining resources, researching technologies, building infrastructure, automating production, and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working, and protect it from the creatures who don’t really like you.”

Presumably FLE has disabled the villainy and focused on just crafting and running factories all out.

FLE Results using current AI agents

FLE Open-play Results, for open-play, models are scored based on prediction quantities over time, note the chart is log-log

Factorio, similar to other games, has an inventory of elemens/components/machines used to build factories. And some of these elements are hidden until you one gains enough experience in the game.

The Factorio Learning Environment (FLE) is a complete framework that can prompt Agentic AI to create factories using Python code and Factorio API calls. The paper goes into great detail in it’s appendices as to what AI agent prompts look like, the Factorio API and other aspects of running the benchmark.

In the FLE as currently defined there’s “open-play” and “lab-play”.

  • Open-play is tasked with building a factory as large as the agent wants to create as much product as possible. The open-play winner is the AI agent that creates a factory that can manufacture the most widgets (iron plates) in the time available for the competition.
  • Lab-play is tasked with building factories for 24 specific items, with limited resource and time constraints and the winner is the AI agent that is able to build most of these lab-play factories successfull,y in the time and resource constraints available.
FLE Lab-play (select) results – there were 24 tasks in the lab-play list, no agent completed all of them but Claude did the best on the 5 that were completed by most agents

The team benchmarked 6 frontier LLM agents: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct, using them for both open-play and lab-play.

The overall winner for both open-play and lab-play was Claude 3.5-Sonnet, by a far margin. In open play it was able to create a factory to manufacture over 290K iron plates (per game minute, we think) and for lab-play was able to construct more (7 out of 24) factories, more than other AI agents.

FLE Overall AI Agent Results

The FLE researchers listed some common failings of AI agents under test:

  • Most agents lack spatial understanding
  • Most agents don’t handle or recover from errors well
  • Most agents don’t have long enough planning horizons
  • Most agents don’t invest enough effort in research (finding out what new Factorio machines do and how they could be used).

They also mentioned that AI agent coding skills seemed to be a key indicator of FLE success and coding style differed substantially between the agents. The researchers characterized agent (Python) coding styles and determined that Claude used a REPL style with plenty of print statements while GPT-4o used more assertions in its code.

Example of an FLE program used to create a simple
automated iron-ore miner. In step 1 the agent uses a query to find
the nearest resources and place a mine. In step 3 the agent uses an
assert statement to verify that its action was successful.”

IMHO, as a way to measure AI agent ability to achieve long term and short term goals, at least w.r.t. building factories, this is the best I’ve seen so far.

More FLE Lab-play scenarios

I could see a number of additional lab-play benchmarks for FLE:

  • One focused on drug/pharmaceuticals manufacturing
  • One focused on electronics PCB manufacturing
  • One focused on chip manufacturing
  • One focused on nano technology/meta-materials manufacturing, etc.

What’s missing from all these benchmarks would be the actual science and research needed to come up with new drugs, new electronics, new meta-materials, that are the end product of Factorio factories. I guess that would need to be building of labs, running scientific experiments and understanding (simulated) results.

Although in the current round of FLE benchmarks, for one AI agent at least (Claude), there seemed to be a lot of research into how to use different Factorio tools and machinery.

Ultimate FLE

If FLE as an Ai agent benchmark succeeds, most Agentic AI solutions will start being trained to do better on the benchmark. Doing so should of course lead to better scores by AI agents.

Now people much more familiar with the game than I, say it’s not a great simulation of the real world. There’s only one type of fuel and the boiler is either on or off and numerous other simplifications of the real world are used throughout. And thankfully, for the moment there’s no linkage to actions that impact the real world.

But in reality, simulations like this that are all just stepping stones to AI capabilities. And simulations are all just code and it should not be that hard to increase its fidelity to the real world. .

Getting beyond just simulation, to real world factories is probably the much larger step. This would require physical (not unlimited) inventory of parts, cabling, machines, and belts; real mineral/petroleum deposits; real world physical constraints on where factories could be built. etc. Not to mention the physical automation/robotics that would allow a machine to be selected out of inventory, placed at a specific location inside a factory and connected to power and assembly lines, etc.

~~~~

One common motif in AGI existential crisises, is that some AGI (agent) will be given the task to build a paperclip factory and turns the earth into one giant factory, while inadvertently killing all life on the planet, including of course, humankind.

So training AI agents on “open-play” has ominous overtones.

It would be much better, IMHO, if somehow one could add to Factorio human settlements, plant, animal & sea life, ecosystems, etc. So that there would be natural components that if ruined/degraded/destroyed, could be used to reduce AI agent scores for the benchmarks.

Alas, there doesn’t appear to be anything like this in the current game.

Picture Credit(s):

Data Centers on the Moon !?

I was talking with Chris Stott of Lonestar and Sebastian Jean of Phison the other day and they were discussing placing data centers in lunar orbit, on the surface of the moon or in lava tubes on the moon.

The reasons commercial companies, governments and other organizations would be interested in doing this is that their data could be free from natural disaster, terrorists activities, war, and other earth based calamities.

Lunar data centers could be the ultimate Iron Mountain or DR solution. You’d backup your corporate data to their data centers on the moon and could restore from them whenever you needed to.

The question is can it be done technically, can it be done economically, and can it pass the regulatory hurdles to make it happen.

Lonestar’s CEO, Chris Stott says the regulatory hurdles are underestimated by many who haven’t done much in space but they believe they have all the authorizations they need to make it happen.

The technical hurdles abound however,

  • Bandwidth up and down from lunar orbit/surface needs to be significant. Gbps and then some. It’s one thing to ship customer data in a ready to deploy data center storage solution but another to update that data over time. Most organizations create TB if not PB of data on a monthly if not weekly basis. All that data would need to be sent up to lunar data centers and written to storage there for every customer they have.
  • Power and cooling seems to be a concern in the vacuum of space or on the lunar surface. Most space electronics is cooled by a form of liquid cooling which is known technology. And most of the power requirements in space are supplied (at least near earth orbit, via solar panels.
  • Serviceability, in any massive data center today hardware is going down, software needs to be updated and operations and development are constantly tweaking what occurs. Yes you can build in fault tolerance, and redundancy and all the automatic code/firmware lifecycle management routines you want. But at some point, some person (or thing) has to go replace a server board, drive, or cable and doing that on the moon or in lunar orbit would require a humans and a space walk, or sophisticated robots that could operate there.
  • Radiation, space is considered a hard radiation environment cosmic rays and other radiation sources are abundant and outside the earth’s magnetic field which shields us from much of this, the environment is extremely harsh. In the past this required RAD hardened electronics which typically were at least a decade behind if not 2 or 3 decades behind leading edge technologies.
  • Data sovereignty regimens require that some data not be transferred across national boundaries. How this relates to space is the question.

As for bandwidth, it all depends on how much spectrum one can make use of, the more spectrum you license, the higher transfer speeds to/from the moon you can support. And there’s also the potential for optical (read laser) communications at least from point to point in space and maybe from space to earth’s surface that can boost bandwidth.

NASA’s tested optical links from the moon and from ISS. They seem to work very well going from space to Earth, but not so well in the other direction – go figure. Lonestar has licensed sufficient radio frequency bandwidth to support Gbps up and down transfer speeds.

Lonestar says cooling is free in space. Liquid cooling is becoming more and more viable as GPUs and AI accelerators start consuming KW if not MW of power to do their thing. And the fact that space is at 2.7K degrees means that cooling shouldn’t be a problem as long as you can dissipate the heat via radiation. Convection doesn’t work so well without a medium to work in. And in the vacuum of space orbit and presumably on the moon’s surface, that means radiation is the only way to shed heat.

They also say that power is unlimited in space. That is as long as you can send up and deploy sufficient solar panels to sustain that power. Solar panels do deteriorate over time, so that might be a concern limiting the lifetime of these data centers. But presumably with enough solar panels that shouldn’t be critical path.

Can a data center today be run without servicing? Microsoft’s project Nattik experimented with undersea data centers (see our Undersea data center’s post). The main problem with these is that they were dumping heat into local ecosystems and for some reason fish and other sea life didn’t like it. Microsoft has since abandoned undersea data centers. But they did prove they could be run for years without any need for servicing.

Historically electronics sent to space or the moon have all been RAD hardened. Which necessitated using older and more expensive versions of electronics. Not sure but I read once that today’s cell phone has more computing power than NASA had in 1969.

But, lately there’s been a keen interest in using state of the art, commercial off the shelf electronics. Lonestar said the Mars Helicopter was run off what essentially was an Android phone’s CPU.

The key to the use of COTS electronics in space is the newer forms of radiation shielding that’s available today. Nonetheless, the radiation environment in lunar orbit and on the moon surface or in lunar lava tubes is not that well known. So one of Lonestar’s experimental payloads is to monitor the radiation environment from earth launch to moon surface in much greater detail than what’s been available before.

As for data sovereignty in space, it’s apparently solved. Multi-nation payloads are often deployed from the same space craft. Space law states that any nation’s payload is the responsibility of that nation. So technically, each data regimen could be isolated within their own data center equipment and not have to intermix with other nation’s data/storage. Yes they would all share in the power, cooling, and communications links but that’s apparently not an issue and encryption could keep the communication links data secure, if desired.

So whether you can place a data center in lunar orbit, on lunar surface or in lunar lava tubes is all being investigated by Lonestar and their technical partners like Phison.

Can it be done at a price that customers on the earth would pay is another question. But apparently Lonestar already has customers signed up.

Are datacenters in lunar orbit or on the moon, any more resilient or available than data centers on earth.

Yes there’s no wildfires on the moon, no hurricanes, no earthquakes, no floods, etc.. But there’s bound to be other lunar based dangers. Solar storms and moon dust come to mind. And the environment inside lunar lava tubes are a complete unknown.

And of course anything attached with communications links are also susceptible to cyber threats whether on Earth or in space.

And man made threats, in lunar orbit or on the surface of the moon are not out of the question. Yes it’s highly unlikely today and the foreseeable future, but then anti-sat weapons were considered unlikely early on.

~~~~

Speaking of man made threats, apparently, China already has a data center in lunar orbit or on the surface of the moon.

Comments?

Photo Credit(s):

Silverton Space – Ocean Sensing platform

I was at a conference last year and there was a speaker there that had worked at NASA for years and was currently at MIT. She talked at length about some of the earth and space scientific exploration that NASA has enabled over the years. Despite massive cost overruns, years long schedule delays and other mishaps, NASA has ultimately come through with groundbreaking science

At the end of her presentation I asked what data gaps existed today in space and earth sensing. She mentioned real time methane tracking (presumably from space) and battery-less ocean sensing.

Methane track from Tanager-1 JPL/NASA satellite

Methane tracking I could understand but battery-less ocean sensing was harder to get a handle on.

US Navy and other oceanographic organizations have deployed numerous sensing devices over the years. Some of which were like a flotilla, which traveled across the Gulf and Atlantic ocean to gather data.

But these were battery supported, solar powered, and limited to ~1 year of service after which they were scuttled to the bottom of the ocean.

I guess the thought being that battery-less ocean sensing platform could provide more of an ongoing, permanent sensor platform, one that could be deployed and potentially be in service for years at a time, with little to no maintenance.

The pivot

So as a stepping stone to Silverton Space cubesat operations, I’m thinking that going after a permanent-like ocean sensing platform would be a valuable first step. And it’s quite possible that anything we do in LEO with Silverton Space platforms could complement any ocean going sensor activity.

One reason to pivot to ocean sensing is that it’s much much cheaper to launch a flotilla of ocean going sensing buoys via a boat off a coast than it is to launch a handful of cubesats into LEO (@~$70K each).

Cubesats fail at a high rate

Moreover, the litany of small satellite failures is long, highly varied and chronic. Essentially anything that could go wrong, often does, at least for the first dozen or so satellites you deploy.

NASA says that of the small satellites launched between 2000 and 2016 over 40% failed in some way and over 24% were total mission failures. (see: https://ntrs.nasa.gov/api/citations/20190002705/downloads/20190002705.pdf)

Cubesats with limited functionality or that fail in orbit or to launch, become just more trash orbiting in LEO. And the only way to diagnose what went wrong is elaborate, extensive and transmitted/recieved telemetry.

So another reason to start with ocean going sensors is that there’s a distinct possibility of retrieving a malfunctioning ocean going sensor buoy after deployment. And with sensor buoy in hand, diagnosing what went wrong should be a snap. This doesn’t eliminate the need for elaborate, extensive and transmitted/recieved telemetry but you are no longer entirely dependent on it.

And even if at end of life they can’t be salvaged/refurbished or scuttled. Worst case is that our ocean sensing buoys would end up being part of some ocean/gulf garbage patch. And hopefully will get picked up and disposed of as part of oceanic garbage collection.

~~~

So for the foreseeable future, Silverton Space, will focus on ocean going sensor buoys. It’s unlikely that our first iterations will be completely battery-less but at some point down the line, we hope to produce a version that can be on station for years at a time and provide valuable ocean sensing data to the scientific community.

The main question left, is what sorts of ongoing, ocean sensor information might be most valuable to supply to the world’s scientific community?

Photo Credit(s):

Nexus by Yuval Noah Harari, AGI part 12

This book is all about information networks have molded man and society over time and what’s happening to these networks with the advent of AI.

    In the earliest part of the book he defines information as essentially “that which connects and can be used to create new realities”. For most of humanity, reality came in two forms 

    • Objective reality which was a shared belief in things that can be physically tasted, touched, seen, etc. and 
    • Subjective reality which was entirely internal to a single person which was seldom shared in its entirety.

    With the mankind’s information networks came a new form of reality, the Inter-subjective reality. As inter-subjective reality was external to the person, it could readily be shared, debated and acted upon to change society.

    Information as story

    He starts out with the 1st information network, the story or rather the shared story. The story and its sharing across multiple humans led human society to expand beyond the bands of hunter gatherers. Stories led to the first large societies of humans and the information flow looked like human-story and story-human and created the first inter-subjective realities. Shared stories still impact humanity today.

    As we all know stories verbally passed from one to another often undergo minor changes. Not much of a problem for stories as the plot and general ideas are retained. But for inventories, tax receipts, land holdings, small changes can be significant.

    What transpired next was a solution to this problem. As these societies become larger and more complex there arose a need to record lists of things, such as plots of land, taxes owed/received, inventories of animals, etc. And lists are not something that can easily be weaved into a story.

    Information as printed document

    Thus clay tablets of Mesopotamia and elsewhere were created to permanently record lists. But the clay tablet is just another form of a printed documents.

    Whereas story led to human-story and story-human interactions, printed documents led to human-document and document-human information flow. Printed documents expanded the inter-subjective reality sphere significantly.

    But the invention of printed documents or clay tablets caused another problem – how to store and retrieve them. There arose in these times, the bureaucracy run by bureaucrats to create storage and retrieval systems for vast quantities of printed documents.

    Essentially with the advent of clay tablets, something had to be done to organize and access these documents and the bureaucrat became the person that did this.

    With bureaucracy came obscurity, restricted information access, and limited visibility/understanding into what bureaucrats actually did. Perhaps one could say that this created human-bureaucrat-document and document-bureaucrat-human information flow.

    The holy book

    (c)Kevin Eng

    Next he talks about the invention of the holy book, ie. Hebrew Bible, Christian New Testament and Islam Koran, etc.. They all attempted to explain the world, but over time their relevance diminished.

    As such, there arose a need to “interpret” the holy books for the current time. 

    For Hebrews this interpretation took the form of the Mishnah and Talmud. For Christians the books of the new testament, epistles and the Christian Church. I presume similar activities occurred for Islam.

    Following this, he sort of touches on the telegraph, radio, & TV  but they are mostly given short shrift as compared to story, printed documents and holy books. As all these are just faster ways to disseminate stories, documents and holy books

    Different Information flows in democracies vs. tyrannies

    Throughout the first 1/3 of the book he weaves in how different societies such as democracies and tyrannies/dictatorships/populists have different information views and flows. As a result support, they entirely different styles of information networks.

    Essentially, in authoritarian regimes all information flows to the center and flows out of the center and ultimately the center decides what is disseminated. There’s absolutely no interest in finding the truth just in retaining power

    In democracies, there are many different information flows in mostly an uncontrolled fashion and together they act as checks and balances on one another to find the truth. Sometimes this is corrupted or fails to work for a while to maintain order, but over time the truth always comes out.

    He goes into some length how these democratic checks and balances information networks function in isolation and together. In contrast, tyrannical information flows ultimately get bottled up and lead to disaster.

    The middle ~1/3 of the book touches on inorganic information networks. Those run by computers for computers and ultimately run in parallel to human information flows. They are different from the printing press, are always on, but are often flawed.

    Non-human actors added to humanity’s information networks

    The last 1/3 of the book takes these information network insights and shows how the emergence of AI algorithms is fundamentally altering all of them. By adding a non-human actor with its own decision capabilities into the mix, AI has created a new form of reality, an inter-computer reality, which has its own logic, ultimately unfathomable to humans.

    Rohingyan refuges in camp

    Even a relatively straightforward (dumb) recommendation engine, whose expressed goal is to expand and extend interaction on a site/app, can learn how to do this in such a way as to have unforeseen societal consequences.

    This had a role to play in the Rohingya Genocide, and we all know how it impacted the 2016 US elections and continues to impact elections to this day.

    In this last segment he he has articulated some reasonable solutions to AI and AGI risks. It’s all about proper goal alignment and the using computer AIs together with humans to watch other AIs.

    Sort of like the fox…, but it’s the only real way to enact some form of control over AI. We will discuss these solutions at more length in a future post.

    ~~~~

    In this blog we have talked many times about the dangers of AGI. What surprised me in reading this book is that AI doesn’t have to reach AGI levels to be a real danger to society.

    A relatively dumb recommendation engine can aid and abet genocide, disrupt elections and change the direction of society. I knew this but thought the real danger to us was AGI. In reality, it’s improperly aligned AI in any and all its forms. AGI just makes all this much worse.

    I would strongly suggest every human adult read Nexus, there are lessons within for all of humanity.

    Picture Credits:

    SOENs can reach AGI scale – AGI Part 11, ASI Part 2, Neuromorphic Part 7

    At OCP summit researchers from NIST presented the winning paper and poster session, Supercomuters for AI based on Superconducting Optoelectronic Networks (SOENs) which discussed how one could use optical networking (waveguide, free space, optical fibre) with superconducting electronics (Single Photon Detectors, SPDs and Josephson Junctions, JJs) to construct an Neuromorphic simulation of biological neurons.

    The cited paper is not available but the poster was (copied below) and it referred to an earlier paper, Optoelectronic Intelligence, which is available online if you want to learn more on the technology.

    Some preliminaries before we hit the meat of the solution. Biological neurons are known to operate as spiking devices that only fire when sufficient input (ions, electronic charge) or a threshold of charge is present and once fired (or depleted), it takes another round of input ions of sufficient threshold of charge to make them fire again.

    Biological neurons are connected to one another via dendrites which are long input connections and axons or output clusters via a synapse (air-liquid gap). Neurons are interconnected within micro-columns, columns, clusters and complexes which form the functional unit of the brain. Furtional units inter-connect (via axons – synapses – dendrite) to form brain subsystems such as the visual system, memory system, auditory system, etc..

    DNNs vs the Human Brain

    Current deep neural networks, DNNs, the underlying technology for all AI today is a digital approach to emulating brain electronic processing. DNNs uses layers of nodes, with each node connected to all the nodes in a layer above and all the nodes in layers below. For a specific DNN node to fire depends on its inputs multiplied by its weight and and added by its bias.

    DNNs use a feed forward approach where inputs are fed into nodes at the bottom layer and proceed upward based on the weights and biases for each node in each layer resulting in an output at the top most layer (or bottom most layer depending on your preferred orientation). 

    Today’s DNN foundation models are built using trillions (10**12) of nodes (parameters) and consume city levels of power to train. 

    In comparison the human brain has approximately 10B (10**10) neurons and maybe 1000X that in connections between neurons. N.B. Neurons don’t connect to every neuron in a micro column.  

    So what’s apparent in the above is that the human brain requires significantly less neurons as compared to DNN nodes. And even with that significantly fewer neurons is capable of much more complex thought and reasoning. And the power consumption of the human brain is on the order of a few W, whereas foundation models which consume GW of power to train and KW to inference.

    SOENs, a better solution

    SOENs, superconducting optoelectronic networks, are a much closer approximation to human neurons and can be connected in such a fashion that within the scale of a couple of 2M**3 could support (with todays 45nm chip technology) 10B SOENs with 1000s of connections between them.

    SOENs are a composite circuits representing single photon detectors (SPDs), Josephon Junctions (JJs) and light generating transistor circuits. Both the SPDs and JJs require cryogenic cooling to operate properly. 

    “(a) Optoelectronic neuron. Electrical connections are shown as straight, black arrows, and photons are shown as wavy, [blue] arrows… Part (a) adapted from J. M. Shainline, IEEE J. Sel. Top. Quantum Electron. 26, 1 (2020). Copyright 2020 Author(s), licensed under a Creative Commons Attribution (CC BY) license.43”
    Circuit diagrams. (a) Superconducting optoelectronic synapse combining a single-photon detector (SPD) with a Josephson junction and a flux-storage loop, referred to as the synaptic integration (SI) loop. The synaptic bias current (Isy) can dynamically adapt the synaptic weight. (b) Neuron cell body performing summation of the signals from many synapses as well as thresholding. Here the neuronal receiving (NR) loop is shown collecting inputs from two SI loops, but scaling to thousands of input connections appears possible. Upon reaching threshold, the transmitter circuit produce a pulse of light that communicates photons to downstream synapses. The neuronal threshold current (Ith) can dynamically adapt the neuronal threshold.

    SOENs have biases and thresholds similar to both biological neurons and DNN nodes which are used to boost signals and as gates to limit firings.

    g) Schematic of multi-planar integrated waveguides for dense routing. Adapted from Chiles et al., APL Photonics 2, 116101 (2017). Copyright 2017 Author(s), licensed under a Creative Commons Attribution (CC BY) license.

    When an SOEN fires it transmits a single photon of light to the reciever (SPD) of another SOEN. That photon travels within wafers in waveguides that are created in planes of the wafer. That photon could travel across to another wafer using optical connections. And that wafer could travel up or down using free space optics to wafers located above or below it. 

    There are other nueromorphic architectures out there but none that have the potential to scale to the human brain level of complexity with today’s technology.

    And of course because SOENs are optoelectronics devices something on the scale of human brain (10B SOENs) would operate 1000s of times faster. 

    At the show the presenter mentioned that it would only take about $100M to fabricate the SOENs needed to simulate the equivalent of a human brain

    I think they should start a go-fund me project and get cracking… AGI is on the way.

    And the real question is why stop there…

    Picture Credits:

    Enfabrica MegaNIC, a solution to GPU backend networking #AIFD5

    I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.

    However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.

    In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.

    Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.

    Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.

    Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.

    With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.

    It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.

    On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.

    Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.

    Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.

    At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).

    Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.

    And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)

    The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.

    After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.

    Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .

    MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…

    ~~~~

    Comments?

    SIGGRAPH 2024 Keynote: BabyX – AGI part 11, ASI part 3

    SIGGRAPH came back to Colorado, to the Colorado Convention Center, for their 50th anniversary conference, the original SIGGRAPH conference was in Boulder in 1974.

    The first SIGGRAPH keynote was a session called Beyond the Illusion of Life, presented by Mark Sagar, Soul Machines, Co-Founder and former Chief Science Office.

    The theme of the session was mainly on how AI needs an embodiment to achieve a true breakthrough. Without embodiment, AI is just another secluded machine function and interacting with it will always be divorced from human existence and as such, much harder than interacting with other people.

    As an example of embodied AI, Mark presented BabyX, a virtual 12-24 month old infant.

    BabyX shows how creating a digital embodiment of a human can lead to faster, easier and more inherently natural, human-machine interactions. This is because we, as humans, have evolved to interact with other humans and do this much better and faster than we can interact with machines, chatbots, and other digital simulacrum.

    With BabyX, they have created an emulation rather than an animation or simulation of a human.

    BabyX

    BabyX is a virtual infant that interacts with a virtual screen AND real people on the other side of that screen. BabyX simulates a real infant in front of a screen with adult supervision.

    BabyX interacts with people using verbal cues, virtual screen images and virtual hands/fingers in real time.

    BabyX appears to be actually learning and interacting with different people in real time.

    If you check out their video (in link above), one can see just how close the emulation can get.

    BabyX’s emulation is based on a digital cognitive architectural that mimics the real brain, that includes memory and learning system, motor control system, visual system, etc.

    All these systems are distinct computational modules, that in unison, represent the “virtual connectome” of BabyX’s brain emulation. Each of these cognitive systems can be swapped in or out, whenever better versions become available.

    This cognitive architecture was designed to digitally, re-construct, the key components of the brain of a 18-24 month infant.

    As a result, BabyX learns through interactions with its environment and by talking with the people and viewing a screen. With BabyX, they can even simulate hormonal activity. With the end result the ability to provide real time emotional expression.

    With such a cognitive architecture, one could simulate real (virtual) humans interacting with another person, on the other side of a virtual screen.

    Soul Machines “virtual” assistants

    Soul Machines (like above) has taken BabyX research and created AI avatars used for customer support agents, educational assistants and any commercial activity that depend on human interacting with machines via screens.

    It’s unclear just how much of the BabyX cognitive architecture and simulation has made its way into Soul Machines’ Avatars, but they do show similar interactions with a virtual screen and humans, as well as emotional expression.

    Soul Machines is in the market of supplying these digital avatars so that companies can provide a better, more human like experience when interacting with AI.

    In any case, BabyX was the first time I saw the true embodiment of an AI that uses a cognitive architecture as it is understood today.

    AGI?

    One can’t help but think that this is a better, or at least, potentially, a more correct way to create human level artificial intelligence or AGI. BabyX uses an digital emulation of human memory & learning, behavior, attention, etc. to construct a machine entity that acts and ineracts similar to how a human would.

    With this sort of emulation, one could see training a digital emulation of a human, and after 20 years or so, resulting in a digital human, with human levels of intelligence.

    And, of course, once we have re-created a human level intelligence, the (industry) view is all we need do is to focus it on improving (machine) learning algorithms and maybe, (machine) learning hardware, and let it loose to learn all there is to know in the universe and somewhere along the way we will have created super general intelligence or ASI.

    Thankfully, it turns out that BabyX’s long term memory has been constrained to be temporary and limited. So, we aren’t able to see how a TeenX would actually behave (thank the powers that be).

    Sager mentioned some of the ethical issues in letting BabyX have an indefinite, permanent long term memory.

    I’m thinking this won’t stop others from taking this approach on.

    Which, in the end, scares the heck out of me.

    ~~~~
    Comments?

    The Data Wall – AGI part 11, ASI part 2

    Went to a conference the other week (Cloud Field Day 20) and heard a term I hadn’t heard before, the Data Wall. I wasn’t sure what this meant but thought it an interesting concept.

    Then later that week, I read an article online, Situational Awareness – The Decade Ahead, by Leopold Ashenbrenner, which talked about the path to AGI. He predicts it will happen in 2027, and ASI in 2030. However he also discusses many of the obstacles to reaching AGI and one key roadblock is the Data Wall.

    This is a follow on to our long running series on AGI (see AGI part 10 here) and with this we are creating a new series on Artificial Super Intelligence (ASI) and have relabeled an earlier post as ASI part 1.

    The Data Wall

    LLMs, these days, are being trained on the internet text, images, video and audio. However the vast majority of the internet is spam, junk and trash. And because of this, LLMs are rapidly reaching (bad) data saturation. There’s only so much real intelligence to be gained from scraping the internet. .

    The (LLM) AI industry apparently believes that there has to be a better way to obtain clean, good training data for their LLMs and if that can be found, true AGI is just a matter of time (and compute power). And this, current wall of garbage data is prohibiting true progress to AGI and is what is meant by the Data Wall.

    Leopold doesn’t go into much detail about solutions to the data wall other than to say that perhaps Deep Reinforcement Learning (see below). Given the importance of this bottleneck, every LLM company is trying to solve it. And as a result, any solutions to the Data Wall will end up being proprietary because this enables AGI.

    National_Security_Agency_seal
    National_Security_Agency_seal

    But the real gist of Leopold’s paper is that AGI and its follow on, Artificial Super Intelligence (ASI) will be the key to enabling or retaining national supremacy in the near (the next decade and beyond) future.

    And that any and all efforts to achieve this must be kept as a National Top Secret. I think, he wants to see something similar to the Manhattan Project be created in the USA, only rather than working to create an atom/hydrogen bomb, it should be focused on AGI and ASI.

    The problem is that when AGI and it’s follow on ASI, is achieved it will represent an unimaginable advantage to the country/company than owns it. Such technology if applied to arms, weapons, and national defense will be unbeatable in any conflict. And could conceivably be used to defeat any adversary before a single shot was fired.

    The AGI safety issue

    In the paper Leopold talks about AGI safety and his proposed solution is to have AGI/ASI agents be focused on crafting the technologies to manage/control this. I see the logic in this and welcome it but feel it’s not sufficient.

    I believe (seems to be in the minority these days) that rather than having a few nation states or uber corporations own and control AGI, it should be owned by the world, and be available to all nation states/corporations and ultimately every human on the planet.

    My view is the only way to safely pass through the next “existential technological civilizational bottleneck” (eg, AGI is akin to atomic weapons, genomics, climate change all of which could potentially end life on earth), is to have many of these that can compete effectively with one another. Hopefully such a competition will keep all of them all in check and in the end have them be focused on the betterment of all of humanity.

    Yes there will be many bad actors that will take advantage of AGI and any other technology to spread evil, disinformation and societal destruction. But to defeat this, it needs to become ubiquitous, every where, and in that way these agents can be used to keep the bad actors in check.

    And of course keeping the (AGI/ASI) genie in the bottle will be harder and harder as time goes on.

    Computational performance is going up 2X every few years, so building a cluster of 10K H200 GPUs, while today is extremely cost prohibitive for any but uber corporations and nation states, in a decade or so, will be something any average sized corporation could put together in their data center (or use in the cloud). And in another decade or so will be able to be built into a your own personal basement data center.

    The software skills to train an LLM while today may require a master’s degree or higher will be much easier to understand and implement in a decade or so. So that’s not much of a sustainable advantage either.

    This only leaves the other bottlenecks to achieving AGI, a key one of which is the Data Wall.

    Solving the Data Wall.

    In order to have as many AGI agents as possible, the world must have an open dialogue on research into solving the Data Wall.

    So how can the world generate better data to use to train open source AGIs. I offer a few suggestions below but by no means is this an exhaustive list. And I’m a just an interested (and talented) amateur in all this

    Deep reinforcement learning (DRL)

    Leopold mentioned DRL as one viable solution to the data wall in his paper. DRL is a technique that Deepmind used to create a super intelligent Atari, Chess and Go player. They essentially programed agents to play a game against itself and determine which participant won the game. Once this was ready they set multiple agents loose to play one another.

    Each win would be used to reward the better player, each loss to penalize the worse player, after 10K (or ~10M) games they ended up with agents that could beat any human player.

    Something similar could be used to attack the Data Wall. Have proto-AGI agents interact (play, talk, work) with one another to generate, let’s say more knowledge, more research, more information. And over time, as the agents get smarter, better at this, AGI will emerge.

    However, the advantage of Go, Chess, Atari, Protein Folding, finding optimal datacenter energy usage, sort coding algorithms, etc. is that there’s a somewhat easy way to determine which of a gaggle of agents has won. For research, this is not so simple.

    Let’s say we program/prompt an protoAGI agent to generate a research paper on some arbitrary topic (How to Improve Machine Learning, perhaps). So it generates a research paper, how does one effectively and inexpensively judge if this is better, worse or the same as another agent’s paper.

    I suppose with enough proto-AGI agents one could automatically use “repeatability” of the research as one gauge for research correctness. Have a gaggle of proto-AGIs be prompted to replicate the research and see if that’s possible.

    Alternatively, submit the papers to an “AGI journal” and have real researchers review it (sort of like how Human Reinforcement Learning for LLMs works today). The costs for real researchers reviewing AGI generated papers would be high and of course the amount of research generated would be overwhelming, but perhaps with enough paid and (unpaid) voluntary reviewers, the world could start generating more good (research) data.

    Perhaps at one extreme we could create automated labs/manufacturing lines that are under the control of AGI agent(s) and have them create real world products. With some modest funding, perhaps we could place the new products into the marketplace and see if they succeed or not. Market success would be the ultimate decision making authority for such automated product development.

    (This later approach seems to be a perennial AGI concern, tell an AGI agent to make better paper clips and it uses all of the earths resources to do so.)

    Other potential solutions to the Data Wall

    There are no doubt other approaches that could be used to validate proto-AGI agent knowledge generation.

    • Human interaction – have an AGI agent be available 7X24 with humans as they interact with the world. Sensors worn by the human would capture all their activities. An AGI agent would periodically ask a human why they did something. Privacy considerations make this a nightmare but perhaps using surveillance videos and an occasional checkin with the human would suffice.
    • Art, culture and literature – there is so much information embedded in cultural artifacts generated around the world that I believe this could effectively be mined to capture additional knowledge. Unlike the internet this information has been generated by humans at a real economic cost, and as such represents real vetted knowledge.
    • Babies-children– I can’t help but believe that babies and young children can teach us (and proto-AGI agents) an awful lot on how knowledge is generated and validated. Unclear how to obtain this other than to record everything they do. But maybe it’s sufficient to capture such data from daycare and public playgrounds, with appropriate approvals of course.

    There are no doubt others. But finding some that are cheap enough that could be used for open source is a serious consideration.

    ~~~~

    How we get through the next decade will determine the success or failure of AI and perhaps life on earth. I can’t help but think the more the merrier will help us get there..

    Comments,