What R/R tracks can tell us about AI deployment

I saw this chart the other night in a history class I’m taking. No idea where it was sourced from but I found it more intriguing than the discussion going on.

There’s an awful lot of R/R track in Europe, Eastern US, India, Japan and Eastern China. Not much elsewhere. There are vast spaces of emptiness in northern Canada, northeastern Russia, northern and central Africa, Northern and southern South America. northern and central Australia, and others.

The question that comes to mind is why the open space. Yeah mountainous regions could present a problem but the Alps didn’t seem to inhibit R/R track laying in Europe. Tundra and deserts maybe a problem. But still South America, Upper Canada, Russia, elsewhere outside Africa don’t fit that pattern. Population density maybe but Africa and China don’t fit that pattern.

And then I thought there’s a technological change that happened during the 20th Century that made R/R not as necessary to economic development. Namely the advance of the automobile, tractor trailers, highways/roadways, etc. But these didn’t really take off until after the 1950s. Arguably there were at least 100yrs of R/R dominance in transportation that occurred between 1850 and 1950

Many of the open spaces were actively fought over. e.g. Africa, South America, Asia, etc. and attempts were made to develop them throughout the 19th century. But they still never got the density of railways that the advanced economies had. And what explains the less dense portions of USA from the higher density portions of the USA. And the discrepancy between India’s track density and the lack of density in mainland China.

Re the US, I can only think that subsidies ran dry after a while in the USA which curtailed R/R track construction. But the vast majority of track laid in the Eastern US was not subsidized by the government (IMHO) why so dense there.

I suppose timing could account for some of this variance. R/R track was laid to support transport of goods and people. The relative sparsity of population of the Western US (at least during 1850-1950) may have had an impact on the R/R track laid down.

I believe two main factors combine to dictate how that R/R track map looks today:

  • The availability of capital for infrastructure development
  • The economic need to support/improve transport to market for industrial goods and agriculture
Claude Sonnet 4.6 created map of long haul fibre connections over the world, using the prompt “Can you find or create a world map, showing where current long haul fibre links exist today”

But none of that tells us why China’s interior doesn’t have a dense R/R track network. My guess is that although the population, industry and (agricultural) production was high in China during 1850-1950, capital and a centralized authority to protect property was missing at the time.

Great Britain had the money (during 1850-1925, at least) and used it to develop the R/R network in India, but didn’t do the same in Australia why. Because the need wasn’t as great, because Australia didn’t have the population, agricultural production, industrial production of India, probably.

What does all this mean for AI

R/R technology was economically essential for much of the 100yrs between 1850-1950. If we assume that AI will fill a necessary economic niche 2020-?, as essential as the R/R, we should see similar developments driving how AI is deployed.

Ultimately, we should see AI data centers be deployed mainly in support of industrial and agricultural production and services in areas where capital is available and can be deployed. AI adoption will likely not occur or be deployed as much in areas that are less advanced economically, mainly because of lack of capital and legal infrastructure to protect it.

IMHO, AI and AI datacenter deployments will probably look similar to the R/R track map above with some minor changes. It will follow the money, economic need and legal structures needed to support it.

In today’s world, literally awash in capital searching for investment, capital shouldn’t be a limiting factor. And legal infrastructure protecting property is almost universal throughout the world these days.

However economic activity and the need to support it is widely variable and dispersed in today’s world. Some regions have migrated away from manufacturing to services, others have undergone a serious manufacturing build out, but all need agriculture to sustain populations.

All that may change the AI deployment maps from matching the R/R track map above a bit. As a result we may see a broader spread for AI deployments than R/R track of yesterday.

I believe the main deployments for AI data centers will be throughout the coasts of USA, with lots in the Eastern states, some in the midwest, lots in Europe, lots in China, India, Japan, Korea, Israel, Taiwan, and Australia/NZ maybe, with spots in South East Asia, spots in Africa, and spots in South America.

Claud Sonnet 4.6 created map using prompt “now can you find or create a similar map showing the current and proposed AI data centers as dots on a world map”

I suppose similar maps could be used to display electricity generation and transport, telephone lines, and fibre channel connections (tried that above but wasn’t as useful as I thought). If I’m correct they should all look similar to the above with minor changes based on when the technology was economically essential.

Comments?

Hammerspace and the Open Flash Platform at #AIIFD3

Was at AI Infrastructure Field Day 3 (AIIFD3) last week in CA and Hammerspace presented. (videos here). Molly and Floyd talked about their solution and some of their recent MLCommon’s performance results but Kurt discussed the Open Flash Platform (OFP) Consortium, announced last July which they and partners have been working on..

OFP currently has 6 partners ranging from Hammerspace (storage software supplier), SK Hynix (NAND and SSDs) and Linux Foundation among others and includes end users (Las Alamos National Labs), computational storage (ScaleFlux) and AI solution providers (Xsight).

As I understand it, the OFP is pushing to become a standard adopted by the Open Compute Project (OCP).

OFP is an attempt to redefine NAS as we know it. Hammerspace has been on this journey for a long time with their software only solution but technology is now at a place where it’s time to tackle hardware changes to NAS that would enable even better performance and throughput for AI and other data intensive workloads.

Some of the technology changes driving the need for a different approach to NAS storage:

  • NAND capacities are going through the roof, accessing all that capacity in an effective and performant way, requires a re-architecturing of the storage stack
  • Compute is becoming more widespread and ubiquitous. Every thing seems to have more and more compute capability that it’s causing a rethink as to how to take advantage of all this ubiquitous compute to better address IT (and AI) performance needs
  • AI bandwidth and performance requirements are extreme and are only becoming more so. .
  • Power has become a limiting factor in many AI deployments.

Hammerspace has addressed much of this from a software perspective with their Linux standards efforts to implement Parallel File System and Flex Files in the Linux kernel and in NFS standards as NFSv4.2. PFS and FlexFiles allows Hammerspace to offer very high file bandwidth and data mobility that can’t be supplied any other way.

So it’s time to see what can be done in hardware to make this even better. Enter OFP.

OFP, NAS storage reborn

The idea is to come up with a new packaging of an NFS (v3) server that’s all storage with high amounts of networking and enough compute to serve the storage. Effectively they are putting a DPU (computational intensive networking card) with 1-800Gbps Ethernet connection in front of a train (or toboggan) of NVMe SSDs and calling this a sled.

Their first version using U.2 NVMe SSDs, offers 1PB of capacity with 800Gbps of networking in a 3.5″ X 1.75″ form factor. They would load a NFS v3 Linux based storage server in the DPU and have it run that along with the Networking stack (and more) on the DPU and have access to all this storage capacity in what essentially is a NFSv3 (relatively dumb storage) storage sled.

Package 6 of these together with a couple of power supplies and now you have 6PB raw capacity in 1RU, with 4.8Tbps of bandwidth, consuming .6 kW of power (presumably this is power consumption at idle).

You will no doubt note that the sled, as configured above, does not allow for hot (or even cold) drive replacement. So when drives fail, the NFSv3 code would need to recover from them and take them out of service. So that over time the sled could still be used even though some SSDs have failed.

In the future, moving from U.2 SSDs to E2(E) NVMe SSDs in the storage sled quadruples the capacity while staying in the same power envelope and supplying the same bandwidth. Again the SSDs are not intended to be (hot or cold) swappable, so drive failure would need to be handled by software. With E2(E) SSDs in a sled and 6 of these in a 1RU, one would have 24PB of storage capacity.

Presumably, OFP Sleds could be hot swappable when enough SSDs in a sled fails.

And of course QLC capacities are not standing still so another doubling of these capacities could easily be possible within the next couple of years (imagine 48PB in a single RU, boggles the mind).

The NAS software one runs in the OFP SLED could be any NFSv3 server software but Hammerspace has their own, called DSX. And when you combine DSX servers with lots of capacity and lots of networking bandwidth, Hammerspace’s NFSv4.2 PFS and FlexFiles can really fly.

And with the power and space efficiency as well as extreme bandwidth available, it could be a winning formula for the AI environments, in contrast to scale-out NAS which is the current alternative.

~~~~

But it seems to me any organization (hypervisors are you listening) with intense storage capacity and storage bandwidth needs would be very interested in the OFP for their own environment.

Comments?

The curse of Scale & AGI

For the past 1/2 decade or more, new generation foundation models have all become significantly (10X or more) larger in parameters than their last versions. The presumption being that more parameters will always lead to better models, better inferences, more users, etc. This has been primarily driven by compute scaling, more compute thrown at training results in bigger models.

But the problem is at some point any process reaches saturation or a point of marginal return where throwing more (of anything) at it only gets marginally better, not incrementally or at least not commensurate with the additional cost. It’s unclear if we are there yet with foundation models, but my guess we are reaching it rapidly.

It’s interesting that ChatGPT-5 seems to have the same number of parameters as ChatGPT-4 (~1.8T).

Not being an active user of foundation models, I can’t really tell if …-5 is much better than …-4, but consensus seems to be they are not getting as better as they used to.

There are probably a number of reasons why this could be the case. The data wall for one. The power and cooling cost of exponentially increasing AI model size is impacting not just training costs but inferencing costs as well. But the end of the scaling advantage maybe another.

Don’t get me wrong if it wasn’t for compute scaling we wouldn’t have the AI we have today. NN training processes were invented in the 50s of last century, but they didn’t have the compute power to use them at the time.. It wasn’t until this century that computation caught up.

As more compute power became available, those old compute bound techniques proved to be the lynchpin for DNN training and we are still riding that curve today, up to a point.

It’s just that speeding up and doing the same old DNN training will lose effectiveness at some point, if not today, then tomorrow.

I’ve seen it myself in some rudimentary models I have trained. At some point adding nodes, layers, training epochs, etc., just doesn’t always result in better models. They often get worse.

AGI

And AGI, I believe, will require us to take a different tack than current foundational model DNN training to get right. Call it a hunch. But one can see glimmers of this in the fact that AGI is always just years away.

In order to achieve AGI, for safety reasons, for planetary climate reasons, and because scale is not getting us there anymore, I strongly believe we need to rethink our approach to foundation model training.

I’m no expert but I think what needs to change is more use of (deep) reinforcement learning (DRL), not just the human feedback reinforcement learning (HFRL) used today for fine tuning foundation models. This would mean using DRL much earlier, more comprehensively in all of phases of foundational model training.

Yes, DRL also consumes compute infrastructure and more “training episodes” for DRL can often lead to better model outcomes, but not always.

DRL training for AGI models

For any reinforcement learning to work, one needs a reward signal that can be used to signal how to optimize the DRL model. So, the real challenge in the use of more DRL for foundation model training is what (or who) supplies that reward signal from some action taken by the DRL model.

Historically, for games reward signals came from the game environment (or model), for robotic motion it can come from physics simulators or movement in the real world.

But any reward signal for AGI foundation models would need much more sophistication than the above.

The easy answer is to create world simulation models. Something that could simulate how the world (in total) would react to an action (or inference) of the foundation model.

But that’s not easy, world simulation models, at the fidelity needed to support DRL for AGI foundation models don’t exist and few if any researchers (AFAIK) are working on getting us there.

But there are some rudimentary baby steps that already exist. Physics engines (or models of real world physical processes) have existed for a long time now and would no doubt be the core of any world simulation model. Nature simulation models exist at least for climate and weather and these could also be incorporated into any world model.

What’s missing would be

  • Geophysical world simulations that would model how the world would react to actions taken by a AGI model. I’m aware of many petroleum earth based simulations ditto for plate tectonics, wind, and water movement, but these would all need to be combined into something that provides a entire world, geophysical reactions to model actions,
  • Biospherical world simulations that would model (at least at some level) how the (biological, i.e. animal, plant, fungi, microbe, etc.) natural world would react to actions. Weather models may have some of this, at least with respect to carbon cycles which span human-natural boundaries but we would need a lot more.
  • Psychological world simulations, or something that would simulate how a person and how a population of humans would react to actions taken by a model. I am unaware of anything available at this level except for a simulation of a baby I saw at SigGraph a couple of years ago. There would need to be a lot more work here to get this up to a level to support AGI training.
  • Sociological-Political world simulations or something that would model how human society across the world would react to model actions. Again some of these exist, at an even more rudimentary level than financial or weather modeling, and we would need a lot of work to get them to a level of fidelity needed for AGI training.
  • Financial-Business world simulations that would determine the financial reactions to model actions. Some of these exist for national economies, but would need broadened to the world at large and to much finer resolution, granularity to be suitable to support AGI foundational model training.

I am certainly missing some or more critical models that may be needed for true world simulations but these could provide a start. They would need to be combined, of course, in some fashion.

And determining the various reward weights would be non-trivial. It seems to me that each of these simulations could have multiple reward signals for any action. Combining them all may be non-trivial. But those are parameter optimizations, which once we have world models working in unison we can tweak at will.

Then there’s the “action space” for an AGI model. For games and robotic motion, the actions are well defined and finite. For an AGI model, it would seem that the actions are potentially infinite. Even if we limited it to a single domain such as tokenized text strings, the magnitude of such actions would be 10K-10M X anything tried before with DRL. But I still believe it’s doable

Once we had such a model together, with a decent reward function and had some way to categorize/grasp the infinite actions that could be taken by an AGI, DRL could be used to train an AGI.

Of course this may take a few “billion or trillion” actions/training episodes to get something worthwhile out of it.

But maybe after something like (or 10M X) that we could create a safe and effective AGI.

~~~~

Comments?

Photo Credit(s):

  • OCP Summit 2024, AMD Hardware Optimizations for power efficient AI, presentation slide
  • Thomas Jefferson National Accelerator Facility (Jefferson Lab), flickr photo
  • SigGraph 2024, Beyond the illusion of life, Keynote presentation slide

AlphaEvolve, DeepMind’s latest intelligence pipeline

Read an article the other day from ArsTechnica on AlphaEvolve (Google Deepmind creates .. AI that can invent…). After Google announced and released their AlphaEvolve website and paper.

Essentially they have created a pipeline of AI agents (uses GeminiFlash and GeminiPro) that uses genetic/evolutionary techniques to evolve code tor anything really that can be transformed into code to be improve or solve something that has code based evaluation techniques.

Genetic evolution of code has been tried before and essentially it uses various combinatorial (splitting, adding, subtracting, etc.) techniques to modify code under evolution. The challenge with any such techniques is that much of the evolutionary code is garbage so you have to have some method to evaluate (quickly?) whether the new code is better or worse than the old code.

That’s where the evaluation code comes into play. It effectively executes the new code and determines a score (could be a scalar or vector) that AlphaEvolve can use to determine if it’s on the right track or not. Also you can have multiple evaluation functions. And as an example you could have some LLM be asked whether the code is simpler/cleaner/easier to understand. That way you could task AlphaEvolve to not only improve the code functionality but also create simpler/cleaner/easier to understand code.

AlphaEvolve uses GeminiFlash to generate a multitude of code variations and when that approach loses steam (no longer improving much) it invokes GeminiPro to look at the code in depth to determine strategies to make it better.

As discussed above to use AlphaEvolve you need to supply infrastructure (compute, storage, networking), one or more evaluation algorithms/prompts (in any coding language you choose) and a starting solution (again in any coding language you want).

As part of the AlphaEvolve’s process it uses a database to record all code modification attempts and its evaluation scores. This database can be used to retrieve prior modifications and take off from there again.

Results

AlphaEvolve has been tasked with historical math problems that involve geometric constructions, as well as computing algorithms improvement as well as full stack coding improvements.

For instance the paper discusses how AlphaEvolve improved their Google Cloud (Borg) compute scheduling algorithm which increased compute utilization by 7% throughout Google Cloud Data centers.

It also found a kernel improvement which led to Gemini training speedup. It found a simpler logic footprint for a TPU chip function.

It found a faster algorithm to do 4X4 matrix complex multiplication algorithm. It found a solution to the 11 dimension circle kissing problem (geometric construction). And probably 50 or more mathematical problems, coding algorithm improvements etc.

It didn’t improve or solve everything it was tasked to do but it did manage to make improvements or solutions to ~20% or so of the starting solutions it was tasked with.

How to use it

The nice thing about AlphaEvolve is that one can have it work with a whole code repo and have it only evolve a set of sections of code in that repo. All the code to be improved is marked with

#EVOLVE-BLOCK START and
#EVOLVE-BLOCK END.

This would be embedded in the starting solution. Presumably this would be in any comment format for the coding language being used.

And it’s important to note that the starting solution could be very rudimentary, and with the proper evaluation algorithms could still be used to solve or improve any algorithm.

For example if you were interested in optimizing a factory production line by picking a component/finished product to manufacture and you had lets say some sort of coded factory simulation with some way to examine the factory to evaluate whether it’s working well or not.

Your rudimentary starting algorithm could pick at random from the set of products/components to manufacture that are currently needed and use as evaluation the throughput of your factory, utilization of bottleneck/machinery, energy consumption or any other easily code-able evaluation metric of interest in isolation or combination (that could make use of your factory simulation to come up with evaluation socer(s). Surround the random selection code in #EVOLVE-BLOCK START and #EVOLVE-BLOCK END and let AlphaEvolve come up with a new selection algorithm for your factory.

After seeing a couple of (10-100-1000) iterations of new graded selection algorithms you could change your evaluation grading algorithms and start over from where you left off to get something even more sophisticated.

Deepmind has created a GitHub jupyter notebook with some of AlphaEvolve’s mathematical solutions/improvements in case you want to see more.

They also have an AlphaEvolve early signup site in case your interested in trying it out. which

~~~~

If I were Deepmind, I could think of probably 10K things to do with AlphaEvolve. I might rankall the functions in GeminiPro/GeminiFlash inference and training by frequency count and take the top 20% of these functions through the AlphaEvolve pipeline. Ditto for Google Cloud services, Google search, Adwords, etc.

But that would be just the start…

….

Photo/Graphic Credit(s):

Reward is all you need – part 2, AGI part 12, ASI part 3

Read an article today about how current LLM technology is running out of steam as it approaches equivalents to all current human knowledge. The article is Welcome to the Age of Experience. Apparently it’s a preprint of a chapter in an upcoming book from MIT, Designing an Intelligence. One of the authors is well known for his research in reinforcement learning and is a co-author of the text book, Reinforcement Learning: An Introduction. .

Sometime back before ChatGPT came out there was a paper on reward is enough (see post: For AGI, is reward enough). And at the time it proposed that reinforcement learning with proper reward signals was sufficient to reach AGI.

Since then, attention has become the prominent road to AGI and is evident in all the LLM activity to date (see ArXiv paper: Attention is all you need).

This new paper (and presumably book) suggests that the current AI training technology focused on attention (to current human knowledge) will ultimately reach an impasse, a human wall if you will. Whenever it attains human levels of AG or the Humanity WalI, it will be unable to proceed any farther. And at that point, it will track human knowledge generation but go no further.

Now, from my perspective something like this is inherently safer than having something that can surpass human intelligence. But putting my reservations aside. The new paper on the Era of Experience shows a potential road map of sorts to achieve super human intelligence.

Era of attention

In the case of transformers (current LLM technology) they have billion parameter models based on learning what the next token in a sequence should be. There are ancillary models that determine, for instance, tokenization of text streams (multi dimensional locations for each portion of a word in a paragraph for instance). Tokenization encoded textual semantics and context as well as the textual word part being analyzed into a string of numbers for each token. Essentially, a multi-dimensional address in textual semantic space

But the big, billion+ parameter models were all essentially trained to predict what the next text token would be based on current context. Similarly, for graphical generation models it went from text tokens to predicting the diffusion pixels of a graphic and other visual artifacts.

But pretty much all of this was based on the underlying technology training approach as outlined in attention is all you need.

The Era of Experience paper suggests that this training approach will ultimately run out of steam. And all of these models will hit the Humanity Wall. Where they reach the equivalent to all human knowledge but will be unable to proceed past that point

Era of Games and Proofs

In an online course I took during Covid on reinforcement learning, the level 1 of the course ended up having us code a Reinforcement Learning algorithm to play pong. Mind you this ended up taking me much longer to get right than I had anticipated. But in the end this was essentially training a deep neural network as a value function (prediction whether a move was going to win or lose) to decide which direction to move the paddle based on the balls current position and velocity.

For this reinforcement learning algorithm reward was simply 0, if you continued the game, +1 if you won the game, and -1, if you lost (the ball went past your paddle).

The authors discuss Deep Mind’s “Alpha-Proof” (more of an explanation of the technology) and Alpha-Geometry2 (also described in the same page) as being an examples of super-human thinking capabilities only in the domain of mathematical proofs. Alpha-Proof and Alpha-Geometry2 have won a prestigious International Mathematics Olympiad silver medal for its capabilities.

Alpha-Proof & Alpha-Geometry2 depend on LEAN a formal mathematical description language (similar to coding for mathematics). So a proof request would be converted to LEAN code and then Alpha-Proof and Alpha-Geometry2

Alpha-proof was originally trained on the sum total of all human generated mathematical proofs but then used reinforcement learning to generate 100’s of million more proofs and trained on those, to reach the level of superhuman mathematical proof generator.

Alpha Proof is an example of deploying Alpha-Zero RL technologies to different domains. Alpha-zero already conquered Chess, Shoji and Go games with super-human skill.

These achieved super-human levels of skill, because human (knowledge) was essentially dropped out of the training loop (very early on) and from then on the algorithm trained itself on self-generated data (game play, mathematical proofs). Using a a game simulator and reward signal(s) to determine when play were good or bad.

Era of Experience

But the Era of Experience takes reward signals to a whole other level.

Essentially in order to create super human intelligence using RL, the reward function needs to become yet another Deep Neural Network or two. And it needs to be trained in a fashion which understands how the world, environment, humans, flora, fauna, etc. reacts to what a (super human) agent is doing.

Unclear how you tokenize (encode) all those real world, experience signals into something a DNN could be trained on but my guess is their book will delve into some of these topics.

But in addition to the multi-faceted reward DNN(s), in order to do effective RL, one also needs a (high fidelity) real world simulator. This would be used similar to internal game play, in game playing traditional RL algorithms so that the super human agent could generate a 100 million agentic scenarios in simulation to determine if they were successful or not long before it ever attempted activities in the real world.

So there you have it tokenization for LLMS DNNs and diffusion and text based agentic LLM DNNs, some sort of multi-faceted Reward DNNs (taking input from real and simulated world experience) and multi-faceted World simulator DNNs.

Once you have all that together and with sufficient time and processing powerand after some 100 million or so of generated actions in the simulated world, you should have a super human agent that you can unleash on the real world.

~~~~

You may wish to constrain your new super human intelligent agent early on to make sure the world simulation has true fidelity with the real world we live in. But after a suitable safety checkout period, one should have a super human intelligence agent ready to take over all human thought, society advancement, scientific research, etc.

Sound like fun!!?

Photo/Graphic Credit(s):

Benchmarking Agentic AI using Factorio – AGI part 12

Yesterday a friend forwarded me something he saw online about a group of researchers who were using the game, Factorio, to benchmark AI Agent solutions (PDF of paper, Github repo).

A Factorio plastic bar factory

The premise is that with an effective API for Factorio, AI agents can be tasked with creating various factories for artifacts. The best agents would be able to create the best factories.

Factorio factories can be easily judged by the number of artifacts they produce per time period and the energy use to manufacture those artifacts. They can also be graded based on how many steps it takes to generate those factories.

Left is Factorio factory progression, middle is AI agent Python code that uses Factorio API, Right is agents submitting programs to Factorio server and receive feedback

The team has created a Factorio framework for using AI agents that create Python code to drive a set of Factorio APIs to build factories to manufacture stuff.

Factorio is a game in which you create and operate factories. From Factorio website: “You will be mining resources, researching technologies, building infrastructure, automating production, and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working, and protect it from the creatures who don’t really like you.”

Presumably FLE has disabled the villainy and focused on just crafting and running factories all out.

FLE Results using current AI agents

FLE Open-play Results, for open-play, models are scored based on prediction quantities over time, note the chart is log-log

Factorio, similar to other games, has an inventory of elemens/components/machines used to build factories. And some of these elements are hidden until you one gains enough experience in the game.

The Factorio Learning Environment (FLE) is a complete framework that can prompt Agentic AI to create factories using Python code and Factorio API calls. The paper goes into great detail in it’s appendices as to what AI agent prompts look like, the Factorio API and other aspects of running the benchmark.

In the FLE as currently defined there’s “open-play” and “lab-play”.

  • Open-play is tasked with building a factory as large as the agent wants to create as much product as possible. The open-play winner is the AI agent that creates a factory that can manufacture the most widgets (iron plates) in the time available for the competition.
  • Lab-play is tasked with building factories for 24 specific items, with limited resource and time constraints and the winner is the AI agent that is able to build most of these lab-play factories successfull,y in the time and resource constraints available.
FLE Lab-play (select) results – there were 24 tasks in the lab-play list, no agent completed all of them but Claude did the best on the 5 that were completed by most agents

The team benchmarked 6 frontier LLM agents: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct, using them for both open-play and lab-play.

The overall winner for both open-play and lab-play was Claude 3.5-Sonnet, by a far margin. In open play it was able to create a factory to manufacture over 290K iron plates (per game minute, we think) and for lab-play was able to construct more (7 out of 24) factories, more than other AI agents.

FLE Overall AI Agent Results

The FLE researchers listed some common failings of AI agents under test:

  • Most agents lack spatial understanding
  • Most agents don’t handle or recover from errors well
  • Most agents don’t have long enough planning horizons
  • Most agents don’t invest enough effort in research (finding out what new Factorio machines do and how they could be used).

They also mentioned that AI agent coding skills seemed to be a key indicator of FLE success and coding style differed substantially between the agents. The researchers characterized agent (Python) coding styles and determined that Claude used a REPL style with plenty of print statements while GPT-4o used more assertions in its code.

Example of an FLE program used to create a simple
automated iron-ore miner. In step 1 the agent uses a query to find
the nearest resources and place a mine. In step 3 the agent uses an
assert statement to verify that its action was successful.”

IMHO, as a way to measure AI agent ability to achieve long term and short term goals, at least w.r.t. building factories, this is the best I’ve seen so far.

More FLE Lab-play scenarios

I could see a number of additional lab-play benchmarks for FLE:

  • One focused on drug/pharmaceuticals manufacturing
  • One focused on electronics PCB manufacturing
  • One focused on chip manufacturing
  • One focused on nano technology/meta-materials manufacturing, etc.

What’s missing from all these benchmarks would be the actual science and research needed to come up with new drugs, new electronics, new meta-materials, that are the end product of Factorio factories. I guess that would need to be building of labs, running scientific experiments and understanding (simulated) results.

Although in the current round of FLE benchmarks, for one AI agent at least (Claude), there seemed to be a lot of research into how to use different Factorio tools and machinery.

Ultimate FLE

If FLE as an Ai agent benchmark succeeds, most Agentic AI solutions will start being trained to do better on the benchmark. Doing so should of course lead to better scores by AI agents.

Now people much more familiar with the game than I, say it’s not a great simulation of the real world. There’s only one type of fuel and the boiler is either on or off and numerous other simplifications of the real world are used throughout. And thankfully, for the moment there’s no linkage to actions that impact the real world.

But in reality, simulations like this that are all just stepping stones to AI capabilities. And simulations are all just code and it should not be that hard to increase its fidelity to the real world. .

Getting beyond just simulation, to real world factories is probably the much larger step. This would require physical (not unlimited) inventory of parts, cabling, machines, and belts; real mineral/petroleum deposits; real world physical constraints on where factories could be built. etc. Not to mention the physical automation/robotics that would allow a machine to be selected out of inventory, placed at a specific location inside a factory and connected to power and assembly lines, etc.

~~~~

One common motif in AGI existential crisises, is that some AGI (agent) will be given the task to build a paperclip factory and turns the earth into one giant factory, while inadvertently killing all life on the planet, including of course, humankind.

So training AI agents on “open-play” has ominous overtones.

It would be much better, IMHO, if somehow one could add to Factorio human settlements, plant, animal & sea life, ecosystems, etc. So that there would be natural components that if ruined/degraded/destroyed, could be used to reduce AI agent scores for the benchmarks.

Alas, there doesn’t appear to be anything like this in the current game.

Picture Credit(s):

Data Centers on the Moon !?

I was talking with Chris Stott of Lonestar and Sebastian Jean of Phison the other day and they were discussing placing data centers in lunar orbit, on the surface of the moon or in lava tubes on the moon.

The reasons commercial companies, governments and other organizations would be interested in doing this is that their data could be free from natural disaster, terrorists activities, war, and other earth based calamities.

Lunar data centers could be the ultimate Iron Mountain or DR solution. You’d backup your corporate data to their data centers on the moon and could restore from them whenever you needed to.

The question is can it be done technically, can it be done economically, and can it pass the regulatory hurdles to make it happen.

Lonestar’s CEO, Chris Stott says the regulatory hurdles are underestimated by many who haven’t done much in space but they believe they have all the authorizations they need to make it happen.

The technical hurdles abound however,

  • Bandwidth up and down from lunar orbit/surface needs to be significant. Gbps and then some. It’s one thing to ship customer data in a ready to deploy data center storage solution but another to update that data over time. Most organizations create TB if not PB of data on a monthly if not weekly basis. All that data would need to be sent up to lunar data centers and written to storage there for every customer they have.
  • Power and cooling seems to be a concern in the vacuum of space or on the lunar surface. Most space electronics is cooled by a form of liquid cooling which is known technology. And most of the power requirements in space are supplied (at least near earth orbit, via solar panels.
  • Serviceability, in any massive data center today hardware is going down, software needs to be updated and operations and development are constantly tweaking what occurs. Yes you can build in fault tolerance, and redundancy and all the automatic code/firmware lifecycle management routines you want. But at some point, some person (or thing) has to go replace a server board, drive, or cable and doing that on the moon or in lunar orbit would require a humans and a space walk, or sophisticated robots that could operate there.
  • Radiation, space is considered a hard radiation environment cosmic rays and other radiation sources are abundant and outside the earth’s magnetic field which shields us from much of this, the environment is extremely harsh. In the past this required RAD hardened electronics which typically were at least a decade behind if not 2 or 3 decades behind leading edge technologies.
  • Data sovereignty regimens require that some data not be transferred across national boundaries. How this relates to space is the question.

As for bandwidth, it all depends on how much spectrum one can make use of, the more spectrum you license, the higher transfer speeds to/from the moon you can support. And there’s also the potential for optical (read laser) communications at least from point to point in space and maybe from space to earth’s surface that can boost bandwidth.

NASA’s tested optical links from the moon and from ISS. They seem to work very well going from space to Earth, but not so well in the other direction – go figure. Lonestar has licensed sufficient radio frequency bandwidth to support Gbps up and down transfer speeds.

Lonestar says cooling is free in space. Liquid cooling is becoming more and more viable as GPUs and AI accelerators start consuming KW if not MW of power to do their thing. And the fact that space is at 2.7K degrees means that cooling shouldn’t be a problem as long as you can dissipate the heat via radiation. Convection doesn’t work so well without a medium to work in. And in the vacuum of space orbit and presumably on the moon’s surface, that means radiation is the only way to shed heat.

They also say that power is unlimited in space. That is as long as you can send up and deploy sufficient solar panels to sustain that power. Solar panels do deteriorate over time, so that might be a concern limiting the lifetime of these data centers. But presumably with enough solar panels that shouldn’t be critical path.

Can a data center today be run without servicing? Microsoft’s project Nattik experimented with undersea data centers (see our Undersea data center’s post). The main problem with these is that they were dumping heat into local ecosystems and for some reason fish and other sea life didn’t like it. Microsoft has since abandoned undersea data centers. But they did prove they could be run for years without any need for servicing.

Historically electronics sent to space or the moon have all been RAD hardened. Which necessitated using older and more expensive versions of electronics. Not sure but I read once that today’s cell phone has more computing power than NASA had in 1969.

But, lately there’s been a keen interest in using state of the art, commercial off the shelf electronics. Lonestar said the Mars Helicopter was run off what essentially was an Android phone’s CPU.

The key to the use of COTS electronics in space is the newer forms of radiation shielding that’s available today. Nonetheless, the radiation environment in lunar orbit and on the moon surface or in lunar lava tubes is not that well known. So one of Lonestar’s experimental payloads is to monitor the radiation environment from earth launch to moon surface in much greater detail than what’s been available before.

As for data sovereignty in space, it’s apparently solved. Multi-nation payloads are often deployed from the same space craft. Space law states that any nation’s payload is the responsibility of that nation. So technically, each data regimen could be isolated within their own data center equipment and not have to intermix with other nation’s data/storage. Yes they would all share in the power, cooling, and communications links but that’s apparently not an issue and encryption could keep the communication links data secure, if desired.

So whether you can place a data center in lunar orbit, on lunar surface or in lunar lava tubes is all being investigated by Lonestar and their technical partners like Phison.

Can it be done at a price that customers on the earth would pay is another question. But apparently Lonestar already has customers signed up.

Are datacenters in lunar orbit or on the moon, any more resilient or available than data centers on earth.

Yes there’s no wildfires on the moon, no hurricanes, no earthquakes, no floods, etc.. But there’s bound to be other lunar based dangers. Solar storms and moon dust come to mind. And the environment inside lunar lava tubes are a complete unknown.

And of course anything attached with communications links are also susceptible to cyber threats whether on Earth or in space.

And man made threats, in lunar orbit or on the surface of the moon are not out of the question. Yes it’s highly unlikely today and the foreseeable future, but then anti-sat weapons were considered unlikely early on.

~~~~

Speaking of man made threats, apparently, China already has a data center in lunar orbit or on the surface of the moon.

Comments?

Photo Credit(s):

Nexus by Yuval Noah Harari, AGI part 12

This book is all about information networks have molded man and society over time and what’s happening to these networks with the advent of AI.

    In the earliest part of the book he defines information as essentially “that which connects and can be used to create new realities”. For most of humanity, reality came in two forms 

    • Objective reality which was a shared belief in things that can be physically tasted, touched, seen, etc. and 
    • Subjective reality which was entirely internal to a single person which was seldom shared in its entirety.

    With the mankind’s information networks came a new form of reality, the Inter-subjective reality. As inter-subjective reality was external to the person, it could readily be shared, debated and acted upon to change society.

    Information as story

    He starts out with the 1st information network, the story or rather the shared story. The story and its sharing across multiple humans led human society to expand beyond the bands of hunter gatherers. Stories led to the first large societies of humans and the information flow looked like human-story and story-human and created the first inter-subjective realities. Shared stories still impact humanity today.

    As we all know stories verbally passed from one to another often undergo minor changes. Not much of a problem for stories as the plot and general ideas are retained. But for inventories, tax receipts, land holdings, small changes can be significant.

    What transpired next was a solution to this problem. As these societies become larger and more complex there arose a need to record lists of things, such as plots of land, taxes owed/received, inventories of animals, etc. And lists are not something that can easily be weaved into a story.

    Information as printed document

    Thus clay tablets of Mesopotamia and elsewhere were created to permanently record lists. But the clay tablet is just another form of a printed documents.

    Whereas story led to human-story and story-human interactions, printed documents led to human-document and document-human information flow. Printed documents expanded the inter-subjective reality sphere significantly.

    But the invention of printed documents or clay tablets caused another problem – how to store and retrieve them. There arose in these times, the bureaucracy run by bureaucrats to create storage and retrieval systems for vast quantities of printed documents.

    Essentially with the advent of clay tablets, something had to be done to organize and access these documents and the bureaucrat became the person that did this.

    With bureaucracy came obscurity, restricted information access, and limited visibility/understanding into what bureaucrats actually did. Perhaps one could say that this created human-bureaucrat-document and document-bureaucrat-human information flow.

    The holy book

    (c)Kevin Eng

    Next he talks about the invention of the holy book, ie. Hebrew Bible, Christian New Testament and Islam Koran, etc.. They all attempted to explain the world, but over time their relevance diminished.

    As such, there arose a need to “interpret” the holy books for the current time. 

    For Hebrews this interpretation took the form of the Mishnah and Talmud. For Christians the books of the new testament, epistles and the Christian Church. I presume similar activities occurred for Islam.

    Following this, he sort of touches on the telegraph, radio, & TV  but they are mostly given short shrift as compared to story, printed documents and holy books. As all these are just faster ways to disseminate stories, documents and holy books

    Different Information flows in democracies vs. tyrannies

    Throughout the first 1/3 of the book he weaves in how different societies such as democracies and tyrannies/dictatorships/populists have different information views and flows. As a result support, they entirely different styles of information networks.

    Essentially, in authoritarian regimes all information flows to the center and flows out of the center and ultimately the center decides what is disseminated. There’s absolutely no interest in finding the truth just in retaining power

    In democracies, there are many different information flows in mostly an uncontrolled fashion and together they act as checks and balances on one another to find the truth. Sometimes this is corrupted or fails to work for a while to maintain order, but over time the truth always comes out.

    He goes into some length how these democratic checks and balances information networks function in isolation and together. In contrast, tyrannical information flows ultimately get bottled up and lead to disaster.

    The middle ~1/3 of the book touches on inorganic information networks. Those run by computers for computers and ultimately run in parallel to human information flows. They are different from the printing press, are always on, but are often flawed.

    Non-human actors added to humanity’s information networks

    The last 1/3 of the book takes these information network insights and shows how the emergence of AI algorithms is fundamentally altering all of them. By adding a non-human actor with its own decision capabilities into the mix, AI has created a new form of reality, an inter-computer reality, which has its own logic, ultimately unfathomable to humans.

    Rohingyan refuges in camp

    Even a relatively straightforward (dumb) recommendation engine, whose expressed goal is to expand and extend interaction on a site/app, can learn how to do this in such a way as to have unforeseen societal consequences.

    This had a role to play in the Rohingya Genocide, and we all know how it impacted the 2016 US elections and continues to impact elections to this day.

    In this last segment he he has articulated some reasonable solutions to AI and AGI risks. It’s all about proper goal alignment and the using computer AIs together with humans to watch other AIs.

    Sort of like the fox…, but it’s the only real way to enact some form of control over AI. We will discuss these solutions at more length in a future post.

    ~~~~

    In this blog we have talked many times about the dangers of AGI. What surprised me in reading this book is that AI doesn’t have to reach AGI levels to be a real danger to society.

    A relatively dumb recommendation engine can aid and abet genocide, disrupt elections and change the direction of society. I knew this but thought the real danger to us was AGI. In reality, it’s improperly aligned AI in any and all its forms. AGI just makes all this much worse.

    I would strongly suggest every human adult read Nexus, there are lessons within for all of humanity.

    Picture Credits: