Cognitive computing – Silverton Consulting

Context, tokens, KV stores & storage, Solidigm presents at #AIFD8

Posted on May 29, 2026May 29, 2026 by Ray in AI storage needs, MLOps, NVMe storage, RAG-LLM, SSD storage, System effectiveness

Solidigm presented (video here) at AIFD8 this month and as part of their presentation they spent time disecting what happens to a prompt, how token growth happens, and where storage can help speed up prompt processing.

The token count explosion

It all starts at a simple prompt something as simple as “run a benchmark against a drive” maybe a 12 token prompt but when it actually gets processed can balloon into something that’s much larger. As an LLM processes the prompt it goes through a number of steps building context, calling tools, obtaining and interpreting results, persisting knowledge and finally, responding to the prompt.

Digging a level deeper, here’s what the token counts look like during prompt processing. First step is to understand the environment of the prompt, rules, safety requirements, methodology at it’s disposal, then there’s retrieval activity that gathers information needed to actually process and perform the prompt, then identifying tools and their APIs needed to process the prompt, and at some point when the LLM has all that it plans out the steps needed to actually perform the prompt, tool results are generated, interpreted and fed back to LLM processing to determine next step. All of which at some point, prompt precessing completes and the prompt reply is sent back to the issuer.

As one can see in the above, the prompt itself was minuscule in token counts in the vast scheme of activity needed to process the prompt. And this is just how one (albeit complex), ~12 token prompt can grow into a 42K token context.

Inferencing and Time To First Token

Inferencing consists of two phases:

PreFill phase – which is the processing that goes on to take the context token stream and convert it into a KV (Key:Value) store which the LLM can use for subsequent processing so it doesn’t have to go back to the token context. PreFill ends up with a fully populated KV store representing all the tokens in the current context, and generates the first token in the LLM response to the prompt
Decode – which is all subsequent processing needed to generate the rest of the prompt response, uses the that KV store to underpin it’s processing to generate any more tokens needed to answer the prompt.

Solidigm went on to describe how these activities impact the Time To First Token (TTFT), or how long it takes from the time the prompt is issued until the LLM responds with the first word (token) of the prompt response.

(Although in the Solidigm’s chart they show Decode in the TTFT path. I believe this incorrect as PreFill generates the first token. Nonetheless, there is a portion of PreFill that “decodes” the prompt response first token and I assume that’s what they are showing here. Of course I could be mistaken.)

Storage can impact both the time it takes to assemble context tokens and to perform PreFill.

While storage can matter a lot during context assemble (lots of potential IO activity reading files, RAGs and other documents), storage’s impact on PreFill is less widely known. That is until you understand how prompt processing can be held up for KV store recalculation (going back to context tokens and rebuilding some or all of the current KV store for the prompt).

Increasing context, leads to more tokens, leads to larger KV stores, all of which impacts TTFT

Although, it’s only conjecture on my part, but the biggest portion of the Tprefill above seems to be calculating and converting context/memory tokens into KV elements stored in the prompts KV store. KV stores are used during prompt downstream processing because they can be easily accessed and each KV item represents intepreted token information in an easily used (by LLM) fashion.

And what’s not evident in the above TTFT decomposition chart is that tool use, generates even more tokens, as tool result (tokens), all of which need to be processed into more KV store elements in order to determine what to do next.

What happens to large KV stores during prompt processing

If there is a single GPU running a single prompt it’s possible, depending on model and HBM size, that it will run out of GPU HBM memory and offload or move some portion of its KV (store) cache to CPU memory. But if that GPU is processing 100s to 1000s of prompts concurrently, even CPU memory may not be large enough to hold every KV cache segment that no longer fits in GPU HBM. And of course most enterprise AI servers hold anywhere from 4 to 10 GPUs, each running 100s to 1000s of prompts concurrently.

KV cache offload is where fast storage can significantly speed up prompt processing

There’s an obvious tradeoff here with respect to KV stores. One can always go back to the Prefill phase, reread all the tokens in current context and recompute the KV store or one can offload KV store segments to memory, local storage or network storage and later retrieve the already computed KV store from wherever it ended up.

The tradeoff is how long it takes to recompute vs do the data transfers to offload and retrieve the KV cache segments. Larger contexts, increase KV store size, which lead to more need to offload or jettison KV store segments when running out of GPU HBM space. Both KV caching to memory-storage vs jetisoning KV store segments and reconstituting them, add time to TTFT. The question is which is faster.

One can see how this would be made ever more of an issue as prompts token counts (& KV elements) skyrocket. Also when more prompts are running concurrently on the same GPU(s) in a single server.

Obviously local, large SSDs with very fast random read would be ideal for KV cache offload activity which has the KV cache segment written out once (and extended as prompt processing adds context) but read back multiple times. Which s is great application for Large capacity, fast read NVMe SSDs which, I must say, are Solidigm’s forte.

NVIDIA and others have started to add KV cache offloading to their inferencing stacks. As they do, large fast NVMe SSDs activity during AI prompt processing will become one of the critical factors in TTFT.

In the meantime, if anyone has any large, fast NVMe SSDs they don’t need anymore, please let me know. 🙂

Hammerspace and the Open Flash Platform at #AIIFD3

Posted on September 19, 2025 by Ray in AI storage needs, Ethernet, File Storage, Storage density, Storage performance, Strategic Inflection Points

Was at AI Infrastructure Field Day 3 (AIIFD3) last week in CA and Hammerspace presented. (videos here). Molly and Floyd talked about their solution and some of their recent MLCommon’s performance results but Kurt discussed the Open Flash Platform (OFP) Consortium, announced last July which they and partners have been working on..

OFP currently has 6 partners ranging from Hammerspace (storage software supplier), SK Hynix (NAND and SSDs) and Linux Foundation among others and includes end users (Las Alamos National Labs), computational storage (ScaleFlux) and AI solution providers (Xsight).

As I understand it, the OFP is pushing to become a standard adopted by the Open Compute Project (OCP).

OFP is an attempt to redefine NAS as we know it. Hammerspace has been on this journey for a long time with their software only solution but technology is now at a place where it’s time to tackle hardware changes to NAS that would enable even better performance and throughput for AI and other data intensive workloads.

Some of the technology changes driving the need for a different approach to NAS storage:

NAND capacities are going through the roof, accessing all that capacity in an effective and performant way, requires a re-architecturing of the storage stack
Compute is becoming more widespread and ubiquitous. Every thing seems to have more and more compute capability that it’s causing a rethink as to how to take advantage of all this ubiquitous compute to better address IT (and AI) performance needs
AI bandwidth and performance requirements are extreme and are only becoming more so. .
Power has become a limiting factor in many AI deployments.

Hammerspace has addressed much of this from a software perspective with their Linux standards efforts to implement Parallel File System and Flex Files in the Linux kernel and in NFS standards as NFSv4.2. PFS and FlexFiles allows Hammerspace to offer very high file bandwidth and data mobility that can’t be supplied any other way.

So it’s time to see what can be done in hardware to make this even better. Enter OFP.

OFP, NAS storage reborn

The idea is to come up with a new packaging of an NFS (v3) server that’s all storage with high amounts of networking and enough compute to serve the storage. Effectively they are putting a DPU (computational intensive networking card) with 1-800Gbps Ethernet connection in front of a train (or toboggan) of NVMe SSDs and calling this a sled.

Their first version using U.2 NVMe SSDs, offers 1PB of capacity with 800Gbps of networking in a 3.5″ X 1.75″ form factor. They would load a NFS v3 Linux based storage server in the DPU and have it run that along with the Networking stack (and more) on the DPU and have access to all this storage capacity in what essentially is a NFSv3 (relatively dumb storage) storage sled.

Package 6 of these together with a couple of power supplies and now you have 6PB raw capacity in 1RU, with 4.8Tbps of bandwidth, consuming .6 kW of power (presumably this is power consumption at idle).

You will no doubt note that the sled, as configured above, does not allow for hot (or even cold) drive replacement. So when drives fail, the NFSv3 code would need to recover from them and take them out of service. So that over time the sled could still be used even though some SSDs have failed.

In the future, moving from U.2 SSDs to E2(E) NVMe SSDs in the storage sled quadruples the capacity while staying in the same power envelope and supplying the same bandwidth. Again the SSDs are not intended to be (hot or cold) swappable, so drive failure would need to be handled by software. With E2(E) SSDs in a sled and 6 of these in a 1RU, one would have 24PB of storage capacity.

Presumably, OFP Sleds could be hot swappable when enough SSDs in a sled fails.

And of course QLC capacities are not standing still so another doubling of these capacities could easily be possible within the next couple of years (imagine 48PB in a single RU, boggles the mind).

The NAS software one runs in the OFP SLED could be any NFSv3 server software but Hammerspace has their own, called DSX. And when you combine DSX servers with lots of capacity and lots of networking bandwidth, Hammerspace’s NFSv4.2 PFS and FlexFiles can really fly.

And with the power and space efficiency as well as extreme bandwidth available, it could be a winning formula for the AI environments, in contrast to scale-out NAS which is the current alternative.

~~~~

But it seems to me any organization (hypervisors are you listening) with intense storage capacity and storage bandwidth needs would be very interested in the OFP for their own environment.

Comments?

The curse of Scale & AGI

Posted on August 13, 2025 by Ray in AGI, Reinforcement Learning, Strategic Inflection Points

For the past 1/2 decade or more, new generation foundation models have all become significantly (10X or more) larger in parameters than their last versions. The presumption being that more parameters will always lead to better models, better inferences, more users, etc. This has been primarily driven by compute scaling, more compute thrown at training results in bigger models.

But the problem is at some point any process reaches saturation or a point of marginal return where throwing more (of anything) at it only gets marginally better, not incrementally or at least not commensurate with the additional cost. It’s unclear if we are there yet with foundation models, but my guess we are reaching it rapidly.

It’s interesting that ChatGPT-5 seems to have the same number of parameters as ChatGPT-4 (~1.8T).

Not being an active user of foundation models, I can’t really tell if …-5 is much better than …-4, but consensus seems to be they are not getting as better as they used to.

There are probably a number of reasons why this could be the case. The data wall for one. The power and cooling cost of exponentially increasing AI model size is impacting not just training costs but inferencing costs as well. But the end of the scaling advantage maybe another.

Don’t get me wrong if it wasn’t for compute scaling we wouldn’t have the AI we have today. NN training processes were invented in the 50s of last century, but they didn’t have the compute power to use them at the time.. It wasn’t until this century that computation caught up.

As more compute power became available, those old compute bound techniques proved to be the lynchpin for DNN training and we are still riding that curve today, up to a point.

It’s just that speeding up and doing the same old DNN training will lose effectiveness at some point, if not today, then tomorrow.

I’ve seen it myself in some rudimentary models I have trained. At some point adding nodes, layers, training epochs, etc., just doesn’t always result in better models. They often get worse.

AGI

And AGI, I believe, will require us to take a different tack than current foundational model DNN training to get right. Call it a hunch. But one can see glimmers of this in the fact that AGI is always just years away.

In order to achieve AGI, for safety reasons, for planetary climate reasons, and because scale is not getting us there anymore, I strongly believe we need to rethink our approach to foundation model training.

I’m no expert but I think what needs to change is more use of (deep) reinforcement learning (DRL), not just the human feedback reinforcement learning (HFRL) used today for fine tuning foundation models. This would mean using DRL much earlier, more comprehensively in all of phases of foundational model training.

Yes, DRL also consumes compute infrastructure and more “training episodes” for DRL can often lead to better model outcomes, but not always.

DRL training for AGI models

For any reinforcement learning to work, one needs a reward signal that can be used to signal how to optimize the DRL model. So, the real challenge in the use of more DRL for foundation model training is what (or who) supplies that reward signal from some action taken by the DRL model.

Historically, for games reward signals came from the game environment (or model), for robotic motion it can come from physics simulators or movement in the real world.

But any reward signal for AGI foundation models would need much more sophistication than the above.

The easy answer is to create world simulation models. Something that could simulate how the world (in total) would react to an action (or inference) of the foundation model.

But that’s not easy, world simulation models, at the fidelity needed to support DRL for AGI foundation models don’t exist and few if any researchers (AFAIK) are working on getting us there.

But there are some rudimentary baby steps that already exist. Physics engines (or models of real world physical processes) have existed for a long time now and would no doubt be the core of any world simulation model. Nature simulation models exist at least for climate and weather and these could also be incorporated into any world model.

What’s missing would be

Geophysical world simulations that would model how the world would react to actions taken by a AGI model. I’m aware of many petroleum earth based simulations ditto for plate tectonics, wind, and water movement, but these would all need to be combined into something that provides a entire world, geophysical reactions to model actions,
Biospherical world simulations that would model (at least at some level) how the (biological, i.e. animal, plant, fungi, microbe, etc.) natural world would react to actions. Weather models may have some of this, at least with respect to carbon cycles which span human-natural boundaries but we would need a lot more.
Psychological world simulations, or something that would simulate how a person and how a population of humans would react to actions taken by a model. I am unaware of anything available at this level except for a simulation of a baby I saw at SigGraph a couple of years ago. There would need to be a lot more work here to get this up to a level to support AGI training.
Sociological-Political world simulations or something that would model how human society across the world would react to model actions. Again some of these exist, at an even more rudimentary level than financial or weather modeling, and we would need a lot of work to get them to a level of fidelity needed for AGI training.
Financial-Business world simulations that would determine the financial reactions to model actions. Some of these exist for national economies, but would need broadened to the world at large and to much finer resolution, granularity to be suitable to support AGI foundational model training.

I am certainly missing some or more critical models that may be needed for true world simulations but these could provide a start. They would need to be combined, of course, in some fashion.

And determining the various reward weights would be non-trivial. It seems to me that each of these simulations could have multiple reward signals for any action. Combining them all may be non-trivial. But those are parameter optimizations, which once we have world models working in unison we can tweak at will.

Then there’s the “action space” for an AGI model. For games and robotic motion, the actions are well defined and finite. For an AGI model, it would seem that the actions are potentially infinite. Even if we limited it to a single domain such as tokenized text strings, the magnitude of such actions would be 10K-10M X anything tried before with DRL. But I still believe it’s doable

Once we had such a model together, with a decent reward function and had some way to categorize/grasp the infinite actions that could be taken by an AGI, DRL could be used to train an AGI.

Of course this may take a few “billion or trillion” actions/training episodes to get something worthwhile out of it.

But maybe after something like (or 10M X) that we could create a safe and effective AGI.

~~~~

Comments?

Photo Credit(s):

OCP Summit 2024, AMD Hardware Optimizations for power efficient AI, presentation slide
Thomas Jefferson National Accelerator Facility (Jefferson Lab), flickr photo
SigGraph 2024, Beyond the illusion of life, Keynote presentation slide

AGI, SuperIntelligence and “The Last Man”

Posted on May 30, 2025 by Ray in AGI, AI Agents, Cognitive computing, Executive leadership

Nietzsche wrote about the last man in Thus Spoke Zarathustra (see Last Man wikipedia article). There’s much to dislike about Nietzsche’s writing but every once in a while there are gems to be found. (Sorry for the sexist statement, it’s not me, blame Nietzsche).

It Zarathustra, Nietzsche talks of the Last Man in contempt. They no longer struggle in their daily life. They no longer create. They have an easy life filled with leisure and entertainment and no work to speak of.

From AGI to SUperIntelligence

I’ve discussed AGI many times before (I think we are up to AGI part 12, this would be part 13 and ASI (Artificial SuperIntelligence) part 3, this would be 4. But I’m thinking numbering them is not helping anymore). How to get there. the existential risk getting there. and many other facets of the risks and rewards of AGI. (Ok less on the rewards…).

I’ve also discussed Artificial SuperIntelligence (ASI). This is what we believe can be attained after AGI. If one were to use AGI to improve AI training algorithms, AI hardware, AI inferencing and use AGI to generate massive amounts of new scientific research/political research/economic research, etc. One could use the new data, the better training, inferencing, and AI hardware to create as ASI agent.

The big debate in the industry is how fast can one go from AGI to ASI. I don’t believe there’s any debate in the industry that SuperIntelligence can be obtained eventually.

There are those that believe

it will take many 3-5-10(?) years to attain SuperIntelligence because of all the infrastructure that has to be put in place to create current LLMs, and the view that AGI will need much more. Thus, build out is years away. If that’s the case it will take more years of infrastructural production, acquisition and data center build out to be ready to train SuperIntelligence after attaining AGI.
It will take just a few years 1-2-3(?) to achieve SuperIntelligence after AGI. This is because, one could use AGI to improve the AI training & inferencing algorithms and drastically increase the utilization of current AI hardware, such that there may be no need for any additional hardware to reach SuperIntelligence. Then the prime determinant of the time it takes to achieve SuperIntelligence is how fast AGI(s) can generate new scientific, medical, sociological, etc. research needed to train SuperIntelligence .

Yes, much scientific, et al research requires experimentation in the real world, (although much can now be done in simulation). But even physical experimentation is being rapidly automated today.

So the time it takes to generate sufficient research to create enough data to train an ASI may be very short. Just consider how fast LLM agents can generate code today to get a feel for what they could do tomorrow for research.

Maybe regulatory bodies could slow this down. But my bet would be that regulatory artifices would turn out to be ineffectual. At best they will drive AGI-ASI training/deployment activity underground which may delay it a couple of years while organizations build up the AI training infrastructure in hiding.

The one serious bottleneck may be AI data center’s power requirements. But if rogue states can build centrifuges to enrich radioactive materials, intercontinental missiles, biological warfare agents, etc., they can certainly steal/buy/find a way to duplicate AI data center infrastructure components.

Regulatory regimens, at worst, would completely ignored by state actors and all large commercial enterprises. The first mover advantages of AGI and ASI are too large for any organization to ignore.

What happens when SuperIntelligence is reached

I see one of two possibilities for how the achievement of AGI and SuperIntelligence plays out, with respect to humanity

Humankind Utopia – AGI & ASI agents can do anything that humans can do and do it better, faster, and more efficiently. The question remains what would be left for humanity to do when this is reached. Alright, at the moment, LLM agents are mostly limited to working in the digital domain. But with robotics coming online over the next decade, this will change to add more real world domains to whatever AGI-ASI agents can do.
Humankind Hell – AGI & ASI agents determine that humanity is a pestilence to the Earth and starts to cut them back to something that’s less consumptive of Earth resources. Again, although AI agents are restricted to the digital domain today, that won’t last for long, especially as AGI & ASI agents go live. So robots with ASI agents will be the worst aggressor in the history of the world and with the tools at their disposal, they could easily create biological, chemical and other weapons of mass destruction to deploy against humanity.

SuperIntelligence risk and rewards

It’s been obvious to me, SciFi authors and some select AI researchers that there is a sizable risk that a SuperIntelligence, once unleashed, will eliminate, severely restrict or enslave humanity resulting in Humanity’s Hell.

On the other extreme are many corporate CEO/CTOs and other AI researchers which believe that SuperIntelligence will be a Godsend to humankind. Once it arrives and is deployed, humanity will no longer have to do any work it does not want to do. All work will be handed off to robots and their ASI agents which will perform it at greater speed, with higher quality and with lower cost than can be conceivable done today.

What seems to be happening today with current AI agents is that some white collar work is becoming easier to perform, if not totally eliminated. CEO’s see this as an opportunity to reduce workforce size. For example, some CEOs are eliminating HR organizations with the belief that LLM chatbots together with a much smaller group can handle this all of what HR was doing before.

And of course as AI agents become more sophisticated this will ensure more workforce reductions. And once AI agents are embodied in robotics, blue collar workforce will also be at risk.

Human Utopia and “The Last Man”

Nietzsche’s was writing in the late 1800s when technology and automation were just starting to make a difference in the world of work. But the industrial revolution was in full steam and had already had significant impact on the work force.

Nietzsche believed that further industrialization, it continued (which of course it has), would result in the Last Man.

The Last Man is at the point where technology and automation has taken over all tasks, trades and work, and where the Last Man has no real duties they need to perform other than consume goods and services provided by automation. For the Last Man, wealthy or poor no longer have any consequences, as they can have anything they could possibly desire.

To Nietzsche, the Last Man is an anathema. He believes that true humanity requires struggle, striving and advancement. Once the Last Man is achieved all these will no longer matter, no longer be a part of humanities existence and no longer impact one’s lifestyle.

When humanity no longer has to struggle, strive and advance, humanity will lose the very essence that makes humanity human. We will, over time, lose the ability and desire to do any of that, as it all becomes the purview of AGI-ASI.

The Last Man is coming already

Example 1: Ethiopian Flight 409 2010 disaster (see wikipedia article) is one example in a very technical domain. As I understand it, the flight was enroute to France when it went into a stall, the pilots did the wrong thing to get out of it and they spiraled into the sea.

The pilot was the most experienced pilot in the airline (logged over 10K flight hrs). The co-pilot was much less experienced. Getting out of a “stall” is rudimentary to flying. In fact, exiting a stall is one of the important skills taught to all pilots and in fact, they need to demonstrate they can get out of a stall before they get their pilot licenses.

The “problem” had been brewing for a while. Ever since aircraft auto-pilots came into service, real live pilots did less and less real flying of airplanes. As a result, these two pilots forgot how to get out of a stall and it caused the accident.

Example 2: Self-driving technology has been rapidly improving over the last decade or so. We often become dependent on its capabilities and when there’s some sort of failure it can be disastrous because we have lost many of our most important driving skills.

In my case, we have a relatively dumb car with what they call “”smart cruise control”. You can set it to a speed and the vehicle will retain that speed unless a vehicle in front of you is going slower, then it will slow down to maintain some set distance behind that vehicle.

We were driving along and a truck cut into our lane. This truck had a very high backend profile with no structures where normal vehicles would protrude until you got to its tires. Well the smart cruise control didn’t detect its existence until we were almost underneath the truck bed. We tried to brake but it took too many seconds to get that done and in the end we had to go off the road to save ourselves. We had lost our emergency braking skills and situational awareness skills. Nowadays we don’t drive with cruise control on as much.

A multitude of examples exist that show AI and automation has led to humans becoming less skilled at some activity. And when AI automation doesn’t work properly, bad things happen, because we no longer know how to react properly.

The Last Man, here today, gone tomorrow.

So imagine a life where you are born with everything you could possible need to succeed. You are educated by the very best automated personal tutors. You are provided an (Amazon and Walmart) X 1000, with unlimited credit. You grow up with everyone else having just the same life as you because all of you have no work to do and have infinite sums and have infinite products to consume.

Life in such a utopia would from some perspective be almost Godlike. But if you take the perspective that humanity needs struggle, needs challenges, needs to strive to better themselves at every stage, such a life would be a disaster.

And that’s what Humanity’s Utopia would look like. Definitely better than Humanity’s Hell but in the end, not sure the difference matters as much.

~~~

I just don’t really see any path forward that’s good for humanity where AGI and SuperIntelligence exists.

Stopping AI development here today, seems idiotic, going where we seem to be going seems insane.

Comments?

Picture Credit(s):

Friedrich Nietzsche by Friedrich Hermann Hartmann
ChatGPT logo by By User:Random837 – Own work (imitated from File:ChatGPT-Logo-2022.svg), Copyrighted free use,
Ethiopian Airline plane By Alastair T. Gardiner, CC BY-SA 4.0,

AlphaEvolve, DeepMind’s latest intelligence pipeline

Posted on May 21, 2025May 20, 2025 by Ray in AI Agents, Artificial Intelligence, Cognitive computing, Strategic Inflection Points

Read an article the other day from ArsTechnica on AlphaEvolve (Google Deepmind creates .. AI that can invent…). After Google announced and released their AlphaEvolve website and paper.

Essentially they have created a pipeline of AI agents (uses GeminiFlash and GeminiPro) that uses genetic/evolutionary techniques to evolve code tor anything really that can be transformed into code to be improve or solve something that has code based evaluation techniques.

Genetic evolution of code has been tried before and essentially it uses various combinatorial (splitting, adding, subtracting, etc.) techniques to modify code under evolution. The challenge with any such techniques is that much of the evolutionary code is garbage so you have to have some method to evaluate (quickly?) whether the new code is better or worse than the old code.

That’s where the evaluation code comes into play. It effectively executes the new code and determines a score (could be a scalar or vector) that AlphaEvolve can use to determine if it’s on the right track or not. Also you can have multiple evaluation functions. And as an example you could have some LLM be asked whether the code is simpler/cleaner/easier to understand. That way you could task AlphaEvolve to not only improve the code functionality but also create simpler/cleaner/easier to understand code.

AlphaEvolve uses GeminiFlash to generate a multitude of code variations and when that approach loses steam (no longer improving much) it invokes GeminiPro to look at the code in depth to determine strategies to make it better.

As discussed above to use AlphaEvolve you need to supply infrastructure (compute, storage, networking), one or more evaluation algorithms/prompts (in any coding language you choose) and a starting solution (again in any coding language you want).

As part of the AlphaEvolve’s process it uses a database to record all code modification attempts and its evaluation scores. This database can be used to retrieve prior modifications and take off from there again.

Results

AlphaEvolve has been tasked with historical math problems that involve geometric constructions, as well as computing algorithms improvement as well as full stack coding improvements.

For instance the paper discusses how AlphaEvolve improved their Google Cloud (Borg) compute scheduling algorithm which increased compute utilization by 7% throughout Google Cloud Data centers.

It also found a kernel improvement which led to Gemini training speedup. It found a simpler logic footprint for a TPU chip function.

It found a faster algorithm to do 4X4 matrix complex multiplication algorithm. It found a solution to the 11 dimension circle kissing problem (geometric construction). And probably 50 or more mathematical problems, coding algorithm improvements etc.

It didn’t improve or solve everything it was tasked to do but it did manage to make improvements or solutions to ~20% or so of the starting solutions it was tasked with.

How to use it

The nice thing about AlphaEvolve is that one can have it work with a whole code repo and have it only evolve a set of sections of code in that repo. All the code to be improved is marked with

#EVOLVE-BLOCK START and
#EVOLVE-BLOCK END.

This would be embedded in the starting solution. Presumably this would be in any comment format for the coding language being used.

And it’s important to note that the starting solution could be very rudimentary, and with the proper evaluation algorithms could still be used to solve or improve any algorithm.

For example if you were interested in optimizing a factory production line by picking a component/finished product to manufacture and you had lets say some sort of coded factory simulation with some way to examine the factory to evaluate whether it’s working well or not.

Your rudimentary starting algorithm could pick at random from the set of products/components to manufacture that are currently needed and use as evaluation the throughput of your factory, utilization of bottleneck/machinery, energy consumption or any other easily code-able evaluation metric of interest in isolation or combination (that could make use of your factory simulation to come up with evaluation socer(s). Surround the random selection code in #EVOLVE-BLOCK START and #EVOLVE-BLOCK END and let AlphaEvolve come up with a new selection algorithm for your factory.

After seeing a couple of (10-100-1000) iterations of new graded selection algorithms you could change your evaluation grading algorithms and start over from where you left off to get something even more sophisticated.

Deepmind has created a GitHub jupyter notebook with some of AlphaEvolve’s mathematical solutions/improvements in case you want to see more.

They also have an AlphaEvolve early signup site in case your interested in trying it out. which

~~~~

If I were Deepmind, I could think of probably 10K things to do with AlphaEvolve. I might rankall the functions in GeminiPro/GeminiFlash inference and training by frequency count and take the top 20% of these functions through the AlphaEvolve pipeline. Ditto for Google Cloud services, Google search, Adwords, etc.

But that would be just the start…

….

Photo/Graphic Credit(s):

From DeepMind’s AlphaEvolve Paper
From DeepMind’s AlphaEvolve website
From DeepMind’s AlphaEvolve Paper
From DeepMind’s AlphaEvolve website

Reward is all you need – part 2, AGI part 12, ASI part 3

Posted on April 18, 2025 by Ray in AGI, ASI, Cognitive computing, Reinforcement Learning, Strategic Inflection Points

Read an article today about how current LLM technology is running out of steam as it approaches equivalents to all current human knowledge. The article is Welcome to the Age of Experience. Apparently it’s a preprint of a chapter in an upcoming book from MIT, Designing an Intelligence. One of the authors is well known for his research in reinforcement learning and is a co-author of the text book, Reinforcement Learning: An Introduction. .

Sometime back before ChatGPT came out there was a paper on reward is enough (see post: For AGI, is reward enough). And at the time it proposed that reinforcement learning with proper reward signals was sufficient to reach AGI.

Since then, attention has become the prominent road to AGI and is evident in all the LLM activity to date (see ArXiv paper: Attention is all you need).

This new paper (and presumably book) suggests that the current AI training technology focused on attention (to current human knowledge) will ultimately reach an impasse, a human wall if you will. Whenever it attains human levels of AG or the Humanity WalI, it will be unable to proceed any farther. And at that point, it will track human knowledge generation but go no further.

Now, from my perspective something like this is inherently safer than having something that can surpass human intelligence. But putting my reservations aside. The new paper on the Era of Experience shows a potential road map of sorts to achieve super human intelligence.

Era of attention

In the case of transformers (current LLM technology) they have billion parameter models based on learning what the next token in a sequence should be. There are ancillary models that determine, for instance, tokenization of text streams (multi dimensional locations for each portion of a word in a paragraph for instance). Tokenization encoded textual semantics and context as well as the textual word part being analyzed into a string of numbers for each token. Essentially, a multi-dimensional address in textual semantic space

But the big, billion+ parameter models were all essentially trained to predict what the next text token would be based on current context. Similarly, for graphical generation models it went from text tokens to predicting the diffusion pixels of a graphic and other visual artifacts.

But pretty much all of this was based on the underlying technology training approach as outlined in attention is all you need.

The Era of Experience paper suggests that this training approach will ultimately run out of steam. And all of these models will hit the Humanity Wall. Where they reach the equivalent to all human knowledge but will be unable to proceed past that point

Era of Games and Proofs

In an online course I took during Covid on reinforcement learning, the level 1 of the course ended up having us code a Reinforcement Learning algorithm to play pong. Mind you this ended up taking me much longer to get right than I had anticipated. But in the end this was essentially training a deep neural network as a value function (prediction whether a move was going to win or lose) to decide which direction to move the paddle based on the balls current position and velocity.

For this reinforcement learning algorithm reward was simply 0, if you continued the game, +1 if you won the game, and -1, if you lost (the ball went past your paddle).

The authors discuss Deep Mind’s “Alpha-Proof” (more of an explanation of the technology) and Alpha-Geometry2 (also described in the same page) as being an examples of super-human thinking capabilities only in the domain of mathematical proofs. Alpha-Proof and Alpha-Geometry2 have won a prestigious International Mathematics Olympiad silver medal for its capabilities.

Alpha-Proof & Alpha-Geometry2 depend on LEAN a formal mathematical description language (similar to coding for mathematics). So a proof request would be converted to LEAN code and then Alpha-Proof and Alpha-Geometry2

Alpha-proof was originally trained on the sum total of all human generated mathematical proofs but then used reinforcement learning to generate 100’s of million more proofs and trained on those, to reach the level of superhuman mathematical proof generator.

Alpha Proof is an example of deploying Alpha-Zero RL technologies to different domains. Alpha-zero already conquered Chess, Shoji and Go games with super-human skill.

These achieved super-human levels of skill, because human (knowledge) was essentially dropped out of the training loop (very early on) and from then on the algorithm trained itself on self-generated data (game play, mathematical proofs). Using a a game simulator and reward signal(s) to determine when play were good or bad.

Era of Experience

But the Era of Experience takes reward signals to a whole other level.

Essentially in order to create super human intelligence using RL, the reward function needs to become yet another Deep Neural Network or two. And it needs to be trained in a fashion which understands how the world, environment, humans, flora, fauna, etc. reacts to what a (super human) agent is doing.

Unclear how you tokenize (encode) all those real world, experience signals into something a DNN could be trained on but my guess is their book will delve into some of these topics.

But in addition to the multi-faceted reward DNN(s), in order to do effective RL, one also needs a (high fidelity) real world simulator. This would be used similar to internal game play, in game playing traditional RL algorithms so that the super human agent could generate a 100 million agentic scenarios in simulation to determine if they were successful or not long before it ever attempted activities in the real world.

So there you have it tokenization for LLMS DNNs and diffusion and text based agentic LLM DNNs, some sort of multi-faceted Reward DNNs (taking input from real and simulated world experience) and multi-faceted World simulator DNNs.

Once you have all that together and with sufficient time and processing powerand after some 100 million or so of generated actions in the simulated world, you should have a super human agent that you can unleash on the real world.

~~~~

You may wish to constrain your new super human intelligent agent early on to make sure the world simulation has true fidelity with the real world we live in. But after a suitable safety checkout period, one should have a super human intelligence agent ready to take over all human thought, society advancement, scientific research, etc.

Sound like fun!!?

Photo/Graphic Credit(s):

From Welcome to the new Era of Experience paper
From DeepMind’s Alpha-Proof webpage.
From DeepMind’s Alpha-Proof webpage.

Benchmarking Agentic AI using Factorio – AGI part 12

Posted on March 13, 2025 by Ray in AGI, AI Agents, Artificial Intelligence, Cognitive computing, Strategic Inflection Points

Yesterday a friend forwarded me something he saw online about a group of researchers who were using the game, Factorio, to benchmark AI Agent solutions (PDF of paper, Github repo).

The premise is that with an effective API for Factorio, AI agents can be tasked with creating various factories for artifacts. The best agents would be able to create the best factories.

Factorio factories can be easily judged by the number of artifacts they produce per time period and the energy use to manufacture those artifacts. They can also be graded based on how many steps it takes to generate those factories.

***Left is Factorio factory progression, middle is AI agent Python code that uses Factorio API, Right is agents submitting programs to Factorio server and receive feedback***

The team has created a Factorio framework for using AI agents that create Python code to drive a set of Factorio APIs to build factories to manufacture stuff.

Factorio is a game in which you create and operate factories. From Factorio website: “You will be mining resources, researching technologies, building infrastructure, automating production, and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working, and protect it from the creatures who don’t really like you.”

Presumably FLE has disabled the villainy and focused on just crafting and running factories all out.

FLE Results using current AI agents

***FLE Open-play Results***, ***for open-play, models are scored based on prediction quantities over time***, ***note the chart is log-log***

Factorio, similar to other games, has an inventory of elemens/components/machines used to build factories. And some of these elements are hidden until you one gains enough experience in the game.

The Factorio Learning Environment (FLE) is a complete framework that can prompt Agentic AI to create factories using Python code and Factorio API calls. The paper goes into great detail in it’s appendices as to what AI agent prompts look like, the Factorio API and other aspects of running the benchmark.

In the FLE as currently defined there’s “open-play” and “lab-play”.

Open-play is tasked with building a factory as large as the agent wants to create as much product as possible. The open-play winner is the AI agent that creates a factory that can manufacture the most widgets (iron plates) in the time available for the competition.
Lab-play is tasked with building factories for 24 specific items, with limited resource and time constraints and the winner is the AI agent that is able to build most of these lab-play factories successfull,y in the time and resource constraints available.

***FLE Lab-play (select) results – there were 24 tasks in the lab-play list, no agent completed all of them but Claude did the best on the 5 that were completed by most agents***

The team benchmarked 6 frontier LLM agents: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct, using them for both open-play and lab-play.

The overall winner for both open-play and lab-play was Claude 3.5-Sonnet, by a far margin. In open play it was able to create a factory to manufacture over 290K iron plates (per game minute, we think) and for lab-play was able to construct more (7 out of 24) factories, more than other AI agents.

***FLE Overall A***I ***Agent Results***

The FLE researchers listed some common failings of AI agents under test:

Most agents lack spatial understanding
Most agents don’t handle or recover from errors well
Most agents don’t have long enough planning horizons
Most agents don’t invest enough effort in research (finding out what new Factorio machines do and how they could be used).

They also mentioned that AI agent coding skills seemed to be a key indicator of FLE success and coding style differed substantially between the agents. The researchers characterized agent (Python) coding styles and determined that Claude used a REPL style with plenty of print statements while GPT-4o used more assertions in its code.

“***Example of an FLE program*** used to create a simple
automated iron-ore miner. In step 1 the agent uses a query to find
the nearest resources and place a mine. In step 3 the agent uses an
assert statement to verify that its action was successful.”

IMHO, as a way to measure AI agent ability to achieve long term and short term goals, at least w.r.t. building factories, this is the best I’ve seen so far.

More FLE Lab-play scenarios

I could see a number of additional lab-play benchmarks for FLE:

One focused on drug/pharmaceuticals manufacturing
One focused on electronics PCB manufacturing
One focused on chip manufacturing
One focused on nano technology/meta-materials manufacturing, etc.

What’s missing from all these benchmarks would be the actual science and research needed to come up with new drugs, new electronics, new meta-materials, that are the end product of Factorio factories. I guess that would need to be building of labs, running scientific experiments and understanding (simulated) results.

Although in the current round of FLE benchmarks, for one AI agent at least (Claude), there seemed to be a lot of research into how to use different Factorio tools and machinery.

Ultimate FLE

If FLE as an Ai agent benchmark succeeds, most Agentic AI solutions will start being trained to do better on the benchmark. Doing so should of course lead to better scores by AI agents.

Now people much more familiar with the game than I, say it’s not a great simulation of the real world. There’s only one type of fuel and the boiler is either on or off and numerous other simplifications of the real world are used throughout. And thankfully, for the moment there’s no linkage to actions that impact the real world.

But in reality, simulations like this that are all just stepping stones to AI capabilities. And simulations are all just code and it should not be that hard to increase its fidelity to the real world. .

Getting beyond just simulation, to real world factories is probably the much larger step. This would require physical (not unlimited) inventory of parts, cabling, machines, and belts; real mineral/petroleum deposits; real world physical constraints on where factories could be built. etc. Not to mention the physical automation/robotics that would allow a machine to be selected out of inventory, placed at a specific location inside a factory and connected to power and assembly lines, etc.

~~~~

One common motif in AGI existential crisises, is that some AGI (agent) will be given the task to build a paperclip factory and turns the earth into one giant factory, while inadvertently killing all life on the planet, including of course, humankind.

So training AI agents on “open-play” has ominous overtones.

It would be much better, IMHO, if somehow one could add to Factorio human settlements, plant, animal & sea life, ecosystems, etc. So that there would be natural components that if ruined/degraded/destroyed, could be used to reduce AI agent scores for the benchmarks.

Alas, there doesn’t appear to be anything like this in the current game.

Picture Credit(s):

From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) Github Repo
From Jack Hopkins Factorio Learning Environment (FLE) paper

Nexus by Yuval Noah Harari, AGI part 12

Posted on November 22, 2024 by Ray in AGI, Artificial Intelligence, Strategic Inflection Points

This book is all about information networks have molded man and society over time and what’s happening to these networks with the advent of AI.

In the earliest part of the book he defines information as essentially “that which connects and can be used to create new realities”. For most of humanity, reality came in two forms

Objective reality which was a shared belief in things that can be physically tasted, touched, seen, etc. and
Subjective reality which was entirely internal to a single person which was seldom shared in its entirety.

With the mankind’s information networks came a new form of reality, the Inter-subjective reality. As inter-subjective reality was external to the person, it could readily be shared, debated and acted upon to change society.

Information as story

He starts out with the 1st information network, the story or rather the shared story. The story and its sharing across multiple humans led human society to expand beyond the bands of hunter gatherers. Stories led to the first large societies of humans and the information flow looked like human-story and story-human and created the first inter-subjective realities. Shared stories still impact humanity today.

As we all know stories verbally passed from one to another often undergo minor changes. Not much of a problem for stories as the plot and general ideas are retained. But for inventories, tax receipts, land holdings, small changes can be significant.

What transpired next was a solution to this problem. As these societies become larger and more complex there arose a need to record lists of things, such as plots of land, taxes owed/received, inventories of animals, etc. And lists are not something that can easily be weaved into a story.

Information as printed document

Thus clay tablets of Mesopotamia and elsewhere were created to permanently record lists. But the clay tablet is just another form of a printed documents.

Whereas story led to human-story and story-human interactions, printed documents led to human-document and document-human information flow. Printed documents expanded the inter-subjective reality sphere significantly.

But the invention of printed documents or clay tablets caused another problem – how to store and retrieve them. There arose in these times, the bureaucracy run by bureaucrats to create storage and retrieval systems for vast quantities of printed documents.

Essentially with the advent of clay tablets, something had to be done to organize and access these documents and the bureaucrat became the person that did this.

With bureaucracy came obscurity, restricted information access, and limited visibility/understanding into what bureaucrats actually did. Perhaps one could say that this created human-bureaucrat-document and document-bureaucrat-human information flow.

The holy book

Next he talks about the invention of the holy book, ie. Hebrew Bible, Christian New Testament and Islam Koran, etc.. They all attempted to explain the world, but over time their relevance diminished.

As such, there arose a need to “interpret” the holy books for the current time.

For Hebrews this interpretation took the form of the Mishnah and Talmud. For Christians the books of the new testament, epistles and the Christian Church. I presume similar activities occurred for Islam.

Following this, he sort of touches on the telegraph, radio, & TV but they are mostly given short shrift as compared to story, printed documents and holy books. As all these are just faster ways to disseminate stories, documents and holy books

Different Information flows in democracies vs. tyrannies

Throughout the first 1/3 of the book he weaves in how different societies such as democracies and tyrannies/dictatorships/populists have different information views and flows. As a result support, they entirely different styles of information networks.

Essentially, in authoritarian regimes all information flows to the center and flows out of the center and ultimately the center decides what is disseminated. There’s absolutely no interest in finding the truth just in retaining power

In democracies, there are many different information flows in mostly an uncontrolled fashion and together they act as checks and balances on one another to find the truth. Sometimes this is corrupted or fails to work for a while to maintain order, but over time the truth always comes out.

He goes into some length how these democratic checks and balances information networks function in isolation and together. In contrast, tyrannical information flows ultimately get bottled up and lead to disaster.

The middle ~1/3 of the book touches on inorganic information networks. Those run by computers for computers and ultimately run in parallel to human information flows. They are different from the printing press, are always on, but are often flawed.

Non-human actors added to humanity’s information networks

The last 1/3 of the book takes these information network insights and shows how the emergence of AI algorithms is fundamentally altering all of them. By adding a non-human actor with its own decision capabilities into the mix, AI has created a new form of reality, an inter-computer reality, which has its own logic, ultimately unfathomable to humans.

Even a relatively straightforward (dumb) recommendation engine, whose expressed goal is to expand and extend interaction on a site/app, can learn how to do this in such a way as to have unforeseen societal consequences.

This had a role to play in the Rohingya Genocide, and we all know how it impacted the 2016 US elections and continues to impact elections to this day.

In this last segment he he has articulated some reasonable solutions to AI and AGI risks. It’s all about proper goal alignment and the using computer AIs together with humans to watch other AIs.

Sort of like the fox…, but it’s the only real way to enact some form of control over AI. We will discuss these solutions at more length in a future post.

~~~~

In this blog we have talked many times about the dangers of AGI. What surprised me in reading this book is that AI doesn’t have to reach AGI levels to be a real danger to society.

A relatively dumb recommendation engine can aid and abet genocide, disrupt elections and change the direction of society. I knew this but thought the real danger to us was AGI. In reality, it’s improperly aligned AI in any and all its forms. AGI just makes all this much worse.

I would strongly suggest every human adult read Nexus, there are lessons within for all of humanity.

Picture Credits:

Photo of book cover.
By Unknown artist – Jastrow (2006), Public Domain, https://commons.wikimedia.org/w/index.php?curid=730728
By NYC Wanderer (Kevin Eng) – originally posted to Flickr as Gutenberg Bible, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=9914015
By Zlatica Hoke (VOA) – Screenshot from the source video by Voice of America, Public Domain, https://commons.wikimedia.org/w/index.php?curid=66794875