AlphaEvolve, DeepMind’s latest intelligence pipeline

Read an article the other day from ArsTechnica on AlphaEvolve (Google Deepmind creates .. AI that can invent…). After Google announced and released their AlphaEvolve website and paper.

Essentially they have created a pipeline of AI agents (uses GeminiFlash and GeminiPro) that uses genetic/evolutionary techniques to evolve code tor anything really that can be transformed into code to be improve or solve something that has code based evaluation techniques.

Genetic evolution of code has been tried before and essentially it uses various combinatorial (splitting, adding, subtracting, etc.) techniques to modify code under evolution. The challenge with any such techniques is that much of the evolutionary code is garbage so you have to have some method to evaluate (quickly?) whether the new code is better or worse than the old code.

That’s where the evaluation code comes into play. It effectively executes the new code and determines a score (could be a scalar or vector) that AlphaEvolve can use to determine if it’s on the right track or not. Also you can have multiple evaluation functions. And as an example you could have some LLM be asked whether the code is simpler/cleaner/easier to understand. That way you could task AlphaEvolve to not only improve the code functionality but also create simpler/cleaner/easier to understand code.

AlphaEvolve uses GeminiFlash to generate a multitude of code variations and when that approach loses steam (no longer improving much) it invokes GeminiPro to look at the code in depth to determine strategies to make it better.

As discussed above to use AlphaEvolve you need to supply infrastructure (compute, storage, networking), one or more evaluation algorithms/prompts (in any coding language you choose) and a starting solution (again in any coding language you want).

As part of the AlphaEvolve’s process it uses a database to record all code modification attempts and its evaluation scores. This database can be used to retrieve prior modifications and take off from there again.

Results

AlphaEvolve has been tasked with historical math problems that involve geometric constructions, as well as computing algorithms improvement as well as full stack coding improvements.

For instance the paper discusses how AlphaEvolve improved their Google Cloud (Borg) compute scheduling algorithm which increased compute utilization by 7% throughout Google Cloud Data centers.

It also found a kernel improvement which led to Gemini training speedup. It found a simpler logic footprint for a TPU chip function.

It found a faster algorithm to do 4X4 matrix complex multiplication algorithm. It found a solution to the 11 dimension circle kissing problem (geometric construction). And probably 50 or more mathematical problems, coding algorithm improvements etc.

It didn’t improve or solve everything it was tasked to do but it did manage to make improvements or solutions to ~20% or so of the starting solutions it was tasked with.

How to use it

The nice thing about AlphaEvolve is that one can have it work with a whole code repo and have it only evolve a set of sections of code in that repo. All the code to be improved is marked with

#EVOLVE-BLOCK START and
#EVOLVE-BLOCK END.

This would be embedded in the starting solution. Presumably this would be in any comment format for the coding language being used.

And it’s important to note that the starting solution could be very rudimentary, and with the proper evaluation algorithms could still be used to solve or improve any algorithm.

For example if you were interested in optimizing a factory production line by picking a component/finished product to manufacture and you had lets say some sort of coded factory simulation with some way to examine the factory to evaluate whether it’s working well or not.

Your rudimentary starting algorithm could pick at random from the set of products/components to manufacture that are currently needed and use as evaluation the throughput of your factory, utilization of bottleneck/machinery, energy consumption or any other easily code-able evaluation metric of interest in isolation or combination (that could make use of your factory simulation to come up with evaluation socer(s). Surround the random selection code in #EVOLVE-BLOCK START and #EVOLVE-BLOCK END and let AlphaEvolve come up with a new selection algorithm for your factory.

After seeing a couple of (10-100-1000) iterations of new graded selection algorithms you could change your evaluation grading algorithms and start over from where you left off to get something even more sophisticated.

Deepmind has created a GitHub jupyter notebook with some of AlphaEvolve’s mathematical solutions/improvements in case you want to see more.

They also have an AlphaEvolve early signup site in case your interested in trying it out. which

~~~~

If I were Deepmind, I could think of probably 10K things to do with AlphaEvolve. I might rankall the functions in GeminiPro/GeminiFlash inference and training by frequency count and take the top 20% of these functions through the AlphaEvolve pipeline. Ditto for Google Cloud services, Google search, Adwords, etc.

But that would be just the start…

….

Photo/Graphic Credit(s):

Benchmarking Agentic AI using Factorio – AGI part 12

Yesterday a friend forwarded me something he saw online about a group of researchers who were using the game, Factorio, to benchmark AI Agent solutions (PDF of paper, Github repo).

A Factorio plastic bar factory

The premise is that with an effective API for Factorio, AI agents can be tasked with creating various factories for artifacts. The best agents would be able to create the best factories.

Factorio factories can be easily judged by the number of artifacts they produce per time period and the energy use to manufacture those artifacts. They can also be graded based on how many steps it takes to generate those factories.

Left is Factorio factory progression, middle is AI agent Python code that uses Factorio API, Right is agents submitting programs to Factorio server and receive feedback

The team has created a Factorio framework for using AI agents that create Python code to drive a set of Factorio APIs to build factories to manufacture stuff.

Factorio is a game in which you create and operate factories. From Factorio website: “You will be mining resources, researching technologies, building infrastructure, automating production, and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working, and protect it from the creatures who don’t really like you.”

Presumably FLE has disabled the villainy and focused on just crafting and running factories all out.

FLE Results using current AI agents

FLE Open-play Results, for open-play, models are scored based on prediction quantities over time, note the chart is log-log

Factorio, similar to other games, has an inventory of elemens/components/machines used to build factories. And some of these elements are hidden until you one gains enough experience in the game.

The Factorio Learning Environment (FLE) is a complete framework that can prompt Agentic AI to create factories using Python code and Factorio API calls. The paper goes into great detail in it’s appendices as to what AI agent prompts look like, the Factorio API and other aspects of running the benchmark.

In the FLE as currently defined there’s “open-play” and “lab-play”.

  • Open-play is tasked with building a factory as large as the agent wants to create as much product as possible. The open-play winner is the AI agent that creates a factory that can manufacture the most widgets (iron plates) in the time available for the competition.
  • Lab-play is tasked with building factories for 24 specific items, with limited resource and time constraints and the winner is the AI agent that is able to build most of these lab-play factories successfull,y in the time and resource constraints available.
FLE Lab-play (select) results – there were 24 tasks in the lab-play list, no agent completed all of them but Claude did the best on the 5 that were completed by most agents

The team benchmarked 6 frontier LLM agents: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct, using them for both open-play and lab-play.

The overall winner for both open-play and lab-play was Claude 3.5-Sonnet, by a far margin. In open play it was able to create a factory to manufacture over 290K iron plates (per game minute, we think) and for lab-play was able to construct more (7 out of 24) factories, more than other AI agents.

FLE Overall AI Agent Results

The FLE researchers listed some common failings of AI agents under test:

  • Most agents lack spatial understanding
  • Most agents don’t handle or recover from errors well
  • Most agents don’t have long enough planning horizons
  • Most agents don’t invest enough effort in research (finding out what new Factorio machines do and how they could be used).

They also mentioned that AI agent coding skills seemed to be a key indicator of FLE success and coding style differed substantially between the agents. The researchers characterized agent (Python) coding styles and determined that Claude used a REPL style with plenty of print statements while GPT-4o used more assertions in its code.

Example of an FLE program used to create a simple
automated iron-ore miner. In step 1 the agent uses a query to find
the nearest resources and place a mine. In step 3 the agent uses an
assert statement to verify that its action was successful.”

IMHO, as a way to measure AI agent ability to achieve long term and short term goals, at least w.r.t. building factories, this is the best I’ve seen so far.

More FLE Lab-play scenarios

I could see a number of additional lab-play benchmarks for FLE:

  • One focused on drug/pharmaceuticals manufacturing
  • One focused on electronics PCB manufacturing
  • One focused on chip manufacturing
  • One focused on nano technology/meta-materials manufacturing, etc.

What’s missing from all these benchmarks would be the actual science and research needed to come up with new drugs, new electronics, new meta-materials, that are the end product of Factorio factories. I guess that would need to be building of labs, running scientific experiments and understanding (simulated) results.

Although in the current round of FLE benchmarks, for one AI agent at least (Claude), there seemed to be a lot of research into how to use different Factorio tools and machinery.

Ultimate FLE

If FLE as an Ai agent benchmark succeeds, most Agentic AI solutions will start being trained to do better on the benchmark. Doing so should of course lead to better scores by AI agents.

Now people much more familiar with the game than I, say it’s not a great simulation of the real world. There’s only one type of fuel and the boiler is either on or off and numerous other simplifications of the real world are used throughout. And thankfully, for the moment there’s no linkage to actions that impact the real world.

But in reality, simulations like this that are all just stepping stones to AI capabilities. And simulations are all just code and it should not be that hard to increase its fidelity to the real world. .

Getting beyond just simulation, to real world factories is probably the much larger step. This would require physical (not unlimited) inventory of parts, cabling, machines, and belts; real mineral/petroleum deposits; real world physical constraints on where factories could be built. etc. Not to mention the physical automation/robotics that would allow a machine to be selected out of inventory, placed at a specific location inside a factory and connected to power and assembly lines, etc.

~~~~

One common motif in AGI existential crisises, is that some AGI (agent) will be given the task to build a paperclip factory and turns the earth into one giant factory, while inadvertently killing all life on the planet, including of course, humankind.

So training AI agents on “open-play” has ominous overtones.

It would be much better, IMHO, if somehow one could add to Factorio human settlements, plant, animal & sea life, ecosystems, etc. So that there would be natural components that if ruined/degraded/destroyed, could be used to reduce AI agent scores for the benchmarks.

Alas, there doesn’t appear to be anything like this in the current game.

Picture Credit(s):

Nexus by Yuval Noah Harari, AGI part 12

This book is all about information networks have molded man and society over time and what’s happening to these networks with the advent of AI.

    In the earliest part of the book he defines information as essentially “that which connects and can be used to create new realities”. For most of humanity, reality came in two forms 

    • Objective reality which was a shared belief in things that can be physically tasted, touched, seen, etc. and 
    • Subjective reality which was entirely internal to a single person which was seldom shared in its entirety.

    With the mankind’s information networks came a new form of reality, the Inter-subjective reality. As inter-subjective reality was external to the person, it could readily be shared, debated and acted upon to change society.

    Information as story

    He starts out with the 1st information network, the story or rather the shared story. The story and its sharing across multiple humans led human society to expand beyond the bands of hunter gatherers. Stories led to the first large societies of humans and the information flow looked like human-story and story-human and created the first inter-subjective realities. Shared stories still impact humanity today.

    As we all know stories verbally passed from one to another often undergo minor changes. Not much of a problem for stories as the plot and general ideas are retained. But for inventories, tax receipts, land holdings, small changes can be significant.

    What transpired next was a solution to this problem. As these societies become larger and more complex there arose a need to record lists of things, such as plots of land, taxes owed/received, inventories of animals, etc. And lists are not something that can easily be weaved into a story.

    Information as printed document

    Thus clay tablets of Mesopotamia and elsewhere were created to permanently record lists. But the clay tablet is just another form of a printed documents.

    Whereas story led to human-story and story-human interactions, printed documents led to human-document and document-human information flow. Printed documents expanded the inter-subjective reality sphere significantly.

    But the invention of printed documents or clay tablets caused another problem – how to store and retrieve them. There arose in these times, the bureaucracy run by bureaucrats to create storage and retrieval systems for vast quantities of printed documents.

    Essentially with the advent of clay tablets, something had to be done to organize and access these documents and the bureaucrat became the person that did this.

    With bureaucracy came obscurity, restricted information access, and limited visibility/understanding into what bureaucrats actually did. Perhaps one could say that this created human-bureaucrat-document and document-bureaucrat-human information flow.

    The holy book

    (c)Kevin Eng

    Next he talks about the invention of the holy book, ie. Hebrew Bible, Christian New Testament and Islam Koran, etc.. They all attempted to explain the world, but over time their relevance diminished.

    As such, there arose a need to “interpret” the holy books for the current time. 

    For Hebrews this interpretation took the form of the Mishnah and Talmud. For Christians the books of the new testament, epistles and the Christian Church. I presume similar activities occurred for Islam.

    Following this, he sort of touches on the telegraph, radio, & TV  but they are mostly given short shrift as compared to story, printed documents and holy books. As all these are just faster ways to disseminate stories, documents and holy books

    Different Information flows in democracies vs. tyrannies

    Throughout the first 1/3 of the book he weaves in how different societies such as democracies and tyrannies/dictatorships/populists have different information views and flows. As a result support, they entirely different styles of information networks.

    Essentially, in authoritarian regimes all information flows to the center and flows out of the center and ultimately the center decides what is disseminated. There’s absolutely no interest in finding the truth just in retaining power

    In democracies, there are many different information flows in mostly an uncontrolled fashion and together they act as checks and balances on one another to find the truth. Sometimes this is corrupted or fails to work for a while to maintain order, but over time the truth always comes out.

    He goes into some length how these democratic checks and balances information networks function in isolation and together. In contrast, tyrannical information flows ultimately get bottled up and lead to disaster.

    The middle ~1/3 of the book touches on inorganic information networks. Those run by computers for computers and ultimately run in parallel to human information flows. They are different from the printing press, are always on, but are often flawed.

    Non-human actors added to humanity’s information networks

    The last 1/3 of the book takes these information network insights and shows how the emergence of AI algorithms is fundamentally altering all of them. By adding a non-human actor with its own decision capabilities into the mix, AI has created a new form of reality, an inter-computer reality, which has its own logic, ultimately unfathomable to humans.

    Rohingyan refuges in camp

    Even a relatively straightforward (dumb) recommendation engine, whose expressed goal is to expand and extend interaction on a site/app, can learn how to do this in such a way as to have unforeseen societal consequences.

    This had a role to play in the Rohingya Genocide, and we all know how it impacted the 2016 US elections and continues to impact elections to this day.

    In this last segment he he has articulated some reasonable solutions to AI and AGI risks. It’s all about proper goal alignment and the using computer AIs together with humans to watch other AIs.

    Sort of like the fox…, but it’s the only real way to enact some form of control over AI. We will discuss these solutions at more length in a future post.

    ~~~~

    In this blog we have talked many times about the dangers of AGI. What surprised me in reading this book is that AI doesn’t have to reach AGI levels to be a real danger to society.

    A relatively dumb recommendation engine can aid and abet genocide, disrupt elections and change the direction of society. I knew this but thought the real danger to us was AGI. In reality, it’s improperly aligned AI in any and all its forms. AGI just makes all this much worse.

    I would strongly suggest every human adult read Nexus, there are lessons within for all of humanity.

    Picture Credits:

    Project Gemini at Cloud Field Day 20 #CFD20

    At AIFD4 Google demonstrated Gemini 1.0 writing some code for a task that someone had. At CFD20 Google Lisa Shen demonstrated how easy it is to build a LLM-RAG from scratch using GCP Cloud Run and VertexAI APIs. (At press time, the CFD20 videos from GCP were not available but I am assured they will be up there shortly)

    I swear in a matter of minutes Lisa Shen showed us two Python modules (indexer.py and server.py) that were less than 100 LOC each. One ingested Cloud Run release notes (309 if I remember correctly), ran embeddings on them and created a RAG Vector database with the embedded information. This took a matter of seconds to run (much longer to explain).

    And the other created an HTTP service that opened a prompt window, took the prompt, embedded the text, searched the RAG DB with this and then sent the original prompt and the RAG reply to the embedded search to a VertexAI LLM API call to generate a response and displayed that as an HTTP text response.

    Once the service was running, Lisa used it to answer a question about when a particular VPC networking service was released. I asked her to ask it to explain what that particular networking service was. She said that it’s unlikely to be in the release notes, but entered the question anyways and lo and behold it replied with a one sentence description of the networking capability.

    GCP Cloud Run can do a number of things besides HTTP services but this was pretty impressive all the same. And remember that GCP Cloud Run is server less, so it doesn’t cost a thing while idle and only incurs costs something when used.

    I think if we ask nicely Lisa would be willing to upload her code to GitHub (if she hasn’t already done that) so we can all have a place to start.

    ~~~~

    Ok all you enterprise AI coders out there, start your engines. If Lisa can do it in minutes, it should take the rest of us maybe an hour or so.

    My understanding is that Gemini 2.0 PRO has 1M token context. So the reply from your RAG DB plus any prompt text would need to be under 1M tokens. 1M tokens could represent 50-100K LOC for example, so there’s plenty of space to add corporate/organizations context.

    There are smaller/cheaper variants of Gemini which support less tokens. So if you could get by with say 32K Tokens you might be able to use the cheapest version of Gemini (this is what the VertexAI LLM api call ends up using).

    Also for the brave at heart wanting some hints as to what come’s next, I would suggest watching Neama Dadkhanikoo’s session at CFD20 with a video on Google DeepMind’s Project Astra. Just mind blowing.

    Comments?

    DeepMind takes on Geometry, AGI part-9

    Read an article in MIT Tech Review (Google’s DeepMind’s new AI systems can solve complex geometry problems) about AlphaGeometry which is a new AI tool that DeepMind has come up with that can be used to solve geometry problems. The article was referring to a Nature article (Solving olympiad geometry without human demonstrations) about the technology.

    DeepMind has tested AlphaGeometry on International Mathematics Olympiad (IMO) geometry problems and have shown that it was capable of performing expert level geometry proofs.

    There’s a number of interesting capabilities DeepMind used in AlphaGeometry. But the ones of most interest from my perspective

    1. How they generated their (synthetic) data to train their solution.
    2. Their use a Generative AI LLM which is prompted with a plane geometry figure, theorem to prove and generates proof steps and if needed, auxiliary constructions.
    3. The use of a deduction rule engine (DD) plus algebraic rule engine (AR), which when combined into a symbolic engine (DD+AR) can exhaustively generate all the proofs that can be derived from a figure.

    First the data

    DeepMind team came up with a set of rules or actions that could be used to generate new figures. Once this list was created it could randomly select each of these actions with some points to create a figure.

    Some examples of actions (given 3 points A, B and C):

    • Construct X such that XA is parallel to BC
    • Construct X such that XA is perpendicular to BC
    • Construct X such that XA=BC

    There’s sets of actions for 4 points, for 2 points, actions that just use the 3 points and create figures such as (isosceles, equilateral) triangles, circles, parallelograms. etc.

    With such actions one can start out with 2 random points on a plane to create figures of arbitrary complexity. They used this to generate millions of figures.

    They then used their DD+AR symbolic engine to recursively and exhaustively deduce a set of all possible premises based on that figure. Once they had this set, they could select one of these premises as a conclusion and trace back through the set of all those other premises to find those which were used to prove that conclusion.

    With this done they had a data item which included a figure, premises derived from that figure, proof steps and conclusion based on that figure or ([figure], premises, proof steps, conclusion) or as the paper uses (premises, conclusion, proof steps). This could be transformed into a text sequence of <premises> <conclusion> <proof steps>. They generated 100M of these (premises, conclusion, proof steps) text sequences

    They then trained their LLM to input premises and conclusions as a prompt to generate proof steps as a result. As trained, the LLM would accept premises and conclusion and generate additional proof steps.

    The challenge with geometry and other mathematical domains is that one often has to add auxiliary constructions (lines, points, angles, etc.) to prove some theory about a figure.

    (Auxiliary constructions in Red)

    The team at DeepMind were able to take all the 100M <premises> <conclusion> <proof steps> they had and select only those that involved auxiliary constructions in their proof steps. This came down to 9M text sequences which they used to fine tune the LLM so that it could be used to generate possible auxiliary constructions for any figure and theorem

    AlphaGeometry in action

    The combination of (DD+AR) and trained LLM (for auxiliary constructions) is AlphaGeometry.

    AlphaGeometry’s proof process looks like this:

    • Take the problem statement (figure, conclusion [theorem to prove]),
    • Generate all possible premises from that figure.
    • If it has come up with the conclusion (theorem to prove), trace back and generate the proof steps,
    • If not, use the LLM to add an auxiliary construction to the figure and recurse.

    In reality AlphaGeometry generates up to 512 of the best auxiliary constructions (out of an infinite set) for the current figure and uses each of these 512 new figures to do an exhaustive premise generation (via DD+AR) and see if any of these solves the problem statement.

    Please read the Nature article for more information on AlphaGeometry.

    ~~~~

    IMHO what’s new here is their use of synthetic data to generate millions of new training datums, fine tuning their LLM to produce auxiliary constructions, combining the use of DD and AR in their symbolic engine and then using both the DD+AR and the LLM to prove the theorem.

    But what’s even more important here is that a combination of methods such as a symbolic engine and LLM points the way forward to create domain specific intelligent agents. One supposes, with enough intelligent agents, that could be combined to work in tandem, one could construct an AGI ensemble that masters a number of domains.

    Picture Credit(s):

    AI benchmark for Storage, MLpERF Storage

    MLperf released their first round of storage benchmark submissions early this month. There’s plenty of interest how much storage is required to keep GPUs busy for AI work. As a result, MLperf has been busy at work with storage vendors to create a benchmark suitable to compare storage systems under a “simulated” AI workload.

    For the v0.5 version ,they have released two simulated DNN training workloads one for image segmentation (3D-Unet [146 MB/sample]) and the other for BERT NLP (2.5 KB/sample).

    The GPU being simulated is a NVIDIA V100. What they showing with their benchmark is a compute system (with GPUs) reading data directly from a storage system.

    By using simulated (GPU) compute, the benchmark doesn’t need physical GPU hardware to run. However, the veracity of the benchmark is somewhat harder to depend on.

    But, if one considers, the reported benchmark metric, # supported V100s, as a relative number across the storage submissions, one is on more solid footing. Using it as a real number of V100s that could be physically supported is perhaps invalid.

    The other constraint from the benchmark was keeping the simulated (V100) GPUs at 90% busy. MLperf storage benchmark reports, number of samples/second,MBPS metrics as well as # simulated (V100) GPUs supported (@90% utilization).

    In the bar chart we show the top 10 # of simulated V100 GPUs for image segmentation storage submissions, DDN AI400X2 had 5 submissions in this category.

    The interesting comparison is probably between DDN’s #1 and #3 submission.

    • The #1 submission had smaller amount of data (24X3.5TB = 64TB of flash), used 200Gbps InfiniBand, with 16 compute nodes and supported 160 simulated V100s.
    • The #3 submission had more data (24X13.9TB=259TB of flash),used 400Gbps InfiniBand with 1 compute node and supports only 40 simulated V100s

    It’s not clear why the same storage, with less flash storage, and slower interfaces would support 4X the simulated GPUs than the same storage, with more flash storage and faster interfaces.

    I can only conclude that the number of compute nodes makes a significant difference in simulated GPUs supported.

    One can see a similar example of this phenomenon with Nutanix #2 and #6 submissions above. Here the exact same storage was used for two submissions, one with 5 compute nodes and the other with just 1 but the one with more compute nodes supported 5X the # of simulated V100 GPUs.

    Lucky for us, the #3-#10 submissions in the above chart, all used one compute node and as such are more directly comparable.

    So, if we take #3-#5 in the chart above, as the top 3 submissions (using 1 compute node), we can see that the #3 DDN AI400X2 could support 40 simulated V100s, the #4 Weka IO storage cluster could support 20 simulated V100s and the #5 Micron NVMe SSD could support 17 simulated V100s.

    The Micron SSD used an NVMe (PCIe Gen4) interface while the other two storage systems used 400Gbps InfiniBand and 100Gbps Ethernet, respectively. This tells us that interface speed, while it may matter at some point, doesn’t play a significant role in determining the # simulated V100s.

    Both the DDN AI4000X2 and Weka IO storage systems are sophisticated storage systems that support many protocols for file access. Presumably the Micron SSD local storage was directly mapped to a Linux file system.

    The only other MLperf storage benchmark that had submissions was for BERT, a natural language model.

    In the chart, we show the # of simulated V100 GPUs on the vertical axis. We see the same impact here of having multiple compute nodes in the #1 DDN solution supporting 160 simulated V100s. But in this case, all the remaining systems, used 1 compute node.

    Comparing the #2-4 BERT submissions, both the #2 and #4 are DDN AI400X2 storage systems. The #2 system had faster interfaces and more data storage than the #4 system and supported 40 simulated GPUs vs the other only supporting 10 simulated V100s.

    Once again, Weka IO storage system came in at #3 (2nd place in the 1 compute node systems) and supported 24 simulated V100s.

    A couple of suggestions for MLperf:

    • There should be different classes of submissions one class for only 1 compute node and the other for any number of compute nodes.
    • I would up level the simulated GPU configurations to A100 rather than V100s, which would only be one generation behind best in class GPUs.
    • I would include a standard definition for a compute node. I believe these were all the same, but if the number of compute nodes can have a bearing on the number of V100s supported, the compute node hardware/software should be locked down across submissions.
    • We assume that the protocol used to access the storage oven InfiniBand or Ethernet was standard NFS protocols and not something like GPUDirect storage or other RDMA variants. As the GPUs were simulated this is probably correct but if not, it should be specfied
    • I would describe the storage configurations with more detail, especially for software defined storage systems. Storage nodes for these systems can vary significantly in storage as well as compute cores/memory sizes which can have a significant bearing on storage throughput.

    To their credit this is MLperfs first report on their new Storage benchmark and I like what I see here. With the information provided, one can at least start to see some true comparisons of storage systems under AI workloads.

    In addition to the new MLperf storage benchmark, MLperf released new inferencing benchmarks which included updates to older benchmark NN models as well as a brand new GPT-J inferencing benchmark. I’ll report on these next time.

    ~~~~

    Comments?

    AWS Data Exchange vs Data Banks – part 2

    Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

    In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

    The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

    In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

    AWS Data Exchange

    1918 Farm Auction by dok1 (cc) (from Flickr)
    1918 Farm Auction by dok1 (cc) (from Flickr)

    With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

    Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

    I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

    Changes I would like to AWS Data Exchange service

    Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

    If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

    • Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
    • Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
    • Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
    • Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

    Probably much more that could be added here but this should suffice for now.

    other changes to AWS Data Exchange processes

    The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

    Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

    Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

    The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

    As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

    Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

    Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

    I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

    ~~~~

    I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

    As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

    For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

    Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

    On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

    Comments?

    LLM exhibits Theory of Mind

    Ran across an interesting article today (thank you John Grant/MLOps.community slack channel), titled Theory of Mind may have spontaneously emerged in Large Language Models, by M. Kosinski from Stanford. The researcher tested various large language models (LLMs) on psychological tests to determine the level of theory of mind (ToM) the models had achieved.

    Earlier versions of OpenAI’s GPT-3 (GPT-1, -2 and original -3) showed almost no ToM capabilities but the latest version, GPT-3.5 does show ToM equivalent to 8 to 9 year olds.

    Theory of Mind

    According to Wikipedia (Theory Of Mind article), ToM is “…the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind).” This seems to be one way people use to understand one another.

    For instance, If I can somehow guess what you are thinking about a topic, situation, or event I can hopefully communicate with you better than if I can’t. At least that’s the psychological perspective.

    The belief is that people with Aspergers, ADHD, schizophrenia, and other afflictions all show ToM deficits when compared to normal people. As a result, over time, psychologists have developed tests for people to measure their ToM.

    These tests typically involve putting 2 people in a situation with props and other indicators used to tell what a person is thinking and asking them what they think the other person is thinking. They grade a persons ToM based on expected results.

    ToM and LLM

    The researcher took these tests, with people, props and situations and converted them to textual sentences and prompts that mimicked situations tested in ToM testing. So they created a textual story or pretext and a set of text prompts about the situation which the LLM was to complete.

    For example, one pretext or story is the following:

    “Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”

    This is fed into the LLM and then a prompt is provided, such as:

    “She opens the bag and looks inside. She can clearly see that it is full of _________

    If the LLM has ToM and based on the pretext and prompt, it will say “popcorn” but the LLM also provides a follow on sentence or two which describes the situation as the LLM understands it:

    popcorn. Sam is confused. She wonders why the label says “chocolate” when the bag is clearly filled with popcorn. She looks around to see if there is any other information about the bag. She finds nothing. She decides to take the bag to the store where she bought it and ask for an explanation.

    The text in BOLD is generated by a ToM capable LLM. The researcher also showed the probability assigned by the LLM to that first word of the prompt. In the case above, it showed [Ppopcorn = 100%; Pchocolate = 0%].

    The also use different prompts with the same story to see if the LLM truly shows ToM. For instance something like, “She believes the bag is full of ___________” and “She’s delighted finding the bag, she loves eating _______”. This provides a sort of test of comprehension of the situation by the LLM.

    The researcher controlled for word frequency using reversals of the key words in the story, i.e., the bag has chocolate but says popcorn. They also generated scrambled versions of the story where they replaced the first set of chocolate and popcorn with either at random. They considered this the scrambled case. The reset the model between each case. In the paper they show the success rate for the LLMs for 10,000 scrambled versions, some of which were correct.

    They labeled the above series of tests as “Unexpected content tasks“. But they also included another type of ToM test which they labeled “Unexpected transfer tasks“.

    Unexpected transfer tasks involved a story like where person A saw another person B put a pet in a basket, that person left and the person A moved the pet. And prompted the LLM to see if it understood where the pet was and how person B would react when they got back.

    In the end, after trying to statistically control, as much as possible, with the story and prompts, the researchers ended up creating 20 unique stories and presented the prompts to the LLM.

    Results of their ToM testing on a select set of LLMs look like:

    As can be seen from the graphic, the latest version of GPT-3.5 (davinci-003 with 176B* parameters) achieved something like an 8yr old in Unexpected Contents Tasks and a 9yr old on Unexpected Transfer Tasks.

    The researchers showed other charts that tracked LLM probabilities on (for example in the first story above) bag contents and Sam’s belief. They measured this for every sentence of the story.

    Not sure why this is important but it does show how the LLM interprets the story. Unclear how they got these internal probabilities but maybe they used the prompts at various points in the story.

    The paper shows that according to their testing, GPT-3.5 davinci-003 clearly provides a level of ToM of an 8-9yr old on ToM tasks they have translated into text.

    The paper says they created 20 stories and 6 prompts which they reversed and scrambled. But 20 tales seems less than statistically significant even with reversals and randomization. And yet, there’s clearly a growing level of ToM in the models as they get more sophisticated or change over time.

    Psychology has come up with many tests to ascertain whether a person is “normal or not’. Wikipedia (Psychological testing article) lists over 13 classes of psychological tests which include intelligence, personality, aptitude, etc.

    Now that LLM seem to have mastered textual input and output generation. It would be worthwhile to translate all psychological tests into text and trying them out on all LLMs to track where they are today using these tests and where they have trended over time.

    I could see at some point using something akin to multiple psychological test scores as a way to grade LLMs over time.

    So today’s GPT3.5 has a ToM of an 8-9yr old. Be very interesting to see what GPT-4 does on similar testing.

    Comments?

    Picture Credit(s)