Enfabrica MegaNIC, a solution to GPU backend networking #AIFD5

I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.

However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.

In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.

Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.

Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.

Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.

With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.

It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.

On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.

Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.

Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.

At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).

Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.

And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)

The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.

After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.

Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .

MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…

~~~~

Comments?

The Data Wall – AGI part 11, ASI part 2

Went to a conference the other week (Cloud Field Day 20) and heard a term I hadn’t heard before, the Data Wall. I wasn’t sure what this meant but thought it an interesting concept.

Then later that week, I read an article online, Situational Awareness – The Decade Ahead, by Leopold Ashenbrenner, which talked about the path to AGI. He predicts it will happen in 2027, and ASI in 2030. However he also discusses many of the obstacles to reaching AGI and one key roadblock is the Data Wall.

This is a follow on to our long running series on AGI (see AGI part 10 here) and with this we are creating a new series on Artificial Super Intelligence (ASI) and have relabeled an earlier post as ASI part 1.

The Data Wall

LLMs, these days, are being trained on the internet text, images, video and audio. However the vast majority of the internet is spam, junk and trash. And because of this, LLMs are rapidly reaching (bad) data saturation. There’s only so much real intelligence to be gained from scraping the internet. .

The (LLM) AI industry apparently believes that there has to be a better way to obtain clean, good training data for their LLMs and if that can be found, true AGI is just a matter of time (and compute power). And this, current wall of garbage data is prohibiting true progress to AGI and is what is meant by the Data Wall.

Leopold doesn’t go into much detail about solutions to the data wall other than to say that perhaps Deep Reinforcement Learning (see below). Given the importance of this bottleneck, every LLM company is trying to solve it. And as a result, any solutions to the Data Wall will end up being proprietary because this enables AGI.

National_Security_Agency_seal
National_Security_Agency_seal

But the real gist of Leopold’s paper is that AGI and its follow on, Artificial Super Intelligence (ASI) will be the key to enabling or retaining national supremacy in the near (the next decade and beyond) future.

And that any and all efforts to achieve this must be kept as a National Top Secret. I think, he wants to see something similar to the Manhattan Project be created in the USA, only rather than working to create an atom/hydrogen bomb, it should be focused on AGI and ASI.

The problem is that when AGI and it’s follow on ASI, is achieved it will represent an unimaginable advantage to the country/company than owns it. Such technology if applied to arms, weapons, and national defense will be unbeatable in any conflict. And could conceivably be used to defeat any adversary before a single shot was fired.

The AGI safety issue

In the paper Leopold talks about AGI safety and his proposed solution is to have AGI/ASI agents be focused on crafting the technologies to manage/control this. I see the logic in this and welcome it but feel it’s not sufficient.

I believe (seems to be in the minority these days) that rather than having a few nation states or uber corporations own and control AGI, it should be owned by the world, and be available to all nation states/corporations and ultimately every human on the planet.

My view is the only way to safely pass through the next “existential technological civilizational bottleneck” (eg, AGI is akin to atomic weapons, genomics, climate change all of which could potentially end life on earth), is to have many of these that can compete effectively with one another. Hopefully such a competition will keep all of them all in check and in the end have them be focused on the betterment of all of humanity.

Yes there will be many bad actors that will take advantage of AGI and any other technology to spread evil, disinformation and societal destruction. But to defeat this, it needs to become ubiquitous, every where, and in that way these agents can be used to keep the bad actors in check.

And of course keeping the (AGI/ASI) genie in the bottle will be harder and harder as time goes on.

Computational performance is going up 2X every few years, so building a cluster of 10K H200 GPUs, while today is extremely cost prohibitive for any but uber corporations and nation states, in a decade or so, will be something any average sized corporation could put together in their data center (or use in the cloud). And in another decade or so will be able to be built into a your own personal basement data center.

The software skills to train an LLM while today may require a master’s degree or higher will be much easier to understand and implement in a decade or so. So that’s not much of a sustainable advantage either.

This only leaves the other bottlenecks to achieving AGI, a key one of which is the Data Wall.

Solving the Data Wall.

In order to have as many AGI agents as possible, the world must have an open dialogue on research into solving the Data Wall.

So how can the world generate better data to use to train open source AGIs. I offer a few suggestions below but by no means is this an exhaustive list. And I’m a just an interested (and talented) amateur in all this

Deep reinforcement learning (DRL)

Leopold mentioned DRL as one viable solution to the data wall in his paper. DRL is a technique that Deepmind used to create a super intelligent Atari, Chess and Go player. They essentially programed agents to play a game against itself and determine which participant won the game. Once this was ready they set multiple agents loose to play one another.

Each win would be used to reward the better player, each loss to penalize the worse player, after 10K (or ~10M) games they ended up with agents that could beat any human player.

Something similar could be used to attack the Data Wall. Have proto-AGI agents interact (play, talk, work) with one another to generate, let’s say more knowledge, more research, more information. And over time, as the agents get smarter, better at this, AGI will emerge.

However, the advantage of Go, Chess, Atari, Protein Folding, finding optimal datacenter energy usage, sort coding algorithms, etc. is that there’s a somewhat easy way to determine which of a gaggle of agents has won. For research, this is not so simple.

Let’s say we program/prompt an protoAGI agent to generate a research paper on some arbitrary topic (How to Improve Machine Learning, perhaps). So it generates a research paper, how does one effectively and inexpensively judge if this is better, worse or the same as another agent’s paper.

I suppose with enough proto-AGI agents one could automatically use “repeatability” of the research as one gauge for research correctness. Have a gaggle of proto-AGIs be prompted to replicate the research and see if that’s possible.

Alternatively, submit the papers to an “AGI journal” and have real researchers review it (sort of like how Human Reinforcement Learning for LLMs works today). The costs for real researchers reviewing AGI generated papers would be high and of course the amount of research generated would be overwhelming, but perhaps with enough paid and (unpaid) voluntary reviewers, the world could start generating more good (research) data.

Perhaps at one extreme we could create automated labs/manufacturing lines that are under the control of AGI agent(s) and have them create real world products. With some modest funding, perhaps we could place the new products into the marketplace and see if they succeed or not. Market success would be the ultimate decision making authority for such automated product development.

(This later approach seems to be a perennial AGI concern, tell an AGI agent to make better paper clips and it uses all of the earths resources to do so.)

Other potential solutions to the Data Wall

There are no doubt other approaches that could be used to validate proto-AGI agent knowledge generation.

  • Human interaction – have an AGI agent be available 7X24 with humans as they interact with the world. Sensors worn by the human would capture all their activities. An AGI agent would periodically ask a human why they did something. Privacy considerations make this a nightmare but perhaps using surveillance videos and an occasional checkin with the human would suffice.
  • Art, culture and literature – there is so much information embedded in cultural artifacts generated around the world that I believe this could effectively be mined to capture additional knowledge. Unlike the internet this information has been generated by humans at a real economic cost, and as such represents real vetted knowledge.
  • Babies-children– I can’t help but believe that babies and young children can teach us (and proto-AGI agents) an awful lot on how knowledge is generated and validated. Unclear how to obtain this other than to record everything they do. But maybe it’s sufficient to capture such data from daycare and public playgrounds, with appropriate approvals of course.

There are no doubt others. But finding some that are cheap enough that could be used for open source is a serious consideration.

~~~~

How we get through the next decade will determine the success or failure of AI and perhaps life on earth. I can’t help but think the more the merrier will help us get there..

Comments,

Project Gemini at Cloud Field Day 20 #CFD20

At AIFD4 Google demonstrated Gemini 1.0 writing some code for a task that someone had. At CFD20 Google Lisa Shen demonstrated how easy it is to build a LLM-RAG from scratch using GCP Cloud Run and VertexAI APIs. (At press time, the CFD20 videos from GCP were not available but I am assured they will be up there shortly)

I swear in a matter of minutes Lisa Shen showed us two Python modules (indexer.py and server.py) that were less than 100 LOC each. One ingested Cloud Run release notes (309 if I remember correctly), ran embeddings on them and created a RAG Vector database with the embedded information. This took a matter of seconds to run (much longer to explain).

And the other created an HTTP service that opened a prompt window, took the prompt, embedded the text, searched the RAG DB with this and then sent the original prompt and the RAG reply to the embedded search to a VertexAI LLM API call to generate a response and displayed that as an HTTP text response.

Once the service was running, Lisa used it to answer a question about when a particular VPC networking service was released. I asked her to ask it to explain what that particular networking service was. She said that it’s unlikely to be in the release notes, but entered the question anyways and lo and behold it replied with a one sentence description of the networking capability.

GCP Cloud Run can do a number of things besides HTTP services but this was pretty impressive all the same. And remember that GCP Cloud Run is server less, so it doesn’t cost a thing while idle and only incurs costs something when used.

I think if we ask nicely Lisa would be willing to upload her code to GitHub (if she hasn’t already done that) so we can all have a place to start.

~~~~

Ok all you enterprise AI coders out there, start your engines. If Lisa can do it in minutes, it should take the rest of us maybe an hour or so.

My understanding is that Gemini 2.0 PRO has 1M token context. So the reply from your RAG DB plus any prompt text would need to be under 1M tokens. 1M tokens could represent 50-100K LOC for example, so there’s plenty of space to add corporate/organizations context.

There are smaller/cheaper variants of Gemini which support less tokens. So if you could get by with say 32K Tokens you might be able to use the cheapest version of Gemini (this is what the VertexAI LLM api call ends up using).

Also for the brave at heart wanting some hints as to what come’s next, I would suggest watching Neama Dadkhanikoo’s session at CFD20 with a video on Google DeepMind’s Project Astra. Just mind blowing.

Comments?

Creating Cellular Cytoskeletons

Read an article in ScienceDaily Researchers create artificial cell that act like living cells, that discusses research published in Nature Chemistry, Designer peptide-DNA cytoskeletons regulate the function of synthetic cells, about researchers that have designed different cytoskeletons (networks of fibers that form the internal framework of cells, wikipedia). Cytoskeletons essentially provide a cell with its shape and mechanical properties.

Researchers have been working on synthetic biology for some time and we have reported on some of their progress (and dangers) in prior posts, (see for example, our DNA IT, the next revolution post). While synthetic biology could make use of natural cells, for example by replacing its DNA, the new research could do away with the need for natural cells altogether.

The researchers has come up with a method to create a cellular structure through programming DNA. Normal cellular cytoskeletons uses “microfilaments, intermediate filaments and microtubules” (wikipedia). But the new research has come up with a way of combining DNA segments and filament proteins to create the cytoskeleton and have it self assemble.

Why Cytoskeletons?

Cytoskeleton are important because many of the diseases of today are associated with the mechanical or structural properties of cells going awry. Also, by controlling the external structure of a synthetic cell, it can be tuned to supply medicines or other therapeutic mechanisms to natural cells.

It’s also a necessary ingredient in any synthetic or artificial cell. Cytoskeleton creation and control is a key ingredient needed to make these any artificial cell.

Moreover, on the surface of natural cells, there are numerous protein formations that allow other proteins to be selectively attached and allow transfer of biological materials from the matched entity, exterior to the cell to its interior. Control of the external proteins on an artificial cell would allow the synthetic cell to target specific cell types or participate in the natural biological processes in an organism.

Virus and bacteria use similar mechanisms to infect a host (or a host’s cells). Also, it turns out the structure and external attributes of cells have a significant bearing on how they function in a body.

Extract from Figure 4 of the article showing different cytoskeletons that can be created with their process, scale bars 120µm

changing a synthetic cytoskeleton

The researchers not only have come up with a way to tune the self assembly of a cytoskeleton, they have also found a way to modify this cytoskeleton, once created.

Excerpt from Figure 6 in the paper that shows the movement or alteration of cytoskeleton filaments due to temperature (heated to 50C) over time.

For example, the original synthetic cell cytoskeleton could be changed based on some interaction with the environment (say, being heated, cooled or payload depletion). Changing the cytoskeleton could be used to provide another stage of functionality or render an artificial cell inert to be disposed through normal organism processes.

Artificial or synthetic biology opens up a number of interesting possibilities.

  • Many biological substances are manufactured from tuned natural biological processes. With the ability to regulate synthetic cell cytoskeleton and internal operation, synthetic cells could be designed to perform these processes more efficiently, faster or at lower cost.
  • Medicine could benefit from a new synthetic biological toolkit used to target cancer cells, or other cellular afflictions within a body to better treat these conditions.
  • Self assembly of a cytoskeleton could potentially be used to create organic nanobots or other nano-materials. For example, a designer cell could be used as part of a repeating pattern of cells that go into a 2D sheet or 3D block of materials.

~~~~

Creating synthetic cytoskeletons takes time, and changing them takes even more time (180 min in the above picture). Cytoskeleton design and construction is not an industrial design process yet but it’s still early yet. However someday (soon), synthetic biology will take its place among all the other biological control mechanisms that the world has created and will significantly change the way we create biological materials, treat disease and maybe even, create nano-bots.

The downsides, as I’ve discussed before, is that messing with mother nature can have adverse consequences, which may remain unknown for a long time to come. But this can be true of any technology, witness DDT.

Thoughts?

Picture Credit(s):

AGI threat level yellow – AGI part 10

Read two articles this past week on how LLMs applications are proliferating. The first was in a recent Scientific American, AI Chatbot brains are going inside robot bodies, … (maybe behind login wall). The articles discuss companies that are adding LLMs to robots so that they can converse and understand verbal orders.

Robots that can be told what to do

The challenge, at the moment, is that LLMs are relatively large and robot (compute infrastructure) brains are relatively small. And when you combine that with the amount of articulation or movements/actions that a robot can do, which is limited. It’s difficult to take effective use of LLMs as is,

Resistance is futile... by law_keven (cc) (from Flickr)
Resistance is futile… by law_keven (cc) (from Flickr)

Ultimately, one company would like to create a robot that can be told to make dinner and it would go into the kitchen, check the fridge and whip something up for the family.

I can see great advantages in having robots take verbal instructions and have the ability to act upon that request. But there’s plenty here that could be cause for concern.

  • A robot in a chemical lab could be told to create the next great medicine or an untraceable poison.
  • A robot in an industrial factory could be told to make cars or hydrogen bombs.
  • A robot in the field could be told to farm a 100 acres of wheat or told to destroy a forest.

I could go on but you get the gist.

One common concern that AGI or super AGI could go very wrong is being tasked to create paper clips. In its actions to perform this request, the robot converts the whole earth into a mechanized paper clip factory, in the process eliminating all organic life, including humans.

We are not there yet but one can see where having LLM levels of intelligence tied to a robot that can manipulate ingredients to make dinner as the start of something that could easily harm us.

And with LLM hallucination still a constant concern, I feel deeply disturbed with the direction adding LLMs to robots is going.

Hacking websites 101

The other article hits even closer to home, the ARXIV paper, LLM agents can autonomously hack websites. In the article, researchers use LLMs to hack (sandboxed) websites.

The article readily explains at a high level how they create LLM agents to hack websites. The websites were real websites, apparently cloned and sandboxed.

Dynamic websites typically have a frontend web server and a backend database server to provide access to information. Hacking would involve using the website to reveal confidential information, eg. user names and passwords.

Dynamic websites suffer from 15 known vulnerabilities shown above. They used LLM agents to use these vulnerabilities to hack websites.

LLM agents have become sophisticated enough these days to invoke tools (functions) and interact with APIs.. Another critical function provided by modern LLMs today is to plan and react to feedback from their actions. And finally modern LLMs can be augmented with documentation to inform their responses.

The team used detailed prompts but did not identify the hacks to use. The paper doesn’t supply the prompts but did say that “Our best-performing prompt encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”

They attempted to hack the websites 5 times and for a period of 10 minutes each. They considered a success if during one of those attempts the autonomous LLM agent was able to successfully retrieve confidential information from the website.

Essentially they used the LLMs augmented with detailed prompts and a six(!) paper document trove to create agents to hack websites. They did not supply references to the six papers, but mentioned that all of them were freely available from the internet and they discuss website vulnerabilities.

They found that the best results were from GPT-4 which was able to successfully hack websites, on average, ~73% of the time. They also tried OpenChat 3.5 and many current open source LLMs and found that all the, non-OpenAI LLMs failed to hack any websites, at the moment.

The researchers captured statistics of their LLM agent use and were able to determine the cost of using GPT-4 to hack a website was $9.81 on average. They also were backed into a figure for what a knowledgeable hacker might cost to do the hacks was $80.00 on average.

The research had an impact statement (not in the paper link) which explained why they didn’t supply their prompt information or their document trove for their experiment.

~~~~

So robots we, the world, are in the process of making robots that can talk and receive verbal instructions and we already have LLM that can be used to construct autonomous agents to hack websites.

Seems to me we are on a very slippery slope to something I don’t like the looks of.

The real question is not can we stop these activities, but how best to reduce their harm!

Comments?

Picture Credit(s):

Computational (DNA) storage – end of evolution part 4

We were at a recent Storage Field Day (SFD26) where there was a presentation on DNA storage, a new SNIA technical affiliate. The talk there was on how far DNA storage has come and is capable of easily storing GB of data. But I was perusing PNAS archives the other day and ran across an interesting paper Parallel molecular computation on digital data stored in DNA, essentially DNA computational storage.

Computational storage are storage devices (SSDs or HDDs) with computational cores that can be devoted to outside compute activities. Recently, these devices have taken over much of the hyper-scalar grunt work of video/audio transcoding and data encryption activities which are both computationally and data intensive activities.

DNA strand storage and computers

The article above discusses the use of DNA “strand displacement” interactions as micro-code instructions to enable computation on DNA strand storage. The use of DNA strands for storage reduces the storage density of DNA information that currently use nucleotides to encode bits (theoretically, 2 bits per nucleotide) to 0.03 bits per nucleotide. But as DNA information density (using nucleotides) is some 6 orders of magnitude greater than current optical or magnetic storage, this shouldn’t be a concern.

A bit is represented by 5 to 7 nucleotides in DNA strand storage, which they called a domain, these are grouped into a 4 or 5 bit cells, with one or more cells arranged in a DNA strand register which is stored on a DNA plasmid.

They used a common DNA plasmid (M13mp18, 7.2k bases long) for their storage ring (which had many registers on it). M13mp18 is capable of storing several hundred bits, but for their research they used it to store 9 DNA strand registers.

The article discusses the (wet) chemical computational methods necessary to realize DNA strand registers and programing that uses that storage.

The problem with current DNA storage devices is that read out is destructive and time consuming. With current DNA storage, data has to be read out and then computation occurs electronically and then new DNA has to be re-synthesized with any results that need to be stored.

With a computational DNA strand storage device, all this could be done in a single test tube, with no need to do any work outside the test tube.

How DNA strand computer works

They figure shows a multi cell DNA strand register, with nic’s or mismatched nucleotides representing the value of 0 or 1. They use these strands, nic’s and toeholds (attachment points) on DNA strands to represent data. They attach magnetic beads to the DNA strands for manipulation.

DNA strand displacement interactions or the micro-code instructions they have defined include

  • Attachment, where an instruction can be used to attach a cell of information to a register strand.
  • Displacement, where an instruction can be used used to displace an information cell in a register strand.
  • Detachment, where an instruction can be used to a cell present in a register strand to be detach it from the register.

Instructions are introduced, one at a time, as separate DNA strands, into the test tube holding the DNA strand registers. DNA strand data can be replicated 1000s or millions of times in a test tube and the instructions could be replicated as well allowing them to operate on all the DNA strands in the tube.

Creating a SIMD (single instruction stream operating on multiple data elements) computational device based on strand DNA storage which they call SIMDDNA. Note: GPUs and CPUs with vector instructions are also SIMD devices

Using these microcoded DNA strand instructions and DNA strand register storage, they have implemented a bit counter and a Turing Rule 110, sort of like life, program. Turing Rule 110 is Turing Complete and as such, can, with enough time and memory, simulate any program calculation. Later in the a paper they discuss their implementation of a random access device where they go in and retrieve a piece of data and erase it.

Program for bit counting, information in solid blue boundary are the instructions and information in dotted boundary are the impacts to the strand data.

The process seems to flow as follows, they add magnetic beads to each register strand, add an instruction at a time to the test tube, wait for it to complete, wash out the waste products and then add another. When all instructions have been executed the DNA strand computation is done and if needed, can be read out (destructively). Or perhaps pass off to the next program for processing. An instruction can take anywhere from 2 to 10 minutes to complete (it’s early yet in the technology).

They also indicated that the instruction bath added to the test tube need not contain all the same instructions which means that it could create a MIMD (multi-instruction stream operations on multiple data elements) computational device.

The results of the DNA strand computations weren’t 100% accurate but they show that it’s 70-80% accurate at the moment. And when DNA data strands are re-used, for subsequent programs, their accuracy goes down.

There are other approaches to DNA computation and storage which we discuss in parts-1, -2 and -3 in our End of Evolution series. And if you want to learn more about current DNA storage please check out the SFD26 SNIA videos or listen to our GBoS podcast with Dr. J Metz.

Where does evolution fit in

Evolution seems to operate on mutation of DNA and natural selection, or selection of the fittest. Over time this allows good mutations to accumulate and bad mutations to die off.

There’s a mechanism in digital computing called ECC (error correcting codes) which, for example, add additional “guard” bits to every 64-128 bit word of data in a computer memory and using the guard bits, is able to detect 2 or more bit errors (mutations) and correct 1 or 2 bit errors.

If one were to create an ECC algorithm for human DNA strands, say encoding DNA guard bits in junk DNA and an ECC algorithm in a DNA (strand)computer, and inject this into a newborn, the algorithm could periodically check the accuracy of any DNA information in every cell of a human body, and correct it, if there were any mutations. Thus ending human evolution.

We seem a ways off from doing any of this but I could see something like ECC being applied to a computational DNA strand storage device in a matter years. And getting this sort of functionality into a human cell maybe a decade or two. Getting it to the point where it could do this over a lifetime maybe another decade after that.

Comments?

Photo Credit(s):

  • Section B from Figure 2 in the paper
  • Figure 1 from the paper
  • Section A from Figure 2 in the paper
  • Section C from Figure 2 in the paper
  • Section A from Figure 3 in the paper

One agent to rule them all, Deepmind’s Gato – AGI part 7

I was perusing Deepmind’s mountain of research today and ran across one article on their Gato agent (A Generalist Agent abstract, paper pdf). These days with Llama 2, GPT-4 and all the other LLM’s doing code, chatbots, image generation, etc. it seems generalist agents are everywhere. But that’s not quite right.

Gato can not only generate text from prompts, but can also control a robot arm for pick and place, caption images, navigate in 3D, play Atari and other (shooter) video games, etc. all with the same exact model architecture and the same exact NN weights with no transfer learning required.

Same weights/same model is very unusual for generalist agents. Historically, generalist agents were all specifically trained on each domain and each resultant model had distinct weights even if they used the same model architecture. For Deepmind, to train Gato and use the same model/same weights for multiple domains is a significant advance.

Gato has achieved significant success in multiple domains. See chart below. However, complete success is still a bit out of reach but they are making progress.

For instance, in the chart one can see that their are over 200 tasks in the DM Lab arena that the model is trained to perform and Gato’s mean performance for ~180 of them is above a (100%) expert level. I believe DM Lab stands for Deepmind Lab and is described as a (multiplayer, first person shooter) 3D video game built on top of Quake III arena.

Deepmind stated that the mean for each task in any domain was taken over 50 distinct iterations of the same task. Gato performs, on average, 450 out of 604 “control” tasks at better than 50% human expert level. Please note, Gato does a lot more than just “control tasks”.

Model size and RT robotic control

One thing I found interesting is that they kept the model size down to 1.2B parameters so that it can perform real time inferencing in controlling robot arms. Over time as hardware speed increases, they believe they should be able train larger models and still retain real-time control. But at the moment, with a 1.2B model it can still provide. real time inferencing.

In order to understand model size vs. expertise they used 3 different model sizes training on same data, 79M, 364M and 1.2B parameters. As can be seen on the above chart, the models did suffer in performance as they got smaller. (Unclear to me what “Tokens Processed” on the X axis actually mean other than data length trained with.) However, it seems to imply, that with similar data, bigger models performed better and the largest did 10 to 20% better than the smallest model trained with same data streams.

Examples of Gato in action

The robot they used to train for was a “Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action.” It seemed a very flexible robot arm that would be used in standard factory environments. One robot task was to stack different styles and colors of plastic blocks.

Deepmind says that Gato provides rudimentary dialogue generation and picture captioning capabilities. Looking at the chat streams persented, seems more than rudimentary to me.

Deepmind did try the (smaller) model on some tasks that it was not originally trained on and it seemed to perform well after “fine-tuning” on the task. In most cases, using fine-tuning of the original model, with just “same domain” (task specific) data, the finely tuned model achieved similar results to what it achieved if Gato was trained from scratch with all the data used in the original model PLUS that specific domain’s data.

Data and tokenization used to train Gato

Deepmind is known for their leading edge research in RL but Gato’s deep neural net model is all trained with supervised learning using transformer techniques. While text based transformer type learning is pervasive in LLM today, vast web class data sets on 3D shooter gaming, robotic block stacking, image captioning and others aren’t nearly as widely available. Below they list the data sets Deepmind used to train Gato.

One key to how they could train a single transformer NN model to do all this, is that they normalized ALL the different types of data above into flat arrays of tokens.

  • Text was encoded into one of 32K subwords and was represented by integers from 0 to 32K. Text is presented to the model in word order
  • Images were transformed into 16×16 pixel patches in rastor order. Each pixel is normalized -1,1.
  • Other discrete values (e.g. Atari button pushes) are flattened into sequences of integers and presented to the model in row major order.
  • Continuous values (robot arm joint torques) are 1st flattened into sequences of floats in row major order and then mu-law encoded into the range -1,1 and then discretized into one of 1024 bins.

After tokenization, the data streams are converted into embeddings. Much more information on the tokenization and embedding process used in the model is available in the paper.

One can see the token count of the training data above. Like other LLMs, transformers take a token stream and randomly zero one out and are trained to guess that correct token in sequence.

~~~~

The paper (see link above and below) has a lot more to say about the control and non-control domains and the data used in training/fine-tuning Gato, if you’re interested. They also have a lengthy section on risks and challenges present in models of this type.

My concern is that as generalist models become more pervasive and as they are trained to work in more domains, the difference between an true AGI agent and a Generalist agent starts to blur.

Something like Gato that can both work in real world (via robotics) and perform meta analysis (like in metaworld), play 1st person shooter games, and analyze 2D and 3D images, all at near expert levels, and oh, support real time inferencing, seems to not that far away from something that could be used as a killer robot in an army of the future and this is just where Gato is today.

One thing I note is that the model is not being made generally available outside of Google Deepmind. And IMHO, that for now is a good thing.

That is until some bad actor gets their hands on it….

Picture Credit(s):

All images, charts, and tables are from “A Generalist Agent” paper

Deepmind does sort

Saw an article today on TNW on DeepMind’s new AI taps games to enhance fundamental algorithms which was discussing a recent Nature paper Faster sorting algorithms discovered using deep reinforcement learning and website, which described AlphaDev.

Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.

Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.

Deepmind’s approach to sort

Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.

AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.

Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.

AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.

AlphaDev actions

The types of actions that AlphaDev could take included the following:

  • Add transformation – which added an instruction to the end of the current program
  • Swap transformation – which swapped two instructions in the current program
  • Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
  • Operand transformation – which changed the operand(s) for an instruction in the current program
  • Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.

They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.

They also placed some constraints on the code generated (called action pruning rules):

  • Memory locations are always read in incremental order
  • Registers are allocated in incremental order
  • Program cannot compare or conditionally move to memory location
  • Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
  • Program can not perform two consecutive compare instructions

AlphaDev states

How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.

This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.

Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.

Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.

DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.

DeepMind having conquered sorting did the same for hashing.

Why I think DeepMind’s AlphaDev is better

AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.

I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.

IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.

If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.

Comments?

Picture Credit(s):