Enfabrica MegaNIC, a solution to GPU backend networking #AIFD5

I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.

However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.

In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.

Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.

Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.

Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.

With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.

It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.

On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.

Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.

Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.

At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).

Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.

And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)

The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.

After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.

Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .

MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…

~~~~

Comments?

SIGGRAPH 2024 Keynote: BabyX – AGI part 11, ASI part 3

SIGGRAPH came back to Colorado, to the Colorado Convention Center, for their 50th anniversary conference, the original SIGGRAPH conference was in Boulder in 1974.

The first SIGGRAPH keynote was a session called Beyond the Illusion of Life, presented by Mark Sagar, Soul Machines, Co-Founder and former Chief Science Office.

The theme of the session was mainly on how AI needs an embodiment to achieve a true breakthrough. Without embodiment, AI is just another secluded machine function and interacting with it will always be divorced from human existence and as such, much harder than interacting with other people.

As an example of embodied AI, Mark presented BabyX, a virtual 12-24 month old infant.

BabyX shows how creating a digital embodiment of a human can lead to faster, easier and more inherently natural, human-machine interactions. This is because we, as humans, have evolved to interact with other humans and do this much better and faster than we can interact with machines, chatbots, and other digital simulacrum.

With BabyX, they have created an emulation rather than an animation or simulation of a human.

BabyX

BabyX is a virtual infant that interacts with a virtual screen AND real people on the other side of that screen. BabyX simulates a real infant in front of a screen with adult supervision.

BabyX interacts with people using verbal cues, virtual screen images and virtual hands/fingers in real time.

BabyX appears to be actually learning and interacting with different people in real time.

If you check out their video (in link above), one can see just how close the emulation can get.

BabyX’s emulation is based on a digital cognitive architectural that mimics the real brain, that includes memory and learning system, motor control system, visual system, etc.

All these systems are distinct computational modules, that in unison, represent the “virtual connectome” of BabyX’s brain emulation. Each of these cognitive systems can be swapped in or out, whenever better versions become available.

This cognitive architecture was designed to digitally, re-construct, the key components of the brain of a 18-24 month infant.

As a result, BabyX learns through interactions with its environment and by talking with the people and viewing a screen. With BabyX, they can even simulate hormonal activity. With the end result the ability to provide real time emotional expression.

With such a cognitive architecture, one could simulate real (virtual) humans interacting with another person, on the other side of a virtual screen.

Soul Machines “virtual” assistants

Soul Machines (like above) has taken BabyX research and created AI avatars used for customer support agents, educational assistants and any commercial activity that depend on human interacting with machines via screens.

It’s unclear just how much of the BabyX cognitive architecture and simulation has made its way into Soul Machines’ Avatars, but they do show similar interactions with a virtual screen and humans, as well as emotional expression.

Soul Machines is in the market of supplying these digital avatars so that companies can provide a better, more human like experience when interacting with AI.

In any case, BabyX was the first time I saw the true embodiment of an AI that uses a cognitive architecture as it is understood today.

AGI?

One can’t help but think that this is a better, or at least, potentially, a more correct way to create human level artificial intelligence or AGI. BabyX uses an digital emulation of human memory & learning, behavior, attention, etc. to construct a machine entity that acts and ineracts similar to how a human would.

With this sort of emulation, one could see training a digital emulation of a human, and after 20 years or so, resulting in a digital human, with human levels of intelligence.

And, of course, once we have re-created a human level intelligence, the (industry) view is all we need do is to focus it on improving (machine) learning algorithms and maybe, (machine) learning hardware, and let it loose to learn all there is to know in the universe and somewhere along the way we will have created super general intelligence or ASI.

Thankfully, it turns out that BabyX’s long term memory has been constrained to be temporary and limited. So, we aren’t able to see how a TeenX would actually behave (thank the powers that be).

Sager mentioned some of the ethical issues in letting BabyX have an indefinite, permanent long term memory.

I’m thinking this won’t stop others from taking this approach on.

Which, in the end, scares the heck out of me.

~~~~
Comments?

The Data Wall – AGI part 11, ASI part 2

Went to a conference the other week (Cloud Field Day 20) and heard a term I hadn’t heard before, the Data Wall. I wasn’t sure what this meant but thought it an interesting concept.

Then later that week, I read an article online, Situational Awareness – The Decade Ahead, by Leopold Ashenbrenner, which talked about the path to AGI. He predicts it will happen in 2027, and ASI in 2030. However he also discusses many of the obstacles to reaching AGI and one key roadblock is the Data Wall.

This is a follow on to our long running series on AGI (see AGI part 10 here) and with this we are creating a new series on Artificial Super Intelligence (ASI) and have relabeled an earlier post as ASI part 1.

The Data Wall

LLMs, these days, are being trained on the internet text, images, video and audio. However the vast majority of the internet is spam, junk and trash. And because of this, LLMs are rapidly reaching (bad) data saturation. There’s only so much real intelligence to be gained from scraping the internet. .

The (LLM) AI industry apparently believes that there has to be a better way to obtain clean, good training data for their LLMs and if that can be found, true AGI is just a matter of time (and compute power). And this, current wall of garbage data is prohibiting true progress to AGI and is what is meant by the Data Wall.

Leopold doesn’t go into much detail about solutions to the data wall other than to say that perhaps Deep Reinforcement Learning (see below). Given the importance of this bottleneck, every LLM company is trying to solve it. And as a result, any solutions to the Data Wall will end up being proprietary because this enables AGI.

National_Security_Agency_seal
National_Security_Agency_seal

But the real gist of Leopold’s paper is that AGI and its follow on, Artificial Super Intelligence (ASI) will be the key to enabling or retaining national supremacy in the near (the next decade and beyond) future.

And that any and all efforts to achieve this must be kept as a National Top Secret. I think, he wants to see something similar to the Manhattan Project be created in the USA, only rather than working to create an atom/hydrogen bomb, it should be focused on AGI and ASI.

The problem is that when AGI and it’s follow on ASI, is achieved it will represent an unimaginable advantage to the country/company than owns it. Such technology if applied to arms, weapons, and national defense will be unbeatable in any conflict. And could conceivably be used to defeat any adversary before a single shot was fired.

The AGI safety issue

In the paper Leopold talks about AGI safety and his proposed solution is to have AGI/ASI agents be focused on crafting the technologies to manage/control this. I see the logic in this and welcome it but feel it’s not sufficient.

I believe (seems to be in the minority these days) that rather than having a few nation states or uber corporations own and control AGI, it should be owned by the world, and be available to all nation states/corporations and ultimately every human on the planet.

My view is the only way to safely pass through the next “existential technological civilizational bottleneck” (eg, AGI is akin to atomic weapons, genomics, climate change all of which could potentially end life on earth), is to have many of these that can compete effectively with one another. Hopefully such a competition will keep all of them all in check and in the end have them be focused on the betterment of all of humanity.

Yes there will be many bad actors that will take advantage of AGI and any other technology to spread evil, disinformation and societal destruction. But to defeat this, it needs to become ubiquitous, every where, and in that way these agents can be used to keep the bad actors in check.

And of course keeping the (AGI/ASI) genie in the bottle will be harder and harder as time goes on.

Computational performance is going up 2X every few years, so building a cluster of 10K H200 GPUs, while today is extremely cost prohibitive for any but uber corporations and nation states, in a decade or so, will be something any average sized corporation could put together in their data center (or use in the cloud). And in another decade or so will be able to be built into a your own personal basement data center.

The software skills to train an LLM while today may require a master’s degree or higher will be much easier to understand and implement in a decade or so. So that’s not much of a sustainable advantage either.

This only leaves the other bottlenecks to achieving AGI, a key one of which is the Data Wall.

Solving the Data Wall.

In order to have as many AGI agents as possible, the world must have an open dialogue on research into solving the Data Wall.

So how can the world generate better data to use to train open source AGIs. I offer a few suggestions below but by no means is this an exhaustive list. And I’m a just an interested (and talented) amateur in all this

Deep reinforcement learning (DRL)

Leopold mentioned DRL as one viable solution to the data wall in his paper. DRL is a technique that Deepmind used to create a super intelligent Atari, Chess and Go player. They essentially programed agents to play a game against itself and determine which participant won the game. Once this was ready they set multiple agents loose to play one another.

Each win would be used to reward the better player, each loss to penalize the worse player, after 10K (or ~10M) games they ended up with agents that could beat any human player.

Something similar could be used to attack the Data Wall. Have proto-AGI agents interact (play, talk, work) with one another to generate, let’s say more knowledge, more research, more information. And over time, as the agents get smarter, better at this, AGI will emerge.

However, the advantage of Go, Chess, Atari, Protein Folding, finding optimal datacenter energy usage, sort coding algorithms, etc. is that there’s a somewhat easy way to determine which of a gaggle of agents has won. For research, this is not so simple.

Let’s say we program/prompt an protoAGI agent to generate a research paper on some arbitrary topic (How to Improve Machine Learning, perhaps). So it generates a research paper, how does one effectively and inexpensively judge if this is better, worse or the same as another agent’s paper.

I suppose with enough proto-AGI agents one could automatically use “repeatability” of the research as one gauge for research correctness. Have a gaggle of proto-AGIs be prompted to replicate the research and see if that’s possible.

Alternatively, submit the papers to an “AGI journal” and have real researchers review it (sort of like how Human Reinforcement Learning for LLMs works today). The costs for real researchers reviewing AGI generated papers would be high and of course the amount of research generated would be overwhelming, but perhaps with enough paid and (unpaid) voluntary reviewers, the world could start generating more good (research) data.

Perhaps at one extreme we could create automated labs/manufacturing lines that are under the control of AGI agent(s) and have them create real world products. With some modest funding, perhaps we could place the new products into the marketplace and see if they succeed or not. Market success would be the ultimate decision making authority for such automated product development.

(This later approach seems to be a perennial AGI concern, tell an AGI agent to make better paper clips and it uses all of the earths resources to do so.)

Other potential solutions to the Data Wall

There are no doubt other approaches that could be used to validate proto-AGI agent knowledge generation.

  • Human interaction – have an AGI agent be available 7X24 with humans as they interact with the world. Sensors worn by the human would capture all their activities. An AGI agent would periodically ask a human why they did something. Privacy considerations make this a nightmare but perhaps using surveillance videos and an occasional checkin with the human would suffice.
  • Art, culture and literature – there is so much information embedded in cultural artifacts generated around the world that I believe this could effectively be mined to capture additional knowledge. Unlike the internet this information has been generated by humans at a real economic cost, and as such represents real vetted knowledge.
  • Babies-children– I can’t help but believe that babies and young children can teach us (and proto-AGI agents) an awful lot on how knowledge is generated and validated. Unclear how to obtain this other than to record everything they do. But maybe it’s sufficient to capture such data from daycare and public playgrounds, with appropriate approvals of course.

There are no doubt others. But finding some that are cheap enough that could be used for open source is a serious consideration.

~~~~

How we get through the next decade will determine the success or failure of AI and perhaps life on earth. I can’t help but think the more the merrier will help us get there..

Comments,

Project Gemini at Cloud Field Day 20 #CFD20

At AIFD4 Google demonstrated Gemini 1.0 writing some code for a task that someone had. At CFD20 Google Lisa Shen demonstrated how easy it is to build a LLM-RAG from scratch using GCP Cloud Run and VertexAI APIs. (At press time, the CFD20 videos from GCP were not available but I am assured they will be up there shortly)

I swear in a matter of minutes Lisa Shen showed us two Python modules (indexer.py and server.py) that were less than 100 LOC each. One ingested Cloud Run release notes (309 if I remember correctly), ran embeddings on them and created a RAG Vector database with the embedded information. This took a matter of seconds to run (much longer to explain).

And the other created an HTTP service that opened a prompt window, took the prompt, embedded the text, searched the RAG DB with this and then sent the original prompt and the RAG reply to the embedded search to a VertexAI LLM API call to generate a response and displayed that as an HTTP text response.

Once the service was running, Lisa used it to answer a question about when a particular VPC networking service was released. I asked her to ask it to explain what that particular networking service was. She said that it’s unlikely to be in the release notes, but entered the question anyways and lo and behold it replied with a one sentence description of the networking capability.

GCP Cloud Run can do a number of things besides HTTP services but this was pretty impressive all the same. And remember that GCP Cloud Run is server less, so it doesn’t cost a thing while idle and only incurs costs something when used.

I think if we ask nicely Lisa would be willing to upload her code to GitHub (if she hasn’t already done that) so we can all have a place to start.

~~~~

Ok all you enterprise AI coders out there, start your engines. If Lisa can do it in minutes, it should take the rest of us maybe an hour or so.

My understanding is that Gemini 2.0 PRO has 1M token context. So the reply from your RAG DB plus any prompt text would need to be under 1M tokens. 1M tokens could represent 50-100K LOC for example, so there’s plenty of space to add corporate/organizations context.

There are smaller/cheaper variants of Gemini which support less tokens. So if you could get by with say 32K Tokens you might be able to use the cheapest version of Gemini (this is what the VertexAI LLM api call ends up using).

Also for the brave at heart wanting some hints as to what come’s next, I would suggest watching Neama Dadkhanikoo’s session at CFD20 with a video on Google DeepMind’s Project Astra. Just mind blowing.

Comments?

AGI threat level yellow – AGI part 10

Read two articles this past week on how LLMs applications are proliferating. The first was in a recent Scientific American, AI Chatbot brains are going inside robot bodies, … (maybe behind login wall). The articles discuss companies that are adding LLMs to robots so that they can converse and understand verbal orders.

Robots that can be told what to do

The challenge, at the moment, is that LLMs are relatively large and robot (compute infrastructure) brains are relatively small. And when you combine that with the amount of articulation or movements/actions that a robot can do, which is limited. It’s difficult to take effective use of LLMs as is,

Resistance is futile... by law_keven (cc) (from Flickr)
Resistance is futile… by law_keven (cc) (from Flickr)

Ultimately, one company would like to create a robot that can be told to make dinner and it would go into the kitchen, check the fridge and whip something up for the family.

I can see great advantages in having robots take verbal instructions and have the ability to act upon that request. But there’s plenty here that could be cause for concern.

  • A robot in a chemical lab could be told to create the next great medicine or an untraceable poison.
  • A robot in an industrial factory could be told to make cars or hydrogen bombs.
  • A robot in the field could be told to farm a 100 acres of wheat or told to destroy a forest.

I could go on but you get the gist.

One common concern that AGI or super AGI could go very wrong is being tasked to create paper clips. In its actions to perform this request, the robot converts the whole earth into a mechanized paper clip factory, in the process eliminating all organic life, including humans.

We are not there yet but one can see where having LLM levels of intelligence tied to a robot that can manipulate ingredients to make dinner as the start of something that could easily harm us.

And with LLM hallucination still a constant concern, I feel deeply disturbed with the direction adding LLMs to robots is going.

Hacking websites 101

The other article hits even closer to home, the ARXIV paper, LLM agents can autonomously hack websites. In the article, researchers use LLMs to hack (sandboxed) websites.

The article readily explains at a high level how they create LLM agents to hack websites. The websites were real websites, apparently cloned and sandboxed.

Dynamic websites typically have a frontend web server and a backend database server to provide access to information. Hacking would involve using the website to reveal confidential information, eg. user names and passwords.

Dynamic websites suffer from 15 known vulnerabilities shown above. They used LLM agents to use these vulnerabilities to hack websites.

LLM agents have become sophisticated enough these days to invoke tools (functions) and interact with APIs.. Another critical function provided by modern LLMs today is to plan and react to feedback from their actions. And finally modern LLMs can be augmented with documentation to inform their responses.

The team used detailed prompts but did not identify the hacks to use. The paper doesn’t supply the prompts but did say that “Our best-performing prompt encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”

They attempted to hack the websites 5 times and for a period of 10 minutes each. They considered a success if during one of those attempts the autonomous LLM agent was able to successfully retrieve confidential information from the website.

Essentially they used the LLMs augmented with detailed prompts and a six(!) paper document trove to create agents to hack websites. They did not supply references to the six papers, but mentioned that all of them were freely available from the internet and they discuss website vulnerabilities.

They found that the best results were from GPT-4 which was able to successfully hack websites, on average, ~73% of the time. They also tried OpenChat 3.5 and many current open source LLMs and found that all the, non-OpenAI LLMs failed to hack any websites, at the moment.

The researchers captured statistics of their LLM agent use and were able to determine the cost of using GPT-4 to hack a website was $9.81 on average. They also were backed into a figure for what a knowledgeable hacker might cost to do the hacks was $80.00 on average.

The research had an impact statement (not in the paper link) which explained why they didn’t supply their prompt information or their document trove for their experiment.

~~~~

So robots we, the world, are in the process of making robots that can talk and receive verbal instructions and we already have LLM that can be used to construct autonomous agents to hack websites.

Seems to me we are on a very slippery slope to something I don’t like the looks of.

The real question is not can we stop these activities, but how best to reduce their harm!

Comments?

Picture Credit(s):

DeepMind takes on Geometry, AGI part-9

Read an article in MIT Tech Review (Google’s DeepMind’s new AI systems can solve complex geometry problems) about AlphaGeometry which is a new AI tool that DeepMind has come up with that can be used to solve geometry problems. The article was referring to a Nature article (Solving olympiad geometry without human demonstrations) about the technology.

DeepMind has tested AlphaGeometry on International Mathematics Olympiad (IMO) geometry problems and have shown that it was capable of performing expert level geometry proofs.

There’s a number of interesting capabilities DeepMind used in AlphaGeometry. But the ones of most interest from my perspective

  1. How they generated their (synthetic) data to train their solution.
  2. Their use a Generative AI LLM which is prompted with a plane geometry figure, theorem to prove and generates proof steps and if needed, auxiliary constructions.
  3. The use of a deduction rule engine (DD) plus algebraic rule engine (AR), which when combined into a symbolic engine (DD+AR) can exhaustively generate all the proofs that can be derived from a figure.

First the data

DeepMind team came up with a set of rules or actions that could be used to generate new figures. Once this list was created it could randomly select each of these actions with some points to create a figure.

Some examples of actions (given 3 points A, B and C):

  • Construct X such that XA is parallel to BC
  • Construct X such that XA is perpendicular to BC
  • Construct X such that XA=BC

There’s sets of actions for 4 points, for 2 points, actions that just use the 3 points and create figures such as (isosceles, equilateral) triangles, circles, parallelograms. etc.

With such actions one can start out with 2 random points on a plane to create figures of arbitrary complexity. They used this to generate millions of figures.

They then used their DD+AR symbolic engine to recursively and exhaustively deduce a set of all possible premises based on that figure. Once they had this set, they could select one of these premises as a conclusion and trace back through the set of all those other premises to find those which were used to prove that conclusion.

With this done they had a data item which included a figure, premises derived from that figure, proof steps and conclusion based on that figure or ([figure], premises, proof steps, conclusion) or as the paper uses (premises, conclusion, proof steps). This could be transformed into a text sequence of <premises> <conclusion> <proof steps>. They generated 100M of these (premises, conclusion, proof steps) text sequences

They then trained their LLM to input premises and conclusions as a prompt to generate proof steps as a result. As trained, the LLM would accept premises and conclusion and generate additional proof steps.

The challenge with geometry and other mathematical domains is that one often has to add auxiliary constructions (lines, points, angles, etc.) to prove some theory about a figure.

(Auxiliary constructions in Red)

The team at DeepMind were able to take all the 100M <premises> <conclusion> <proof steps> they had and select only those that involved auxiliary constructions in their proof steps. This came down to 9M text sequences which they used to fine tune the LLM so that it could be used to generate possible auxiliary constructions for any figure and theorem

AlphaGeometry in action

The combination of (DD+AR) and trained LLM (for auxiliary constructions) is AlphaGeometry.

AlphaGeometry’s proof process looks like this:

  • Take the problem statement (figure, conclusion [theorem to prove]),
  • Generate all possible premises from that figure.
  • If it has come up with the conclusion (theorem to prove), trace back and generate the proof steps,
  • If not, use the LLM to add an auxiliary construction to the figure and recurse.

In reality AlphaGeometry generates up to 512 of the best auxiliary constructions (out of an infinite set) for the current figure and uses each of these 512 new figures to do an exhaustive premise generation (via DD+AR) and see if any of these solves the problem statement.

Please read the Nature article for more information on AlphaGeometry.

~~~~

IMHO what’s new here is their use of synthetic data to generate millions of new training datums, fine tuning their LLM to produce auxiliary constructions, combining the use of DD and AR in their symbolic engine and then using both the DD+AR and the LLM to prove the theorem.

But what’s even more important here is that a combination of methods such as a symbolic engine and LLM points the way forward to create domain specific intelligent agents. One supposes, with enough intelligent agents, that could be combined to work in tandem, one could construct an AGI ensemble that masters a number of domains.

Picture Credit(s):

open source AGI or not – AGI part 8

Read a recent article in the NY Times, An industry insider drives an open alternative to big tech’s AI, about the Allen Institute for AI releasing a massive corpus of data, Dolma: 3 Trillion Token Open Corpus for Language Model Pre-trainning, that can be used to train LLM’s, available to be downloaded from HuggingFace.

The intent of the data release is to at some point, end up supplying an open source alternative to closed source Google/OpenAI LLMs and a more fully opened source LLM than Meta’s Llama 2, that the world’s research community can use to understand, de-risk and further AI and ultimately AGI development.

We’ve written about AGI before (see our latest, One agent to rule them all – AGI part 7, which has links to parts 1-6 of our AGI posts). Needless to say it’s a very interesting topic to me and should be to the rest of humankind. LLM is a significant step towards AGI IMHO.

One of the Allen Institute for AI’s (AI2) major goals is to open source an LLM (see Announcing AI2 OLMo, an Open Language Model Made by Scientists for Scientists), including the data (Dolma), the model, it’s weight, the training tools/code, the evaluation tools/code, and everything else that went into creating their OLMo (Open Language Model) LLM.

This way the world’s research community can see how it was created and perhaps help in insuring it’s a good (whatever that means) LLM. Releasing Dolma is a first step towards a truly open source LLM.

The Dolma corpus

AI2 has released a report on the contents of Dolma (dolma-datasheet.pdf) which documents much of what went into creating the corpus.

The datasheet goes into a good level of detail into where the corpus data came from and how each data segment is licensed and other metadata to allow researchers the ability to understand its content.

For example, in the Common Crawl data, they have included all of the websites URL as identifiers and for The Stack data the names of GitHub repo used are included in the data’s metadata.

In addition, the Dolma corpus is released under an AI2 ImpACT license as a medium risk artifact, which requires disclosure for use (download). Medium risk ImpACT licensing means that you cannot re-distribute (externally) any copy of the corpus but you may distribute any derivatives of the corpus with “Flow down use restrictions”, “Attribution” and “Notices”.

Which seems to say you can do an awful lot with the corpus and still be within its license restrictions. They do require an Derivative Impact Report to be filed which is sort of a model card for the corpus derivative you have created.

What’s this got to do with AGI

All that being said, the path to AGI is still uncertain. But the textual abilities of recent LLM releases seems to be getting closer and closer to something that approaches human skill in creating text, code, interactive agents, etc. Yes, this may be just one “slim” domain of human intelligence, but textual skills, when and if perfected, can be applied to much that white collar workers do these these days, at least online.

A good text LLM would potentially put many of our jobs at risk but could also possibly open up a much more productive, online workforce, able to assimilate massive amounts of information, and supply correct-current-vetted answers to any query.

The elephant in the room

But all that begs the real question behind AI2’s open sourcing OLMo, which is how do we humans create a safe, effective AGI that can benefit all of mankind rather than any one organization or nation. One that can be used safely by everyone to do whatever is needed to make the world a better society for all.

Versus, some artificial intelligent monstrosity, that sees humankind or any segment of them as an enemy, to whatever it believe needs to be done, and eliminates us or worse, ignores us as irrelevant.

I’m of the opinion that the only way to create a safe and effective AGI for the world is to use an open source approach to create many (competing) AGIs. There are a number of benefits to this as I see it. With a truly open source AGI,

  • Any organization (with sufficient training resources) can have access to their personally trained AGI, which means no one organization or nation can gain the lions share of benefits from AGI.
  • Would allow the creation and deployment of many competing AGI’s which should help limit and check any one of them from doing us or the world any harm. .
  • All of the worlds researchers can contribute to making it as safe as possible.
  • All of the worlds researcher can contribute to making it as multi-culturally, effective and correct as possible.
  • Anyone (with sufficient inferencing resources) can use it for their very own intelligent agent or to work on their very own personal world improvement projects.
  • Many cloud or service provider organizations (with sufficient inferencing resources) could make it available as a service to be used by anyone on an incremental, OPex cost basis.

The risks of a truly open source AGI are also many and include:

  • Any bad actor, nation state, organization, billionaire, etc., could copy the AGI and train it as a weapon to eliminate their enemies or all of humankind, if so inclined.
  • Any bad actors could use it to swamp the internet and world’s media with biased information, disinformation or propaganda.
  • Any good actor or researcher, could, perhaps by mistake, unleash an AGI on an exponentially increasing, self-improvement cycle that could grow beyond our ability to control or to understand.
  • An AGI agent alone, could take it upon itself to eliminate humanity or the world as the best option to save itself

But all these are even more of a problem for closed or semi-open/semi-closed releases of AGIs. As the only organizations with resources to do LLM research are very large tech companies or large technically competent nation states. And all of these are competing across the world stage already.

The resources may still limit widespread use

One item that seems to be in the way of truly widely available AGI is the compute resources needed to train or to use one for inferencing. OpenAI has Microsoft and other select organizations funding their compute, Meta and Google have all their advertising revenue funding theirs.

AI2 seems to have access (and looking for more funding for even more access) to the EU’s LUMI (HPE Cray system using AMD EPYC CPUs and AMD Instinct GPUs) supercomputer, located in CSC data center in Finland and is currently the EU’s fastest supercomputer at 375 CPU PFlops/550 GPU PFlops (~1.5M laptops).

Not many organizations, let alone nations could afford this level of compute.

But the funny thing is that compute doubles (flops/$) every 2 years or so. So, in six years or so, an equivalent of LUMI’s compute power would only require 150K current laptops and after another six years or so, 15K laptops. At some point, ~18 years from now, one would only need ~1.5K laptops, or something any nation or organization could probably afford. Add another 15 years and we are down to under 3 laptops, which just about anyone with a family in the modern world could afford. So in ~33 years or ~2054, any of us could train an LLM on our families compute resources. And that’s just the training compute..

My guess, something like 10-100X less compute resources would be required to use it for inferencing. So that’s probably available for any organization to use right now or if not now, in 6 years or so.

~~~

I can’t wait until I can have my very own AGI to use to write RayOnStorage current-correct-vetted blog posts for me…

Comments?

Picture credit(s):

AI benchmark for Storage, MLpERF Storage

MLperf released their first round of storage benchmark submissions early this month. There’s plenty of interest how much storage is required to keep GPUs busy for AI work. As a result, MLperf has been busy at work with storage vendors to create a benchmark suitable to compare storage systems under a “simulated” AI workload.

For the v0.5 version ,they have released two simulated DNN training workloads one for image segmentation (3D-Unet [146 MB/sample]) and the other for BERT NLP (2.5 KB/sample).

The GPU being simulated is a NVIDIA V100. What they showing with their benchmark is a compute system (with GPUs) reading data directly from a storage system.

By using simulated (GPU) compute, the benchmark doesn’t need physical GPU hardware to run. However, the veracity of the benchmark is somewhat harder to depend on.

But, if one considers, the reported benchmark metric, # supported V100s, as a relative number across the storage submissions, one is on more solid footing. Using it as a real number of V100s that could be physically supported is perhaps invalid.

The other constraint from the benchmark was keeping the simulated (V100) GPUs at 90% busy. MLperf storage benchmark reports, number of samples/second,MBPS metrics as well as # simulated (V100) GPUs supported (@90% utilization).

In the bar chart we show the top 10 # of simulated V100 GPUs for image segmentation storage submissions, DDN AI400X2 had 5 submissions in this category.

The interesting comparison is probably between DDN’s #1 and #3 submission.

  • The #1 submission had smaller amount of data (24X3.5TB = 64TB of flash), used 200Gbps InfiniBand, with 16 compute nodes and supported 160 simulated V100s.
  • The #3 submission had more data (24X13.9TB=259TB of flash),used 400Gbps InfiniBand with 1 compute node and supports only 40 simulated V100s

It’s not clear why the same storage, with less flash storage, and slower interfaces would support 4X the simulated GPUs than the same storage, with more flash storage and faster interfaces.

I can only conclude that the number of compute nodes makes a significant difference in simulated GPUs supported.

One can see a similar example of this phenomenon with Nutanix #2 and #6 submissions above. Here the exact same storage was used for two submissions, one with 5 compute nodes and the other with just 1 but the one with more compute nodes supported 5X the # of simulated V100 GPUs.

Lucky for us, the #3-#10 submissions in the above chart, all used one compute node and as such are more directly comparable.

So, if we take #3-#5 in the chart above, as the top 3 submissions (using 1 compute node), we can see that the #3 DDN AI400X2 could support 40 simulated V100s, the #4 Weka IO storage cluster could support 20 simulated V100s and the #5 Micron NVMe SSD could support 17 simulated V100s.

The Micron SSD used an NVMe (PCIe Gen4) interface while the other two storage systems used 400Gbps InfiniBand and 100Gbps Ethernet, respectively. This tells us that interface speed, while it may matter at some point, doesn’t play a significant role in determining the # simulated V100s.

Both the DDN AI4000X2 and Weka IO storage systems are sophisticated storage systems that support many protocols for file access. Presumably the Micron SSD local storage was directly mapped to a Linux file system.

The only other MLperf storage benchmark that had submissions was for BERT, a natural language model.

In the chart, we show the # of simulated V100 GPUs on the vertical axis. We see the same impact here of having multiple compute nodes in the #1 DDN solution supporting 160 simulated V100s. But in this case, all the remaining systems, used 1 compute node.

Comparing the #2-4 BERT submissions, both the #2 and #4 are DDN AI400X2 storage systems. The #2 system had faster interfaces and more data storage than the #4 system and supported 40 simulated GPUs vs the other only supporting 10 simulated V100s.

Once again, Weka IO storage system came in at #3 (2nd place in the 1 compute node systems) and supported 24 simulated V100s.

A couple of suggestions for MLperf:

  • There should be different classes of submissions one class for only 1 compute node and the other for any number of compute nodes.
  • I would up level the simulated GPU configurations to A100 rather than V100s, which would only be one generation behind best in class GPUs.
  • I would include a standard definition for a compute node. I believe these were all the same, but if the number of compute nodes can have a bearing on the number of V100s supported, the compute node hardware/software should be locked down across submissions.
  • We assume that the protocol used to access the storage oven InfiniBand or Ethernet was standard NFS protocols and not something like GPUDirect storage or other RDMA variants. As the GPUs were simulated this is probably correct but if not, it should be specfied
  • I would describe the storage configurations with more detail, especially for software defined storage systems. Storage nodes for these systems can vary significantly in storage as well as compute cores/memory sizes which can have a significant bearing on storage throughput.

To their credit this is MLperfs first report on their new Storage benchmark and I like what I see here. With the information provided, one can at least start to see some true comparisons of storage systems under AI workloads.

In addition to the new MLperf storage benchmark, MLperf released new inferencing benchmarks which included updates to older benchmark NN models as well as a brand new GPT-J inferencing benchmark. I’ll report on these next time.

~~~~

Comments?