I was at a conference last year and there was a speaker there that had worked at NASA for years and was currently at MIT. She talked at length about some of the earth and space scientific exploration that NASA has enabled over the years. Despite massive cost overruns, years long schedule delays and other mishaps, NASA has ultimately come through with groundbreaking science
At the end of her presentation I asked what data gaps existed today in space and earth sensing. She mentioned real time methane tracking (presumably from space) and battery-less ocean sensing.
Methane track from Tanager-1 JPL/NASA satellite
Methane tracking I could understand but battery-less ocean sensing was harder to get a handle on.
US Navy and other oceanographic organizations have deployed numerous sensing devices over the years. Some of which were like a flotilla, which traveled across the Gulf and Atlantic ocean to gather data.
But these were battery supported, solar powered, and limited to ~1 year of service after which they were scuttled to the bottom of the ocean.
I guess the thought being that battery-less ocean sensing platform could provide more of an ongoing, permanent sensor platform, one that could be deployed and potentially be in service for years at a time, with little to no maintenance.
The pivot
So as a stepping stone to Silverton Space cubesat operations, I’m thinking that going after a permanent-like ocean sensing platform would be a valuable first step. And it’s quite possible that anything we do in LEO with Silverton Space platforms could complement any ocean going sensor activity.
One reason to pivot to ocean sensing is that it’s much much cheaper to launch a flotilla of ocean going sensing buoys via a boat off a coast than it is to launch a handful of cubesats into LEO (@~$70K each).
Cubesats fail at a high rate
Moreover, the litany of small satellite failures is long, highly varied and chronic. Essentially anything that could go wrong, often does, at least for the first dozen or so satellites you deploy.
Cubesats with limited functionality or that fail in orbit or to launch, become just more trash orbiting in LEO. And the only way to diagnose what went wrong is elaborate, extensive and transmitted/recieved telemetry.
So another reason to start with ocean going sensors is that there’s a distinct possibility of retrieving a malfunctioning ocean going sensor buoy after deployment. And with sensor buoy in hand, diagnosing what went wrong should be a snap. This doesn’t eliminate the need for elaborate, extensive and transmitted/recieved telemetry but you are no longer entirely dependent on it.
And even if at end of life they can’t be salvaged/refurbished or scuttled. Worst case is that our ocean sensing buoys would end up being part of some ocean/gulf garbage patch. And hopefully will get picked up and disposed of as part of oceanic garbage collection.
~~~
So for the foreseeable future, Silverton Space, will focus on ocean going sensor buoys. It’s unlikely that our first iterations will be completely battery-less but at some point down the line, we hope to produce a version that can be on station for years at a time and provide valuable ocean sensing data to the scientific community.
The main question left, is what sorts of ongoing, ocean sensor information might be most valuable to supply to the world’s scientific community?
I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.
However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.
In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.
Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.
Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.
Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.
With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.
It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.
On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.
Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.
Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.
At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).
Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.
And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)
The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.
After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.
Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .
MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…
At AIFD4 Google demonstrated Gemini 1.0 writing some code for a task that someone had. At CFD20 Google Lisa Shen demonstrated how easy it is to build a LLM-RAG from scratch using GCP Cloud Run and VertexAI APIs. (At press time, the CFD20 videos from GCP were not available but I am assured they will be up there shortly)
I swear in a matter of minutes Lisa Shen showed us two Python modules (indexer.py and server.py) that were less than 100 LOC each. One ingested Cloud Run release notes (309 if I remember correctly), ran embeddings on them and created a RAG Vector database with the embedded information. This took a matter of seconds to run (much longer to explain).
And the other created an HTTP service that opened a prompt window, took the prompt, embedded the text, searched the RAG DB with this and then sent the original prompt and the RAG reply to the embedded search to a VertexAI LLM API call to generate a response and displayed that as an HTTP text response.
Once the service was running, Lisa used it to answer a question about when a particular VPC networking service was released. I asked her to ask it to explain what that particular networking service was. She said that it’s unlikely to be in the release notes, but entered the question anyways and lo and behold it replied with a one sentence description of the networking capability.
GCP Cloud Run can do a number of things besides HTTP services but this was pretty impressive all the same. And remember that GCP Cloud Run is server less, so it doesn’t cost a thing while idle and only incurs costs something when used.
I think if we ask nicely Lisa would be willing to upload her code to GitHub (if she hasn’t already done that) so we can all have a place to start.
~~~~
Ok all you enterprise AI coders out there, start your engines. If Lisa can do it in minutes, it should take the rest of us maybe an hour or so.
My understanding is that Gemini 2.0 PRO has 1M token context. So the reply from your RAG DB plus any prompt text would need to be under 1M tokens. 1M tokens could represent 50-100K LOC for example, so there’s plenty of space to add corporate/organizations context.
There are smaller/cheaper variants of Gemini which support less tokens. So if you could get by with say 32K Tokens you might be able to use the cheapest version of Gemini (this is what the VertexAI LLM api call ends up using).
Also for the brave at heart wanting some hints as to what come’s next, I would suggest watching Neama Dadkhanikoo’s session at CFD20 with a video on Google DeepMind’s Project Astra. Just mind blowing.
Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.
Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.
Deepmind’s approach to sort
Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.
AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.
Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.
AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.
AlphaDev actions
The types of actions that AlphaDev could take included the following:
Add transformation – which added an instruction to the end of the current program
Swap transformation – which swapped two instructions in the current program
Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
Operand transformation – which changed the operand(s) for an instruction in the current program
Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.
They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.
They also placed some constraints on the code generated (called action pruning rules):
Memory locations are always read in incremental order
Registers are allocated in incremental order
Program cannot compare or conditionally move to memory location
Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
Program can not perform two consecutive compare instructions
AlphaDev states
How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.
This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.
Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.
Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.
DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.
DeepMind having conquered sorting did the same for hashing.
Why I think DeepMind’s AlphaDev is better
AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.
I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.
IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.
If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.
We had a relatively long discussion yesterday, amongst a bunch of independent analysts and one topic that came up was my thesis that enterprise IT is being hollowed out by two forces pulling in opposite directions on their apps. Those forces are the cloud and the edge.
The siren call of the cloud for business units, developers and modern apps has been present for a long time now. And their call is more omnipresent than Odysseus ever had to deal with.
The cloud’s allure is primarily low cost-instant infrastructure that just works, a software solution/tool box that’s overflowing, with locations close to most major metropolitan areas, and the extreme ease of starting up.
If your app ever hopes to scale to meet customer demand, where else can you go. If your data can literally come in from anywhere, it usually lands on the cloud. And if you have need for modern solutions, tools, frameworks or just about anything the software world can create, there’s nowhere else with more of this than the cloud.
Pre-cloud, all those apps would have run in the enterprise or wouldn’t have run at all. And all that data would have been funneled back into the enterprise.
Not today, the cloud has it all, its siren call is getting louder everyday, ever ready to satisfy every IT desire anyone could possibly have, except for the edge.
The Edge, last bastion for onsite infrastructure
The edge sort of emerged over the last decade or so kind of in stealth mode. Yes there were always pockets of edge, with unique compute or storage needs. For example, video surveillance has been around forever but the real acceleration of edge deployments started over the last decade or so as compute and storage prices came down drastically.
These days, the data being generated is stagering and compute requirements that go along with all that data are all over the place, from a few ARMv/RISC V cores to a server farm.
For instance, CERN’s LHC creates a PB of data every second of operation (see IEEE Spectrum article, ML shaking up particle physics too). But they don’t store all that. So they use extensive compute (and ML) to try to only store interesting events.
Seismic ships roam the seas taking images of underground structures, generating gobs of data, some of which is processed on ship and the rest elsewhere. A friend of mine creates RPi enabled devices that measure tank liquid levels deployed in the field.
More recently, smart cars are like a data center on tires, rolling across roads around the world generating more data than you want can even imagine. 5G towers are data centers ontop of buildings, in farmland, and in cell towers doting the highways of today. All off the beaten path, and all places where no data center has ever gone before.
In olden days there would have been much less processing done at the edge and more in an enterprise data center. But nowadays, with the advent of relatively cheap computing and storage, data can be pre-processed, compressed, tagged all done at the edge, and then sent elsewhere for further processing (mostly done in the cloud of course).
IT Vendors at the crossroads
And what does the hollowing out of the enterprise data centers mean for IT server and storage vendors, mostly danger lies ahead. Enterprise IT hardware spend will stop growing, if it hasn’t already, and over time, shrink dramatically. It may be hard to see this today, but it’s only a matter of time.
Certainly, all these vendors can become more cloud like, on prem, offering compute and storage as a service, with various payment options to make it easier to consume. And for storage vendors, they can take advantage of their installed base by providing software versions of their systems running in the cloud that allows for easier migration and onboarding to the cloud. The server vendors have no such option. I see all the above as more of a defensive, delaying or holding action.
This is not to say the enterprise data centers will go away. Just like, mainframe and tape before them, on prem data centers will exist forever, but will be relegated to smaller and smaller, niche markets, that won’t grow anymore. But, only as long as vendor(s) continue to upgrade technology AND there’s profit to be made.
It’s just that that astronomical growth, that’s been happening ever since the middle of last century, happen in enterprise hardware anymore.
Long term life for enterprise vendors will be hard(er)
Over the long haul, some server vendors may be able to pivot to the edge. But the diversity of compute hardware there will make it difficult to generate enough volumes to make a decent profit there. However, it’s not to say that there will be 0 profits there, just less. So, when I see a Dell or HPE server, under the hood of my next smart car or inside the guts of my next drone, then and only then, will I see a path forward (or sustained revenue growth) for these guys.
For enterprise storage vendors, their future prospects look bleak in comparison. Despite the data generation and growth at the edge, I don’t see much of a role for them there. The enterprise class feature and functionality, they have spent the decades creating and nurturing aren’t valued as much in the cloud nor are they presently needed in the edge. Maybe I’m missing something here, but I just don’t see a long term play for them in the cloud or edge.
~~~~
For the record, all this is conjecture on my part. But I have always believed that if you follow where new apps are being created, there you will find a market ready to explode. And where the apps are no longer being created, there you will see a market in the throws of a slow death.
Read two articles over the past month or so. The more recent one was an Economist article (AI enters the industrial age, paywall) and the other was A generalist agent (from Deepmind). The Deepmind article was all about the training of Gato, a new transformer deep learning model trained to perform well on 600 separate task arenas from image captioning, to Atari games, to robotic pick and place tasks.
And then there was this one tweet from Nando De Frietas, research director at Deepmind:
“Someone’s opinion article. My opinion: It’s all about scale now! The Game is Over! It’s about making these models bigger, safer, compute efficient, faster at sampling, smarter memory, more modalities, INNOVATIVE DATA, on/offline, … 1/N“
I take this to mean that AGI is just a matter of more scale. Deepmind and others see the way to attain AGI is just a matter of throwing more servers, GPUs and data at the training the model.
We have discussed AGI in the past (see part-0 [ish], part-1 [ish], part-2 [ish], part-3ish and part-4 blog posts [We apologize, only started numbering them at 3ish]). But this tweet is possibly the first time we have someone in the know, saying they see a way to attain AGI.
Transformer models
It’s instructive from my perspective that, Gato is a deep learning transformer model. Also the other big NLP models have all been transformer models as well.
Gato (from Deepmind), SWITCH Transformer (from Google), GPT-3/GPT-J (from OpenAI), OPT (from meta), and Wu Dai 2.0 (from China’s latest supercomputer) are all trained on more and more text and image data scraped from the web, wikipedia and other databases.
Wikipedia says transformer models are an outgrowth of RNN and LSTM models that use attention vectors on text. Attention vectors encode, into a vector (matrix), all textual symbols (words) prior to the latest textual symbol. Each new symbol encountered creates another vector with all prior symbols plus the latest word. These vectors would then be used to train RNN models using all vectors to generate output.
The problem with RNN and LSTM models is that it’s impossible to parallelize. You always need to wait until you have encountered all symbols in a text component (sentence, paragraph, document) before you can begin to train.
Instead of encoding this attention vectors as it encounters each symbol, transformer models encode all symbols at the same time, in parallel and then feed these vectors into a DNN to assign attention weights to each symbol vector. This allows for complete parallelism which also reduced the computational load and the elapsed time to train transformer models.
And transformer models allowed for a large increase in DNN parameters (I read these as DNN nodes per layer X number of layers in a model). GATO has 1.2B parameters, GPT-3 has 175B parameters, and SWITCH Transformer is reported to have 7X more parameters than GPT-3 .
Estimates for how much it cost to train GPT-3 range anywhere from $10M-20M USD.
AGI will be here in 10 to 20 yrs at this rate
So if it takes ~$15M to train a 175B transformer model and Google has already done SWITCH which has 7-10X (~1.5T) the number of GPT-3 parameters. It seems to be an arms race.
If we assume it costs ~$65M (~2X efficiency gain since GPT-3 training) to train SWITCH, we can create some bounds as to how much it will cost to train an AGI model.
By the way, the number of synapses in the human brain is approximately 1000T (See Basic NN of the brain, …). If we assume that DNN nodes are equivalent to human synapses (a BIG IF), we probably need to get to over 1000T parameter model before we reach true AGI.
So my guess is that any AGI model lies somewhere between 650X to 6,500X parameters beyond SWITCH or between 1.5Q to 15Q model parameters.
If we assume current technology to do the training this would cost $40B to $400B to train. Of course, GPUs are not standing still and NVIDIA’s Hopper (introduced in 2022) is at least 2.5X faster than their previous gen, A100 GPU (introduced in 2020). So if we waited a 10 years, or so we might be able to reduce this cost by a factor of 100X and in 20 years, maybe by 10,000X, or back to where roughly where SWITCH is today.
So in the next 20 years most large tech firms should be able to create their own AGI models. In the next 10 years most governments should be able to train their own AGI models. And as of today, a select few world powers could train one, if they wanted to.
Where they get the additional data to train these models (I assume that data counts would go up linearly with parameter counts) may be another concern. However, I’m sure if you’re willing to spend $40B on AGI model training, spending a few $B more on data acquisition shouldn’t be a problem.
~~~~
At the end of the Deepmind article on Gato, it talks about the need for AGI safety in terms of developing preference learning, uncertainty modeling and value alignment. The footnote for this idea is the book, Human Compatible (AI) by S. Russell.
Preference learning is a mechanism for AGI to learn the “true” preference of a task it’s been given. For instance, if given the task to create toothpicks, it should realize the true preference is to not destroy the world in the process of making toothpicks.
Uncertainty modeling seems to be about having AI assume it doesn’t really understand what the task at hand truly is. This way there’s some sort of (AGI) humility when it comes to any task. Such that the AGI model would be willing to be turned off, if it’s doing something wrong. And that decision is made by humans.
Deepmind has an earlier paper on value alignment. But I see this as the ability of AGI to model human universal values (if such a thing exists) such as the sanctity of human life, the need for the sustainability of the planet’s ecosystem, all humans are created equal, all humans have the right to life, liberty and the pursuit of happiness, etc.
I can see a future post is needed soon on Human Compatible (AI).
Last May, an article came out of DeepMind research titled Reward is enough. It was published in an artificial intelligence journal but PDFs of it are available free of charge.
My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.
Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.
Reinforcement learning, explained
Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.
The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).
The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.
Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.
Reinforcement learning, a chess playing example
Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.
Figure 2: Comparison with specialized programs. (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against respectively Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (‘W’, green), draw (‘D’, grey), loss (‘L’, red). (B) Scalability of AlphaZero with thinking time, compared to Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 seconds per move), time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing, and against Stockfish with a strong opening book. Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program Aperyqhapaq at full time controls and against Elmo under 2017 CSA world championship time controls (10 minutes per game plus 10 seconds per move). (D) Average result of chess matches starting from different opening positions: either common human positions, or the 2016 TCEC world championship opening positions . Average result of shogi matches starting from common human positions . CSA world championship games start from the initial board position.
If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.
Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.
So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.
For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.
The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”
Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.
Does rewards and reinforcement learning provide a roadmap to AGI
To use reinforcement learning to achieve AGI implies that
One can identify all the arenas required for (human) intelligence
One can compute a proper reward signal for each arena involved in (human) intelligence,
One can programmatically compute appropriate steps to take to solve that arena’s activity,
One can save a sequence of state-steps taken to solve that arena’s problem, and
One can run sequences of steps enough times to produce a good policy for that arena.
There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.
For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.
The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.
So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.
One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.
Alternatives to full AGI
But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.
Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.
I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.
Perhaps the MuZero techniques might have an answer to some or all of the above.
~~~~
Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.
I attended AIFD2 ( videos of their sessions available here) a couple of weeks back and for the last session, Intel presented information on what they had been working on for new graphical optimized cores and a partner they have, called Katana Graph, which supports a highly optimized graphical analytics processing tool set using latest generation Xeon compute and Optane PMEM.
What’s so special about graphs
The challenges with graphical processing is that it’s nothing like standard 2D tables/images or 3D oriented data sets. It’s essentially a non-Euclidean data space that has nodes with edges that connect them.
But graphs are everywhere we look today, for instance, “friend” connection graphs, “terrorist” networks, page rank algorithms, drug impacts on biochemical pathways, cut points (single points of failure in networks or electrical grids), and of course optimized routing.
The challenge is that large graphs aren’t easily processed with standard scale up or scale out architectures. Part of this is that graphs are very sparse, one node could point to one other node or to millions. Due to this sparsity, standard data caching fetch logic (such as fetching everything adjacent to a memory request) and standardized vector processing (same instructions applied to data in sequence) don’t work very well at all. Also standard compute branch prediction logic doesn’t work. (Not sure why but apparently branching for graph processing depends more on data at the node or in the edge connecting nodes).
Intel talked about a new compute core they’ve been working on, which was was in response to a DARPA funded activity to speed up graphical processing and activities 1000X over current CPU/GPU hardware capabilities.
DARPA’s goals became public in 2017 and described their Hierarchical Identity Verify Exploit (HIVE) architecture. HIVE is DOD’s description of a graphical analytics processor and is a multi-institutional initiative to speed up graphical processing. .
Intel PIUMA cores come with a multitude of 64-bit RISC processor pipelines with a global (shared) address space, memory and network interfaces that are optimized for 8 byte data transfers, a (globally addressed) scratchpad memory and an offload engine for common operations like scatter/gather memory access.
Each multi-thread PIUMA core has a set of instruction caches, small data caches and register files to support each thread (pipeline) in execution. And a PIUMA core has a number of multi-thread cores that are connected together.
PIUMA cores are optimized for TTEPS (Tera-Traversed Edges Per Second) and attempt to balance IO, memory and compute for graphical activities. PIUMA multi-thread cores are tied together into (completely connected) clique into a tile, multiple tiles are connected within a single node and multiple nodes are tied together with a 8 byte transfer optimized network into a PIUMA system.
P[I]UMA (labeled PUMA in the video) multi-thread cores apparently eschew extensive data and instruction caching to focus on creating a large number of relatively simple cores, that can process a multitude of threads at the same time. Most of these threads will be waiting on memory, so the more threads executing, the less likely that whole pipeline will need to be idle, and hopefully the more processing speedup can result.
Performance of P[I]UMA architecture vs. a standard Xeon compute architecture on graphical analytics and other graph oriented tasks were simulated with some results presented below.
Simulated speedup for a single node with P[I]UMAtechnology vs. Xeon range anywhere from 3.1x to 279x and depends on the amount of computation required at each node (or edge). (Intel saw no speedups between a single Xeon node and multiple Xeon Nodes, so the speedup results for 16 P[I]UMA nodes was 16X a single P[I]UMA node).
Having a global address space across all PIUMA nodes in a system is pretty impressive. We guess this is intrinsic to their (large) graph processing performance and is dependent on their use of photonics HyperX networking between nodes for low latency, small (8 byte) data access.
Katana Graph software
Another part of Intel’s session at AIFD2 was on their partnership with Katana Graph, a scale out graph analytics software provider. Katana Graph can take advantage of ubiquitous Xeon compute and Optane PMEM to speed up and scale-out graph processing. Katana Graph uses Intel’s oneAPI.
Katana graph is architected to support some of the largest graphs around. They tested it with the WDC12 web data commons 2012page crawl with 3.5B nodes (pages) and 128B connections (links) between nodes.
Katana runs on AWS, Azure, GCP hyperscaler environment as well as on prem and can scale up to 256 systems.
Katana Graph performance results for Graph Neural Networks (GNNs) is shown below. GNNs are similar to AI/ML/DL CNNs but use graphical data rather than images. One can take a graph and reduce (convolute) and summarize segments to classify them. Moreover, GNNs can be used to understand whether two nodes are connected and whether two (sub)graphs are equivalent/similar.
In addition to GNNs, Katana Graph supports Graph Transformer Networks (GTNs) which can analyze meta paths within a larger, heterogeneous graph. The challenge with large graphs (say friend/terrorist networks) is that there are a large number of distinct sub-graphs within the graph. GTNs can break heterogenous graphs into sub- or meta-graphs, which can then be used to understand these relationships at smaller scales.
At AIFD2, Intel also presented an update on their Analytics Zoo, which is Intel’s MLops framework. But that will need to wait for another time.
~~~~
It was sort of a revelation to me that graphical data was not amenable to normal compute core processing using today’s GPUs or CPUs. DARPA (and Intel) saw this defect as a need for a completely different, brand new compute architecture.
Even so, Intel’s partnership with Katana Graph says that even today compute environment could provide higher performance on graphical data with suitable optimizations.
It would be interesting to see what Katana Graph could do using PIUMA technology and appropriate optimizations.
In any case, we shouldn’t need to wait long, Intel indicated in the video that P[I]UMA Technology chips could be here within the next year or so.