System effectiveness – Silverton Consulting

Context, tokens, KV stores & storage, Solidigm presents at #AIFD8

Posted on May 29, 2026May 29, 2026 by Ray in AI storage needs, MLOps, NVMe storage, RAG-LLM, SSD storage, System effectiveness

Solidigm presented (video here) at AIFD8 this month and as part of their presentation they spent time disecting what happens to a prompt, how token growth happens, and where storage can help speed up prompt processing.

The token count explosion

It all starts at a simple prompt something as simple as “run a benchmark against a drive” maybe a 12 token prompt but when it actually gets processed can balloon into something that’s much larger. As an LLM processes the prompt it goes through a number of steps building context, calling tools, obtaining and interpreting results, persisting knowledge and finally, responding to the prompt.

Digging a level deeper, here’s what the token counts look like during prompt processing. First step is to understand the environment of the prompt, rules, safety requirements, methodology at it’s disposal, then there’s retrieval activity that gathers information needed to actually process and perform the prompt, then identifying tools and their APIs needed to process the prompt, and at some point when the LLM has all that it plans out the steps needed to actually perform the prompt, tool results are generated, interpreted and fed back to LLM processing to determine next step. All of which at some point, prompt precessing completes and the prompt reply is sent back to the issuer.

As one can see in the above, the prompt itself was minuscule in token counts in the vast scheme of activity needed to process the prompt. And this is just how one (albeit complex), ~12 token prompt can grow into a 42K token context.

Inferencing and Time To First Token

Inferencing consists of two phases:

PreFill phase – which is the processing that goes on to take the context token stream and convert it into a KV (Key:Value) store which the LLM can use for subsequent processing so it doesn’t have to go back to the token context. PreFill ends up with a fully populated KV store representing all the tokens in the current context, and generates the first token in the LLM response to the prompt
Decode – which is all subsequent processing needed to generate the rest of the prompt response, uses the that KV store to underpin it’s processing to generate any more tokens needed to answer the prompt.

Solidigm went on to describe how these activities impact the Time To First Token (TTFT), or how long it takes from the time the prompt is issued until the LLM responds with the first word (token) of the prompt response.

(Although in the Solidigm’s chart they show Decode in the TTFT path. I believe this incorrect as PreFill generates the first token. Nonetheless, there is a portion of PreFill that “decodes” the prompt response first token and I assume that’s what they are showing here. Of course I could be mistaken.)

Storage can impact both the time it takes to assemble context tokens and to perform PreFill.

While storage can matter a lot during context assemble (lots of potential IO activity reading files, RAGs and other documents), storage’s impact on PreFill is less widely known. That is until you understand how prompt processing can be held up for KV store recalculation (going back to context tokens and rebuilding some or all of the current KV store for the prompt).

Increasing context, leads to more tokens, leads to larger KV stores, all of which impacts TTFT

Although, it’s only conjecture on my part, but the biggest portion of the Tprefill above seems to be calculating and converting context/memory tokens into KV elements stored in the prompts KV store. KV stores are used during prompt downstream processing because they can be easily accessed and each KV item represents intepreted token information in an easily used (by LLM) fashion.

And what’s not evident in the above TTFT decomposition chart is that tool use, generates even more tokens, as tool result (tokens), all of which need to be processed into more KV store elements in order to determine what to do next.

What happens to large KV stores during prompt processing

If there is a single GPU running a single prompt it’s possible, depending on model and HBM size, that it will run out of GPU HBM memory and offload or move some portion of its KV (store) cache to CPU memory. But if that GPU is processing 100s to 1000s of prompts concurrently, even CPU memory may not be large enough to hold every KV cache segment that no longer fits in GPU HBM. And of course most enterprise AI servers hold anywhere from 4 to 10 GPUs, each running 100s to 1000s of prompts concurrently.

KV cache offload is where fast storage can significantly speed up prompt processing

There’s an obvious tradeoff here with respect to KV stores. One can always go back to the Prefill phase, reread all the tokens in current context and recompute the KV store or one can offload KV store segments to memory, local storage or network storage and later retrieve the already computed KV store from wherever it ended up.

The tradeoff is how long it takes to recompute vs do the data transfers to offload and retrieve the KV cache segments. Larger contexts, increase KV store size, which lead to more need to offload or jettison KV store segments when running out of GPU HBM space. Both KV caching to memory-storage vs jetisoning KV store segments and reconstituting them, add time to TTFT. The question is which is faster.

One can see how this would be made ever more of an issue as prompts token counts (& KV elements) skyrocket. Also when more prompts are running concurrently on the same GPU(s) in a single server.

Obviously local, large SSDs with very fast random read would be ideal for KV cache offload activity which has the KV cache segment written out once (and extended as prompt processing adds context) but read back multiple times. Which s is great application for Large capacity, fast read NVMe SSDs which, I must say, are Solidigm’s forte.

NVIDIA and others have started to add KV cache offloading to their inferencing stacks. As they do, large fast NVMe SSDs activity during AI prompt processing will become one of the critical factors in TTFT.

In the meantime, if anyone has any large, fast NVMe SSDs they don’t need anymore, please let me know. 🙂

Silverton Space – Ocean Sensing platform

Posted on December 18, 2024December 18, 2024 by Ray in Ocean Sensing, Space, System effectiveness, Visionary leadershp

I was at a conference last year and there was a speaker there that had worked at NASA for years and was currently at MIT. She talked at length about some of the earth and space scientific exploration that NASA has enabled over the years. Despite massive cost overruns, years long schedule delays and other mishaps, NASA has ultimately come through with groundbreaking science

At the end of her presentation I asked what data gaps existed today in space and earth sensing. She mentioned real time methane tracking (presumably from space) and battery-less ocean sensing.

Methane track from Tanager-1 JPL/NASA satellite

Methane tracking I could understand but battery-less ocean sensing was harder to get a handle on.

US Navy and other oceanographic organizations have deployed numerous sensing devices over the years. Some of which were like a flotilla, which traveled across the Gulf and Atlantic ocean to gather data.

But these were battery supported, solar powered, and limited to ~1 year of service after which they were scuttled to the bottom of the ocean.

I guess the thought being that battery-less ocean sensing platform could provide more of an ongoing, permanent sensor platform, one that could be deployed and potentially be in service for years at a time, with little to no maintenance.

The pivot

So as a stepping stone to Silverton Space cubesat operations, I’m thinking that going after a permanent-like ocean sensing platform would be a valuable first step. And it’s quite possible that anything we do in LEO with Silverton Space platforms could complement any ocean going sensor activity.

One reason to pivot to ocean sensing is that it’s much much cheaper to launch a flotilla of ocean going sensing buoys via a boat off a coast than it is to launch a handful of cubesats into LEO (@~$70K each).

Cubesats fail at a high rate

Moreover, the litany of small satellite failures is long, highly varied and chronic. Essentially anything that could go wrong, often does, at least for the first dozen or so satellites you deploy.

NASA says that of the small satellites launched between 2000 and 2016 over 40% failed in some way and over 24% were total mission failures. (see: https://ntrs.nasa.gov/api/citations/20190002705/downloads/20190002705.pdf)

Cubesats with limited functionality or that fail in orbit or to launch, become just more trash orbiting in LEO. And the only way to diagnose what went wrong is elaborate, extensive and transmitted/recieved telemetry.

So another reason to start with ocean going sensors is that there’s a distinct possibility of retrieving a malfunctioning ocean going sensor buoy after deployment. And with sensor buoy in hand, diagnosing what went wrong should be a snap. This doesn’t eliminate the need for elaborate, extensive and transmitted/recieved telemetry but you are no longer entirely dependent on it.

And even if at end of life they can’t be salvaged/refurbished or scuttled. Worst case is that our ocean sensing buoys would end up being part of some ocean/gulf garbage patch. And hopefully will get picked up and disposed of as part of oceanic garbage collection.

~~~

So for the foreseeable future, Silverton Space, will focus on ocean going sensor buoys. It’s unlikely that our first iterations will be completely battery-less but at some point down the line, we hope to produce a version that can be on station for years at a time and provide valuable ocean sensing data to the scientific community.

The main question left, is what sorts of ongoing, ocean sensor information might be most valuable to supply to the world’s scientific community?

Photo Credit(s):

From JPL/NASA Tanager-1 press release
From DARPA Ocean of Things website
From National Geographic Article on The Great Pacific Garbage Patch

Enfabrica MegaNIC, a solution to GPU backend networking #AIFD5

Posted on September 17, 2024 by Ray in Cognitive computing, Ethernet, Networking, Software Defined Network, Strategic Inflection Points, System effectiveness, Visionary leadershp

I attended AI FieldDay 5 (AIFD5) last week and there were networking vendors there discussing how their systems dealt with backeng GPU network congestion issues. Most of these were traditional vendor congestion solutions.

However, one vendor, Enfabrica, (videos of their session will be available here) seemed to be going down a different path, which involved a new ASIC design destined to resolve all the congestion, power, and performance problems inherent in current backend GPU Ethernet networks.

In essence, Enfabrica’s Super or MegaNIC (they used both terms during their session) combines PCIe lanes switching, Ethernet networking, and ToR routing with SDN (software defined networking) programability to connect GPUs directly to a gang of Ethernet links. This allows it to replace multiple (standard/RDMA/RoCEv2) NIC cards with one MegaNIC using their ACF-S (Advanced Compute Fabric SuperNic) ASIC.

Their first chip, codenamed “Millennium” supports 8Tbps bandwidth.

Their ACF-S chip provides all the bandwidth needed to connect up to 4 GPUs to 32/16/8/4-100/200/400/800Gbps links. And because their ACF-S chip controls and drives all these network connections, it can better understand and deal with congestion issues backend GPU networks. And it is PCIe 5/6 compliant, supporting 128-160 lanes.

Further, it has onboard ARM processing to handle its SDN operations, onboard hardware engines to accelerate networking protocol activity and network and PCIe switching hardware to support directly connecting GPUs to Ethernet links.

With its SDN, it supports current RoCE, RDMA over TCP, UEC direct, etc. network protocols.

It took me (longer than it should) to get my head around what they were doing but essentially they are supporting all the NIC-TOR functionality as well as PCIe functionality needed to connect up to 4 GPUs to a backend Ethernet GPU network.

On the slide above I was extremely skeptical of the Every 10^52 Years “job failures due to NIC RAIL failures”. But Rochan said that these errors are predominantly optics failures and as both the NIC functionality and ToR switch functionality is embedded in the ACF-S silicon, those faults should not exist.

Still 10^52 years is a long MTBF rate (BTW, the universe is only 10^10 years old). And there’s still software controlling “some” of this activity. It may not show up as a “NIC RAIL” failure, but there will still be “networking” failures in any system using ACF-S devices.

Back to their solution. What this all means is you can have one less hop in your backend GPU networks leading to wider/flatter backend networks and a lot less congestion on this network. This should help improve (GPU) job performance, networking performance and reduce networking power requirements to support your 100K GPU supercluster.

At another session during the show, Arista (videos will be available here) said that just the DSP/LPO optics alone for a 100K GPU backend network will take a 96/32 MW of power. Unclear whether this took into consideration within rack copper connections. But anyway you cut it, it’s a lot of power. Of course the 100K GPUs would take 400MW alone (at 4KW per GPU).

Their ACF-S driver has been upstreamed into standard CCL and Linux distributions, so once installed (or if you are at the proper versions of CCL & Linux software), it should support complete NCCL (NVIDIA Collective Communications Library) stack compliance.

And because, with its driver installed and active, it talks standard Ethernet and standard PCIe protocols on both ends, it is should fully support any other hardware that comes along attaching to these networks or busses (CXL perhaps)

The fact that this may or may not work with other (GPU) accelerators seems moot at this point as NVIDIA owns the GPU for AI acceleration market. But the flexibility inherent in their own driver AND on chip SDN, indicates for the right price, just about any communications link software stack could be supported.

After spending most of the rest of AIFD5 discussing how various vendors deal with congestion for backend GPU networks, having startup on the stage with a different approach was refreshing.

Whether it reaches adoption and startup success is hard to say at this point. But if it delivers on what it seems capable of doing for power, performance and network flexibility, anybody deploying new greenfield GPU superclusters ought to take a look at Enfabricas solution. .

MegaNIC/ACF-S pilot boxes are available for order now. No indication as to what these would cost but if you can afford 100K GPUs it’s probably in the noise…

~~~~

Comments?

One agent to rule them all, Deepmind’s Gato – AGI part 7

Posted on August 30, 2023August 30, 2023 by Ray in AGI, Deep Learning, Scenario planning, Strategic Inflection Points, Strategic planning, System effectiveness

I was perusing Deepmind’s mountain of research today and ran across one article on their Gato agent (A Generalist Agent abstract, paper pdf). These days with Llama 2, GPT-4 and all the other LLM’s doing code, chatbots, image generation, etc. it seems generalist agents are everywhere. But that’s not quite right.

Gato can not only generate text from prompts, but can also control a robot arm for pick and place, caption images, navigate in 3D, play Atari and other (shooter) video games, etc. all with the same exact model architecture and the same exact NN weights with no transfer learning required.

Same weights/same model is very unusual for generalist agents. Historically, generalist agents were all specifically trained on each domain and each resultant model had distinct weights even if they used the same model architecture. For Deepmind, to train Gato and use the same model/same weights for multiple domains is a significant advance.

Gato has achieved significant success in multiple domains. See chart below. However, complete success is still a bit out of reach but they are making progress.

For instance, in the chart one can see that their are over 200 tasks in the DM Lab arena that the model is trained to perform and Gato’s mean performance for ~180 of them is above a (100%) expert level. I believe DM Lab stands for Deepmind Lab and is described as a (multiplayer, first person shooter) 3D video game built on top of Quake III arena.

Deepmind stated that the mean for each task in any domain was taken over 50 distinct iterations of the same task. Gato performs, on average, 450 out of 604 “control” tasks at better than 50% human expert level. Please note, Gato does a lot more than just “control tasks”.

Model size and RT robotic control

One thing I found interesting is that they kept the model size down to 1.2B parameters so that it can perform real time inferencing in controlling robot arms. Over time as hardware speed increases, they believe they should be able train larger models and still retain real-time control. But at the moment, with a 1.2B model it can still provide. real time inferencing.

In order to understand model size vs. expertise they used 3 different model sizes training on same data, 79M, 364M and 1.2B parameters. As can be seen on the above chart, the models did suffer in performance as they got smaller. (Unclear to me what “Tokens Processed” on the X axis actually mean other than data length trained with.) However, it seems to imply, that with similar data, bigger models performed better and the largest did 10 to 20% better than the smallest model trained with same data streams.

Examples of Gato in action

The robot they used to train for was a “Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action.” It seemed a very flexible robot arm that would be used in standard factory environments. One robot task was to stack different styles and colors of plastic blocks.

Deepmind says that Gato provides rudimentary dialogue generation and picture captioning capabilities. Looking at the chat streams persented, seems more than rudimentary to me.

Deepmind did try the (smaller) model on some tasks that it was not originally trained on and it seemed to perform well after “fine-tuning” on the task. In most cases, using fine-tuning of the original model, with just “same domain” (task specific) data, the finely tuned model achieved similar results to what it achieved if Gato was trained from scratch with all the data used in the original model PLUS that specific domain’s data.

Data and tokenization used to train Gato

Deepmind is known for their leading edge research in RL but Gato’s deep neural net model is all trained with supervised learning using transformer techniques. While text based transformer type learning is pervasive in LLM today, vast web class data sets on 3D shooter gaming, robotic block stacking, image captioning and others aren’t nearly as widely available. Below they list the data sets Deepmind used to train Gato.

One key to how they could train a single transformer NN model to do all this, is that they normalized ALL the different types of data above into flat arrays of tokens.

Text was encoded into one of 32K subwords and was represented by integers from 0 to 32K. Text is presented to the model in word order
Images were transformed into 16×16 pixel patches in rastor order. Each pixel is normalized -1,1.
Other discrete values (e.g. Atari button pushes) are flattened into sequences of integers and presented to the model in row major order.
Continuous values (robot arm joint torques) are 1st flattened into sequences of floats in row major order and then mu-law encoded into the range -1,1 and then discretized into one of 1024 bins.

After tokenization, the data streams are converted into embeddings. Much more information on the tokenization and embedding process used in the model is available in the paper.

One can see the token count of the training data above. Like other LLMs, transformers take a token stream and randomly zero one out and are trained to guess that correct token in sequence.

~~~~

The paper (see link above and below) has a lot more to say about the control and non-control domains and the data used in training/fine-tuning Gato, if you’re interested. They also have a lengthy section on risks and challenges present in models of this type.

My concern is that as generalist models become more pervasive and as they are trained to work in more domains, the difference between an true AGI agent and a Generalist agent starts to blur.

Something like Gato that can both work in real world (via robotics) and perform meta analysis (like in metaworld), play 1st person shooter games, and analyze 2D and 3D images, all at near expert levels, and oh, support real time inferencing, seems to not that far away from something that could be used as a killer robot in an army of the future and this is just where Gato is today.

One thing I note is that the model is not being made generally available outside of Google Deepmind. And IMHO, that for now is a good thing.

That is until some bad actor gets their hands on it….

Picture Credit(s):

All images, charts, and tables are from “A Generalist Agent” paper

MLperf results show H100 v A100 and v Habana Gaudi2 GPUs

Posted on July 12, 2023 by Ray in Cognitive computing, Deep Learning, Machine Learning, System effectiveness

MLCommons recently released new MLperf data center training results. The headlines for the relaese was that they added new GPT-3 data center training results but what I found more interesting was there was a plethora of H100 and A100 results on the same training runs which allowed me to compare the two NVIDIA GPUs in performance.

For example, in ResNet 50 (Image recognition) model training there were a number of H100 and A100 results from Dell. Two of which used the same Intel CPU counts and same H100/A100 GPU counts.

Above we show the top 10 ResNet 50 results and if you examine the #6 submission, it’s a Dell result with 4 Intel Platinum CPUs and 16 NVIDIA H100-SXM5-80GB GPUs which trained ResNet 50 model in 7.8 minutes.

What’s not on that chart is another Dell submission (#16) that also had 4 Intel Platinum CPUs but used 16 NVIDIA A100-SXM-80GB GPUs, which trained the same model in 14.4 minutes.

For ResNet 50 then the H100 is 1.8X faster than a similarly configured A100.

We show above results for Image Segmentation model training top 10. In this case there were two similar Dell submissions, at #3 and #4, in the top 10. These had similar hardware configuration but used H100 or A100 GPUs

These Dell two Image Segmentation (3D-Unet) model training result submissions of 7.6 minutes and 11.0 minutes, respectively means that for Image Segmentation, the H100 is 1.4X faster than the A100.

Finally, for DLRM Recommendation engine training results, there were two other Dell submissions (#5 & #7) that used 2 Intel Platinum CPUs and 8 (H100-SXM5-80GB and A100-SXM-80GB) GPUs and trained in 4.3 and 8.4 minutes, respectively. This says for the DLRM model training the H100 is 2.0X faster than the A100 for DLRM model tracing.

There were other comparisons (that didn’t attain top training results) with with 2 Intel Platinum CPUs and 8 (H100 and A100) GPUs for other model results, which show the H100 is anywhere from 1.7X faster to 2.1X faster.

Unclear why the H100 GPUs perform relatively better with fewer GPUs in the configuration but there may be some additional overhead involved in supporting more CPUs and GPUs which reduces their relative performance.

As a result, we can report from recent MLperf data center training results show for 4 CPUs and 16 (H100 or A100) GPUs the H100 performed 1.4X to 1.8X faster than the A100 and for 2 CPUs and 8 (H100 & A100) GPUs the H100 performed 1.7X two 2.1X faster than the A100.

There was one other interesting GPU comparison shown in recent MLperf results, that between the NVIDIA H100-SXM5-80GB and the Intel Habana Gaudi2 GPU. In this case the submissions involved different vendors (Dell and Intel) and different AI frameworks NGC MXNet 23.04, NGC Pytorch 23.04, NGC HugeCTR 23.04 for the H100 and PyTorch 1.13.1a0 for the Habana Gaudi2. For both submissions they used 2 Intel Platinum CPUs and 8 (H100 or Habana Gaudi1) GPUs.

Again, none of these (H100 vs Habana Guidi2 GPU) results appear in the top result charts we show here.

For ResNet 50 The H100 GPU trained ResNet 50 ins 13.5 min and the Habana Gaudii2 GPU trained ResNet 50 in 16.5 min. This would say the H100 is 1.2X faster than the Habana Guidi2 GPU.

In addition, both of these submissions also trained against the image segmentation model. The H100 trained the image segmentation model in 12.2 minutes while the Habana Guidi2 trained in 20.5 minutes. This would say that the H100 is 1.7X faster than the Habana Gaudi2 GPU.

As a result, recent MLperf data center training results show the NVIDIA H100-SXM5-80GB is 1.2 to 1.7X faster than the Intel Habana Guadi2 GPU on the 2 different model training esults with similar hardware configurations

Finally, MLperf results for GPT-3 are brand new for this release, so we present them below.

There were only 4 (on prem) submissions for GPT-3 in this round. And the #1 NVIDIA with 192 CPUs and 768 H100-SXM5-80GB GPUS trained in 44.8 minutes while the #4 Intel submission with 64 CPUs and 256 Habana Gaudi2 GPUs trained in 442.6 min, respectively.

It’s less certain whether we should compare GPU speeds here as 1) the comparison (#1 to #3 and #2 to #4) used 1/2 the hardware and 2) the software frameworks were very dissimilar, the (#1 & #2) NVIDIA H100 GPT-3 submissions used the NVIDIA NeMo software framework and the Intel (#3 AND #4) submissions used PyTorch 1.13.1a0. Not sure what NVIDIA NeMo is derived from but it doesn’t seem to be being used in any other model training run for MLperf other than GPT-3.

Comments?

Deepmind does sort

Posted on June 13, 2023 by Ray in Cognitive computing, Deep Learning, Reinforcement Learning, Strategic Inflection Points, System effectiveness, Visionary leadershp, Visionary organizations

Saw an article today on TNW on DeepMind’s new AI taps games to enhance fundamental algorithms which was discussing a recent Nature paper Faster sorting algorithms discovered using deep reinforcement learning and website, which described AlphaDev.

Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.

Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.

Deepmind’s approach to sort

Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.

AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.

Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.

AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.

AlphaDev actions

The types of actions that AlphaDev could take included the following:

Add transformation – which added an instruction to the end of the current program
Swap transformation – which swapped two instructions in the current program
Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
Operand transformation – which changed the operand(s) for an instruction in the current program
Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.

They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.

They also placed some constraints on the code generated (called action pruning rules):

Memory locations are always read in incremental order
Registers are allocated in incremental order
Program cannot compare or conditionally move to memory location
Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
Program can not perform two consecutive compare instructions

AlphaDev states

How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.

This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.

Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.

Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.

DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.

DeepMind having conquered sorting did the same for hashing.

Why I think DeepMind’s AlphaDev is better

AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.

I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.

IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.

If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.

Comments?

Picture Credit(s):

All graphics in this post have been taken from the Nature article and it’s appendices, see: Faster sorting algorithms discovered using deep reinforcement learning

Steam Locomotive lessons for disk vs. SSD

Posted on March 31, 2023March 31, 2023 by Ray in Data density, Market dynamics, Strategic Inflection Points, System effectiveness

Read a PHYS ORG article on Extinction of Steam Locomotives derails assumption about biological evolution… which was reporting on a Royal Society research paper The end of the line: competitive exclusion & the extinction… that looked at the historical record of steam locomotives since their inception in the early 19th century until their demise in the mid 20th century. Reading the article it seems to me to have a wider applicability than just to evolutionary extinction dynamics and in fact similar analysis could reveal some secrets of technological extinction.

Steam locomotives

During its 150 years of production, many competitive technologies emerged starting with electronic locomotives, followed by automobiles & trucks and finally, the diesel locomotive.

The researchers selected a single metric to track the evolution (or fitness) of the steam locomotive called tractive effort (TE) or the weight a steam locomotive could move. Early on, steam locomotives hauled both passengers and freight. The researchers included automobiles and trucks as competitive technologies because they do offer a way to move people and freight. The diesel locomotive was a more obvious competitor.

The dark line is a linear regression trend line on the wavy mean TE line, the boxes are the interquartile (25%-75%) range, the line within the boxes the median TE value, and the shaded areas 95% confidence interval for trend line of the steam locomotives TE that were produced that year. Raw data from Locobase, a steam locomotives database

One can see from the graph three phases. The red phase, from 1829-1881, there was unencumbered growth of TE for steam locomotives during this time. But in 1881, electric locomotives were introduced corresponding to the blue phase and after WW II the black phase led to the demise of steam.

Here (in the blue phase) we see a phenomena often seen with the introduction of competitive technologies, there seems to be an increase in innovation as the multiple technologies duke it out in the ecosystem.

Automobiles and trucks were introduced in 1901 but they don’t seem to impact steam locomotive TE. Possibly this is because the passenger and freight volume hauled by cars and trucks weren’t that significant. Or maybe it’ impact was more on the distances hauled.

In 1925 diesel locomotives were introduced. Again we don’t see an immediate change in trend values but over time this seemed to be the death knell of the steam locomotive.

The researchers identified four aspects to the tracking of inter-species competition:

A functional trait within the competitive species can be identified and tracked. For the steam locomotive this was TE,
Direct competitors for the specie can be identified that coexist within spatial, temporal and resource requirements. For the steam locomotive, autos/trucks and electronic/diesel locomotives.
A complete time series for the species/clade (group of related organisms) can be identified. This was supplied by Locobase
Non-competitive factors don’t apply or are irrelevant. There’s plenty here including most of the items listed on their chart.

From locomotives to storage

I’m not saying that disk is akin to steam locomotives while flash is akin to diesel but maybe. For example one could consider storage capacity as similar to locomotive TE. There’s a plethora of other factors that one could track over time but this one factor was relevant at the start and is still relevant today. What we in the industry lack is any true tracking of capacities produced since the birth of the disk drive 1956 (according to wikipedia History of hard disk drives article) and today.

But I’d venture to say the mean capacity have been trending up and the variance in that capacity have been static for years (based on more platter counts rather than anything else).

There are plenty of other factors that could be tracked for example areal density or $/GB.

Here’s a chart, comparing areal (2D) density growth of flash, disk and tape media between 2008 and 2018. Note both this chart and the following charts are Log charts.

Over the last 5 years NAND has gone 3D. Current NAND chips in production have 300+ layers. Disks went 3D back in the 1960s or earlier. And of course tape has always been 3D, as it’s a ribbon wrapped around reels within a cartridge.

So areal density plays a critical role but it’s only 2 of 3 dimensions that determine capacity. The areal density crossover point between HDD and NAND in 2013 seems significant to me and perhaps the history of disk

Here’s another chart showing the history of $/GB of these technologies

In this chart they are comparing price/GB of the various technologies (presumably the most economical available during that year). Trajectories in HDDs between 2008-2010 was on a 40%/year reduction trend in $/GB, then flat lined and now appears to be on a 20%/year reduction trend. Flash during 2008-2017 has been on a 25% reduction in $/GB for that period which flatlined in 2018. LTO Tape had been on a 25%/year reduction from 2008 through 2014 and since then has been on a 11% reduction.

If these $/GB trends continue, a big if, flash will overcome disk in $/GB and tape over time.

But here’s something on just capacity which seems closer to the TE chart for steam locomotives.

There’s some dispute regarding this chart as it only reflects drives available for retail and drives with higher capacities were not always available there. Nonetheless it shows a couple of interesting items. Early on up to ~1990 drive capacities were relatively stagnant. From 1995-20010 there was a significant increase in drive capacity and since 2010, drive capacities have seemed to stop increasing as much. We presume the number of x’s for a typical year shows different drive capacities available for retail sales, sort of similar to the box plots on the TE chart above

SSDs were first created in the early 90’s, but the first 1TB SSD came out around 2010. Since then the number of disk drives offered for retail (as depicted by Xs on the chart each year) seem to have declined and their range in capacity (other than ~2016) seem to have declined significantly.

If I take the lessons from the Steam Locomotive to heart here, one would have to say that the HDD has been forced to adapt to a smaller market than they had prior to 2010. And if areal density trends are any indication, it would seem that R&D efforts to increase capacity have declined or we have reached some physical barrier with todays media-head technologies. Although such physical barriers have always been surpassed after new technologies emerged.

What we really need is something akin to the Locobase for disk drives. That would track all disk drives sold during each year and that way we can truly see something similar to the chart tracking TE for steam locomotives. And this would allow us to see if the end of HDD is nigh or not.

Final thoughts on technology Extinction dynamics

The Royal Society research had a lot to say about the dynamics of technology competition. And they had other charts in their report but I found this one very interesting.

This shows an abstract analysis of Steam Locomotive data. They identify 3 zones of technology life. The safe zone where the technology has no direct competitions. The danger zone where competition has emerged but has not conquered all of the technologies niche. And the extinction zone where competing technology has entered every niche that the original technology existed.

In the late 90s, enterprise disk supported high performance/low capacity, medium performance/medium capacity and low performance/high capacity drives. Since then, SSDs have pretty much conquered the high performance/low capacity disk segment. And with the advent of QLC and PLC (4 and 5 bits per cell) using multi-layer NAND chips, SSDs seem poisedl to conquer the low performance/high capacity niche. And there are plenty of SSDs using MLC/TLC (2 or 3 bits per cell) with multi-layer NAND to attack the medium performance/medium capacity disk market.

There were also very small disk drives at one point which seem to have been overtaken by M.2 flash.

On the other hand, just over 95% of all disk and flash storage capacity being produced today is disk capacity. So even though disk is clearly in the extinction zone with respect to flash storage, it’s seems to still be doing well.

It would be wonderful to have a similar analysis done on transistors vs vacuum tubes, jet vs propeller propulsion, CRT vs. LED screens, etc. Maybe at some point with enough studies we could have a theory of technological extinction that can better explain the dynamics impacting the storage and other industries today.

Comments,

Photo Credit(s):

Chart and caption from The end of the line: competitive exclusion and the extinction of historical entities paper, Figure 1
Chart from CERN Storage Market Technology and Markets Status and Evolution presentation, slide 11
Chart from CERN Storage Market Technology and Markets Status and Evolution presentation, slide 11
Hard disk capacity between 1980 and present (2011), based on for-retail products. For data, data source, and discussion, see Talk page on Commons. Hankwang 17:00, 2 March 2008 (UTC), update 20:38, 18 September 2011 (UTC).
Chart and caption from The end of the line: competitive exclusion and the extinction of historical entities paper, Figure 3

BEHAVIOR, an in-home robot, benchmark

Posted on November 19, 2021November 19, 2021 by Ray in Artificial Intelligence, Machine Learning, R&D measures, Robots, Scenario planning, Strategic Inflection Points, System effectiveness

As my readers probably already know, I’m a long time benchmark geek. So when I recently read an article out of Stanford (AI Experts Establish the “North Star” for Domestic Robotics Field) where a research team there developed a new robotic benchmark, I was interested. The new robotics benchmark is called BEHAVIOR which was documented in an ARXIV.org article (see: BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and ecOlogical enviRonments). It essentially uses real world data to identify domestic work activities that any robot would need to perform in a home.

The problems with robot benchmarks

The problem with benchmarks are multi-faceted:

How realistic are the workloads used to evaluate the systems being measured?
How accurate are the metrics used to rank and judge benchmark submissions?
How costly/complex is it to run a benchmark?
How are submissions audited and are they reproducible?.
Where are benchmark results reported and are they public?

And of course robotics brings in it’s own issues that makes benchmarking more difficult:

What sensors does the robot have to understand how to complete tasks?
What manipulators does the robot have to perform the tasks required of it?
Do the robots move in the environment and if so, how do the robots move?
Does the robot perform the task in the real world on in a simulated environment.

And of course, when using a simulated environment, how realistic is it.

BEHAVIOR with iGibson (see below) seem to answer many of these concerns for an in home robot benchmarking.

What is BEHAVIOR?

First, BEHAVIOR’s home making tasks were selected from an American Time Use Survey maintained by the USA Bureau of Labor Statistics which identifies tasks Americans perform in their homes. With BEHAVIOR 1.0 there are 100 tasks ranging from building a fruit basket to cleaning a toilet, and just about everything in between. I didn’t see any cooking or mixing drinks tasks but maybe those will be added.

Second, BEHAVIOR uses a predicate logic, called BDDL (BEHAVIOR Domain Definition Language) to define initial conditions for tasks such as tables, chairs, books, etc located in the room, where objects need to be placed, and successful completion goals or what task completion should look like.

BEHAVIOR uses 15 different rooms or scenes in their benchmark, such as a kitchen, garage, study, etc. Each of the 100 tasks are performed in a specific room.

BEHAVIOR incorporates 1217 different objects in 391 categories. Once initial conditions are defined for a task, BEHAVIOR essentially randomly selects different object for the task and randomly locates them throughout the room.

In order to run the benchmark, one could conceivably create a real room, with all the objects and have them placed according to BEHAVIOR BDDL’s randomly assigned locations with a robot physically present in the room and have it perform the assigned task OR one could use a simulation engine and have the robot run the task in the simulation environment, with simulated room, objects and robot.

It appears as if BEHAVIOR could operate in any robotics simulation environment but has been currently implemented in Stanford’s open source robotics simulation engine called iGibson 2.0 (see: iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks and iGibson 2.0 website). iGibson uses the Bullet real time physics engine for realistic physical environment simulation.

A robot operating within iGibson is provided a 3D rendering of the room and objects in images or LIDAR sensor scans. It can then identify the objects that it needs to manipulate to perform the tasks. One can define the robot simulated sensors and manipulators in iGibnot 2.0 and it’s written in Python, is open source (GitHub Repo) and can be installed to run on (Ubuntu 16.04) Linux, Windows (10) or Mac (10.15) systems.

Finally, BEHAVIOR uses a set of metrics to determine how well a robot has performed its assigned task. Their first metric is success score defined as the fraction of goal conditions satisfied by the robot performing the task. Such as the number of dishes properly cleaned and placed in the drying rack divided by the total number of dishes for a “washing dishes” task. And their second metric is a set of efficiency metrics, like time to complete a task, sum total of object distance moved during the task, how well objects are arranged at task completion (is the toilet seat down…), etc.

Another feature of iGibson 2.0 is that it offers the ability to record a human (in VR) doing a task in its simulated environment. So if your robotic system is able to learn by example, then iGibson could be used to provide training data for an activity.

~~~~

A couple of additions to the BEHAVIOR benchmark/iGibson simulation environment that I would like to see:

There ought to be a way to construct a house/apartment where multiple rooms are arranged in a hierarchy, i.e., rooms associated with floors with connections using hallways, doors, stairs, etc. between them. This way one could conceivably have a define a set of homes/apartments (let’s say 5) that a robot would perform its tasks in.
They need a task list to drive robot activities. Assume that there’s some amount of time let’s say 8-12 hours that a robot is active and construct a series of tasks that need to be accomplished during that period.
Robots should be placed in the rooms/apartments/homes at random with random orientation and then they would have to navigate through rooms/passageways to the rooms to perform the tasks.
They need to add pet/human avatars in the rooms throughout a home. These would represent real time obstacles to task completion/navigation as well as add more tasks associated with caring for pets/humans.
They need the ability to add non-home rooms that could encompass factory floors, emergency response debris fields, grocery stores, etc. and their own unique set of tasks for each of these so that it could be used as a benchmark for more than just domestic robots.

Aside from the above additions to BEHAVIOR/iGibson 2.0, there’s the question of the organization that manages the benchmark and submissions. There needs to be a website/place to publish benchmark results for a robot AND a mechanism to audit results for accuracy to insure fair play.

Typically this would be associated with an organization responsible for publishing and auditing submissions as well as guide further development of BEHAVIOR/iGibson 2.0. BEHAVIOR 1.0 is not the end but it’s a great start at providing realistic tasks that any domestic robot would need to perform.

Benchmarks have always aided the development and assessment of new technologies. Having a in home robot benchmark like BEHAVIOR makes getting domestic robots that do what we want them to do a more likely possibility someday.

There’s a new benchmark in town and it signals the dawning of the domestic robot age.

Photo Credit(s):