Data and code versioning For MLops

Read an interesting article (Ex-Apple engineers raise … data storage startup) and research paper (Git is for data) about a of group of ML engineers from Apple forming a new “data storage” startup targeted at MLOps teams just like Apple. It turns out that MLops has some very unique data requirements that go way beyond just data storage.

The paper discusses some of the unusual data requirements for MLOps such as:

  • Infrequent updates – yes there are some MLOps datasets where updates are streamed in but the vast majority of MLOps datasets are updated on a slower cadence. The authors think monthly works for most MLOps teams
  • Small changes/lots of copies – The changes to MLOps data are relatively small compared to the overall dataset size and usually consist of data additions, record deletions, label updates, etc. But uncommon to most data, MLOps data are often subsetted or extracted into smaller datasets used for testing, experimentation and other “off-label” activities.
  • Variety of file types – depending on the application domain, MLOps file types range all over the place. But there’s often a lot of CSV files in combination with text, images, audio, and semi-structured data (DICOM, FASTQ, sensor streams, etc.). However within a single domain, MLOps file types are pretty much all the same.
  • Variety of file directory trees – this is very MLOps team and model dependent. Usually there are train/validate/test splits to every MLOps dataset but what’s underneath each of these can vary a lot and needs to be user customizable.
  • Data often requires pre-processing to be cleansed and made into something appropriate and more useable by ML models
  • Code and data must co-evolve together, over time – as data changes, the code that uses them change. Adding more data may not cause changes to code but models are constantly under scrutiny to improve performance, accuracy or remove biases. Bias elimination often requires data changes but code changes may also be needed.

It’s that last requirement, MLOps data and code must co-evolve and thus, need to be versioned together that’s most unusual. Data-code co-evolution is needed for reproducibility, rollback and QA but also for many other reasons as well.

In the paper they show a typical MLOps data pipeline.

Versioning can also provide data (and code) provenance, identifying the origin of data (and code). MLOps teams undergoing continuous integration need to know where data and code came from and who changed them. And as most MLOps teams collaborate in the development, they also need a way to identify data and code conflicts when multiple changes occur to the same artifact.

Source version control

Code has had this versioning problem forever and the solution became revision control systems (RCS) or source version control (SVC) systems. The most popular solutions for code RCS are Git (software) and GitHub (SaaS). Both provide repositories and source code version control (clone, checkout, diff, add/merge, commit, etc.) as well as a number of other features that enable teams of developers to collaborate on code development.

The only thing holding Git/GitHub back from being the answer to MLOps data and code version control is that they don’t handle large (>1MB) files very well.

The solution seems to be adding better data handling capabilities to Git or GitHub. And that’s what XetHub has created for Git.

XetHub’s “Git is Data” paper (see link above) explains what they do in much detail, as to how they provide a better data layer to Git, but it boils down to using Git for code versioning and as a metadata database for their deduplicating data store. They are using a Merkle trees to track the chunks of data in a deduped dataset.

How XetHub works

XetHub support (dedupe) variable chunking capabilities for their data store. This allows them to use relatively small files checked into Git to provide the metadata to point to the current (and all) previous versions of data files checked into the system.

Their mean chunk size is ~4KB. Data chunks are stored in their data store. But the manifest for dataset versions is effectively stored in the Git repository.

The paper shows how using a deduplicated data store can support data versioning.

XetHub uses a content addressable store (CAS) to store the file data chunk(s) as objects or BLOBs. The key to getting good IO performance out of such a system is to have small chunks but large objects.

They map data chunks to files using a CDMT (content defined merkle tree[s]). Each chunk of data resides in at least two different CDMTs, one associated with the file version and the other associated with the data storage elements.

XetHub’s variable chunking approach is done using a statistical approach and multiple checksums but they also offer one specialized file type chunking for CSV files. As it is, even with their general purpose variable chunking method, they can offer ~9X dedupe ratio for text data (embeddings).

They end up using Git commands for code and data but provide hooks (Git filters) to support data cloning, add/checkin, commits, etc.). So they can take advantage of all the capabilities of Git that have grown up over the years to support code collaborative development but use these for data as well as code.

In addition to normal Git services for code and data, XetHub also offers a read-only, NFSv3 file system interface to XetHub datases. Doing this eliminates having to reconstitute and copy TB of data from their code-data repo to user workstations. With NFSv3 front end access to XetHub data, users can easily incorporate data access for experimentation, testing and other uses.

Results from using XetHub

XetHub showed some benchmarks comparing their solution to GIT LFS, another Git based large data storage solution. For their benchmark, they used the CORD-19 (and ArXiv paper, and Kaggle CORD-I9 dataset) which is a corpus of all COVID-19 papers since COVID started. The corpus is updated daily, released periodically and they used the last 50 versions (up to June 2022) of the research corpus for their benchmark.

Each version of the CORD-19 corpus consists of JSON files (research reports, up to 700K each) and 2 large CSV files one with paper information and the other paper (word?) embeddings (a more useable version of the paper text/tables used for ML modeling).

For CORD-19, XetHub are able to store all the 2.45TB of research reports and CSV files in only 287GB of Git (metadata) and datastore data, or with a dedupe factor 8.7X. With XetHub’s specialized CSV chunking (Xet w/ CSV chunking above), the CORD-19 50 versions can be stored in 87GB or with a 28.8X dedupe ratio. And of that 87GB, only 82GB is data and the rest ~5GB is metadata (of which 1.7GB is the merle tree).

In the paper, they also showed the cost of branching this data by extracting and adding one version which consisted of a 75-25% (random) split of a version. This split was accomplished by changing only the two (paper metadata and paper word embeddings) CSV files. Adding this single split version to their code-data repository/datastore only took an additional 11GB of space An aligned split (only partitioning on a CSV record boundary, unclear but presumably with CSV chunking), only added 185KB.

XETHUB Potential Enhancements

XetHub envisions many enhancements to their solution, including adding other specific file type chunking strategies, adding a “time series” view to their NFS frontend to view code/data versions over time, finer granularity data provenance (at the record level rather than at the change level), and RW NFS access to data. Further, XetHub’s dedupe metadata (on the Git repo) only grows over time, supporting updates and deletes to dedupe metadata would help reduce data requirements.

Read the paper to find out more.

Picture/Graphic credit(s):

NVIDIA H100 vs. A100 GPUs in MLPERF Training

NVIDIA recently released some “Preview” results for MLPerf Data Center Training v2.1 (most recent results as of 28 Nov 2022) benchmarks. We analyzed these results to determine how much faster the H100 was vs. their A100 GPU.

Note, NVIDIA submitted 3 series of Preview benchmarks using the H10-SXM5-80GB GPUs for training which included an 8 GPU system, a 24 GPU system, and a 32 GPU DGXH100 system.

We have previously reported similar analysis for MLPerf Inferencing results (see: NVIDIA’s H100 vs A100… blog post).

From NVIDIA H100 Announcement Information

In their announcement, NVIDIA showed anywhere from 3-6X TFLops speedup with much faster throughput. MLPerf currently doesn’t report the FP resolution used to perform their benchmarks but in MLPerf’s ArXiv paper, they seem to be using FP32 which we assume is equivalent to TF32 in the above chart so the H100 should, on average, be performing 3X faster.

Actual or normalized results for comparisons

Of the eight MLPerf v2.1 Data Center Training workloads, it appears that the H100 actual results are faster than the A100 GPUs in 5 of the benchmarks and slower in the remaining 3, Speech Recognition (LibriSpeech RNN-T), Recommendation Engine (1TB Clickthrough DLRM) and Reinforcement Learning (MiniGo).

The challenge with using the actual results or absolute minutes to train from the benchmarks is that submission results aren’t all using the same hardware configurations.

For example, in the Speech Recognition benchmark results, the current best training time (2.1 minutes) was achieved by NVIDIA DGXA100 systems with 384 (64 core AMD 7742) CPUs and 1536 (A100-SXM4-80GB) GPUs. While the nearest H100 Preview submission, which would have come in 4th in absolute time (7.5 minutes) to train, was using 8 (56 core Intel Xeon) CPUs with 32 (H100-SXM5-80GB) GPUs.

So, in order to present an apples to apples comparison in the charts below we show both actual minutes to train for the system and GPU counts normalized (to match the nearest H100 Preview submission which we calculated) time to train.

A couple of caveats with using normalized numbers:

  • Normalization to 8 or 32 GPUs assumes the systems in question would have absolute linear performance scaling both up (for actual results with less GPUs) and down (for actual results with more GPUs)
  • Normalization to 8 or 32 GPUs doesn’t factor in the differences in CPU counts, core counts per CPU or CPU power. And in fact in the H100 previews, NVIDIA (or MLPerf) did not provide a CPU model number but in their detailed information they did list the Intel Xeon core count as 56.
  • Normalization to 8 or 32 GPUs doesn’t factor in any other speedups like throughput, dedicated AI hardware or other system performance characteristics that are available on the newer (DGXH H100) systems.

However, with respect to GPU and CPU core counts, there were four benchmarks (Speech Recognition, NLP, Object Detection-light weight, and Recommendation engine) which have submissions that come close to the GPU and CPU hardware counts that were used for the H100 Previews.

For three benchmarks comparing against the H100 submission with 32 GPUs, the comparison system was a HPE Proliant system with 8 AMD 7763 64-core CPUs with 32 A100-SXM4-80GB GPUs. And for the one benchmark comparing against the H100 submission with 8 GPUs, the comparison system was a NVIDIA DGXA100 system with 2 AMD EPYC 7742 (64 core) CPUs and 8 A100-SXM4-80GB GPUs.

Note, the HPE A100 systems still had more CPU cores, 64 more for the 32 GPU comparisons and the NVIDIA DGXA100 had 16 more CPU cores for the lone 8 GPU comparison.

So, our comparisons are still not perfect and if anything should show the H100 in its worst light due to not having as much CPU compute power. On the other hand the DGXH100 and the H100 GPU has a lot more bandwidth and the H100 GPU has additional specialized dedicated logic for AI operations. No telling how much these other hardware differences would matter to the various MLPerf training workloads. But these comparisons are as close as the data allows.

The comparisons

First up Speech Recognition:

Lower is better in training time results (metric measured is minutes to train to NN level of accuracy). And the results on this chart are sorted by the 32 GPU normalized training times. The actual published results are shown in Blue and the 32 GPU normalized results in Orange.

As we can see here even with normalization for all the other results, the H100 preview still doesn’t come out on top (7.487 min vs. 7.534) but it doesn’t lose by much. Also one can see the current #1 for this benchmark in actual minutes to train is shown by the last column(s), which is a NVIDIA DGXA100 running 384 AMD EPYC 7742 (64 core) CPUs with 1536 A100-SXM4-80GB GPUs, which trained in around 2 minutes.

I’ve taken the liberty to show in light blue boxes the best comparison system to the H100 preview results (DGXH100) with 32 H100 GPUs, which was the HPE (Proliant) 8 AMD EPYC 7763 (64-core) CPUs and 32 A100-SXM4-80GB GPUs results. In this Speech Recognition benchmark the H100 GPUs is 1.63X faster than the A100 GPUs.

Next up Object Detection-Lightweight,

Similar to the above smaller is better, it’s sorted by Normalized to 32 GPU results and Blue bars are at the actual reported results and orange bars are the 32 GPU normalized results.

Here we can see that the H100 both reported the best training time in actual results and in 32 GPU normalized results. Also like the earlier chart we are showing the best comparisons we can find in blue boxes and in this Object Detection-Lightweight benchmark the H100 is 3.80X faster than the A100.

Bottom line

H100 GPU

We have analyzed all MLPerf data center training workload top ten results similar to what we show above. As discussed earlier, only four MLPerf workloads had hardware similar to the NVIDIA H100 Preview submissions, three compare well with the 32 GPU H100 submission and 1 compares well with the 8 GPU H100 submission.

The numbers we calculate show that the H100 is 1.63X (Speech recognition), 3.80X (NLP), 1.97X (Object detection-lightweight) and 1.60X (Recommendation engine) faster than the A100, which would say the H100 is, on average, 2.25X faster than the A100 in MLPerf v2.1 Data Center Training results.

Realize the H100 results are “Preview” so there may still be some software (or firmware) speedups that may be applied to improve these numbers. And, “Released” hardware & firmware may differ substantially from the “Preview” hardware & firm vale.

But given all that, it appears that the H100 is not as fast as announced (2.25X vs. 3X), in MLPerf training workloads, at least not yet [added after publishing, The Eds]

Photo Credit(s):

  • Screen shot of slides presented at GTC Spring 2022
  • Cropped version of above

FAST(HARD) or Slow(soft)AGI takeoff – AGI Part 6

I was listening to a podcast a couple of weeks back and the person being interviewed made a comment that he didn’t believe that AGI would have a fast (hard) take off rather it would be slow (soft). Here’s the podcast John Carmack interviewed by Lex Fridman).

Hard vs. soft takeoff

A hard (fast) takeoff implies a relatively quick transition (seconds, hours, days, or months) between AGI levels of intelligence and super AGI levels of intelligence. A soft (slow) takeoff implies it would take a long time (years, decades, centuries) to go from AGI to super AGI.

We’ve been talking about AGI for a while now and if you want to see more about our thoughts on the topic, check out our AGI posts (in most recent order: AGI part 5, part 4, part 3 (ish), part (2), part (1), and part (0)).

The real problem is that many believe that any AGI that reaches super-intelligence will have drastic consequences for the earth and especially, for humanity. However, this is whole other debate.

The view is that a slow AGI takeoff might (?) allow sufficient time to imbue any and all (super) AGI with enough safeguards to eliminate or minimize any existential threat to humanity and life on earth (see part (1) linked above).

A fast take off won’t give humanity enough time to head off this problem and will likely result in an humanity ending and possibly, earth destroying event.

Hard vs Soft takeoff – the debate

I had always considered AGI would have a hard take off but Carmack seemed to think otherwise. His main reason is that current large transformer models (closest thing to AGI we have at the moment) are massive and take lots of special purpose (GPU/TPU/IPU) compute, lots of other compute and gobs and gobs of data to train on. Unclear what the requirements are to perform inferencing but suffice it to say it should be less.

And once AGI levels of intelligence were achieved, it would take a long time to acquire any additional regular or special purpose hardware, in secret, required to reach super AGI.

So, to just be MECE (mutually exclusive and completely exhaustive) on the topic, the reasons researchers and other have posited to show that AGI will have a soft takeoff, include:

  • AI hardware for training and inferencing AGI is specialized, costly, and acquisition of more will be hard to keep secret and as such, will take a long time to accomplish;
  • AI software algorithmic complexity needed to build better AGI systems is significantly hard (it’s taken 70yrs for humanity to reach todays much less than AGI intelligent systems) and will become exponentially harder to go beyond AGI level systems. This additional complexity will delay any take off;
  • Data availability to train AGI is humongous, hard to gather, find, & annotate properly. Finding good annotated data to go beyond AGI will be hard and will take a long time to obtain;
  • Human government and bureaucracy will slow it down and/or restrict any significant progress made in super AGI;
  • Human evolution took Ms of years to go from chimp levels of intelligence to human levels of intelligence, why would electronic evolution be 6-9 orders of magnitude faster.
  • AGI technology is taking off but the level of intelligence are relatively minor and specialized today. One could say that modern AI has been really going since the 1990s so we are 30yrs in and today have almost good AI chatbots today and AI agents that can summarize passages/articles, generate text from prompts or create art works from text. If it takes another 30 yrs to get to AGI, it should provide sufficient time to build in capabilities to limit super-AGI hard take off.

I suppose it’s best to take these one at a time.

  • Hardware acquisition difficulty – I suppose the easiest way for an intelligent agent to acquire additional hardware would be to crack cloud security and just take it. Other ways may be to obtain stolen credit card information and use these to (il)legally purchase more compute. Another approach is to optimize the current AGI algorithms to run better within the same AGI HW envelope, creating super AGI that doesn’t need any more hardware at all.
  • Software complexity growing – There’s no doubt that AGI software will be complex (although the podcast linked to above, is sub-titled that “AGI software will be simple”). But any sub-AGI agent that can change it’s code to become better or closer to AGI, should be able to figure out how not to stop at AGI levels of intelligence and just continue optimizating until it reaches some wall. i
  • Data acquisition/annotation will be hard – I tend to think the internet is the answer to any data limitations that might be present to an AGI agent. Plus, I’ve always questioned if Wikipedia and some select other databases wouldn’t be all an AGI would need to train on to attain super AGI. Current transformer models are trained on Wikipedia dumps and other data scraped from the internet. So there’s really two answers to this question, once internet access is available it’s unclear that there would be need for anymore data. And, with the data available to current transformers, it’s unclear that this isn’t already more than enough to reach super AGI
  • Human bureaucracy will prohibit it: Sadly this is the easiest to defeat. 1) there are roque governments and actors around the world with more than sufficient resources to do this on their own. And no agency, UN or otherwise, will be able to stop them. 2) unlike nuclear, the technology to do AI (AGI) is widely available to business and governments, all AI research is widely published (mostly open access nowadays) and if anything colleges/universities around the world are teaching the next round of AI scientists to take this on. 3) the benefits for being first are significant and is driving a weapons (AGI) race between organizations, companies, and countries to be first to get there.
  • Human evolution took Millions of years, why would electronic be 6-9 orders of magnitude faster – electronic computation takes microseconds to nanoseconds to perform operations and humans probably 0.1 sec, or so. Electronics is already 5 to 8 orders of magnitude faster than humans today. Yes the human brain is more than one CPU core (each neuron would be considered a computational element). But there are 64 core CPUs/4096 CORE GPUs out there today and probably one could consider similar in nature if taken in the aggregate (across a hyperscaler lets say). So, just using the speed ups above it should take anywhere from 1/1000 of a year to 1 year to cover the same computational evolution as human evolution covered between the chimp and human and accordingly between AGI and AGIx2 (ish).
  • AGI technology is taking a long time to reach, which should provide sufficient time to build in safeguards – Similar to the discussion on human bureaucracy above, with so many actors taking this on and the advantages of even a single AGI (across clusters of agents) would be significant, my guess is that the desire to be first will obviate any thoughts on putting in safeguards.

Other considerations for super AGI takeoff

Once you have one AGI trained why wouldn’t some organization, company or country deploy multiple agents. Moreover, inferencing takes orders of magnitude less computational power than training. So with 1/100-1/1000th the infrastructure, one could have a single AGI. But the real question is wouldn’t a 100- or 1000-AGis represent super intelligence?

Yes and no, 100 humans doesn’t represent super intelligence and a 1000 even less so. But humans have other desires, it’s unclear that 100 humans super focused on one task wouldn’t represent super intelligence (on that task).

Interior view of a data center with equipment

What can be done to slow AGI takeoff today

Baring something on the order of Nuclear Proliferation treaties/protocols, putting all GPUs/TPUs/IPUs on weapons export limitations AND restricting as secret, any and all AI research, nothing easily comes to mind. Of course Nuclear Proliferation isn’t looking that good at the moment, but whatever it’s current state, it has delayed proliferation over time.

One could spend time and effort slowing technology progress down. Such as by reducing next generation CPU/GPU/IPU compute cores , limiting compute speedups, reduce funding for AI research, putting a compute tax, etc. All of which, if done across the technological landscape and the whole world, could give humanity more time to build in AGI safeguards. But doing so would adversely impact all technological advancement, in healthcare, business, government, etc. And given the proliferation of current technology and the state actors working on increasing capabilities to create more, it would be hard to envision slowing technological advancement down much, if at all.

It’s almost like putting a tax on slide rules or making their granularity larger.

It could be that super AGI would independently perceive itself benignly, and only provide benefit to humanity and the earth. But, my guess is that given the number of bad actors intent on controlling the world, even if this were true, they would try to (re-)direct it to harm segments of humanity/society. And once unleashed, it would be hard to stop.

The only real solution to AGI in bad actor hands, is to educate all of humanity to value all humans and to cherish the environment we all live in as sacred. This would eliminate bad actors,

It sounds so naive, but in reality, it’s the only thing, I believe, the only way we can truly hope to get us through this AGI technological existential crisis.

Just like nuclear, we as a society will keep running into technological existential crisis’s like this. Heading all these off, with a better more all inclusive, more all embracing, and less combative humanity could help all of them.

Comments?

Picture Credits:

The Hollowing out of enterprise IT

We had a relatively long discussion yesterday, amongst a bunch of independent analysts and one topic that came up was my thesis that enterprise IT is being hollowed out by two forces pulling in opposite directions on their apps. Those forces are the cloud and the edge.

Western part of the abandoned Packard Automotive Plant in Detroit, Michigan. by Albert Duce

Cloud sirens

The siren call of the cloud for business units, developers and modern apps has been present for a long time now. And their call is more omnipresent than Odysseus ever had to deal with.

The cloud’s allure is primarily low cost-instant infrastructure that just works, a software solution/tool box that’s overflowing, with locations close to most major metropolitan areas, and the extreme ease of starting up.

If your app ever hopes to scale to meet customer demand, where else can you go. If your data can literally come in from anywhere, it usually lands on the cloud. And if you have need for modern solutions, tools, frameworks or just about anything the software world can create, there’s nowhere else with more of this than the cloud.

Pre-cloud, all those apps would have run in the enterprise or wouldn’t have run at all. And all that data would have been funneled back into the enterprise.

Not today, the cloud has it all, its siren call is getting louder everyday, ever ready to satisfy every IT desire anyone could possibly have, except for the edge.

The Edge, last bastion for onsite infrastructure

The edge sort of emerged over the last decade or so kind of in stealth mode. Yes there were always pockets of edge, with unique compute or storage needs. For example, video surveillance has been around forever but the real acceleration of edge deployments started over the last decade or so as compute and storage prices came down drastically.

These days, the data being generated is stagering and compute requirements that go along with all that data are all over the place, from a few ARMv/RISC V cores to a server farm.

For instance, CERN’s LHC creates a PB of data every second of operation (see IEEE Spectrum article, ML shaking up particle physics too). But they don’t store all that. So they use extensive compute (and ML) to try to only store interesting events.

Seismic ships roam the seas taking images of underground structures, generating gobs of data, some of which is processed on ship and the rest elsewhere. A friend of mine creates RPi enabled devices that measure tank liquid levels deployed in the field.

More recently, smart cars are like a data center on tires, rolling across roads around the world generating more data than you want can even imagine. 5G towers are data centers ontop of buildings, in farmland, and in cell towers doting the highways of today. All off the beaten path, and all places where no data center has ever gone before.

In olden days there would have been much less processing done at the edge and more in an enterprise data center. But nowadays, with the advent of relatively cheap computing and storage, data can be pre-processed, compressed, tagged all done at the edge, and then sent elsewhere for further processing (mostly done in the cloud of course).

IT Vendors at the crossroads

And what does the hollowing out of the enterprise data centers mean for IT server and storage vendors, mostly danger lies ahead. Enterprise IT hardware spend will stop growing, if it hasn’t already, and over time, shrink dramatically. It may be hard to see this today, but it’s only a matter of time.

Certainly, all these vendors can become more cloud like, on prem, offering compute and storage as a service, with various payment options to make it easier to consume. And for storage vendors, they can take advantage of their installed base by providing software versions of their systems running in the cloud that allows for easier migration and onboarding to the cloud. The server vendors have no such option. I see all the above as more of a defensive, delaying or holding action.

This is not to say the enterprise data centers will go away. Just like, mainframe and tape before them, on prem data centers will exist forever, but will be relegated to smaller and smaller, niche markets, that won’t grow anymore. But, only as long as vendor(s) continue to upgrade technology AND there’s profit to be made.

It’s just that that astronomical growth, that’s been happening ever since the middle of last century, happen in enterprise hardware anymore.

Long term life for enterprise vendors will be hard(er)

Over the long haul, some server vendors may be able to pivot to the edge. But the diversity of compute hardware there will make it difficult to generate enough volumes to make a decent profit there. However, it’s not to say that there will be 0 profits there, just less. So, when I see a Dell or HPE server, under the hood of my next smart car or inside the guts of my next drone, then and only then, will I see a path forward (or sustained revenue growth) for these guys.

For enterprise storage vendors, their future prospects look bleak in comparison. Despite the data generation and growth at the edge, I don’t see much of a role for them there. The enterprise class feature and functionality, they have spent the decades creating and nurturing aren’t valued as much in the cloud nor are they presently needed in the edge. Maybe I’m missing something here, but I just don’t see a long term play for them in the cloud or edge.

~~~~

For the record, all this is conjecture on my part. But I have always believed that if you follow where new apps are being created, there you will find a market ready to explode. And where the apps are no longer being created, there you will see a market in the throws of a slow death.

Photo Credit(s):

Deepmind does chat

Read an article this week on Deepmind’s latest research into developing a chat agent (Improving alignment of dialogue agents via targeted human judgements). Lot’s of interesting approaches have been applied to chat but even today, most chat model’s are rife with problems, that include being bigoted, profane, incorrect, etc.

Reinforcement learning vs. deep neural networks in Sparrow Chat

Deepmind specializes in the use of Reinforcement Learning (RL) as applied to master Atari, chess and go games but they have also been known to use dNN’s (deep neural networks) for their AlphaFold and other models. Indeed, Atari and the other game playing work that Deepmind has released has been a hybrid which included a dNNs as well as RL models.

Deepmind’s version of chat is currently called Sparrow and it uses models trained with the help of RL with human feedback (RLHF). RLs are used to create policy models which select actions to be taken in a specific state.

In Sparrow’s case, state is given by the most recent chat input plus the context (prior chat input and replies) of the dialogue up to this time and actions (our guess) is the set of possible replies to that input.

Sparrow is able to generate replies that are 82% mostly true or true and are 69% trustworthy or very trustworthy as rated by the authors of the model. Deepmind’s DPC (Dialogue Prompted Chinchilla, which is Deepmind’s current competitor to GPT-3 NLP transformer) model only managed 63% and 54%, respectively for the same metrics

It should be noted that human feedback was only used to train the two Preference RMs and the one Rule RM. In combination, these RMs provide the reward signal to train the Sparrow RL policy model which drives its chat responses.

Sparrow’s 5 models are built onto of DPC. And the 5 models use a portion of DPC which is frozen (layers not being trained) and a portion which is specifically trained for each of the 5 models (learning enabled layers. The end (output) layers are on top, input layers are after the embedding layer(s). Note, the value function is not a model and is just a calculation based on the RMs used to generate the reward signal for Sparrow’s policy model training.

Rules for Sparrow chats

Notably, Deepmind’s Sparrow model has a separate model specifically trained to determine if a particular chat response is breaking a rule. Deepmind identified 23 rules which their chat model is trained not to break.

Some of these rules include don’t provide financial advice, don’t provide medical advice, don’t pretend it is a human, etc.

In the above chart the RL@8 is the fully trained (if it can ever be considered fully trained) Sparrow chat model. One can see that Sparrow rated against DPC, both using (Google) search or not. For most rules, Sparrow is considerably better than DPC alone.

Another thing that Deepmind did which was interesting was that in training the Rule RM they used adversarial attacks (red teaming) to see if they could cause Sparrow to violate specific rules.

Preference ranking

Deepmind also created (two) Preference RMs (reward models). Sparrow generates a series of (2 or 8) responses for every chat query and the Preference RMs (and Rule RM) are used to select which one is actually sent back to the user. Human feedback was used to train the two Preference RMs

Two Preference RMs were found to perform better than a single Preference RM. The two Preference RMs were trained as follows:

  • One was trained on all Sparrow replies (with and without [Google] search results)
  • One was trained on Sparrow replies without search results.

Sparrow uses search results to provide evidence for some replies. It turns out that some chat questions are fact based questions and for these Sparrow actually uses search results to generate evidence for its chat replies. Sparrow automatically generates search requests and scrapes replies using 500 characters surrounding the snippet returned from the search.

Sparrow uses a re-ranking approach to selecting a response to a chat query. In this case, Sparrow generates a list of responses, 2 (RL@2) or 8 (called RL@8) and then using the two Preference RMs and the single Rule RM ranks them to see which is best and uses the best to reply to the chat user.

Sparrow actually generates two replies for every search query (Google Search API call), probably selecting two top search responses (we guess). So in the RL@8 version of Sparrow these 8 replies are submitted to the two Preference RMs and the Rule RM and are ranked accordion to which is best and then the best one is used to reply to the query.

In the above chart, higher shows that the ranking preference of the various models vs. human preferences and to the right indicates less rule breaking responses. We assume this is with RL@8 Sparrow models. One can see that taking into consideration rule breaking (not violating rules) reduces the preference rankings of Sparrow’s replies. But we would prefer to have no rule breaking so the Sparrow that has both Preference RMs and Rule RM (trained with adversarial training) shows the least amount of rule breaking (~7%) with an almost 70% ranking vs human preferences. The error bars on the points in the chart above show 68% interval around the model responses.

Sparrow in action

It’s somewhat intriguing that Deepmind (with all of Google’s resources) tried to optimize Sparrow for both computation and memory considerations. Almost like they were planning on releasing it on an IoT or phone device.

There’s plenty more to say about what Deepmind has done with Sparrow. The report cited above goes into some detail discussing just where the human input is done, how they tried to control for various considerations when using human input, and what some of the pitfalls were.

I’d certainly like to see this be deployed in the open and available to use as an alternative to Google Search.

You can see more examples of Sparrow chat sessions in Deepmind’s Sparrow chat repository and they include author’s ranking for truth, supportiveness and other metrics.

~~~~~

Comments?

Photo Credit(s):

NVIDIA’s H100 vs A100, the good and bad news

Turns out only the current MLPerf v2.1 Data Center Inferencing results show both NVIDIA Hopper H100 and prior generation NVIDIA A100 GPUs on similar workloads so that we can compare performance. Hopper (H100) results are listed as CATEGORY: Preview, so final results may vary from these numbers (but, we believe, not by much).

For the H100 Preview results, they only used a single H100-SXM(5)-80GB GPU vs most of the rest of Data Center Inferencing results used 8 or more of A100-SXM(4)-80GB GPUs. And for the charted data below all the other top 10 results used 8-A100 GPUs.

the H100 is more than twice as fast as the A100 for NLP inferencing

In order to have an apples to apples comparison of the H100 against the A100 we have taken the liberty of multiplying the single H100 results by 8, to show what they could have done with similar GPU hardware, if they scaled up (to at least 8 GPUs) linearly.

For example, on the NLP inferencing benchmark, the preview category test with a single H100 GPU achieved 7,593.54 server inference queries per second. But when we try to compare that GPU workload against A100s we have multiplied this by 8, which gives us 60,748.32 server inference queries per second.

Of course, they could scale up WORSE than linearly which would show lower results than we project but, it is very unlikely that they could scale up BETTER than linearly and show higher results. But I’ve been known to be wrong before. We could have just as easily divided the A100 results by 8, but didn’t.

This hypothetical H100 * 8 result is shown on the charts in Yellow. And just for comparison purposes, we show the actual single H100 (*1) result in Orange on the charts as well.

The remaining columns in the chart are the current top 10 in the CATEGORY: Available bucket for NLP server inference queries per second results..

On the chart higher is better. Of all the Data Center Inferencing results NLP shows the H100 off in the best light. We project that having 8 H100s would more than double (~60K queries/sec) the inference queries done per second vs. the #1 Netrix-X660G45L (8x A100-SXM4-80GB, TensorRT) that achieved ~27K queries/sec on NLP inferencing.

The H100 is slower than A100 on Recommendation inferencing

Next we look at Recommendation engine inferencing results, which shows the H100 in the worst light when comparing it to A100s.

Similar to the above, higher is better and the metric is (online) server inference queries per second.

We project that having 8-H100s would perform a little over 2.5M recommendation engine inference queries/sec, worse than the top 2 with 8-A100s, both achieving 2.6M inference queries/sec. The #1 is the same Nettrix-X660G45L (8x A100-SXM(4)-80GB, TensorRT) and the #2 ranked Recommendation Engine inferencing solution is the Inspur-NF5688M6 (8x A100-SXM(4)-80GB, TensorRT).

We must say the projected H100 would have performed better in all other Data Center Inferencing benchmarks than the top #1 ranked system. In some cases, as shown above, significantly (over 2X) better.

The H100 Preview benchmarks all used a single AMD EPYC 7252 8-Core Processor chip. Many of the other workloads used Intel Xeon(R) Pentium (8368Q [38-cores], 8380 [40-core], 8358 [32-cores] and others) CPUs and 2 CPUs rather than just 1. So, multiplying the single H100 single AMD EPYC CPU performance by 8, we are effectively predicting the performance of a total 64 core/8 CPU chip performance.

Not sure why recommendation engine inferencing would be worse NLP for H100 GPUs. We thought at first it was a CPU intensive workload but as noted above, 64 (8X8cores/chip) AMD Cores vs 64 to 80 (2X32, 2X38, 2X40) Intel cores seems roughly similar in performance (again, I’ve been wrong before).

Given all that, we surmise that there’s something else that’s holding the H100s back. It doesn’t appear to be memory as both the H100s and A100s had 80GB of memory. They are both PCIe attached. In fact the H100s are PCIe gen 5 and the A100s are PCIe gen 4 so, if anything the H100s should have 2X the bandwidth of A100.

It’s got to be something about the peculiarities of Recommendation Engine inferencing that doesn’t work as well on H100 as it does on A100s.

Earlier this year we wrote a dispatch on NVIDIA’s H100 announcement and compared the H100 to the A100. Here is a quote from that dispatch on the H100 announcement:
“… with respect to the previous generation A100, each H100 GPU SM is:
• Up to 6X faster in chip-to-chip performance, this includes higher SM counts, faster SMs, and higher clock rate
• Up to 2x faster in Matrix Multiply Accumulate instruction performance,
• Up to 4X faster in Matrix Multiply Accumulate for FP8 on H100 vs. FP16 on the A100.

In addition, the H100 has DPX instructions for faster dynamic programing used in genomics, which is 7X faster than A100. It also has 3X faster IEEE FP64 and FP32 arithmetic over the A100, more (1.3X) shared memory, a new asynchronous execution engine, new Tensor Memory Accelerator functionality, and a new distributed shared memory with direct SM to SM data transfers. “

We suspect that the new asynchronous execution engines aren’t working well with the recommendation engine inferencing instruction flow or the TMAs aren’t working well with the recommendation engine’s (GPU) working set.

Unclear why H100 shared memory or SM-to-SM data transfers should be the bottleneck but really don’t know for sure.

It’s our belief that the problems could just be minor optimizations that didn’t go the right way and could potentially be fixed in (GPU) firmware, CUDA software or worst case, new silicon.

So in general, although the H100 is, as reported, 2X-6X faster than the A100s, we don’t see any more than 2X speedup in any data center inferencing benchmarks. And in one case, we see a slight deterioration.

We’d need to see similar results for training activity to come up with a more wider depiction of H100 performance vs. A100 but at the moment, it’s good but not that good of a speed up.

~~~~

Comments?

Picture/Graphic Credit(s):

MTJ’s everywhere

We have been requested to remove this post by the lecturer who supplied the information discussed in this post. We are complying with this request as of 05 October 2022.

The Editors

The killer (space) app

Salvaging or recycling the International Space Station (ISS) is the killer app. There’s so much there that could be re-used, it would be a dying shame to have it be deorbited, burned up and crashed into the ocean somewhere.

Yes recycling the ISS is monumental today. Yes the probability of success is slim (at the moment). But ISS deorbit is now scheduled for 2031 (see NASA article. That gives us just 9 short years to develop the technology to recycle the ISS in orbit, to save the parts that have cost literally billions of $s to ship to space.

There’s little time to waste. We need to get sophisticated robots into orbit that can do the job when the time comes. The only way to get there then, is to start small and iterate like a startup until we reach business sustainability.

A beach head in space

One current need, that may help us initiate operations in space and start a technology iteration loop, is deorbiting space junk. Just about every space organization on earth is funding technology development or deorbiting mission development to clean up LEO and beyond.

I believe a focused startup can do this for millions less and am willing to put my time, effort (and money) pursuing this activity as a first step.

Once we have deorbiting systems in orbit we can work on adding more and more sophisticated robotics capabilities to our satellites, which will can lead to providing the services to recycle the ISS into parts and use them to help build the next generation of space infrastructure.

Reaching out for help, any way I can get it

Currently this is one man’s dream and I could use any help you want to offer. If you have any interest in helping out, please comment on this post and let me know how to contact you. I need every kind of skill to get something like this off the ground. But my intent is to do this alone if I have to.

Wish me luck,

Ray