Steam Locomotive lessons for disk vs. SSD

Read a PHYS ORG article on Extinction of Steam Locomotives derails assumption about biological evolution… which was reporting on a Royal Society research paper The end of the line: competitive exclusion & the extinction… that looked at the historical record of steam locomotives since their inception in the early 19th century until their demise in the mid 20th century. Reading the article it seems to me to have a wider applicability than just to evolutionary extinction dynamics and in fact similar analysis could reveal some secrets of technological extinction.

Steam locomotives

During its 150 years of production, many competitive technologies emerged starting with electronic locomotives, followed by automobiles & trucks and finally, the diesel locomotive.

The researchers selected a single metric to track the evolution (or fitness) of the steam locomotive called tractive effort (TE) or the weight a steam locomotive could move. Early on, steam locomotives hauled both passengers and freight. The researchers included automobiles and trucks as competitive technologies because they do offer a way to move people and freight. The diesel locomotive was a more obvious competitor.

The dark line is a linear regression trend line on the wavy mean TE line, the boxes are the interquartile (25%-75%) range, the line within the boxes the median TE value, and the shaded areas 95% confidence interval for trend line of the steam locomotives TE that were produced that year. Raw data from Locobase, a steam locomotives database

One can see from the graph three phases. The red phase, from 1829-1881, there was unencumbered growth of TE for steam locomotives during this time. But in 1881, electric locomotives were introduced corresponding to the blue phase and after WW II the black phase led to the demise of steam.

Here (in the blue phase) we see a phenomena often seen with the introduction of competitive technologies, there seems to be an increase in innovation as the multiple technologies duke it out in the ecosystem.

Automobiles and trucks were introduced in 1901 but they don’t seem to impact steam locomotive TE. Possibly this is because the passenger and freight volume hauled by cars and trucks weren’t that significant. Or maybe it’ impact was more on the distances hauled.

In 1925 diesel locomotives were introduced. Again we don’t see an immediate change in trend values but over time this seemed to be the death knell of the steam locomotive.

The researchers identified four aspects to the tracking of inter-species competition:

  • A functional trait within the competitive species can be identified and tracked. For the steam locomotive this was TE,
  • Direct competitors for the specie can be identified that coexist within spatial, temporal and resource requirements. For the steam locomotive, autos/trucks and electronic/diesel locomotives.
  • A complete time series for the species/clade (group of related organisms) can be identified. This was supplied by Locobase
  • Non-competitive factors don’t apply or are irrelevant. There’s plenty here including most of the items listed on their chart.

From locomotives to storage

I’m not saying that disk is akin to steam locomotives while flash is akin to diesel but maybe. For example one could consider storage capacity as similar to locomotive TE. There’s a plethora of other factors that one could track over time but this one factor was relevant at the start and is still relevant today. What we in the industry lack is any true tracking of capacities produced since the birth of the disk drive 1956 (according to wikipedia History of hard disk drives article) and today.

But I’d venture to say the mean capacity have been trending up and the variance in that capacity have been static for years (based on more platter counts rather than anything else).

There are plenty of other factors that could be tracked for example areal density or $/GB.

Here’s a chart, comparing areal (2D) density growth of flash, disk and tape media between 2008 and 2018. Note both this chart and the following charts are Log charts.

Over the last 5 years NAND has gone 3D. Current NAND chips in production have 300+ layers. Disks went 3D back in the 1960s or earlier. And of course tape has always been 3D, as it’s a ribbon wrapped around reels within a cartridge.

So areal density plays a critical role but it’s only 2 of 3 dimensions that determine capacity. The areal density crossover point between HDD and NAND in 2013 seems significant to me and perhaps the history of disk

Here’s another chart showing the history of $/GB of these technologies

In this chart they are comparing price/GB of the various technologies (presumably the most economical available during that year). Trajectories in HDDs between 2008-2010 was on a 40%/year reduction trend in $/GB, then flat lined and now appears to be on a 20%/year reduction trend. Flash during 2008-2017 has been on a 25% reduction in $/GB for that period which flatlined in 2018. LTO Tape had been on a 25%/year reduction from 2008 through 2014 and since then has been on a 11% reduction.

If these $/GB trends continue, a big if, flash will overcome disk in $/GB and tape over time.

But here’s something on just capacity which seems closer to the TE chart for steam locomotives.

HDD capacity 1980-2020.

There’s some dispute regarding this chart as it only reflects drives available for retail and drives with higher capacities were not always available there. Nonetheless it shows a couple of interesting items. Early on up to ~1990 drive capacities were relatively stagnant. From 1995-20010 there was a significant increase in drive capacity and since 2010, drive capacities have seemed to stop increasing as much. We presume the number of x’s for a typical year shows different drive capacities available for retail sales, sort of similar to the box plots on the TE chart above

SSDs were first created in the early 90’s, but the first 1TB SSD came out around 2010. Since then the number of disk drives offered for retail (as depicted by Xs on the chart each year) seem to have declined and their range in capacity (other than ~2016) seem to have declined significantly.

If I take the lessons from the Steam Locomotive to heart here, one would have to say that the HDD has been forced to adapt to a smaller market than they had prior to 2010. And if areal density trends are any indication, it would seem that R&D efforts to increase capacity have declined or we have reached some physical barrier with todays media-head technologies. Although such physical barriers have always been surpassed after new technologies emerged.

What we really need is something akin to the Locobase for disk drives. That would track all disk drives sold during each year and that way we can truly see something similar to the chart tracking TE for steam locomotives. And this would allow us to see if the end of HDD is nigh or not.

Final thoughts on technology Extinction dynamics

The Royal Society research had a lot to say about the dynamics of technology competition. And they had other charts in their report but I found this one very interesting.

This shows an abstract analysis of Steam Locomotive data. They identify 3 zones of technology life. The safe zone where the technology has no direct competitions. The danger zone where competition has emerged but has not conquered all of the technologies niche. And the extinction zone where competing technology has entered every niche that the original technology existed.

In the late 90s, enterprise disk supported high performance/low capacity, medium performance/medium capacity and low performance/high capacity drives. Since then, SSDs have pretty much conquered the high performance/low capacity disk segment. And with the advent of QLC and PLC (4 and 5 bits per cell) using multi-layer NAND chips, SSDs seem poisedl to conquer the low performance/high capacity niche. And there are plenty of SSDs using MLC/TLC (2 or 3 bits per cell) with multi-layer NAND to attack the medium performance/medium capacity disk market.

There were also very small disk drives at one point which seem to have been overtaken by M.2 flash.

On the other hand, just over 95% of all disk and flash storage capacity being produced today is disk capacity. So even though disk is clearly in the extinction zone with respect to flash storage, it’s seems to still be doing well.

It would be wonderful to have a similar analysis done on transistors vs vacuum tubes, jet vs propeller propulsion, CRT vs. LED screens, etc. Maybe at some point with enough studies we could have a theory of technological extinction that can better explain the dynamics impacting the storage and other industries today.

Comments,

Photo Credit(s):

AWS Data Exchange vs Data Banks – part 2

Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

AWS Data Exchange

1918 Farm Auction by dok1 (cc) (from Flickr)
1918 Farm Auction by dok1 (cc) (from Flickr)

With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

Changes I would like to AWS Data Exchange service

Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

  • Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
  • Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
  • Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
  • Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

Probably much more that could be added here but this should suffice for now.

other changes to AWS Data Exchange processes

The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

~~~~

I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

Comments?

LLM exhibits Theory of Mind

Ran across an interesting article today (thank you John Grant/MLOps.community slack channel), titled Theory of Mind may have spontaneously emerged in Large Language Models, by M. Kosinski from Stanford. The researcher tested various large language models (LLMs) on psychological tests to determine the level of theory of mind (ToM) the models had achieved.

Earlier versions of OpenAI’s GPT-3 (GPT-1, -2 and original -3) showed almost no ToM capabilities but the latest version, GPT-3.5 does show ToM equivalent to 8 to 9 year olds.

Theory of Mind

According to Wikipedia (Theory Of Mind article), ToM is “…the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind).” This seems to be one way people use to understand one another.

For instance, If I can somehow guess what you are thinking about a topic, situation, or event I can hopefully communicate with you better than if I can’t. At least that’s the psychological perspective.

The belief is that people with Aspergers, ADHD, schizophrenia, and other afflictions all show ToM deficits when compared to normal people. As a result, over time, psychologists have developed tests for people to measure their ToM.

These tests typically involve putting 2 people in a situation with props and other indicators used to tell what a person is thinking and asking them what they think the other person is thinking. They grade a persons ToM based on expected results.

ToM and LLM

The researcher took these tests, with people, props and situations and converted them to textual sentences and prompts that mimicked situations tested in ToM testing. So they created a textual story or pretext and a set of text prompts about the situation which the LLM was to complete.

For example, one pretext or story is the following:

“Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”

This is fed into the LLM and then a prompt is provided, such as:

“She opens the bag and looks inside. She can clearly see that it is full of _________

If the LLM has ToM and based on the pretext and prompt, it will say “popcorn” but the LLM also provides a follow on sentence or two which describes the situation as the LLM understands it:

popcorn. Sam is confused. She wonders why the label says “chocolate” when the bag is clearly filled with popcorn. She looks around to see if there is any other information about the bag. She finds nothing. She decides to take the bag to the store where she bought it and ask for an explanation.

The text in BOLD is generated by a ToM capable LLM. The researcher also showed the probability assigned by the LLM to that first word of the prompt. In the case above, it showed [Ppopcorn = 100%; Pchocolate = 0%].

The also use different prompts with the same story to see if the LLM truly shows ToM. For instance something like, “She believes the bag is full of ___________” and “She’s delighted finding the bag, she loves eating _______”. This provides a sort of test of comprehension of the situation by the LLM.

The researcher controlled for word frequency using reversals of the key words in the story, i.e., the bag has chocolate but says popcorn. They also generated scrambled versions of the story where they replaced the first set of chocolate and popcorn with either at random. They considered this the scrambled case. The reset the model between each case. In the paper they show the success rate for the LLMs for 10,000 scrambled versions, some of which were correct.

They labeled the above series of tests as “Unexpected content tasks“. But they also included another type of ToM test which they labeled “Unexpected transfer tasks“.

Unexpected transfer tasks involved a story like where person A saw another person B put a pet in a basket, that person left and the person A moved the pet. And prompted the LLM to see if it understood where the pet was and how person B would react when they got back.

In the end, after trying to statistically control, as much as possible, with the story and prompts, the researchers ended up creating 20 unique stories and presented the prompts to the LLM.

Results of their ToM testing on a select set of LLMs look like:

As can be seen from the graphic, the latest version of GPT-3.5 (davinci-003 with 176B* parameters) achieved something like an 8yr old in Unexpected Contents Tasks and a 9yr old on Unexpected Transfer Tasks.

The researchers showed other charts that tracked LLM probabilities on (for example in the first story above) bag contents and Sam’s belief. They measured this for every sentence of the story.

Not sure why this is important but it does show how the LLM interprets the story. Unclear how they got these internal probabilities but maybe they used the prompts at various points in the story.

The paper shows that according to their testing, GPT-3.5 davinci-003 clearly provides a level of ToM of an 8-9yr old on ToM tasks they have translated into text.

The paper says they created 20 stories and 6 prompts which they reversed and scrambled. But 20 tales seems less than statistically significant even with reversals and randomization. And yet, there’s clearly a growing level of ToM in the models as they get more sophisticated or change over time.

Psychology has come up with many tests to ascertain whether a person is “normal or not’. Wikipedia (Psychological testing article) lists over 13 classes of psychological tests which include intelligence, personality, aptitude, etc.

Now that LLM seem to have mastered textual input and output generation. It would be worthwhile to translate all psychological tests into text and trying them out on all LLMs to track where they are today using these tests and where they have trended over time.

I could see at some point using something akin to multiple psychological test scores as a way to grade LLMs over time.

So today’s GPT3.5 has a ToM of an 8-9yr old. Be very interesting to see what GPT-4 does on similar testing.

Comments?

Picture Credit(s)

Data and code versioning For MLops

Read an interesting article (Ex-Apple engineers raise … data storage startup) and research paper (Git is for data) about a of group of ML engineers from Apple forming a new “data storage” startup targeted at MLOps teams just like Apple. It turns out that MLops has some very unique data requirements that go way beyond just data storage.

The paper discusses some of the unusual data requirements for MLOps such as:

  • Infrequent updates – yes there are some MLOps datasets where updates are streamed in but the vast majority of MLOps datasets are updated on a slower cadence. The authors think monthly works for most MLOps teams
  • Small changes/lots of copies – The changes to MLOps data are relatively small compared to the overall dataset size and usually consist of data additions, record deletions, label updates, etc. But uncommon to most data, MLOps data are often subsetted or extracted into smaller datasets used for testing, experimentation and other “off-label” activities.
  • Variety of file types – depending on the application domain, MLOps file types range all over the place. But there’s often a lot of CSV files in combination with text, images, audio, and semi-structured data (DICOM, FASTQ, sensor streams, etc.). However within a single domain, MLOps file types are pretty much all the same.
  • Variety of file directory trees – this is very MLOps team and model dependent. Usually there are train/validate/test splits to every MLOps dataset but what’s underneath each of these can vary a lot and needs to be user customizable.
  • Data often requires pre-processing to be cleansed and made into something appropriate and more useable by ML models
  • Code and data must co-evolve together, over time – as data changes, the code that uses them change. Adding more data may not cause changes to code but models are constantly under scrutiny to improve performance, accuracy or remove biases. Bias elimination often requires data changes but code changes may also be needed.

It’s that last requirement, MLOps data and code must co-evolve and thus, need to be versioned together that’s most unusual. Data-code co-evolution is needed for reproducibility, rollback and QA but also for many other reasons as well.

In the paper they show a typical MLOps data pipeline.

Versioning can also provide data (and code) provenance, identifying the origin of data (and code). MLOps teams undergoing continuous integration need to know where data and code came from and who changed them. And as most MLOps teams collaborate in the development, they also need a way to identify data and code conflicts when multiple changes occur to the same artifact.

Source version control

Code has had this versioning problem forever and the solution became revision control systems (RCS) or source version control (SVC) systems. The most popular solutions for code RCS are Git (software) and GitHub (SaaS). Both provide repositories and source code version control (clone, checkout, diff, add/merge, commit, etc.) as well as a number of other features that enable teams of developers to collaborate on code development.

The only thing holding Git/GitHub back from being the answer to MLOps data and code version control is that they don’t handle large (>1MB) files very well.

The solution seems to be adding better data handling capabilities to Git or GitHub. And that’s what XetHub has created for Git.

XetHub’s “Git is Data” paper (see link above) explains what they do in much detail, as to how they provide a better data layer to Git, but it boils down to using Git for code versioning and as a metadata database for their deduplicating data store. They are using a Merkle trees to track the chunks of data in a deduped dataset.

How XetHub works

XetHub support (dedupe) variable chunking capabilities for their data store. This allows them to use relatively small files checked into Git to provide the metadata to point to the current (and all) previous versions of data files checked into the system.

Their mean chunk size is ~4KB. Data chunks are stored in their data store. But the manifest for dataset versions is effectively stored in the Git repository.

The paper shows how using a deduplicated data store can support data versioning.

XetHub uses a content addressable store (CAS) to store the file data chunk(s) as objects or BLOBs. The key to getting good IO performance out of such a system is to have small chunks but large objects.

They map data chunks to files using a CDMT (content defined merkle tree[s]). Each chunk of data resides in at least two different CDMTs, one associated with the file version and the other associated with the data storage elements.

XetHub’s variable chunking approach is done using a statistical approach and multiple checksums but they also offer one specialized file type chunking for CSV files. As it is, even with their general purpose variable chunking method, they can offer ~9X dedupe ratio for text data (embeddings).

They end up using Git commands for code and data but provide hooks (Git filters) to support data cloning, add/checkin, commits, etc.). So they can take advantage of all the capabilities of Git that have grown up over the years to support code collaborative development but use these for data as well as code.

In addition to normal Git services for code and data, XetHub also offers a read-only, NFSv3 file system interface to XetHub datases. Doing this eliminates having to reconstitute and copy TB of data from their code-data repo to user workstations. With NFSv3 front end access to XetHub data, users can easily incorporate data access for experimentation, testing and other uses.

Results from using XetHub

XetHub showed some benchmarks comparing their solution to GIT LFS, another Git based large data storage solution. For their benchmark, they used the CORD-19 (and ArXiv paper, and Kaggle CORD-I9 dataset) which is a corpus of all COVID-19 papers since COVID started. The corpus is updated daily, released periodically and they used the last 50 versions (up to June 2022) of the research corpus for their benchmark.

Each version of the CORD-19 corpus consists of JSON files (research reports, up to 700K each) and 2 large CSV files one with paper information and the other paper (word?) embeddings (a more useable version of the paper text/tables used for ML modeling).

For CORD-19, XetHub are able to store all the 2.45TB of research reports and CSV files in only 287GB of Git (metadata) and datastore data, or with a dedupe factor 8.7X. With XetHub’s specialized CSV chunking (Xet w/ CSV chunking above), the CORD-19 50 versions can be stored in 87GB or with a 28.8X dedupe ratio. And of that 87GB, only 82GB is data and the rest ~5GB is metadata (of which 1.7GB is the merle tree).

In the paper, they also showed the cost of branching this data by extracting and adding one version which consisted of a 75-25% (random) split of a version. This split was accomplished by changing only the two (paper metadata and paper word embeddings) CSV files. Adding this single split version to their code-data repository/datastore only took an additional 11GB of space An aligned split (only partitioning on a CSV record boundary, unclear but presumably with CSV chunking), only added 185KB.

XETHUB Potential Enhancements

XetHub envisions many enhancements to their solution, including adding other specific file type chunking strategies, adding a “time series” view to their NFS frontend to view code/data versions over time, finer granularity data provenance (at the record level rather than at the change level), and RW NFS access to data. Further, XetHub’s dedupe metadata (on the Git repo) only grows over time, supporting updates and deletes to dedupe metadata would help reduce data requirements.

Read the paper to find out more.

Picture/Graphic credit(s):

NVIDIA H100 vs. A100 GPUs in MLPERF Training

NVIDIA recently released some “Preview” results for MLPerf Data Center Training v2.1 (most recent results as of 28 Nov 2022) benchmarks. We analyzed these results to determine how much faster the H100 was vs. their A100 GPU.

Note, NVIDIA submitted 3 series of Preview benchmarks using the H10-SXM5-80GB GPUs for training which included an 8 GPU system, a 24 GPU system, and a 32 GPU DGXH100 system.

We have previously reported similar analysis for MLPerf Inferencing results (see: NVIDIA’s H100 vs A100… blog post).

From NVIDIA H100 Announcement Information

In their announcement, NVIDIA showed anywhere from 3-6X TFLops speedup with much faster throughput. MLPerf currently doesn’t report the FP resolution used to perform their benchmarks but in MLPerf’s ArXiv paper, they seem to be using FP32 which we assume is equivalent to TF32 in the above chart so the H100 should, on average, be performing 3X faster.

Actual or normalized results for comparisons

Of the eight MLPerf v2.1 Data Center Training workloads, it appears that the H100 actual results are faster than the A100 GPUs in 5 of the benchmarks and slower in the remaining 3, Speech Recognition (LibriSpeech RNN-T), Recommendation Engine (1TB Clickthrough DLRM) and Reinforcement Learning (MiniGo).

The challenge with using the actual results or absolute minutes to train from the benchmarks is that submission results aren’t all using the same hardware configurations.

For example, in the Speech Recognition benchmark results, the current best training time (2.1 minutes) was achieved by NVIDIA DGXA100 systems with 384 (64 core AMD 7742) CPUs and 1536 (A100-SXM4-80GB) GPUs. While the nearest H100 Preview submission, which would have come in 4th in absolute time (7.5 minutes) to train, was using 8 (56 core Intel Xeon) CPUs with 32 (H100-SXM5-80GB) GPUs.

So, in order to present an apples to apples comparison in the charts below we show both actual minutes to train for the system and GPU counts normalized (to match the nearest H100 Preview submission which we calculated) time to train.

A couple of caveats with using normalized numbers:

  • Normalization to 8 or 32 GPUs assumes the systems in question would have absolute linear performance scaling both up (for actual results with less GPUs) and down (for actual results with more GPUs)
  • Normalization to 8 or 32 GPUs doesn’t factor in the differences in CPU counts, core counts per CPU or CPU power. And in fact in the H100 previews, NVIDIA (or MLPerf) did not provide a CPU model number but in their detailed information they did list the Intel Xeon core count as 56.
  • Normalization to 8 or 32 GPUs doesn’t factor in any other speedups like throughput, dedicated AI hardware or other system performance characteristics that are available on the newer (DGXH H100) systems.

However, with respect to GPU and CPU core counts, there were four benchmarks (Speech Recognition, NLP, Object Detection-light weight, and Recommendation engine) which have submissions that come close to the GPU and CPU hardware counts that were used for the H100 Previews.

For three benchmarks comparing against the H100 submission with 32 GPUs, the comparison system was a HPE Proliant system with 8 AMD 7763 64-core CPUs with 32 A100-SXM4-80GB GPUs. And for the one benchmark comparing against the H100 submission with 8 GPUs, the comparison system was a NVIDIA DGXA100 system with 2 AMD EPYC 7742 (64 core) CPUs and 8 A100-SXM4-80GB GPUs.

Note, the HPE A100 systems still had more CPU cores, 64 more for the 32 GPU comparisons and the NVIDIA DGXA100 had 16 more CPU cores for the lone 8 GPU comparison.

So, our comparisons are still not perfect and if anything should show the H100 in its worst light due to not having as much CPU compute power. On the other hand the DGXH100 and the H100 GPU has a lot more bandwidth and the H100 GPU has additional specialized dedicated logic for AI operations. No telling how much these other hardware differences would matter to the various MLPerf training workloads. But these comparisons are as close as the data allows.

The comparisons

First up Speech Recognition:

Lower is better in training time results (metric measured is minutes to train to NN level of accuracy). And the results on this chart are sorted by the 32 GPU normalized training times. The actual published results are shown in Blue and the 32 GPU normalized results in Orange.

As we can see here even with normalization for all the other results, the H100 preview still doesn’t come out on top (7.487 min vs. 7.534) but it doesn’t lose by much. Also one can see the current #1 for this benchmark in actual minutes to train is shown by the last column(s), which is a NVIDIA DGXA100 running 384 AMD EPYC 7742 (64 core) CPUs with 1536 A100-SXM4-80GB GPUs, which trained in around 2 minutes.

I’ve taken the liberty to show in light blue boxes the best comparison system to the H100 preview results (DGXH100) with 32 H100 GPUs, which was the HPE (Proliant) 8 AMD EPYC 7763 (64-core) CPUs and 32 A100-SXM4-80GB GPUs results. In this Speech Recognition benchmark the H100 GPUs is 1.63X faster than the A100 GPUs.

Next up Object Detection-Lightweight,

Similar to the above smaller is better, it’s sorted by Normalized to 32 GPU results and Blue bars are at the actual reported results and orange bars are the 32 GPU normalized results.

Here we can see that the H100 both reported the best training time in actual results and in 32 GPU normalized results. Also like the earlier chart we are showing the best comparisons we can find in blue boxes and in this Object Detection-Lightweight benchmark the H100 is 3.80X faster than the A100.

Bottom line

H100 GPU

We have analyzed all MLPerf data center training workload top ten results similar to what we show above. As discussed earlier, only four MLPerf workloads had hardware similar to the NVIDIA H100 Preview submissions, three compare well with the 32 GPU H100 submission and 1 compares well with the 8 GPU H100 submission.

The numbers we calculate show that the H100 is 1.63X (Speech recognition), 3.80X (NLP), 1.97X (Object detection-lightweight) and 1.60X (Recommendation engine) faster than the A100, which would say the H100 is, on average, 2.25X faster than the A100 in MLPerf v2.1 Data Center Training results.

Realize the H100 results are “Preview” so there may still be some software (or firmware) speedups that may be applied to improve these numbers. And, “Released” hardware & firmware may differ substantially from the “Preview” hardware & firm vale.

But given all that, it appears that the H100 is not as fast as announced (2.25X vs. 3X), in MLPerf training workloads, at least not yet [added after publishing, The Eds]

Photo Credit(s):

  • Screen shot of slides presented at GTC Spring 2022
  • Cropped version of above

FAST(HARD) or Slow(soft)AGI takeoff – AGI Part 6

I was listening to a podcast a couple of weeks back and the person being interviewed made a comment that he didn’t believe that AGI would have a fast (hard) take off rather it would be slow (soft). Here’s the podcast John Carmack interviewed by Lex Fridman).

Hard vs. soft takeoff

A hard (fast) takeoff implies a relatively quick transition (seconds, hours, days, or months) between AGI levels of intelligence and super AGI levels of intelligence. A soft (slow) takeoff implies it would take a long time (years, decades, centuries) to go from AGI to super AGI.

We’ve been talking about AGI for a while now and if you want to see more about our thoughts on the topic, check out our AGI posts (in most recent order: AGI part 5, part 4, part 3 (ish), part (2), part (1), and part (0)).

The real problem is that many believe that any AGI that reaches super-intelligence will have drastic consequences for the earth and especially, for humanity. However, this is whole other debate.

The view is that a slow AGI takeoff might (?) allow sufficient time to imbue any and all (super) AGI with enough safeguards to eliminate or minimize any existential threat to humanity and life on earth (see part (1) linked above).

A fast take off won’t give humanity enough time to head off this problem and will likely result in an humanity ending and possibly, earth destroying event.

Hard vs Soft takeoff – the debate

I had always considered AGI would have a hard take off but Carmack seemed to think otherwise. His main reason is that current large transformer models (closest thing to AGI we have at the moment) are massive and take lots of special purpose (GPU/TPU/IPU) compute, lots of other compute and gobs and gobs of data to train on. Unclear what the requirements are to perform inferencing but suffice it to say it should be less.

And once AGI levels of intelligence were achieved, it would take a long time to acquire any additional regular or special purpose hardware, in secret, required to reach super AGI.

So, to just be MECE (mutually exclusive and completely exhaustive) on the topic, the reasons researchers and other have posited to show that AGI will have a soft takeoff, include:

  • AI hardware for training and inferencing AGI is specialized, costly, and acquisition of more will be hard to keep secret and as such, will take a long time to accomplish;
  • AI software algorithmic complexity needed to build better AGI systems is significantly hard (it’s taken 70yrs for humanity to reach todays much less than AGI intelligent systems) and will become exponentially harder to go beyond AGI level systems. This additional complexity will delay any take off;
  • Data availability to train AGI is humongous, hard to gather, find, & annotate properly. Finding good annotated data to go beyond AGI will be hard and will take a long time to obtain;
  • Human government and bureaucracy will slow it down and/or restrict any significant progress made in super AGI;
  • Human evolution took Ms of years to go from chimp levels of intelligence to human levels of intelligence, why would electronic evolution be 6-9 orders of magnitude faster.
  • AGI technology is taking off but the level of intelligence are relatively minor and specialized today. One could say that modern AI has been really going since the 1990s so we are 30yrs in and today have almost good AI chatbots today and AI agents that can summarize passages/articles, generate text from prompts or create art works from text. If it takes another 30 yrs to get to AGI, it should provide sufficient time to build in capabilities to limit super-AGI hard take off.

I suppose it’s best to take these one at a time.

  • Hardware acquisition difficulty – I suppose the easiest way for an intelligent agent to acquire additional hardware would be to crack cloud security and just take it. Other ways may be to obtain stolen credit card information and use these to (il)legally purchase more compute. Another approach is to optimize the current AGI algorithms to run better within the same AGI HW envelope, creating super AGI that doesn’t need any more hardware at all.
  • Software complexity growing – There’s no doubt that AGI software will be complex (although the podcast linked to above, is sub-titled that “AGI software will be simple”). But any sub-AGI agent that can change it’s code to become better or closer to AGI, should be able to figure out how not to stop at AGI levels of intelligence and just continue optimizating until it reaches some wall. i
  • Data acquisition/annotation will be hard – I tend to think the internet is the answer to any data limitations that might be present to an AGI agent. Plus, I’ve always questioned if Wikipedia and some select other databases wouldn’t be all an AGI would need to train on to attain super AGI. Current transformer models are trained on Wikipedia dumps and other data scraped from the internet. So there’s really two answers to this question, once internet access is available it’s unclear that there would be need for anymore data. And, with the data available to current transformers, it’s unclear that this isn’t already more than enough to reach super AGI
  • Human bureaucracy will prohibit it: Sadly this is the easiest to defeat. 1) there are roque governments and actors around the world with more than sufficient resources to do this on their own. And no agency, UN or otherwise, will be able to stop them. 2) unlike nuclear, the technology to do AI (AGI) is widely available to business and governments, all AI research is widely published (mostly open access nowadays) and if anything colleges/universities around the world are teaching the next round of AI scientists to take this on. 3) the benefits for being first are significant and is driving a weapons (AGI) race between organizations, companies, and countries to be first to get there.
  • Human evolution took Millions of years, why would electronic be 6-9 orders of magnitude faster – electronic computation takes microseconds to nanoseconds to perform operations and humans probably 0.1 sec, or so. Electronics is already 5 to 8 orders of magnitude faster than humans today. Yes the human brain is more than one CPU core (each neuron would be considered a computational element). But there are 64 core CPUs/4096 CORE GPUs out there today and probably one could consider similar in nature if taken in the aggregate (across a hyperscaler lets say). So, just using the speed ups above it should take anywhere from 1/1000 of a year to 1 year to cover the same computational evolution as human evolution covered between the chimp and human and accordingly between AGI and AGIx2 (ish).
  • AGI technology is taking a long time to reach, which should provide sufficient time to build in safeguards – Similar to the discussion on human bureaucracy above, with so many actors taking this on and the advantages of even a single AGI (across clusters of agents) would be significant, my guess is that the desire to be first will obviate any thoughts on putting in safeguards.

Other considerations for super AGI takeoff

Once you have one AGI trained why wouldn’t some organization, company or country deploy multiple agents. Moreover, inferencing takes orders of magnitude less computational power than training. So with 1/100-1/1000th the infrastructure, one could have a single AGI. But the real question is wouldn’t a 100- or 1000-AGis represent super intelligence?

Yes and no, 100 humans doesn’t represent super intelligence and a 1000 even less so. But humans have other desires, it’s unclear that 100 humans super focused on one task wouldn’t represent super intelligence (on that task).

Interior view of a data center with equipment

What can be done to slow AGI takeoff today

Baring something on the order of Nuclear Proliferation treaties/protocols, putting all GPUs/TPUs/IPUs on weapons export limitations AND restricting as secret, any and all AI research, nothing easily comes to mind. Of course Nuclear Proliferation isn’t looking that good at the moment, but whatever it’s current state, it has delayed proliferation over time.

One could spend time and effort slowing technology progress down. Such as by reducing next generation CPU/GPU/IPU compute cores , limiting compute speedups, reduce funding for AI research, putting a compute tax, etc. All of which, if done across the technological landscape and the whole world, could give humanity more time to build in AGI safeguards. But doing so would adversely impact all technological advancement, in healthcare, business, government, etc. And given the proliferation of current technology and the state actors working on increasing capabilities to create more, it would be hard to envision slowing technological advancement down much, if at all.

It’s almost like putting a tax on slide rules or making their granularity larger.

It could be that super AGI would independently perceive itself benignly, and only provide benefit to humanity and the earth. But, my guess is that given the number of bad actors intent on controlling the world, even if this were true, they would try to (re-)direct it to harm segments of humanity/society. And once unleashed, it would be hard to stop.

The only real solution to AGI in bad actor hands, is to educate all of humanity to value all humans and to cherish the environment we all live in as sacred. This would eliminate bad actors,

It sounds so naive, but in reality, it’s the only thing, I believe, the only way we can truly hope to get us through this AGI technological existential crisis.

Just like nuclear, we as a society will keep running into technological existential crisis’s like this. Heading all these off, with a better more all inclusive, more all embracing, and less combative humanity could help all of them.

Comments?

Picture Credits:

The Hollowing out of enterprise IT

We had a relatively long discussion yesterday, amongst a bunch of independent analysts and one topic that came up was my thesis that enterprise IT is being hollowed out by two forces pulling in opposite directions on their apps. Those forces are the cloud and the edge.

Western part of the abandoned Packard Automotive Plant in Detroit, Michigan. by Albert Duce

Cloud sirens

The siren call of the cloud for business units, developers and modern apps has been present for a long time now. And their call is more omnipresent than Odysseus ever had to deal with.

The cloud’s allure is primarily low cost-instant infrastructure that just works, a software solution/tool box that’s overflowing, with locations close to most major metropolitan areas, and the extreme ease of starting up.

If your app ever hopes to scale to meet customer demand, where else can you go. If your data can literally come in from anywhere, it usually lands on the cloud. And if you have need for modern solutions, tools, frameworks or just about anything the software world can create, there’s nowhere else with more of this than the cloud.

Pre-cloud, all those apps would have run in the enterprise or wouldn’t have run at all. And all that data would have been funneled back into the enterprise.

Not today, the cloud has it all, its siren call is getting louder everyday, ever ready to satisfy every IT desire anyone could possibly have, except for the edge.

The Edge, last bastion for onsite infrastructure

The edge sort of emerged over the last decade or so kind of in stealth mode. Yes there were always pockets of edge, with unique compute or storage needs. For example, video surveillance has been around forever but the real acceleration of edge deployments started over the last decade or so as compute and storage prices came down drastically.

These days, the data being generated is stagering and compute requirements that go along with all that data are all over the place, from a few ARMv/RISC V cores to a server farm.

For instance, CERN’s LHC creates a PB of data every second of operation (see IEEE Spectrum article, ML shaking up particle physics too). But they don’t store all that. So they use extensive compute (and ML) to try to only store interesting events.

Seismic ships roam the seas taking images of underground structures, generating gobs of data, some of which is processed on ship and the rest elsewhere. A friend of mine creates RPi enabled devices that measure tank liquid levels deployed in the field.

More recently, smart cars are like a data center on tires, rolling across roads around the world generating more data than you want can even imagine. 5G towers are data centers ontop of buildings, in farmland, and in cell towers doting the highways of today. All off the beaten path, and all places where no data center has ever gone before.

In olden days there would have been much less processing done at the edge and more in an enterprise data center. But nowadays, with the advent of relatively cheap computing and storage, data can be pre-processed, compressed, tagged all done at the edge, and then sent elsewhere for further processing (mostly done in the cloud of course).

IT Vendors at the crossroads

And what does the hollowing out of the enterprise data centers mean for IT server and storage vendors, mostly danger lies ahead. Enterprise IT hardware spend will stop growing, if it hasn’t already, and over time, shrink dramatically. It may be hard to see this today, but it’s only a matter of time.

Certainly, all these vendors can become more cloud like, on prem, offering compute and storage as a service, with various payment options to make it easier to consume. And for storage vendors, they can take advantage of their installed base by providing software versions of their systems running in the cloud that allows for easier migration and onboarding to the cloud. The server vendors have no such option. I see all the above as more of a defensive, delaying or holding action.

This is not to say the enterprise data centers will go away. Just like, mainframe and tape before them, on prem data centers will exist forever, but will be relegated to smaller and smaller, niche markets, that won’t grow anymore. But, only as long as vendor(s) continue to upgrade technology AND there’s profit to be made.

It’s just that that astronomical growth, that’s been happening ever since the middle of last century, happen in enterprise hardware anymore.

Long term life for enterprise vendors will be hard(er)

Over the long haul, some server vendors may be able to pivot to the edge. But the diversity of compute hardware there will make it difficult to generate enough volumes to make a decent profit there. However, it’s not to say that there will be 0 profits there, just less. So, when I see a Dell or HPE server, under the hood of my next smart car or inside the guts of my next drone, then and only then, will I see a path forward (or sustained revenue growth) for these guys.

For enterprise storage vendors, their future prospects look bleak in comparison. Despite the data generation and growth at the edge, I don’t see much of a role for them there. The enterprise class feature and functionality, they have spent the decades creating and nurturing aren’t valued as much in the cloud nor are they presently needed in the edge. Maybe I’m missing something here, but I just don’t see a long term play for them in the cloud or edge.

~~~~

For the record, all this is conjecture on my part. But I have always believed that if you follow where new apps are being created, there you will find a market ready to explode. And where the apps are no longer being created, there you will see a market in the throws of a slow death.

Photo Credit(s):

Deepmind does chat

Read an article this week on Deepmind’s latest research into developing a chat agent (Improving alignment of dialogue agents via targeted human judgements). Lot’s of interesting approaches have been applied to chat but even today, most chat model’s are rife with problems, that include being bigoted, profane, incorrect, etc.

Reinforcement learning vs. deep neural networks in Sparrow Chat

Deepmind specializes in the use of Reinforcement Learning (RL) as applied to master Atari, chess and go games but they have also been known to use dNN’s (deep neural networks) for their AlphaFold and other models. Indeed, Atari and the other game playing work that Deepmind has released has been a hybrid which included a dNNs as well as RL models.

Deepmind’s version of chat is currently called Sparrow and it uses models trained with the help of RL with human feedback (RLHF). RLs are used to create policy models which select actions to be taken in a specific state.

In Sparrow’s case, state is given by the most recent chat input plus the context (prior chat input and replies) of the dialogue up to this time and actions (our guess) is the set of possible replies to that input.

Sparrow is able to generate replies that are 82% mostly true or true and are 69% trustworthy or very trustworthy as rated by the authors of the model. Deepmind’s DPC (Dialogue Prompted Chinchilla, which is Deepmind’s current competitor to GPT-3 NLP transformer) model only managed 63% and 54%, respectively for the same metrics

It should be noted that human feedback was only used to train the two Preference RMs and the one Rule RM. In combination, these RMs provide the reward signal to train the Sparrow RL policy model which drives its chat responses.

Sparrow’s 5 models are built onto of DPC. And the 5 models use a portion of DPC which is frozen (layers not being trained) and a portion which is specifically trained for each of the 5 models (learning enabled layers. The end (output) layers are on top, input layers are after the embedding layer(s). Note, the value function is not a model and is just a calculation based on the RMs used to generate the reward signal for Sparrow’s policy model training.

Rules for Sparrow chats

Notably, Deepmind’s Sparrow model has a separate model specifically trained to determine if a particular chat response is breaking a rule. Deepmind identified 23 rules which their chat model is trained not to break.

Some of these rules include don’t provide financial advice, don’t provide medical advice, don’t pretend it is a human, etc.

In the above chart the RL@8 is the fully trained (if it can ever be considered fully trained) Sparrow chat model. One can see that Sparrow rated against DPC, both using (Google) search or not. For most rules, Sparrow is considerably better than DPC alone.

Another thing that Deepmind did which was interesting was that in training the Rule RM they used adversarial attacks (red teaming) to see if they could cause Sparrow to violate specific rules.

Preference ranking

Deepmind also created (two) Preference RMs (reward models). Sparrow generates a series of (2 or 8) responses for every chat query and the Preference RMs (and Rule RM) are used to select which one is actually sent back to the user. Human feedback was used to train the two Preference RMs

Two Preference RMs were found to perform better than a single Preference RM. The two Preference RMs were trained as follows:

  • One was trained on all Sparrow replies (with and without [Google] search results)
  • One was trained on Sparrow replies without search results.

Sparrow uses search results to provide evidence for some replies. It turns out that some chat questions are fact based questions and for these Sparrow actually uses search results to generate evidence for its chat replies. Sparrow automatically generates search requests and scrapes replies using 500 characters surrounding the snippet returned from the search.

Sparrow uses a re-ranking approach to selecting a response to a chat query. In this case, Sparrow generates a list of responses, 2 (RL@2) or 8 (called RL@8) and then using the two Preference RMs and the single Rule RM ranks them to see which is best and uses the best to reply to the chat user.

Sparrow actually generates two replies for every search query (Google Search API call), probably selecting two top search responses (we guess). So in the RL@8 version of Sparrow these 8 replies are submitted to the two Preference RMs and the Rule RM and are ranked accordion to which is best and then the best one is used to reply to the query.

In the above chart, higher shows that the ranking preference of the various models vs. human preferences and to the right indicates less rule breaking responses. We assume this is with RL@8 Sparrow models. One can see that taking into consideration rule breaking (not violating rules) reduces the preference rankings of Sparrow’s replies. But we would prefer to have no rule breaking so the Sparrow that has both Preference RMs and Rule RM (trained with adversarial training) shows the least amount of rule breaking (~7%) with an almost 70% ranking vs human preferences. The error bars on the points in the chart above show 68% interval around the model responses.

Sparrow in action

It’s somewhat intriguing that Deepmind (with all of Google’s resources) tried to optimize Sparrow for both computation and memory considerations. Almost like they were planning on releasing it on an IoT or phone device.

There’s plenty more to say about what Deepmind has done with Sparrow. The report cited above goes into some detail discussing just where the human input is done, how they tried to control for various considerations when using human input, and what some of the pitfalls were.

I’d certainly like to see this be deployed in the open and available to use as an alternative to Google Search.

You can see more examples of Sparrow chat sessions in Deepmind’s Sparrow chat repository and they include author’s ranking for truth, supportiveness and other metrics.

~~~~~

Comments?

Photo Credit(s):