NVIDIA’s H100 vs A100, the good and bad news

Turns out only the current MLPerf v2.1 Data Center Inferencing results show both NVIDIA Hopper H100 and prior generation NVIDIA A100 GPUs on similar workloads so that we can compare performance. Hopper (H100) results are listed as CATEGORY: Preview, so final results may vary from these numbers (but, we believe, not by much).

For the H100 Preview results, they only used a single H100-SXM(5)-80GB GPU vs most of the rest of Data Center Inferencing results used 8 or more of A100-SXM(4)-80GB GPUs. And for the charted data below all the other top 10 results used 8-A100 GPUs.

the H100 is more than twice as fast as the A100 for NLP inferencing

In order to have an apples to apples comparison of the H100 against the A100 we have taken the liberty of multiplying the single H100 results by 8, to show what they could have done with similar GPU hardware, if they scaled up (to at least 8 GPUs) linearly.

For example, on the NLP inferencing benchmark, the preview category test with a single H100 GPU achieved 7,593.54 server inference queries per second. But when we try to compare that GPU workload against A100s we have multiplied this by 8, which gives us 60,748.32 server inference queries per second.

Of course, they could scale up WORSE than linearly which would show lower results than we project but, it is very unlikely that they could scale up BETTER than linearly and show higher results. But I’ve been known to be wrong before. We could have just as easily divided the A100 results by 8, but didn’t.

This hypothetical H100 * 8 result is shown on the charts in Yellow. And just for comparison purposes, we show the actual single H100 (*1) result in Orange on the charts as well.

The remaining columns in the chart are the current top 10 in the CATEGORY: Available bucket for NLP server inference queries per second results..

On the chart higher is better. Of all the Data Center Inferencing results NLP shows the H100 off in the best light. We project that having 8 H100s would more than double (~60K queries/sec) the inference queries done per second vs. the #1 Netrix-X660G45L (8x A100-SXM4-80GB, TensorRT) that achieved ~27K queries/sec on NLP inferencing.

The H100 is slower than A100 on Recommendation inferencing

Next we look at Recommendation engine inferencing results, which shows the H100 in the worst light when comparing it to A100s.

Similar to the above, higher is better and the metric is (online) server inference queries per second.

We project that having 8-H100s would perform a little over 2.5M recommendation engine inference queries/sec, worse than the top 2 with 8-A100s, both achieving 2.6M inference queries/sec. The #1 is the same Nettrix-X660G45L (8x A100-SXM(4)-80GB, TensorRT) and the #2 ranked Recommendation Engine inferencing solution is the Inspur-NF5688M6 (8x A100-SXM(4)-80GB, TensorRT).

We must say the projected H100 would have performed better in all other Data Center Inferencing benchmarks than the top #1 ranked system. In some cases, as shown above, significantly (over 2X) better.

The H100 Preview benchmarks all used a single AMD EPYC 7252 8-Core Processor chip. Many of the other workloads used Intel Xeon(R) Pentium (8368Q [38-cores], 8380 [40-core], 8358 [32-cores] and others) CPUs and 2 CPUs rather than just 1. So, multiplying the single H100 single AMD EPYC CPU performance by 8, we are effectively predicting the performance of a total 64 core/8 CPU chip performance.

Not sure why recommendation engine inferencing would be worse NLP for H100 GPUs. We thought at first it was a CPU intensive workload but as noted above, 64 (8X8cores/chip) AMD Cores vs 64 to 80 (2X32, 2X38, 2X40) Intel cores seems roughly similar in performance (again, I’ve been wrong before).

Given all that, we surmise that there’s something else that’s holding the H100s back. It doesn’t appear to be memory as both the H100s and A100s had 80GB of memory. They are both PCIe attached. In fact the H100s are PCIe gen 5 and the A100s are PCIe gen 4 so, if anything the H100s should have 2X the bandwidth of A100.

It’s got to be something about the peculiarities of Recommendation Engine inferencing that doesn’t work as well on H100 as it does on A100s.

Earlier this year we wrote a dispatch on NVIDIA’s H100 announcement and compared the H100 to the A100. Here is a quote from that dispatch on the H100 announcement:
“… with respect to the previous generation A100, each H100 GPU SM is:
• Up to 6X faster in chip-to-chip performance, this includes higher SM counts, faster SMs, and higher clock rate
• Up to 2x faster in Matrix Multiply Accumulate instruction performance,
• Up to 4X faster in Matrix Multiply Accumulate for FP8 on H100 vs. FP16 on the A100.

In addition, the H100 has DPX instructions for faster dynamic programing used in genomics, which is 7X faster than A100. It also has 3X faster IEEE FP64 and FP32 arithmetic over the A100, more (1.3X) shared memory, a new asynchronous execution engine, new Tensor Memory Accelerator functionality, and a new distributed shared memory with direct SM to SM data transfers. “

We suspect that the new asynchronous execution engines aren’t working well with the recommendation engine inferencing instruction flow or the TMAs aren’t working well with the recommendation engine’s (GPU) working set.

Unclear why H100 shared memory or SM-to-SM data transfers should be the bottleneck but really don’t know for sure.

It’s our belief that the problems could just be minor optimizations that didn’t go the right way and could potentially be fixed in (GPU) firmware, CUDA software or worst case, new silicon.

So in general, although the H100 is, as reported, 2X-6X faster than the A100s, we don’t see any more than 2X speedup in any data center inferencing benchmarks. And in one case, we see a slight deterioration.

We’d need to see similar results for training activity to come up with a more wider depiction of H100 performance vs. A100 but at the moment, it’s good but not that good of a speed up.

~~~~

Comments?

Picture/Graphic Credit(s):

AI ML DL hardware performance results from MLPerf

Read an article a couple of weeks back from IEEE Spectrum, New Records for AI Training which discussed recent MLPerf v0.7 performance results. The article mentioned that MLPerf performance on its benchmarks has increased by ~2.7X in the last year alone.

The MLPerf organization was started back in 2018 to supply machine learning workload performance results, somewhat like what SPEC and TPC did for NFS and transaction processing. The MLPerf organization documented their philosophy in a paper.

But first please take our new poll:

As far as I can tell, MLPerf is the only benchmark currently available to show hardware system performance on AI training and inferencing. Below we report on MLPerf training results.

MLPerf also reports on both closed and open division benchmark results. Closed division submission all use the same software algorithms for each workload submission. This way one can compare workload performance across different hardware systems. Open division results can make use of any algorithm to achieve the desired results on the problem set. We report on MLPerf closed division results below.

Current MLPerf v0.7 (open and closed division) training results are available online (on GitHub) and are summarized in a training results page on their web site.

MLPerf v0.7 workload changes

The MLPerf team added a few new workloads and upped the game of another benchmark for V0.7

  • Recommendation DLRM: a replacement for what was used in MLPerf v0.6 and is from Facebook providing more parallelism in training for recommendations.
  • Wikipedia BERT: an addition to what was used in MLPerf v0.6 and is a new natural language processing (N?P) frontend, trained on Wikipedia which is used with other language processing capabilities.
  • Go MiniGo: an enhancement to MLPerf v0.6 MiniGo accuracy requirements and uses reinforcement learning to learn to play Go well enough to achieve a 50% win rate. For v0.7, they now use a full sized, 19X19 Go board and upped the win rate requirement to 50%.

MiniGo Results

A couple of items of note for the MiniGo results. There are essentially 3 different architectures represented in the above: NVIDIA DGX series (DGX A100, DGX-2H, DGX-1), Google TPUs (V4 and V3) and Intel (8 server nodes with Copper Lake-6 CPUs).

Google TPUs are considered internal and are only available to Google, its hardware partners or on GCP. Although MLPerf include GCP TPU system results for other workloads, there were none submitted for MiniGo.

The Intel system is a preview of their latest gen Copper Lake chips, which may not be commercially available yet. On the other hand, all NVIDIA systems are commercially available and can be deployed in your data center today.

As one can see in the above, NVIDIA systems swept the first 3 positions on our Top 10 MiniGo chart. A DGX A100 came in at #1, reaching a 50% win rate at MiniGo in mere 17 seconds using 448 CPUs and 1792 A100 GPUs. Coming in at #2 at 30 seconds was another DGX A100 using 64 CPUs and 256 A100 GPUs. And at #3 at 35 seconds was a DGX-2H using 64 CPUs and 512 V100 GPUs.

Next at #4 at 151 seconds was a Google TPU system with 64 TPUv4 accelerators (unclear how many CPUs, if any are used, results show 0). Note, an 8-node Intel server with the 32 CPUs (4/node) using the latest gen Copper Lake (-6) CPU came in at #7 using 409 seconds to achieve the training results.

There are 6 other MLPerf workloads including DLRM and BERT mentioned above. Each of these deserve their own discussion on top ten results. Alas, they will need to wait for another time and I will cover all of them in future posts.

~~~~

Nowadays, with much of IT turning to AI ML DL to provide critical services, it’s more important than ever to understand what can and can’t be done with available hardware. The fact that one can train a model to play decent Go in 17 seconds on a large DGX A100 cluster and under 7 minutes on an 8-node, leading edge, Intel server cluster is pretty impressive.

Despite MLPerf’s best efforts, it’s still tough to compare ML performance across systems when there’s so much diversity in the underlying hardware, especially in GPU, TPU and CPU counts. IMHO, it would be very useful to have a single GPU , TPU or CPU system submission requirement for each workload. That way one could compare how well each hardware element can perform the workload in isolation.

Nonetheless, the MLPerf suite of benchmarks provides a great first step in understanding what today’s hardware can accomplish in ML training (and inferencing).

Comments?

Photo Credits: