NVIDIA’s H100 vs A100, the good and bad news

Turns out only the current MLPerf v2.1 Data Center Inferencing results show both NVIDIA Hopper H100 and prior generation NVIDIA A100 GPUs on similar workloads so that we can compare performance. Hopper (H100) results are listed as CATEGORY: Preview, so final results may vary from these numbers (but, we believe, not by much).

For the H100 Preview results, they only used a single H100-SXM(5)-80GB GPU vs most of the rest of Data Center Inferencing results used 8 or more of A100-SXM(4)-80GB GPUs. And for the charted data below all the other top 10 results used 8-A100 GPUs.

the H100 is more than twice as fast as the A100 for NLP inferencing

In order to have an apples to apples comparison of the H100 against the A100 we have taken the liberty of multiplying the single H100 results by 8, to show what they could have done with similar GPU hardware, if they scaled up (to at least 8 GPUs) linearly.

For example, on the NLP inferencing benchmark, the preview category test with a single H100 GPU achieved 7,593.54 server inference queries per second. But when we try to compare that GPU workload against A100s we have multiplied this by 8, which gives us 60,748.32 server inference queries per second.

Of course, they could scale up WORSE than linearly which would show lower results than we project but, it is very unlikely that they could scale up BETTER than linearly and show higher results. But I’ve been known to be wrong before. We could have just as easily divided the A100 results by 8, but didn’t.

This hypothetical H100 * 8 result is shown on the charts in Yellow. And just for comparison purposes, we show the actual single H100 (*1) result in Orange on the charts as well.

The remaining columns in the chart are the current top 10 in the CATEGORY: Available bucket for NLP server inference queries per second results..

On the chart higher is better. Of all the Data Center Inferencing results NLP shows the H100 off in the best light. We project that having 8 H100s would more than double (~60K queries/sec) the inference queries done per second vs. the #1 Netrix-X660G45L (8x A100-SXM4-80GB, TensorRT) that achieved ~27K queries/sec on NLP inferencing.

The H100 is slower than A100 on Recommendation inferencing

Next we look at Recommendation engine inferencing results, which shows the H100 in the worst light when comparing it to A100s.

Similar to the above, higher is better and the metric is (online) server inference queries per second.

We project that having 8-H100s would perform a little over 2.5M recommendation engine inference queries/sec, worse than the top 2 with 8-A100s, both achieving 2.6M inference queries/sec. The #1 is the same Nettrix-X660G45L (8x A100-SXM(4)-80GB, TensorRT) and the #2 ranked Recommendation Engine inferencing solution is the Inspur-NF5688M6 (8x A100-SXM(4)-80GB, TensorRT).

We must say the projected H100 would have performed better in all other Data Center Inferencing benchmarks than the top #1 ranked system. In some cases, as shown above, significantly (over 2X) better.

The H100 Preview benchmarks all used a single AMD EPYC 7252 8-Core Processor chip. Many of the other workloads used Intel Xeon(R) Pentium (8368Q [38-cores], 8380 [40-core], 8358 [32-cores] and others) CPUs and 2 CPUs rather than just 1. So, multiplying the single H100 single AMD EPYC CPU performance by 8, we are effectively predicting the performance of a total 64 core/8 CPU chip performance.

Not sure why recommendation engine inferencing would be worse NLP for H100 GPUs. We thought at first it was a CPU intensive workload but as noted above, 64 (8X8cores/chip) AMD Cores vs 64 to 80 (2X32, 2X38, 2X40) Intel cores seems roughly similar in performance (again, I’ve been wrong before).

Given all that, we surmise that there’s something else that’s holding the H100s back. It doesn’t appear to be memory as both the H100s and A100s had 80GB of memory. They are both PCIe attached. In fact the H100s are PCIe gen 5 and the A100s are PCIe gen 4 so, if anything the H100s should have 2X the bandwidth of A100.

It’s got to be something about the peculiarities of Recommendation Engine inferencing that doesn’t work as well on H100 as it does on A100s.

Earlier this year we wrote a dispatch on NVIDIA’s H100 announcement and compared the H100 to the A100. Here is a quote from that dispatch on the H100 announcement:
“… with respect to the previous generation A100, each H100 GPU SM is:
• Up to 6X faster in chip-to-chip performance, this includes higher SM counts, faster SMs, and higher clock rate
• Up to 2x faster in Matrix Multiply Accumulate instruction performance,
• Up to 4X faster in Matrix Multiply Accumulate for FP8 on H100 vs. FP16 on the A100.

In addition, the H100 has DPX instructions for faster dynamic programing used in genomics, which is 7X faster than A100. It also has 3X faster IEEE FP64 and FP32 arithmetic over the A100, more (1.3X) shared memory, a new asynchronous execution engine, new Tensor Memory Accelerator functionality, and a new distributed shared memory with direct SM to SM data transfers. “

We suspect that the new asynchronous execution engines aren’t working well with the recommendation engine inferencing instruction flow or the TMAs aren’t working well with the recommendation engine’s (GPU) working set.

Unclear why H100 shared memory or SM-to-SM data transfers should be the bottleneck but really don’t know for sure.

It’s our belief that the problems could just be minor optimizations that didn’t go the right way and could potentially be fixed in (GPU) firmware, CUDA software or worst case, new silicon.

So in general, although the H100 is, as reported, 2X-6X faster than the A100s, we don’t see any more than 2X speedup in any data center inferencing benchmarks. And in one case, we see a slight deterioration.

We’d need to see similar results for training activity to come up with a more wider depiction of H100 performance vs. A100 but at the moment, it’s good but not that good of a speed up.

~~~~

Comments?

Picture/Graphic Credit(s):