NVIDIA H100 vs. A100 GPUs in MLPERF Training

NVIDIA recently released some “Preview” results for MLPerf Data Center Training v2.1 (most recent results as of 28 Nov 2022) benchmarks. We analyzed these results to determine how much faster the H100 was vs. their A100 GPU.

Note, NVIDIA submitted 3 series of Preview benchmarks using the H10-SXM5-80GB GPUs for training which included an 8 GPU system, a 24 GPU system, and a 32 GPU DGXH100 system.

We have previously reported similar analysis for MLPerf Inferencing results (see: NVIDIA’s H100 vs A100… blog post).

From NVIDIA H100 Announcement Information

In their announcement, NVIDIA showed anywhere from 3-6X TFLops speedup with much faster throughput. MLPerf currently doesn’t report the FP resolution used to perform their benchmarks but in MLPerf’s ArXiv paper, they seem to be using FP32 which we assume is equivalent to TF32 in the above chart so the H100 should, on average, be performing 3X faster.

Actual or normalized results for comparisons

Of the eight MLPerf v2.1 Data Center Training workloads, it appears that the H100 actual results are faster than the A100 GPUs in 5 of the benchmarks and slower in the remaining 3, Speech Recognition (LibriSpeech RNN-T), Recommendation Engine (1TB Clickthrough DLRM) and Reinforcement Learning (MiniGo).

The challenge with using the actual results or absolute minutes to train from the benchmarks is that submission results aren’t all using the same hardware configurations.

For example, in the Speech Recognition benchmark results, the current best training time (2.1 minutes) was achieved by NVIDIA DGXA100 systems with 384 (64 core AMD 7742) CPUs and 1536 (A100-SXM4-80GB) GPUs. While the nearest H100 Preview submission, which would have come in 4th in absolute time (7.5 minutes) to train, was using 8 (56 core Intel Xeon) CPUs with 32 (H100-SXM5-80GB) GPUs.

So, in order to present an apples to apples comparison in the charts below we show both actual minutes to train for the system and GPU counts normalized (to match the nearest H100 Preview submission which we calculated) time to train.

A couple of caveats with using normalized numbers:

  • Normalization to 8 or 32 GPUs assumes the systems in question would have absolute linear performance scaling both up (for actual results with less GPUs) and down (for actual results with more GPUs)
  • Normalization to 8 or 32 GPUs doesn’t factor in the differences in CPU counts, core counts per CPU or CPU power. And in fact in the H100 previews, NVIDIA (or MLPerf) did not provide a CPU model number but in their detailed information they did list the Intel Xeon core count as 56.
  • Normalization to 8 or 32 GPUs doesn’t factor in any other speedups like throughput, dedicated AI hardware or other system performance characteristics that are available on the newer (DGXH H100) systems.

However, with respect to GPU and CPU core counts, there were four benchmarks (Speech Recognition, NLP, Object Detection-light weight, and Recommendation engine) which have submissions that come close to the GPU and CPU hardware counts that were used for the H100 Previews.

For three benchmarks comparing against the H100 submission with 32 GPUs, the comparison system was a HPE Proliant system with 8 AMD 7763 64-core CPUs with 32 A100-SXM4-80GB GPUs. And for the one benchmark comparing against the H100 submission with 8 GPUs, the comparison system was a NVIDIA DGXA100 system with 2 AMD EPYC 7742 (64 core) CPUs and 8 A100-SXM4-80GB GPUs.

Note, the HPE A100 systems still had more CPU cores, 64 more for the 32 GPU comparisons and the NVIDIA DGXA100 had 16 more CPU cores for the lone 8 GPU comparison.

So, our comparisons are still not perfect and if anything should show the H100 in its worst light due to not having as much CPU compute power. On the other hand the DGXH100 and the H100 GPU has a lot more bandwidth and the H100 GPU has additional specialized dedicated logic for AI operations. No telling how much these other hardware differences would matter to the various MLPerf training workloads. But these comparisons are as close as the data allows.

The comparisons

First up Speech Recognition:

Lower is better in training time results (metric measured is minutes to train to NN level of accuracy). And the results on this chart are sorted by the 32 GPU normalized training times. The actual published results are shown in Blue and the 32 GPU normalized results in Orange.

As we can see here even with normalization for all the other results, the H100 preview still doesn’t come out on top (7.487 min vs. 7.534) but it doesn’t lose by much. Also one can see the current #1 for this benchmark in actual minutes to train is shown by the last column(s), which is a NVIDIA DGXA100 running 384 AMD EPYC 7742 (64 core) CPUs with 1536 A100-SXM4-80GB GPUs, which trained in around 2 minutes.

I’ve taken the liberty to show in light blue boxes the best comparison system to the H100 preview results (DGXH100) with 32 H100 GPUs, which was the HPE (Proliant) 8 AMD EPYC 7763 (64-core) CPUs and 32 A100-SXM4-80GB GPUs results. In this Speech Recognition benchmark the H100 GPUs is 1.63X faster than the A100 GPUs.

Next up Object Detection-Lightweight,

Similar to the above smaller is better, it’s sorted by Normalized to 32 GPU results and Blue bars are at the actual reported results and orange bars are the 32 GPU normalized results.

Here we can see that the H100 both reported the best training time in actual results and in 32 GPU normalized results. Also like the earlier chart we are showing the best comparisons we can find in blue boxes and in this Object Detection-Lightweight benchmark the H100 is 3.80X faster than the A100.

Bottom line

H100 GPU

We have analyzed all MLPerf data center training workload top ten results similar to what we show above. As discussed earlier, only four MLPerf workloads had hardware similar to the NVIDIA H100 Preview submissions, three compare well with the 32 GPU H100 submission and 1 compares well with the 8 GPU H100 submission.

The numbers we calculate show that the H100 is 1.63X (Speech recognition), 3.80X (NLP), 1.97X (Object detection-lightweight) and 1.60X (Recommendation engine) faster than the A100, which would say the H100 is, on average, 2.25X faster than the A100 in MLPerf v2.1 Data Center Training results.

Realize the H100 results are “Preview” so there may still be some software (or firmware) speedups that may be applied to improve these numbers. And, “Released” hardware & firmware may differ substantially from the “Preview” hardware & firm vale.

But given all that, it appears that the H100 is not as fast as announced (2.25X vs. 3X), in MLPerf training workloads, at least not yet [added after publishing, The Eds]

Photo Credit(s):

  • Screen shot of slides presented at GTC Spring 2022
  • Cropped version of above

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.