MLperf results show H100 v A100 and v Habana Gaudi2 GPUs

MLCommons recently released new MLperf data center training results. The headlines for the relaese was that they added new GPT-3 data center training results but what I found more interesting was there was a plethora of H100 and A100 results on the same training runs which allowed me to compare the two NVIDIA GPUs in performance.

For example, in ResNet 50 (Image recognition) model training there were a number of H100 and A100 results from Dell. Two of which used the same Intel CPU counts and same H100/A100 GPU counts.

Above we show the top 10 ResNet 50 results and if you examine the #6 submission, it’s a Dell result with 4 Intel Platinum CPUs and 16 NVIDIA H100-SXM5-80GB GPUs which trained ResNet 50 model in 7.8 minutes.

What’s not on that chart is another Dell submission (#16) that also had 4 Intel Platinum CPUs but used 16 NVIDIA A100-SXM-80GB GPUs, which trained the same model in 14.4 minutes.

For ResNet 50 then the H100 is 1.8X faster than a similarly configured A100.

We show above results for Image Segmentation model training top 10. In this case there were two similar Dell submissions, at #3 and #4, in the top 10. These had similar hardware configuration but used H100 or A100 GPUs

These Dell two Image Segmentation (3D-Unet) model training result submissions of 7.6 minutes and 11.0 minutes, respectively means that for Image Segmentation, the H100 is 1.4X faster than the A100.

Finally, for DLRM Recommendation engine training results, there were two other Dell submissions (#5 & #7) that used 2 Intel Platinum CPUs and 8 (H100-SXM5-80GB and A100-SXM-80GB) GPUs and trained in 4.3 and 8.4 minutes, respectively. This says for the DLRM model training the H100 is 2.0X faster than the A100 for DLRM model tracing.

There were other comparisons (that didn’t attain top training results) with with 2 Intel Platinum CPUs and 8 (H100 and A100) GPUs for other model results, which show the H100 is anywhere from 1.7X faster to 2.1X faster.

Unclear why the H100 GPUs perform relatively better with fewer GPUs in the configuration but there may be some additional overhead involved in supporting more CPUs and GPUs which reduces their relative performance.

As a result, we can report from recent MLperf data center training results show for 4 CPUs and 16 (H100 or A100) GPUs the H100 performed 1.4X to 1.8X faster than the A100 and for 2 CPUs and 8 (H100 & A100) GPUs the H100 performed 1.7X two 2.1X faster than the A100.

There was one other interesting GPU comparison shown in recent MLperf results, that between the NVIDIA H100-SXM5-80GB and the Intel Habana Gaudi2 GPU. In this case the submissions involved different vendors (Dell and Intel) and different AI frameworks NGC MXNet 23.04, NGC Pytorch 23.04, NGC HugeCTR 23.04 for the H100 and PyTorch 1.13.1a0 for the Habana Gaudi2. For both submissions they used 2 Intel Platinum CPUs and 8 (H100 or Habana Gaudi1) GPUs.

Again, none of these (H100 vs Habana Guidi2 GPU) results appear in the top result charts we show here.

For ResNet 50 The H100 GPU trained ResNet 50 ins 13.5 min and the Habana Gaudii2 GPU trained ResNet 50 in 16.5 min. This would say the H100 is 1.2X faster than the Habana Guidi2 GPU.

In addition, both of these submissions also trained against the image segmentation model. The H100 trained the image segmentation model in 12.2 minutes while the Habana Guidi2 trained in 20.5 minutes. This would say that the H100 is 1.7X faster than the Habana Gaudi2 GPU.

As a result, recent MLperf data center training results show the NVIDIA H100-SXM5-80GB is 1.2 to 1.7X faster than the Intel Habana Guadi2 GPU on the 2 different model training esults with similar hardware configurations

Finally, MLperf results for GPT-3 are brand new for this release, so we present them below.

There were only 4 (on prem) submissions for GPT-3 in this round. And the #1 NVIDIA with 192 CPUs and 768 H100-SXM5-80GB GPUS trained in 44.8 minutes while the #4 Intel submission with 64 CPUs and 256 Habana Gaudi2 GPUs trained in 442.6 min, respectively.

It’s less certain whether we should compare GPU speeds here as 1) the comparison (#1 to #3 and #2 to #4) used 1/2 the hardware and 2) the software frameworks were very dissimilar, the (#1 & #2) NVIDIA H100 GPT-3 submissions used the NVIDIA NeMo software framework and the Intel (#3 AND #4) submissions used PyTorch 1.13.1a0. Not sure what NVIDIA NeMo is derived from but it doesn’t seem to be being used in any other model training run for MLperf other than GPT-3.

Comments?