MLperf v1.1 DC training performance as of 17Dec2021

In GPU, MLperf, NVIDIA by AdministratorLeave a Comment

This Storage Intelligence (StorInt™) dispatch covers the MLperf™ v1.1 series of AI-ML-DL model training and inferencing benchmarks . This report focuses on training activity for the Data Center computing environment. Workloads for this benchmark include image classification, medical image segmentation, object detection-lightweight/heavyweight, speech recognition, natural language processing (NLP), recommendation and reinforcement learning algorithms. The main MLperf data center inferencing metric we use is time (minutes) to achieve required training accuracy. MLperf v1.1 added a cloud category for their DC training benchmarks but there were not enough results for our purposes, and they were all on Microsoft Azure.

MLperf v1.1 DC training “available on premises” benchmark results

We start our discussion with data center image classification training results in Figure 1.

Figure 1 Top 10 MLperf v1.1 Data Center image classification training results

For the first time we can see two competitors to the NVIDIA GPU chips and boards which have dominated in the past. Yes, the #1 and #2 ranked systems here were NVIDIA DGXA100 systems with 4320 and 1024 NVIDIA A100-SXM4-80GB GPU chips with 1080 and 256 AMD EPYC 7742 CPUs, respectively. But the #3 ranked system used 256 (Intel) Habana Gaudi Tensor Processing Cores (TPC) 2.0 training/inferencing boards with 128 Intel Xeon Platinum 8280 CPUs that included on chip RoCE engines for easier/cheaper connectivity. And ranked at #4, is a system that used 256 Graphcore IPU GC200 boards with 32 AMD EPYC 7742 CPUs.

In the image classification workload above we see:

  • NVIDIA A100-SXM4-80GB chips used in #1, 2 & 5 ranked systems
  • Intel Habana Gaudi TPC boards used in #3, 6 & 9 ranked systems.
  • Graphcore IPU GC200 boards used in #4, 7 and 8 ranked systems.
  • NVIDIA A100-PCIE-80GB boards uses in #10 ranked system

If we compare the Gaudi TPC system (#6) against the Graphcore IPU system (#7) that used 128 boards. We find that the Gaudi TPC solution took 5.4 minutes to train and the Graphcore IPU solution took 5.7 minutes to train to the required accuracy in image classification. This would say that the Graphcore IPU 128 board system are 5.3% slower than the Gaudi TPC 128 board system.

To compare the NVIDIA to the Gaudi and Graphcore boards we must consider the #10 ranked system that used 32 NVIDIA-A100-PCIE-80GB boards and trained in 10.6 minutes. However, the #8 and 9 ranked systems used 64 IPU and TPC boards respectively. If we assume that a 32-board configured system would take 2X as long to train as a 64-board system (probably a worst-case assumption). Then this would say that the 32 board Graphcore IPU system should train in 17.0 minutes and the 32 board Gaudi TPC system should train in 19.0 minutes. Given these assumptions, this would say that the Graphcore IPU 32 board system is 38% slower and the Gaudi TPC 32 board system is 44% slower than the NVIDIA A100-PCIE-80GB 32 board system.

All the above only leaves to compare the NVIDIA A100-SXM4-80GB chips to the Gaudi TPC and Graphcore IPU board systems. If we use as our comparison the 64 chip/board systems, the #5, 8 & 9 ranked systems, the NVIDIA A100-SXM4-80GB 64 chip system took 4.5 minutes to train while the Graphcore IPU 64 board system took 8.5 minutes to train, and the Gaudi TPC 64 board system took 9.5 minutes to train to the required accuracy. This says that the Graphcore IPU 64 board system was 47% slower and the Gaudi TPC 64 board system was 53% slower than the NVIDIA A100-SXM4-80GB 64 chip system.

Note, none of the above comparisons consider CPU chip counts. These range from 1080 (#1) to 8 (#8) AMD EPYC 7742 CPUs with 128 (#3) to 32 (#9) Intel Xeon Platinum 8380 (#3, 6 &9) and 16 (#10) Intel Xeon 6338 (#10) CPUs.

In Figure 2, we show Data Center Natural Language Processing training results.

Figure 2 Top 10 MLperf v1.1 Data Center Natural Language Processing training results

In Figure 2, again we can see NVIDHIA A100-SXM-80GB systems, ranked with Graphcore IPU systems and Gaudi TPC systems. For NLP processing, #1-3 were all NVIDIA DGXA100 systems with 4320, 1080 and 64 A100-SXM-80GB GPU chips, respectively. The #4 and 5 ranked systems used 128 and 64 Graphcore IPU boards and the #6 and 10 ranked systems used 64 and 32 Gaudi TPC boards respectively. In the top 10 NLP training ranked systems there were no A100-PCIE-80GB GPUs.

So, to compare NVIDIA GPU chips against Graphcore IPU boards and Gaudi TPC boards all we need to do is look to #3 (NVIDIA with 64 GPU chips), #5 (Graphcore with 64 IPU boards) and #6 with (Gaudi with 64 TPC boards) which took 3.0, 10.6 and 11.9 minutes to train to required NLP accuracy, respectively. Which means that the Graphcore IPU 64 board system was 72% slower and the Gaudi TPC 64 board system was 75% slower than the NVIDIA A100-SXM4-80GB 64 chip system to train NLP to the required accuracy.

In addition, this indicates that the Gaudi TPC 64 board system was 11% slower than the Graphcore IPU 64 board system to train to NLP required accuracy.

Again, the above NLP comparisons don’t consider the varying CPU chip counts.

Significance

From our perspective it’s great to be seeing some competition for NVIDIA GPUs in MLperf top 10 rankings. The fact that the NVIDIA A100-SXM4 chips and A100-PCIE boards are still the fastest (not withstanding CPU counts) in image classification or NLP data center training workloads is fine by us. Note, there’s no information on pricing for these systems which would add a whole new dimension to our comparisons.

As always, suggestions on how to improve any of our performance analyses are welcomed. 

[This system/storage performance reopen was originally sent out to our newsletter subscribers in December of 2021.  If you would like to receive this information via email please consider signing up for our free monthly newsletter (see subscription request, above right) and we will send our current issue along with download instructions for this and other reports. Dispatches are posted to our website generally a month or more after they are sent to our subscribers. ]

Silverton Consulting, Inc., is a U.S.-based Storage, Strategy & Systems consulting firm offering products and services to the data storage community

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.