This Storage Intelligence (StorInt™) dispatch covers the MLperf™ series of AI-ML-DL HPC suite, model training and model inferencing benchmarks , a first for us. We will try to report on the HPC suite in a future performance report but for this report we will focus on MLperf training and inferencing benchmarks. Reference versions of all models used in MLperf benchmarks are available on GitHub (https://github.com/mlcommons).
We start our discussion with MLperf model training benchmark results.
MLperf v0.7 training benchmark results
The MLperf v0.7 training series of benchmarks takes standard AI Deep Learning models and trains them on various hardware & software configurations until they reach a certain level of accuracy (say 75% object classification, 50%-win rate, etc.) using a standard data set and model architecture.
There are 8 separate training models in use including image classification (ImageNet-ResNet), object detection (COCO-SSD & COCO-Mask R-CNN), translation (WMT EG-NMT & WMT EG-Transformer), NLP (Wikipedia BERT), Recommendation (1TB clickthrough-DLRM) and Reinforcement learning (MiniGO).
Also for training performance, MLperf has 4 different categories of submissions: in the cloud, on prem, preview and research. We report solely on the on prem results.
We start our discussion with the image classification benchmark results (ImageNet-ResNet), in Figure 1.
Figure 1 Top 10 MLperf v0.7 Training Results ImageNet-ResNet results
In Figure 1 (lower is better for training time), NVIDIA solutions took the top 4 slots, and for the top 3 used A100 GPUs. The #1 submission from NVIDIA trained in 0.8minutes and the #10 submission from Inspur trained in 33.4 minutes .
There are many variables in the configurations of the top 10 systems in Figure 1. However, we can take software out of the picture, as all the ImageNet-ResNet submissions used Apache MXNet modeling software.
We have summarized some hardware configuration variables in the table.
In the table, we shaded the two rows where the submissions had 8 GPUs/CPU core, all of the rest of the top 10 used 4 GPUs/CPU core. When we examine the table, a couple of performance insights emerge:
• The A100s training performance doesn’t scale linearly. For example, with 8 A100s the number 10 ranked submission was able to train in 2002 seconds. The #1 submission with 1840 A100s (230X more GPUs than the #10 submission) was able to train in 46 seconds (or only 43X faster). Whereas the #3 solution with 768 A100s (96X more GPUs) was able to train in 64 seconds (or only 31X faster). We conclude that something other than CPU cores-GPU counts are the bottleneck for these configurations.
• The V100s training performance also doesn’t scale linearly, but it scales better than the A100s). For example, with 256 V100s, the number 8 submission was able to train in 237 seconds. The #5 submission with 1280 V100s (5X more GPUs) was able to train in 101 second (2.3X faster). Whereas the #7 submission with 512 V100s (2X more GPUs) was able to train in 141 seconds (1.7X faster). Again, we must conclude that the bottleneck is not CPU cores-GPU counts for these systems.
• The Intel® Xeon Platinum 8174 @3.1Ghz seems easily capable of handling 8 V100 GPUs/CPU core. But here to the V100s training performance didn’t scale linearly.
In Figure 2 we present training results for object detection AI modeling using the COCO-SSD algorithm.
Figure 2 MLperf v0.7 Training COCO SSD performance results
In Figure 2, once again NVIDIA took the top 4 rankings but used both V100s (#1 &3) and A100s (#2 & 4). The #1 system from NVIDIA trained in under 4 minutes while the #10 submission from Dell EMC took 25.4 minutes. As for software used in this top 10, it was a mix of PyTorch (#1-5 & #7-8) with the rest of the submissions using Apache MXNet.
Similar to our treatment above we show some hardware configuration parameters for the top 10 submissions in the table.
Again, we have shaded the rows with submissions that used 8 GPUs/CPU core. In the table we can see some more items of interest.
• The V100 with the Intel Xeon Gold 6148 training performance scales linearly. For example the #10 submission had 8 V100s and trained it 1522 seconds while the #6 submission, with 16 V100s (2X more GPUs), trained in 746 seconds (over 2X faster).
• The V100 with the Intel Xeon Platinum 8174 training performance does not scale linearly. For example the #3 submission had 16 V100s and trained it 579 seconds while the #1 submission, with 64 V100s (4X more GPUs) and trained in 236 seconds (2.5X faster). Something about the faster processor has introduced a different bottleneck limiting performance scaling for the V100.
• The A100 with AMD EPYC 7742 training performance comes very close to scaling linearly. For example, the #4 and #5 submission each had 8 A100s and trained in ~621 seconds and the #2 submission with 16 A100s (2X more GPUs) trained in 341 seconds (~1.8X faster).
Next, we turn to MLperf inferencing results for the data center.
MLperf v0.7 inferencing data center benchmark results
Similar to our discussion above, the MLperf v0.7 inferencing workloads are .executed on various sets of hardware & software using standard models that have previously been trained (to a preset level of accuracy) and uses a standard example dataset. Inferencing benchmark results must meet a latency level as well as a quality (accuracy) target to be a valid submission. Most inferencing benchmark submissions provide both in server mode (in real time?) and in offline mode (batched inferencing) metrics.
There are four sets of inferencing benchmarks, data center, edge, mobile, & tiny. We focus here on data center inferencing and plan to cover others in a later report.
For Inferencing results, MLperf has 3 categories: available, preview and research. We report only on available results below.
In Figure 3 we present the top 10 MLperf v0.7 inferencing data center image classification results.
Figure 3 Top 10 MLperf v0.7 data center image classification performance results
In Figure 3 higher is better (more inferences per second [ips]). NVIDIA systems weren’t quite as dominant here as in the training results above, only achieving #2, 4 & 9 rankings. And Inspur(#1) did quite well here at 262.3K ips with close to equivalent. hardware as NVIDIA (#2) at 255.1K ips and QCT (#3) at 220.1K ips systems. The software in use for all of the top 10 above was TensorRT & CUDA.
Similar to the discussions above, configuration information details are provided in the table.
Note, the GPU/CPU core counts in the above table range from 2 to 10.5. A couple of interesting points emerge when examining the table.
• The A100-PCIe inferencing performance seems to almost scale linearly. The #9-10 submissions with 4 A100-PCIes each were able to achieve ~109.3K ips while the #3 submission with 10 A100-PCIes (2.5X more GPUs) was able to perform 220.1K ips (~2.0X more).
• The A100-PCIe does about 5X more ips than the T4, (see #9-10 vs. #7-8) in the table
• The A100-PCIe does about 1.8X more ips than the RTX 6000, (see #3 vs. #6).
Next, we report on MLperf v0.7 datacenter inferencing results for recommendation engines (1TB Clickthrough, DLRM), in Figure 4.
Figure 4 Top 10 MLperf v0.7 datacenter inferencing recommendation performance results
Inspur came in again at #1 at 2.1M ips on the recommendation algorithm but NVIDIA almost tied it at only 1.5K ips lower. Then there’s a substantial drop off with Dell EMC at 777.9K ips.
We supply the hardware configuration details in the table.
Again the GPU per CPU count varies widely from 1 to 10.5. Similar to the above discussion we can see
• The A100-PCIe with AMD EPYC 7742 ips performance seems to almost scale linearly with 2X more GPUs offering 1.7X the inferences per second (see #9-10 vs #4).
• The T4 with various Intel Xeon CPUs ips performance scales linearly with #7-8 using 16 T4s and running about 551.8K or ~34.5K inferences per second per GPU while #6 with 20 T4s runs about 600.3K or ~30.0K inferences per second per GPU
AI-ML-DL training and inferencing workloads will become increasingly important as times moves on. So it’s a great that MLperf is providing benchmarks that can be used to understand system performance on a standardized set of DL modeling work.
It’s surprising to me that for some training models, we can throw almost 2K GPUs at the work and still see benefits whereas others need only 1/30th as many GPUs. We suppose this is primarily due to the complexity of the models being trained. On the other hand, decent inferencing ips levels can be attained with 20 or less GPUs for these models.
But one thing is for certain, NVIDIA GPUs are the go to accelerator for AI-ML-DL training and data center inferencing for on prem systems.
This is only our first performance report attempt at analyzing MLperf submissions. There is so much variability in configuration hardware in these submissions, we were almost lost in the weeds. However, concentrating on CPU and GPU counts along with CPU and GPU type did provide some clarity in the end.
We hope to discuss different training and inferencing workloads the next time we report on MLperf results. If there is something we missed or have introduced an error in any of our analysis, please let us know and we would be glad to fix it.
[This storage performance was originally sent out to our newsletter subscribers in December of 2020. If you would like to receive this information via email please consider signing up for our free monthly newsletter (see subscription request, above right) and we will send our current issue along with download instructions for this and other reports. Dispatches are posted to our website at least a quarter or more after they are sent to our subscribers. ]
Silverton Consulting, Inc., is a U.S.-based Storage, Strategy & Systems consulting firm offering products and services to the data storage community