This Storage Intelligence (StorInt™) dispatch covers the MLperf™ v1.1 series of AI-ML-DL model training and inferencing benchmarks. This report focuses on inferencing activity for the Edge computing environments and workloads. which consists of image classification, object detection-small and -large, medical image segmentation, speech to text and natural language processing model inferencing. The main MLperf inferencing metric is single stream latency, measured in msec.
MLperf v1.1 Edge inferencing “closed-available” benchmark results
We start our discussion with the edge medical image segmentation inferencing benchmark results in Figure 1.
Figure 1 Top 10 MLperf v1.1 Edge Medical Image Segmentation inferencing results
In Figure 1, systems with a single NVIDIA A100-SXM-80GB GPU chip took the top 4 spots (#1-4 on the chart), differing only in ~0.5msec for single stream latency. The next 2 systems (#5-6) used NVIDIA A100-PCIe-80GB GPU boards and came in, on average, about 8 msec. or 50% slower than the SXM chip systems. The last 4 placed systems (#7,8,9 &10) all used A100-PCIe-40GB cards (although #9&10 don’t indicate GPU memory size, we assume 40GB) and the last 2 ranked systems (#9&10) had 2 A100 GPU cards rather than 1 for all the other submissions
Besides packaging, the main difference between the A100-SXM and A100-PCIe is power consumption, the SXM chip can run up to 400w while the PCIe board runs up to 250w. This also means the chip probably performs faster.
All submissions on medical image segmentation ran the same TensorRT 8.0.1 and CUDA 11.3 software. All submissions except #2, 9 and 10, used dual AMD EPYC 7742 CPUs while #2 used dual Intel Xeon Platinum 8358 CPUs and #9&10 used dual Intel Xeon Gold 6258R CPUs.
Unclear why using 2 A100-PCIe(-40GB) GPUs such as #9&10 had, is slower than using one A100-PCIe-40GB GPU card. However, the last two did use older Intel Xeon Gold 6258R CPUs which could explain the anomaly.
In Figure 2, we show Edge Speech to Text inferencing results.
Figure 2 Top 10 MLperf v1.1 Edge Speech to Text inferencing results
Speech-to-text inferencing results have a much different ranking for these same systems. For instance, the first 3 systems (#1-3), all used Intel CPUs, #1 used dual Intel Xeon Platinum 8358 and #2-3 used dual Intel Xeon Gold 6258R CPUs.
Moreover, the GPU type didn’t matter as much here as in medical imaging. #1 used a single A100-SXM-80GB GPU where #2-3 used 2 A100-PCIe(-40GB?) GPUs. And #4 & 5 used A100-SXM-80GB GPU chips while #6 used a A100-PCIe-40GB GPU card and #7 used A100-PCIe-80GB card.
The last 3 ranked systems (#8,9 &10) used older NVIDIA GPUs, i.e. A30 (#8) and A10 (#9-10) GPUs. #8 used dual AMD EPYC 7742 CPUS while the other two used older Intel CPUs.
Given all the above it seems to us, that the speech-to-text RNN-T inferencing is more CPU than GPU bound. And for some reason, Intel CPUs perform better than the AMD CPUs on this activity.
We have examined all other edge inferencing workloads and most other results appear similar to the medical imaging, as they all seem GPU driven activities. The only exception we found was Speech-to-text.
It’s interesting to see that inferencing response times seem equivalent to disk storage latencies at ~24msec. We don’t have any hypothesis as to why this is. Inferencing is almost entirely computational (ok, data transfer too) and as such, should be much, much faster than storage access. And with GPUs having a 1000 or more cores to throw at this work, response times in 24msec range says there’s a ocean-liner load of computation going on.
As always, suggestions on how to improve any of our performance analyses are welcomed.[This system/storage performance reopen was originally sent out to our newsletter subscribers in September of 2021. If you would like to receive this information via email please consider signing up for our free monthly newsletter (see subscription request, above right) and we will send our current issue along with download instructions for this and other reports. Dispatches are posted to our website generally a month or more after they are sent to our subscribers. ]
Silverton Consulting, Inc., is a U.S.-based Storage, Strategy & Systems consulting firm offering products and services to the data storage community
 All MLperf inferencing and training results are available at https://mlcommons.org/en/ as of 09/27/2021
 Since most of these systems were NVIDIA (#1 & 4-8) with the rest INSPUR (#2, 9, &10) and all used NVIDIA GPUs, naming the vendor and system was less useful than calling them by their ranking.