AI ML DL hardware performance results from MLPerf

Read an article a couple of weeks back from IEEE Spectrum, New Records for AI Training which discussed recent MLPerf v0.7 performance results. The article mentioned that MLPerf performance on its benchmarks has increased by ~2.7X in the last year alone.

The MLPerf organization was started back in 2018 to supply machine learning workload performance results, somewhat like what SPEC and TPC did for NFS and transaction processing. The MLPerf organization documented their philosophy in a paper.

But first please take our new poll:

As far as I can tell, MLPerf is the only benchmark currently available to show hardware system performance on AI training and inferencing. Below we report on MLPerf training results.

MLPerf also reports on both closed and open division benchmark results. Closed division submission all use the same software algorithms for each workload submission. This way one can compare workload performance across different hardware systems. Open division results can make use of any algorithm to achieve the desired results on the problem set. We report on MLPerf closed division results below.

Current MLPerf v0.7 (open and closed division) training results are available online (on GitHub) and are summarized in a training results page on their web site.

MLPerf v0.7 workload changes

The MLPerf team added a few new workloads and upped the game of another benchmark for V0.7

  • Recommendation DLRM: a replacement for what was used in MLPerf v0.6 and is from Facebook providing more parallelism in training for recommendations.
  • Wikipedia BERT: an addition to what was used in MLPerf v0.6 and is a new natural language processing (N?P) frontend, trained on Wikipedia which is used with other language processing capabilities.
  • Go MiniGo: an enhancement to MLPerf v0.6 MiniGo accuracy requirements and uses reinforcement learning to learn to play Go well enough to achieve a 50% win rate. For v0.7, they now use a full sized, 19X19 Go board and upped the win rate requirement to 50%.

MiniGo Results

A couple of items of note for the MiniGo results. There are essentially 3 different architectures represented in the above: NVIDIA DGX series (DGX A100, DGX-2H, DGX-1), Google TPUs (V4 and V3) and Intel (8 server nodes with Copper Lake-6 CPUs).

Google TPUs are considered internal and are only available to Google, its hardware partners or on GCP. Although MLPerf include GCP TPU system results for other workloads, there were none submitted for MiniGo.

The Intel system is a preview of their latest gen Copper Lake chips, which may not be commercially available yet. On the other hand, all NVIDIA systems are commercially available and can be deployed in your data center today.

As one can see in the above, NVIDIA systems swept the first 3 positions on our Top 10 MiniGo chart. A DGX A100 came in at #1, reaching a 50% win rate at MiniGo in mere 17 seconds using 448 CPUs and 1792 A100 GPUs. Coming in at #2 at 30 seconds was another DGX A100 using 64 CPUs and 256 A100 GPUs. And at #3 at 35 seconds was a DGX-2H using 64 CPUs and 512 V100 GPUs.

Next at #4 at 151 seconds was a Google TPU system with 64 TPUv4 accelerators (unclear how many CPUs, if any are used, results show 0). Note, an 8-node Intel server with the 32 CPUs (4/node) using the latest gen Copper Lake (-6) CPU came in at #7 using 409 seconds to achieve the training results.

There are 6 other MLPerf workloads including DLRM and BERT mentioned above. Each of these deserve their own discussion on top ten results. Alas, they will need to wait for another time and I will cover all of them in future posts.

~~~~

Nowadays, with much of IT turning to AI ML DL to provide critical services, it’s more important than ever to understand what can and can’t be done with available hardware. The fact that one can train a model to play decent Go in 17 seconds on a large DGX A100 cluster and under 7 minutes on an 8-node, leading edge, Intel server cluster is pretty impressive.

Despite MLPerf’s best efforts, it’s still tough to compare ML performance across systems when there’s so much diversity in the underlying hardware, especially in GPU, TPU and CPU counts. IMHO, it would be very useful to have a single GPU , TPU or CPU system submission requirement for each workload. That way one could compare how well each hardware element can perform the workload in isolation.

Nonetheless, the MLPerf suite of benchmarks provides a great first step in understanding what today’s hardware can accomplish in ML training (and inferencing).

Comments?

Photo Credits: