AI ML DL hardware performance results from MLPerf

Read an article a couple of weeks back from IEEE Spectrum, New Records for AI Training which discussed recent MLPerf v0.7 performance results. The article mentioned that MLPerf performance on its benchmarks has increased by ~2.7X in the last year alone.

The MLPerf organization was started back in 2018 to supply machine learning workload performance results, somewhat like what SPEC and TPC did for NFS and transaction processing. The MLPerf organization documented their philosophy in a paper

As far as I can tell, MLPerf is the only benchmark currently available to show hardware system performance on AI training and inferencing. Below we report on MLPerf training results.

MLPerf also reports on both closed and open division benchmark results. Closed division submission all use the same software algorithms for each workload submission. This way one can compare workload performance across different hardware systems. Open division results can make use of any algorithm to achieve the desired results on the problem set. We report on MLPerf closed division results below.

Current MLPerf v0.7 (open and closed division) training results are available online (on GitHub) and are summarized in a training results page on their web site.

MLPerf v0.7 workload changes

The MLPerf team added a few new workloads and upped the game of another benchmark for V0.7

  • Recommendation DLRM: a replacement for what was used in MLPerf v0.6 and is from Facebook providing more parallelism in training for recommendations.
  • Wikipedia BERT: an addition to what was used in MLPerf v0.6 and is a new natural language processing (N?P) frontend, trained on Wikipedia which is used with other language processing capabilities.
  • Go MiniGo: an enhancement to MLPerf v0.6 MiniGo accuracy requirements and uses reinforcement learning to learn to play Go well enough to achieve a 50% win rate. For v0.7, they now use a full sized, 19X19 Go board and upped the win rate requirement to 50%.

MiniGo Results

A couple of items of note for the MiniGo results. There are essentially 3 different architectures represented in the above: NVIDIA DGX series (DGX A100, DGX-2H, DGX-1), Google TPUs (V4 and V3) and Intel (8 server nodes with Copper Lake-6 CPUs).

Google TPUs are considered internal and are only available to Google, its hardware partners or on GCP. Although MLPerf include GCP TPU system results for other workloads, there were none submitted for MiniGo.

The Intel system is a preview of their latest gen Copper Lake chips, which may not be commercially available yet. On the other hand, all NVIDIA systems are commercially available and can be deployed in your data center today.

As one can see in the above, NVIDIA systems swept the first 3 positions on our Top 10 MiniGo chart. A DGX A100 came in at #1, reaching a 50% win rate at MiniGo in mere 17 seconds using 448 CPUs and 1792 A100 GPUs. Coming in at #2 at 30 seconds was another DGX A100 using 64 CPUs and 256 A100 GPUs. And at #3 at 35 seconds was a DGX-2H using 64 CPUs and 512 V100 GPUs.

Next at #4 at 151 seconds was a Google TPU system with 64 TPUv4 accelerators (unclear how many CPUs, if any are used, results show 0). Note, an 8-node Intel server with the 32 CPUs (4/node) using the latest gen Copper Lake (-6) CPU came in at #7 using 409 seconds to achieve the training results.

There are 6 other MLPerf workloads including DLRM and BERT mentioned above. Each of these deserve their own discussion on top ten results. Alas, they will need to wait for another time and I will cover all of them in future posts.

~~~~

Nowadays, with much of IT turning to AI ML DL to provide critical services, it’s more important than ever to understand what can and can’t be done with available hardware. The fact that one can train a model to play decent Go in 17 seconds on a large DGX A100 cluster and under 7 minutes on an 8-node, leading edge, Intel server cluster is pretty impressive.

Despite MLPerf’s best efforts, it’s still tough to compare ML performance across systems when there’s so much diversity in the underlying hardware, especially in GPU, TPU and CPU counts. IMHO, it would be very useful to have a single GPU , TPU or CPU system submission requirement for each workload. That way one could compare how well each hardware element can perform the workload in isolation.

Nonetheless, the MLPerf suite of benchmarks provides a great first step in understanding what today’s hardware can accomplish in ML training (and inferencing).

Comments?

Photo Credits:

Hybrid digital training-analog inferencing AI

Read an article from IBM Research, Iso-accuracy DL inferencing with in-memory computing, the other day that referred to an article in Nature, Accurate DNN inferencing using computational PCM (phase change memory or memresistive technology) which discussed using a hybrid digital-analog computational approach to DNN (deep neural network) training-inferencing AI systems. It’s important to note that the PCM device is both a storage device and a computational device, thus performing two functions in one circuit.

In the past, we have seenPCM circuitry used in neuromorphic AI. The use of PCM here is not that (see our Are neuromorphic chips a dead end? post).

Hybrid digital-analog AI has the potential to be more energy efficient and use a smaller footprint than digital AI alone. Presumably, the new approach is focused on edge devices for IoT and other energy or space limited AI deployments.

Whats different in Hybrid digital-analog AI

As researchers began examining the use of analog circuitry for use in AI deployments, the nature of analog technology led to inaccuracy and under performance in DNN inferencing. This was because of the “non-idealities” of analog circuitry. In other words, analog electronics has some intrinsic capabilities that induce some difficulties when modeling digital logic and digital exactitude is difficult to implement precisely in analog circuitry.

The caption for Figure 1 in the article runs to great length but to summarize (a) is the DNN model for an image classification DNN with fewer inputs and outputs so that it can ultimately fit on a PCM array of 512×512; (b) shows how noise is injected during the forward propagation phase of the DNN training and how the DNN weights are flattened into a 2D matrix and are programmed into the PCM device using differential conductance with additional normalization circuitry

As a result, the researchers had to come up with some slight modifications to the typical DNN training and inferencing process to improve analog PCM inferencing. Those changes involve:

  • Injecting noise during DNN neural network training, so that the resultant DNN model becomes more noise resistant;
  • Flattening the resultant DNN model from 3D to 2D so that neural network node weights can be implementing as differential conductance in the analog PCM circuitry.
  • Normalizing the internal DNN layer outputs before input to the next layer in the model

Analog devices are intrinsically more noisy than digital devices, so DNN noise sensitivity had to be reduced. During normal DNN training there is both forward pass of inputs to generate outputs and a backward propagation pass (to adjust node weights) to fit the model to the required outputs. The researchers found that by injecting noise during the forward pass they were able to create a more noise resistant DNN.

Differential conductance uses the difference between the conductance of two circuits. So a single node weight is mapped to two different circuit conductance values in the PCM device. By using differential conductance, the PCM devices inherent noisiness can be reduced from the DNN node propagation.

In addition, each layer’s outputs are normalized via additional circuitry before being used as input for the next layer in the model. This has the affect of counteracting PCM circuitry drift over time (see below).

Hybrid AI results

The researchers modeled their new approach and also performed some physical testing of a digital-analog DNN. Using CIFAR-10 image data and the ResNet-32 DNN model. The process began with an already trained DNN which was then retrained while injecting noise during forward pass processing. The resultant DNN was then modeled and programed into a PCM circuit for implementation testing.

Part D of Figure 4 shows the Baseline which represents a completely digital implementation using FP32 multiplication logic; Experiment which represents the actual use of the PCM device with a global drift calibration performed on each layer before inferencing; Mode which represents theira digital model of the PCM device and its expected accuracy. Blue band is one standard-deviation on the modeled result.

One challenge with any memristive device is that over time its functionality can drift. The researchers implemented a global drift calibration or normalization circuitry to counteract this. One can see evidence of drift in experimental results between ~20sec and ~60 seconds into testing. During this interval PCM inferencing accuracy dropped from 93.8% to 93.2% but then stayed there for the remainder of the experiment (~28 hrs). The baseline noted in the chart used digital FP32 arithmetic for infererenci and achieved ~93.9% for the duration of the test.

Certainly not as accurate as the baseline all digital implementation, but implementing DNN inferencing model in PCM and only losing 0.7% accuracy seems more than offset by the clear gain in energy and footprint reduction.

While the simplistic global drift calibration (GDC) worked fairly well during testing, the researchers developed another adaptive (batch normalization statistical [AdaBS]) approach, using a calibration image set (from the training data) and at idle times, feed these through the PCM device to calculate an average error used to adjust the PCM circuitry. As modeled and tested, the AdaBS approach increased accuracy and retained (at least modeling showed) accuracy over longer time frames.

The researchers were also able to show that implementing part (first and last layers) of the DNN model in digital FP32 and the rest in PCM improved inferencing accuracy even more.

~~~~

As shown above, a hybrid digital-analog PCM AI deployment can provide similar accuracy (at least for CIFAR-10/ResNet-24 image recognition) to an all digital DNN model but due to the efficiencies of the PCM analog circuitry allowed for a more energy efficient DNN deployment.

Photo Credit(s):

Artistic AI

Read a couple of articles in the past few weeks on OpenAI’s Jukebox and another one on computer generated art, in Art in America, (artistically) Creative AI poses problems to art criticism. Both of these discuss how AI is starting to have an impact on music and the arts.

I can recall almost back when I was in college (a very long time ago) where we were talking about computer generated art work. The creative AI article talks some about the history of computer art, which in those days used computers to generate random patterns, some of which would be considered art.

AI painting

More recent attempts at AI creating artworks uses AI deep learning neural networks together with generative adversarial network (GANs). These involve essentially two different neural networks.

  • The first is an Art deep neural networks (Art DNN) discriminator (classification neural network) that is trained using an art genre such as classical, medieval, modern art paintings, etc. This Art DNN is used to grade a new piece of art as to how well it conforms to the genre it has been trained on. For example, an Art DNN, could be trained on Monet’s body of work and then it would be able to grade any new art on how well it conforms to Monet’s style of art.
  • The second is a Art GAN which is used to generate random artworks that can then be fed to the Art DNN to determine if it’s any good. This is then used as reinforcement to modify the Art GAN to generate a better match over time.

The use of these two types of networks have proved to be very useful in current AI game playing as well as many other DNNs that don’t start with a classified data set.

However, in this case, a human artist does perform useful additional work during the process. An artist selects the paintings to be used to train the Art DNN. And the artist is active in tweaking/tuning the Art GAN to generate the (random) artwork that approximates the targeted artist.

And it’s in these two roles that that there is a place for an (human) artist in creative art generation activities.

AI music

Using AI to generate songs is a bit more complex and requires at least 3 different DNNs to generate the music and another couple for the lyrics:

  • First a song tokenizer DNN which is a trained DNN used to compress an artist songs into, for lack of a better word musical phrases or tokens. That way they could take raw audio of an artist’s song and split up into tokens, each of which had 0-2047 values. They actually compress (encode) the artist songs using 3 different resolutions which apparently lose some information for each level but retain musical attributes such as pitch, timbre and volume.
  • A second musical token generative DNN, which is trained to generate musical tokens in the same distribution of a selected artist. This is used to generate a sequence of musical tokens that matches an artist’s musical work. They use a technique based on sparse transformers that can generate (long) sequences of tokens based on a training dataset.
  • A third song de-tokenizer DNN which is trained to take the generated musical tokenst (in the three resolutions) convert them to musical compositions.

These three pretty constitute the bulk of the work for AI to generate song music. They use data augmented with information from LyricWiki, which has the lyrics 600K recorded songs in English. LyricWiki also has song metadata which includes the artist, the genre, keywords associated with the song, etc. When training the music generator a they add the artist’s name and genre information so that the musical token generator DNN can construct a song specific to an artist and a genre.

The lyrics take another couple of steps. They have data for the lyrics for every song recorded of an artist from LyricWiki. They use a number of techniques to generate the lyrics for each song and to time the lyrics to the music. lexical text generator trained on the artist lyrics to generate lyrics for a song. Suggest you check out the explanation in OpenAI Jukebox’s website to learn more.

As part of the music generation process, the models learn how to classify songs to a genre. They have taken the body of work for a number of artists and placed them in genre categories which you can see below.

The OpenAI Jukebox website has a number of examples on their home page as well as a complete catalog behind their home page. The catalog has over a 7000 songs under a number of genres, from Acoustic to Rock and everything in between. In the fashion of a number of artists in each genre, both with and without lyrics . For the (100%) blues category they have over 75 songs and songs similar to artists from B.B. King to Taj Mahal including songs similar to Fats Domino, Muddy Water, Johnny Winter and more.

OpenAI Jukebox calls the songs “re-renditions” of the artist. And the process of adding lyrics to the songs as lyric conditioning.

Source code for the song generator DNNs is available on GitHub. You can use the code to train on your own music and have it generate songs in your own musical style.

The songs sound ok but not great. The tokenizer/de-tokenizer process results in noise in the music generated. I suppose more time resolution tokenizing might reduce this somewhat but maybe not.

~~~~

The AI song generator is ok but they need more work on the lyrics and to reduce noise. The fact that they have generated so many re-renditions means to me the process at this point is completely automated.

I’m also impressed with the AI painter. Yes there’s human interaction involved (atm) but it does generate some interesting pictures that follow in the style of a targeted artist. I really wanted to see a Picasso generated painting or even a Jackson Pollack generated painting. Now that would be interesting

So now we have AI song generators and AI painting generators but there’s a lot more to artworks than paintings and songs, such as sculpture, photography, videography, etc. It seems that many of the above approaches to painting and music could be applied to some of these as well.

And then there’s plays, fiction and non-fiction works. The songs are ~3 minutes in length and the lyrics are not very long. So anything longer may represent a serious hurdle for any AI generator. So for now these are still safe.

Photo credits:

OFA DNNs, cutting the carbon out of AI

Read an article (Reducing the carbon footprint of AI… in Science Daily) the other day about a new approach to reducing the energy demands for AI deep neural net (DNN) training and inferencing. The article was reporting on a similar piece in MIT News but both were discussing a technique original outlined in a ICLR 2020 (Int. Conf. on Learning Representations) paper, Once-for-all: Train one network & specialize it for efficient deployment.

The problem stems from the amount of energy it takes to train a DNN and use it for inferencing. In most cases, training and (more importantly) inferencing can take place on many different computational environments, from IOT devices, to cars, to HPC super clusters and everything in between. In order to create DNN inferencing algorithms for use in all these environments, one would have to train a different DNN for each. Moreover, if you’re doing image recognition applications, resolution levels matter. Resolution levels would represent a whole set of more required DNNs that would need to be trained.

The authors of the paper suggest there’s a better approach. Train one large OFA (once-for-all) DNN, that covers the finest resolution and largest neural net required in such a way that smaller, sub-nets could be extracted and deployed for less weighty computational and lower resolution deployments.

The authors contend the OFA approach takes less overall computation (and energy) to create and deploy than training multiple times for each possible resolution and deployment environment. It does take more energy to train than training a few (4-7 judging by the chart) DNNs, but that can be amortized over a vastly larger set of deployments.

OFA DNN explained

Essentially the approach is to train one large (OFA) DNN, with sub-nets that can be used by themselves. The OFA DNN sub-nets have been optimized for different deployment dimensions such as DNN model width, depth and kernel size as well as resolution levels.

While DNN width is purely the number of numeric weights in each layer, and DNN depth is the number of layers, Kernel size is not as well known. Kernels were introduced in convolutional neural networks (CovNets) to identify the number of features that are to be recognized. For example, in human faces these could be mouths, noses, eyes, etc. All these dimensions + resolution levels are used to identify all possible deployment options for an OFA DNN.

OFA secrets

One key to the OFA success is that any model (sub-network) selected actually shares the weights of all of its larger brethren. That way all the (sub-network) models can be represented by the same DNN and just selecting the dimensions of interest for your application. If you were to create each and every DNN, the number would be on the order of 10**19 DNNs for the example cited in the paper with depth using {2,3,4) layers, width using {3,4,6} and kernel sizes over 25 different resolution levels.

In order to do something like OFA, one would need to train for different objectives (once for each different resolution, depth, width and kernel size). But rather than doing that, OFA uses an approach which attempts to shrink all dimensions at the same time and then fine tunes that subsets NN weights for accuracy. They call this approach progressive shrinking.

Progressive shrinking, training for different dimensions

Essentially they train first with the largest value for each dimension (the complete DNN) and then in subsequent training epochs reduce one or more dimensions required for the various deployments and just train that subset. But these subsequent training passes always use the pre-trained larger DNN weights. As they gradually pick off and train for every possible deployment dimension, the process modifies just those weights in that configuration. This way the weights of the largest DNN are optimized for all the smaller dimensions required. And as a result, one can extract a (defined) subnet with the dimensions needed for your inferencing deployments.

They use a couple of tricks when training the subsets. For example, when training for smaller kernel sizes, they use the center most kernels and transform their weights using a transformation matrix to improve accuracy with less kernels. When training for smaller depths, they use the first layers in the DNN and ignore any layers lower in the model. Training for smaller widths, they sort each layer for the highest weights, thus ensuring they retain those parameters that provide the most sensitivity.

It’s sort of like multiple video encodings in a single file. Rather than having a separate file for every video encoding format (Mpeg 2, Mpeg 4, HVEC, etc.), you have one file, with all encoding formats embedded within it. If for example you needed Mpeg-4, one could just extract those elements of the video file representing that encoding level

OFA DNN results

In order to do OFA, one must identify, ahead of time, all the potential inferencing deployments (depth, width, kernel sizes) and resolution levels to support. But in the end, you have a one size fits all trained DNN whose sub-nets can be selected and deployed for any of the pre-specified deployments.

The authors have shown (see table and figure above) that OFA beats (in energy consumed and accuracy level) other State of the Art (SOTA) and Neural (network) Architectural Search (NAS) approaches to training multiple DNNs.

The report goes on to discuss how OFA could be optimized to support different latency (inferencing response time) requirements as well as diverse hardware architectures (CPU, GPU, FPGA, etc.).

~~~~

When I first heard of OFA DNN, I thought we were on the road to artificial general intelligence but this is much more specialized than that. It’s unclear to me how many AI DNNs have enough different deployment environments to warrant the use of OFA but with the proliferation of AI DNNs for IoT, automobiles, robots, etc. their will come a time soon where OFA DNNs and its competition will become much more important.

Comments

Photo Credit(s):

Weight Agnostic Neural Networks (WANNs)

Read an article the other day (Neural Networks Can Drive Without [weight] Learning) about a new form of deep learning neural network (NN) that is not dependent on the weights assigned to network nodes. The new NN is called WANN (Weight Agnostic NN). There’s also a scientific paper (on Github, Weight Agnostic Neural Networks) that describes WANNs in more detail.

How WANNs differ from normal NN

If I understand them properly, WANNs are trained, but instead of assigning weights during training, WANN networks architectures (nodes and connections) are modified and optimized to perform well against the training data.

Indeed, most NN start out with assigning random weights to all network nodes and then these weights are adjusted through the training cycle, until the NN performs well on the training data. But NN such as these, have a structure (# nodes/layer, # layers, connectivity type, etc.) defined by the researcher, that is stable and unchanging during a training-validation cycle. If the NN model is not accurate enough, the researcher has two choices, find better data or change the model’s structure. WANNs start and end with changing the model’s structure.

With WANNs they start out with a set of NN architectures (#nodes/layer, #layers, connection types, etc). Each NN architecture is evaluated against the training data with a single shared randomized weight. That shared weight is altered (randomly) for a training pass and the model evaluated for accuracy.

At the end of a WANN training pass you have a set of evaluation metrics for each model structure. The resultant WANNs are then ordered by performance and complexity. The highest performing networks are then used to create a new population (set) of WANN architecture to be tested and the process iterates from there. This would presumably continue until you have reached a plateau of accuracy statistics across a number of shared randomized weights. And this would be the WANN model used for the application

Why WANN?

For a normal NN, each node weight would be adjusted automatically and independently at the end of each training batch. There would, of course, be a large number of batches, causing each weight in the NN nodes to be altered (via floating point arithmetic). So the math would be floating point arithmetic*#nodes*#layers*# of training batches (* # training passes (or epochs).

WANNs avoid this inner loop math altogether. Instead they would need to test a model on a number of shared random weights. This would presumably be done after a complete training pass (each epoch). And even if you had the same number of WANN models as nodes in a normal NN, the computations would be much less. Something on the order of #models * # epochs (each training pass [or epoch] could conceivable test a different shared random weight).

Another advantage of WANNs is that they result in simpler, less complex NN models (# nodes, # layers, # of connections, etc.) than normal DL NNs. Simpler NN models could be very useful for IoT applications, where computational power and storage is limited.

The main disadvantage of WANNs is that they aren’t as accurate as normally (weight adjusted) NNs. However, once you have a WANN, you can always elect to re-train it in the normal fashion by adjusting weights to gain more accuracy. And doing so would likely be much closer to a more complex NN model that was trained from the start by altering weights.

WANNs are more like nature

Human and other mammal (probably avian, aquatic, etc as well) seem to be born with certain innate abilities, visual, perceptive, mobility and with certain habits such as nursing, facial mimicking, hunger-feeding, etc. Presumably these innate abilities and habits are hardwired neuron networks that don’t depend on envirnonmental learning. Something that they are all born with.

Concievably WANNs could be consider similar to these hardwired (unlearned) neuron networks. WANNs could be used in a similar fashion to embed certain innate habits and abilities into robots or other automation that could be further trained with their interactions with their environment

““`

The Github paper has an online WANN model widget with a slider where you can alter a shared random weight and see its impact on the operation of a the widget. Playing with this, the only weight that seems to have a significant impact on the actions of the widget is zero…

Photo Credit(s): “Neural Connections In the Human Brain” by Image Editor is licensed under CC BY-NC-ND 2.0 

Supercomputing 2019 (SC19) conference

I was at SC19 last week and as always there was lots to see on the expo floor and at the show in general. Two expo booths that I thought were especially interesting were:

  • Zapata Computing systems – a quantum computing programming for hire outfit and
  • Cerebras – a new AI wafer scale accelerator chip that sported 400K+ cores in a single package.

Zapata Computing, quantum coding for hire

We’ve been on a sort of quantum thread this past month or so (e.g., see our Quantum computing – part 2 and part 1, The race for quantum supremacy posts). Zapata Computing was at the edge of the exhibit floor in a small booth pretty much just one guy (Michael Warren) and their booth with some handouts. Must have had something on the booth about quantum computing, because I stopped by

Warren said they have ~20 PhDs, from around the world working for them and provide quantum coding for hire. Zapata works with organizations to either get them up to speed on quantum programing or write quantum programs themselves under contract for clients and help run them on quantum computers.

Zapata’s quantum algorithms are designed to run on any type of quantum computer such as ion trap, superconducting qubit, quantum annealers, etc. They also work with Microsoft Azure Quantum, IBM Q, Rigetti, and Honeywell systems to run quantum programs for customers. Notably missing from this list was Google and Honeywell is new to me but seem active in quantum computing.

Zapata has their own Orquestra quantum toolkit. We have discussed quantum software development kits like IBM Q Qiskit previously but Microsoft has their own, QDK and Rigetti has Forrest SDK. So, presumably, Orquestra front ends these other development kits. Couldn’t find anything on Honeywell but it’s likely they have their own development kit as well or make use of others.

In talking to the Warren at the show, Zapata is working to come up with a quantum computing cloud, which can be used to run quantum code on any of these quantum computers with the click of a button. Warren sounded like this was coming out soon.

Some of the Zapata Computing quantum programs they have developed for clients include: logistic simulations, materials design, chemistry simulations, etc.

Warren didn’t mention the cost of running on quantum computers but he said that some companies are more forthright with pricing than others. It seemed Rigetti had a published price list to use their systems but others seemed to want to negotiate price on a per use basis.

It seems only a matter of time before quantum computing becomes just like GPUs. Just another computational accelerator that works well for some workloads but not others. Zapata Computing and Orquestra are just steps along this path.

Cerebras

AI accelerator chips have also been a hot topic for us (see our posts on Google TPU, GraphCore’s system, and the Mythic’s and Syntiant’s AI accelerators). But none,. with the possible exception of GraphCore, has taken this on to quite the same level as Cerebras.

Cerebras offers a wafer scale chip that is embedded into their CS-1 system. The chip has 400K cores, 18GB of (very fast) SRAM (memory), 100Pb/sec (peta-bits or 10**15 bits per second) of bandwidth and draws ~20kW. Their CS-1 system fits in a standard rack taking up 15U of space.

The on-chip fabric is called SWARM which supports a 2D mesh. The SWARM mesh is entirely configurable, to support optimal neural network connectivity. I assume this means that any core can talk directly (with 0 hops) to any other core on the chip through a configuration setup.

The high speed on chip SRAM supports up to 9PB/sec of memory bandwidth and can be accessed in a single clock cycle. They call the cores Sparse Linear Algebra Compute (SLAC) cores and say that they are optimized to support ML-DL computations, which we assume meansfloating point aritmetic.

Although you can’t really see the (wafer scale) chip in the picture above, it’s located in the section between the copper plate and the copper heat sink and is starts at the copper line between the two. CS-1 consumes a lot of power and much of its design is to provide proper cooling. One can view some of that on the left side of the picture above.

As for software, Cerebras CS-1 supports TensorFlow and PyTorch as well as standard C++. Their Cerebras Software Platform stack, consists of two layers: the Cerebras Intermediate Representation and Cerebras Graph Compiler (CGC) that feeds their Cerebras Wafer Scale Engine (WSE). The CGC maps neural network nodes to cores on the WSE and probably configures SWARM to provide NN core to NN core connectivity.

It’s great to see hardware innovation again. There was a time where everyone thought that software alone was going to kill off hardware innovation. But the facts are that both need to innovate to take computing forward. Cerebras didn’t tell me any PetaFlop rate for their system and but my guess it would beat out the 2PFlop GraphCore2 (GC2) system but it’s only a matter of time before GC3 comes out. That being said, what could be beyond wafer scale integration?

~~~~

I enjoy going to SC19 for all the leading edge technology on display. They have some very interesting cooling solutions that I don’t ever see anywhere else. And the student competition is fun. Teams of students running HPC workloads around the clock, on donated equipment, from Monday evening until Wednesday evening. With (by SC19) spurious fault injection to see how they and their systems react to the faults to continue to perform the work needed.

For every SC conference, they create an SCinet to support the show. This year it supported Tb/sec of bandwidth and the WiFi for the floor and conference. All the equipment and time that goes into creating SCinet is donated.

Unfortunately, I didn’t get a chance to go to keynotes or plenary sessions. I did attend one workshop on container use in HPC and it was completely beyond me. Next years, SC20 will be in Atlanta.

Photo Credit(s):

Cambrian Explosion of AI DL app’s in industry and the world

I was at the NetApp Insight conference last week and recorded a podcast (see: GreyBeards Podcast) on what NetApp is doing in the AI DL (Deep Learning) space. On the podcast, we talked about a number of verticals that were deploying AI DL right now and using it to improve outcomes.

It was only is 2012 that AI DL broke out and pretty much conquered the speech recognition contest by improving recognition accuracy by leaps and bounds. Prior to that improvements had been very small and incremental at best. Here we are, just 7 years later and AI DL models are proliferating across industry and every other sector of the world economy.

DL applications in the real world

At the show. we talked about AI DL models being used in healthcare (radiological image analysis, cell counts for infection assessments), automotive (self driving cars), financial services (fraud detection), and retail (predicting how make up would look on someone).

And early this year, at HPE Discover, they discussed a new technique to share training data but still keep it private. In this case, they use block chain technology to publish and share a DL neural network model weights and other hyper parameters trained for some real world purpose.

Customers download and use the model in their day to day activities but record the data that their model analyzes and its predictions. They use this data to update (re-train) their DL neural net. They then publish their new neural net model weights and other parameters to all the other customers. Each customer of the model do the same, updating (re-training) their DL neural net.

At some point an owner or global model arbitrator takes all these individual model updates and aggregates the neural net weights, into a new neural net model and publishes the new model. And then the process starts over again. In this way, training data is never revealed, kept secure and private but DL model updates that result from re-training the model with secured private data would be available to any customer.

Recently, there’s been a slew of articles across many different organizations that show how AI DL is being adopted to work in different areas:

And that’s just a sample of the last few weeks of papers of AI DL activity.

Next Steps

All it takes is data, that can be quantified and classified. With data and classifications in hand, anyone can train a DL model that performs that classification. It doesn’t require GPU farms, decent CPUs are up to the task for TB of data.

But if you want better prediction/classificatoin accuracy, you will need more data which means longer AI DL training runs. So at some point, maybe at >100TB of data, or use AI DL training a lot, you may want that GPU farm.

The Deep Learning with Python book (my favorite) has a number of examples such as, sentiment analysis of text, median real estate pricing predictions, generating text that looks like an authors work, with maybe a dozen more that one can use to understand AI DL technology. But it’s not rocket science, I believe any qualified programmer could do it, with some serious study.

So the real question is what are you doing with your data to make use of AI DLmodels now?

I suppose the other question ought to be, how can you collect more data and classification information, to train more AI DL models?

~~~~

It’s great to be in the storage business.

Photo Credit(s):

Quantum computing NNs

As many who have been following our blog know, AI, Machine Learning (ML) and Deep Learning (DL) (e.g. see our Learning machine learning – part 3, & Industrial revolution deep learning & NVIDIA’s 3U supercomputer, AI reaches a crossroads posts), have become much more mainstream and AI has anointed DL as the best approach for pattern recognition, classification, and prediction, but has applicability beyond that.

One problem with DL has been it’s energy costs. There have been some approaches to address this, but none have been entirely successful (e.g. see Intel new DL Boost, New GraphCore GC2 chips, AI processing at the edge posts) just yet. At one time neuromorphic hardware was the answer but I’ve become disillusioned with that technology over time (see Are neuromorphic chips a dead end post).

This past week we learned of a whole new approach, something called a Quantum Convolutional NN or QCNN (see PhysOrg Introducing QCNN, pre-print of Quantum CNNs, presentation deck on QCNNs, Nature QCNN paper paywall).

Some of you may not know that convolutional neural networks (ConvNets) are the latest in a long line of DL architectures focused on recognizing patterns and classification of data in sequence. DL ConvNets can be used to recognize speech, classify photo segments, analyze ticker tapes, etc.

But why quantum computing

First off, quantum computing (QC) is a new leading edge technology targeted to solving very hard (NP Complete, wikipedia) problems, like cracking Public Key encryption keys, solving the traveling salesperson problem and assembling an optimum Bitcoin block problem (see List of NP complete problems, wikipedia).

QC utilizes quantum mechanical properties of the universe to solve these problems without resorting to brute force searches, such as, going down every path in the traveling salesmen problem (see our QC programming and QC at our doorsteps posts).

At the moment, IBM, Google, Intel and others are all working on the QC and trying to scale it up, by increasing the number of Qubits (quantum bits) their systems support. The more qubits, the more quantum storage you have, and the more sophisticated NP complete problems one can solve. Current qubit counts include: 72 qubits for Google, 42 for Intel, and 50 for IBM. Apparently not all qubits are alike, and they don’t last very long, ~100 microseconds (see Timeline of QC, wikipedia).

What’s a QCNN?

What’s new is the use of quantum computing circuits to create ConvNets. Essentially the researchers have created a way to apply AI DL (ConvNet) techniques to quantum computing data (qubits).

Apparently there are QC [qubit] phases that need to be recognized and what better way to do that than use DL ConvNets. The only problem is that performing DL on QC data with today’s tools, would require reading out the phase into a digital (a pattern recognition problem), converting to digital data, and then processing it via CPU/GPU DL ConvNets, a classic chicken or egg problem. But with QCNNs, one has a DL ConvNet entirely implemented in QC.

DL ConvNets are typically optimized for a specific problem, varying layer counts, nodes/layer, node connectivity, etc. QCNNs match this and also come in various sizes. Above is a QCNN circuit, optimized to recognize the phase (joining?) of two sets of symmetrically-protected topology numbers (SPT, see pre-print article).

I won’t go into the QC technology used in any detail (as I barely understand it), but the researchers have come up with a way to map DL ConvNets into QC circuitry. Assuming this all works, one can then use QC to perform DL pattern recognition on qubit data.

~~~~

Comments?

Photo Credits: