Convolutional neural networks – Silverton Consulting

Hardware

GraphCore’s new Colossus GC2 chip holds 1216 IPU-Cores™. Each IPU runs at 100GFlops and is capable of running 7 threads. The GC2 chip supports 300MB of memory, with an aggregate of 30TB/s of memory bandwidth. Each IPU supports low precision floating point arithmetic in completely parallel/concrrent execution. The GC2 chip has 23.6B transistors.

Each GC2 chip supports 80 IPU-Links™ to connect to other GC2 chips with 2.5tbps of chip to chip bandwidth. Further, the chip includes a PCIe Gen 4 x16 link (31.5GB/s) to host processors. And each chip supports up to 8TB/s IPU-Exchange™ on the chip bandwidth for inter chip, IPU to IPU communications.

The GC2 chip is available on a PCIe accelerator board that includes 2 GC2 chips. It’s also available in a Dell server configuration with 8 of their PCIe accelerator boards. In the server, with 2 GC2 chips per board, it has ~19.5K IPUs with ~2.0PFlops in total of IPU processing power.

Software

GC2 IPUs support GraphCore’s Poplar® software and API’s that allows users to code in many of their favorite AI framework, such as PyTorch and TensorFlow.

At the NIPS 2017 conference GraphCore showed some AI ResNet-50, DeepBench LSTM RNN, and DeepVoice WaveNet performance benchmark results with their GC2 accelerator cards..

The chart above shows DeepBench LSTMN RNN runs comparing their GC2 accelerator card against an Nvidia P100 GPU board (longer is better).

DeepBench is intended to support a set of workloads that mimic or simulate typical deep neural net types of operations and is used to compare NN hardware systems. The chart above compares DeepBench RNN inference operations with GC2 accelerator card vs. Nvidia P100 cards at three levels of response times (<2msec, <5msec. and <7msec.).

As can be seen in the chart, the GraphCore GC2 accelerator card performed significantly (from 182X to 242X) better than the Nvidia accelerator card executing NN inferencing at <5msec and <7msec latency. And was able to perform ~42K Inferences at <2msec latency where Nvidia P100 was unable to do at all.

~~~~
The GC2 chip, accelerator card and Dell EMC servers that run them look to be a significant advance in AI NN computations. We didn’t see any technical spec’s for the server but we assume it comes in a 4U configuration and uses less power than 8 GPUs.

However, at the moment, the servers are sold out. No information on the GC2 accelerator cards but our guess is that they are sold out as well, and probably ditto for the chips. Dell didn’t quote us any pricing on the servers, so its hard to know whether we could afford one, even if they weren’t sold out.

Who wouldn’t want to own a 4U server with 2PFlops performance for their AI apps?

Comments?

Photo Credit(s): Photos taken during Dell EMC Analyst Summit GraphCore presentation

I was at Flash Memory Summit this past week and besides the fire on the exhibit floor, there was a interesting keynote by Andy Steinbach, PhD from NVIDIA on “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute”. The title was a bit much but his session was great.

2012 the dawn of the 4th industrial revolution

Steinbach started off describing AI, machine learning and deep learning as another industrial revolution, similar to the emergence of steam engines, mass production and automation of production. All of which have changed the world for the better.

Steinbach said that AI is been gestating for 50 years now but in 2012 there was a step change in it’s capabilities.

Prior to 2012 hand coded AI image recognition algorithms were able to achieve about a 74% image recognition level but in 2012, a deep learning algorithm achieved almost 85%, in one year.

And since then it’s been on a linear trend of improvements such that in 2015, current deep learning algorithms are better than human image recognition. Similar step function improvements were seen in speech recognition as well around 2012.

What drove the improvement?

Machine and deep learning depend on convolutional neural networks. These are layers of connected nodes. There are typically an input layer and output layer and N number of internal layers in a network. The connection weights between nodes control the response of the network.

Todays image recognition convolutional networks can have ~10 layers, billions of parameters, take ~30 Exaflops to train, using 10M images and took days to weeks to train.

Image recognition covolutional neural networks end up modeling the human visual cortex which has neurons to recognize edges and other specialized characteristics of a visual field.

The other thing that happened was that convolutional neural nets were translated to execute on GPUs in 2011. Neural networks had been around in AI since almost the very beginning but their computational complexity made them impossible to use effectively until recently. GPUs with 1000s of cores all able to double precision floating point operations made these networks now much more feasible.

Deep learning training of a network takes place through optimization of the node connections weights. This is done via a back propagation algorithm that was invented in the 1980’s. Back propagation typically depends on “supervised learning” which adjust the weights of the connections between nodes to come closer to the correct answer, like recognizing Sarah in an image.

Deep learning today

Steinbach showed multiple examples of deep learning algorithms such as:

Mortgage prepayment predictor system which takes information about a mortgagee, location, and other data and predicts whether they will pre-pay their mortgage.
Car automation image recognition system which recognizes people, cars, lanes, road surfaces, obstacles and just about anything else in front of a car traveling a road.
X-ray diagnostic system that can diagnose diseases present in people from the X-ray images.

As far as I know all these algorithms use supervised learning and back propagation to train a convolutional network.

Steinbach did show an example of “un-supervised learning” which essentially was fed a bunch of images and did clustering analysis on them. Not sure what the back propagation tried to optimize but the system was used to cluster the images in the set. It was able to identify one cluster of just military aircraft images out of the data.

The other advantage of convolutional neural networks is that they can be reused. E.g. the X-ray diagnostic system above used an image recognition neural net as a starting point and then ran it against a supervised set of X-rays with doctor provided diagnoses.

Another advantage of deep learning is that it can handle any number of dimensions. Mathematical optimization algorithms can handle a relatively few dimensions but deep learning can handle any number of dimensions. The number of input dimensions, the number of nodes in each layer and number of layers in your network are only limited by computational power.

NVIDIA’s DGX a deep learning super computer

At the end of Stienbach’s talk he mentioned the DGX appliance designed by NVIDIA for AI research.

The appliance has 8 state of the art NVIDIA GPUs, connected over a high speed NVLink with anywhere from ~29K to ~41K cores depending on GPU selected, and is capable of 170 to 960 Flops (FP16).

Steinbach said this single 3u appliance would have been rated the number one supercomputer in 2004 beating out a building full of servers. If you were to connect 13 (I think) DGX’s together, you would qualify to be on the top 500 super computers in the world.

~~~~

Comments?

Photo credit(s): Steinbach’s “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute” presentation at FMS 2017.

Tag: Convolutional neural networks

New GraphCore GC2 chips with 2PFlop performance in a Dell Server