Industrial revolutions, deep learning & NVIDIA’s 3U AI super computer @ FMS 2017

I was at Flash Memory Summit this past week and besides the fire on the exhibit floor, there was a interesting keynote by Andy Steinbach, PhD from NVIDIA on “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute”.  The title was a bit much but his session was great.

2012 the dawn of the 4th industrial revolution

Steinbach started off describing AI, machine learning and deep learning as another industrial revolution, similar to the emergence of steam engines, mass production and automation of production. All of which have changed the world for the better.

Steinbach said that AI is been gestating for 50 years now but in 2012 there was a step change in it’s capabilities.

Prior to 2012 hand coded AI image recognition algorithms were able to achieve about a 74%  image recognition level but in 2012, a deep learning algorithm achieved almost 85%, in one year.

And since then it’s been on a linear trend of improvements such that in 2015, current deep learning algorithms are better than human image recognition. Similar step function improvements were seen in speech recognition as well around 2012.

What drove the improvement?

Machine and deep learning depend on convolutional neural networks. These are layers of connected nodes. There are typically an input layer and output layer and N number of internal layers in a network. The connection weights between nodes control the response of the network.

Todays image recognition convolutional networks can have ~10 layers, billions of parameters, take ~30 Exaflops to train, using 10M images and took days to weeks to train.

Image recognition covolutional neural networks end up modeling the human visual cortex which has neurons to recognize edges and other specialized characteristics of a visual field.

The other thing that happened was that convolutional neural nets were translated to execute on GPUs in 2011. Neural networks had been around in AI since almost the very beginning but their computational complexity made them impossible to use effectively until recently. GPUs with 1000s of cores all able to double precision floating point operations made these networks now much more feasible.

Deep learning training of a network takes place through optimization of the node connections weights. This is done via a back propagation algorithm that was invented in the 1980’s.  Back propagation typically depends on “supervised learning” which adjust the weights of the connections between nodes to come closer to the correct answer, like recognizing Sarah in an image.

Deep learning today

Steinbach showed multiple examples of deep learning algorithms such as:

  • Mortgage prepayment predictor system which takes information about a mortgagee, location, and other data and predicts whether they will pre-pay their mortgage.
  • Car automation image recognition system which recognizes people, cars, lanes, road surfaces, obstacles and just about anything else in front of a car traveling a road.
  • X-ray diagnostic system that can diagnose diseases present in people from the X-ray images.

As far as I know all these algorithms use supervised learning and back propagation to train a convolutional network.

Steinbach did show an example of “un-supervised learning” which essentially was fed a bunch of images and did clustering analysis on them.  Not sure what the back propagation tried to optimize but the system was used to cluster the images in the set. It was able to identify one cluster of just military aircraft images out of the data.

The other advantage of convolutional neural networks is that they can be reused. E.g. the X-ray diagnostic system above used an image recognition neural net as a starting point and then ran it against a supervised set of X-rays with doctor provided diagnoses.

Another advantage of deep learning is that it can handle any number of dimensions. Mathematical optimization algorithms can handle a relatively few dimensions but deep learning can handle any number of dimensions.  The number of input dimensions, the number of nodes in each layer and number of layers in your network are only limited by computational power.

NVIDIA’s DGX a deep learning super computer

At the end of Stienbach’s talk he mentioned the DGX appliance designed by NVIDIA for AI research.

The appliance has 8 state of the art NVIDIA GPUs, connected over a high speed NVLink with anywhere from ~29K to ~41K cores depending on GPU selected, and is capable of 170 to 960 Flops (FP16).

Steinbach said this single 3u appliance would have been rated the number one supercomputer in 2004 beating out a building full of servers. If you were to connect 13 (I think) DGX’s together, you would qualify to be on the top 500 super computers in the world.

~~~~

Comments?

Photo credit(s): Steinbach’s “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute” presentation at FMS 2017.

Axellio, next gen, IO intensive server for RT analytics by X-IO Technologies

We were at X-IO Technologies last week for SFD13 in Colorado Springs talking with the team and they showed us their new IO and storage intensive server, the Axellio. They want to sell Axellio to customers that need extreme IOPS, very high bandwidth, and large storage requirements. Videos of X-IO’s sessions at SFD13 are available here.

The hardware

Axellio comes in 2U appliance with two server nodes. Each server supports  2 sockets of Intel E5-26xx v4 CPUs (4 sockets total) supporting from 16 to 88 cores. Each server node can be configured with up to 1TB of DRAM or it also supports NVDIMMs.

There are two key differentiators to Axellio:

  1. The FabricExpress™, a PCIe based interconnect which allows both server nodes to access dual-ported,  2.5″ NVMe SSDs; and
  2. Dense drive trays, the Axellio supports up to 72 (6 trays with 12 drives each) 2.5″ NVMe SSDs offering up to 460TB of raw NVMe flash using 6.4TB NVMe SSDs. Higher capacity NVMe SSDS available soon will increase Axellio capacity to 1PB of raw NVMe flash.

They also probably spent a lot of time on packaging, cooling and power in order to make Axellio a reliable solution for edge computing. We asked if it was NEBs compliant and they told us not yet but they are working on it.

Axellio can also be configured to replace 2 drive trays with 2 processor offload modules such as 2x Intel Phi CPU extensions for parallel compute, 2X Nvidia K2 GPU modules for high end video or VDI processing or 2X Nvidia P100 Tesla modules for machine learning processing. Probably anything that fits into Axellio’s power, cooling and PCIe bus lane limitations would also probably work here.

At the frontend of the appliance there are 1x16PCIe lanes of server retained for networking that can support off the shelf NICs/HCAs/HBAs with HHHL or FHHL cards for Ethernet, Infiniband or FC access to the Axellio. This provides up to 2x100GbE per server node of network access.

Performance of Axellio

With Axellio using all NVMe SSDs, we expect high IO performance. Further, they are measuring IO performance from internal to the CPUs on the Axellio server nodes. X-IO says the Axellio can hit >12Million IO/sec with at 35µsec latencies with 72 NVMe SSDs.

Lab testing detailed in the chart above shows IO rates for an Axellio appliance with 48 NVMe SSDs. With that configuration the Axellio can do 7.8M 4KB random write IOPS at 90µsec average response times and 8.6M 4KB random read IOPS at 164µsec latencies. Don’t know why reads would take longer than writes in Axellio, but they are doing 10% more of them.

Furthermore, the difference between read and write IOP rates aren’t close to what we have seen with other AFAs. Typically, maximum write IOPs are much less than read IOPs. Why Axellio’s read and write IOP rates are so close to one another (~10%) is a significant mystery.

As for IO bandwitdh, Axellio it supports up to 60GB/sec sustained and in the 48 drive lax testing it generated 30.5GB/sec for random 4KB writes and 33.7GB/sec for random 4KB reads. Again much closer together than what we have seen for other AFAs.

Also noteworthy, given PCIe’s bi-directional capabilities, X-IO said that there’s no reason that the system couldn’t be doing a mixed IO workload of both random reads and writes at similar rates. Although, they didn’t present any test data to substantiate that claim.

Markets for Axellio

They really didn’t talk about the software for Axellio. We would guess this is up to the customer/vertical that uses it.

Aside from the obvious use case as a X-IO’s next generation ISE storage appliance, Axellio could easily be used as an edge processor for a massive fabric of IoT devices, analytics processor for large RT streaming data, and deep packet capture and analysis processing for cyber security/intelligence gathering, etc. X-IO seems to be focusing their current efforts on attacking these verticals and others with similar processing requirements.

X-IO Technologies’ sessions at SFD13

Other sessions at X-IO include: Richard Lary, CTO X-IO Technologies gave a very interesting presentation on an mathematically optimized way to do data dedupe (caution some math involved); Bill Miller, CEO X-IO Technologies presented on edge computing’s new requirements and Gavin McLaughlin, Strategy & Communications talked about X-IO’s history and new approach to take the company into more profitable business.

Again all the videos are available online (see link above). We were very impressed with Richard’s dedupe session and haven’t heard as much about bloom filters, since Andy Warfield, CTO and Co-founder Coho Data, talked at SFD8.

For more information, other SFD13 blogger posts on X-IO’s sessions:

Full Disclosure

X-IO paid for our presence at their sessions and they provided each blogger a shirt, lunch and a USB stick with their presentations on it.

 

Google releases new Cloud TPU & Machine Learning supercomputer in the cloud

Last year about this time Google released their 1st generation TPU chip to the world (see my TPU and HW vs. SW … post for more info).

This year they are releasing a new version of their hardware called the Cloud TPU chip and making it available in a cluster on their Google Cloud.  Cloud TPU is in Alpha testing now. As I understand it, access to the Cloud TPU will eventually be free to researchers who promise to freely publish their research and at a price for everyone else.

What’s different between TPU v1 and Cloud TPU v2

The differences between version 1 and 2 mostly seem to be tied to training Machine Learning Models.

TPU v1 didn’t have any real ability to train machine learning (ML) models. It was a relatively dumb (8 bit ALU) chip but if you had say a ML model already created to do something like understand speech, you could load that model into the TPU v1 board and have it be executed very fast. The TPU v1 chip board was also placed on a separate PCIe board (I think), connected to normal x86 CPUs  as sort of a CPU accelerator. The advantage of TPU v1 over GPUs or normal X86 CPUs was mostly in power consumption and speed of ML model execution.

Cloud TPU v2 looks to be a standalone multi-processor device, that’s connected to others via what looks like Ethernet connections. One thing that Google seems to be highlighting is the Cloud TPU’s floating point performance. A Cloud TPU device (board) is capable of 180 TeraFlops (trillion or 10^12 floating point operations per second). A 64 Cloud TPU device pod can theoretically execute 11.5 PetaFlops (10^15 FLops).

TPU v1 had no floating point capabilities whatsoever. So Cloud TPU is intended to speed up the training part of ML models which requires extensive floating point calculations. Presumably, they have also improved the ML model execution processing in Cloud TPU vs. TPU V1 as well. More information on their Cloud TPU chips is available here.

So how do you code a TPU?

Both TPU v1 and Cloud TPU are programmed by Google’s open source TensorFlow. TensorFlow is a set of software libraries to facilitate numerical computation via data flow graph programming.

Apparently with data flow programming you have many nodes and many more connections between them. When a connection is fired between nodes it transfers a multi-dimensional matrix (tensor) to the node. I guess the node takes this multidimensional array does some (floating point) calculations on this data and then determines which of its outgoing connections to fire and how to alter the tensor to send to across those connections.

Apparently, TensorFlow works with X86 servers, GPU chips, TPU v1 or Cloud TPU. Google TensorFlow 1.2.0 is now available. Google says that TensorFlow is in use in over 6000 open source projects. TensorFlow uses Python and 1.2.0 runs on Linux, Mac, & Windows. More information on TensorFlow can be found here.

So where can I get some Cloud TPUs

Google is releasing their new Cloud TPU in the TensorFlow Research Cloud (TFRC). The TFRC has 1000 Cloud TPU devices connected together which can be used by any organization to train machine learning algorithms and execute machine learning algorithms.

I signed up (here) to be an alpha tester. During the signup process the site asked me: what hardware (GPUs, CPUs) and platforms I was currently using to training my ML models; how long does my ML model take to train; how large a training (data) set do I use (ranging from 10GB to >1PB) as well as other ML model oriented questions. I guess there trying to understand what the market requirements are outside of Google’s own use.

Google’s been using more ML and other AI technologies in many of their products and this will no doubt accelerate with the introduction of the Cloud TPU. Making it available to others is an interesting play but this would be one way to amortize the cost of creating the chip. Another way would be to sell the Cloud TPU directly to businesses, government agencies, non government agencies, etc.

I have no real idea what I am going to do with alpha access to the TFRC but I was thinking maybe I could feed it all my blog posts and train a ML model to start writing blog post for me. If anyone has any other ideas, please let me know.

Comments?

Photo credit(s): From Google’s website on the new Cloud TPU