Compressing information through the information bottleneck during deep learning

Read an article in Quanta Magazine (New theory cracks open the black box of deep learning) about a talk (see 18: Information Theory of Deep Learning, YouTube video) done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information.

The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer.

Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.

Limits of deep learning CNNs

In his talk the Professor identifies two modes of operations of a deep learning CNN: the encoder layers and decoder layers. The encoder function identifies relevant information in the input and the decoder function takes this relevant information and maps this to an output.

This view results in two statistics that can characterize any deep learning CNN:

  • Sample complexity which refers to the the mutual information inside the last hidden layer of the encoder function, and
  • Accuracy or generalization error, which refers to the mutual information inside the last hidden layer of the decoder function.

Where mutual information is defined as how much of the uncertainty of an input is removed when you have an output that is based on that input. (See the talk for a more formal explanation).

The professor states that any complex deep learning CNN can be characterized by these two statistics where sample complexity determines the number of samples required and accuracy determines the precision by which the deep learning CNN can properly interpret those samples. The deep black line in the chart represents the limits of accuracy achievable at some number of training events, with some number of hidden layers and some sample set.

What happens during deep learning

Moreover, the professor shows an interesting characteristic of all CNNs is that they converge over time in accuracy and that convergence differs based mostly on the number of layers, sample size and training count used.

In the chart, the top row show 3 CNNs with different amounts of training data (5%, 40% and 80% of total). The chart shows the end result and trace of learning within the CNN over the same number of epochs (training cycles). More training data generates more accurate results.

The Professor views those epochs after the farthest right traces (where the trace essentially starts moving up and to the left in the chart), the compression phase of deep learning.

Statistics of deep learning process

The professor goes on to characterize the deep learning  process by calculating the mean and variance of each layers connection weights.

In the chart he shows an standard “eiffel tower” neural network, with 6 hidden layers, each with less neurons (nodes)  than the previous layer (12 nodes, 10 nodes, 7 nodes, etc.). And what he plots is the average weights and variance between layers (red lines are average and variance of the weights for arcs[connections] between nodes in layer 1 to nodes in layer 2, blue lines the mean and variance of weights for arcs between layer 2 and 3, purple lines the mean and variance of weights for arcs between layer 3 and 4, etc.).

He shows that at the start of training the (randomly assigned) weights for each layer have a normalized mean which is higher than its normalized variance. He calls this phase as high signal to noise (I would say the opposite, its low signal to noise, more noise than signal). But as training proceeds (over more epochs), there comes a point where the layer mean drops below its variance and the signal to noise ratio changes dramatically. After that point the mean weights and variance of the group of layers start to diverge or move apart.

The phase (epochs) after the line where the weights means are lower than its variance, he calls the Compression phase of the deep layer CNN training.

The Professor suggests that every complex deep learning CNN looks the same during training if you perform the calculations. The professor shows charts like this for other deep learning CNNs used on different problems and they all exhibit some point where their means are lower than their weights after which means and variances between layers starts to differentiate.

Do layer counts and sample size matter?


It turns out that the more hidden layers you have, the sooner (less training) you need to begin the compression phase. This chart shows the same problem, with different hidden layer counts. One can see in the traces, that not only is accuracy improved with more layers but it also more quickly reaches the compression phase.

Using his sample complexity and accuracy statistics, the Professor has also shown that their are limits to the amount of accuracy to any deep learning CNN based on the function of layer counts, sample size and training event counts.

~~~~

As far as I know, The Professor and his team are the first to try to characterize and understand what happens during deep learning. In doing so, he has shown that the number of layers and the number of samples can be used to predict the speed of learning. And ultimately how accurate any deep learning CNN can be.

Comments?

Industrial revolutions, deep learning & NVIDIA’s 3U AI super computer @ FMS 2017

I was at Flash Memory Summit this past week and besides the fire on the exhibit floor, there was a interesting keynote by Andy Steinbach, PhD from NVIDIA on “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute”.  The title was a bit much but his session was great.

2012 the dawn of the 4th industrial revolution

Steinbach started off describing AI, machine learning and deep learning as another industrial revolution, similar to the emergence of steam engines, mass production and automation of production. All of which have changed the world for the better.

Steinbach said that AI is been gestating for 50 years now but in 2012 there was a step change in it’s capabilities.

Prior to 2012 hand coded AI image recognition algorithms were able to achieve about a 74%  image recognition level but in 2012, a deep learning algorithm achieved almost 85%, in one year.

And since then it’s been on a linear trend of improvements such that in 2015, current deep learning algorithms are better than human image recognition. Similar step function improvements were seen in speech recognition as well around 2012.

What drove the improvement?

Machine and deep learning depend on convolutional neural networks. These are layers of connected nodes. There are typically an input layer and output layer and N number of internal layers in a network. The connection weights between nodes control the response of the network.

Todays image recognition convolutional networks can have ~10 layers, billions of parameters, take ~30 Exaflops to train, using 10M images and took days to weeks to train.

Image recognition covolutional neural networks end up modeling the human visual cortex which has neurons to recognize edges and other specialized characteristics of a visual field.

The other thing that happened was that convolutional neural nets were translated to execute on GPUs in 2011. Neural networks had been around in AI since almost the very beginning but their computational complexity made them impossible to use effectively until recently. GPUs with 1000s of cores all able to double precision floating point operations made these networks now much more feasible.

Deep learning training of a network takes place through optimization of the node connections weights. This is done via a back propagation algorithm that was invented in the 1980’s.  Back propagation typically depends on “supervised learning” which adjust the weights of the connections between nodes to come closer to the correct answer, like recognizing Sarah in an image.

Deep learning today

Steinbach showed multiple examples of deep learning algorithms such as:

  • Mortgage prepayment predictor system which takes information about a mortgagee, location, and other data and predicts whether they will pre-pay their mortgage.
  • Car automation image recognition system which recognizes people, cars, lanes, road surfaces, obstacles and just about anything else in front of a car traveling a road.
  • X-ray diagnostic system that can diagnose diseases present in people from the X-ray images.

As far as I know all these algorithms use supervised learning and back propagation to train a convolutional network.

Steinbach did show an example of “un-supervised learning” which essentially was fed a bunch of images and did clustering analysis on them.  Not sure what the back propagation tried to optimize but the system was used to cluster the images in the set. It was able to identify one cluster of just military aircraft images out of the data.

The other advantage of convolutional neural networks is that they can be reused. E.g. the X-ray diagnostic system above used an image recognition neural net as a starting point and then ran it against a supervised set of X-rays with doctor provided diagnoses.

Another advantage of deep learning is that it can handle any number of dimensions. Mathematical optimization algorithms can handle a relatively few dimensions but deep learning can handle any number of dimensions.  The number of input dimensions, the number of nodes in each layer and number of layers in your network are only limited by computational power.

NVIDIA’s DGX a deep learning super computer

At the end of Stienbach’s talk he mentioned the DGX appliance designed by NVIDIA for AI research.

The appliance has 8 state of the art NVIDIA GPUs, connected over a high speed NVLink with anywhere from ~29K to ~41K cores depending on GPU selected, and is capable of 170 to 960 Flops (FP16).

Steinbach said this single 3u appliance would have been rated the number one supercomputer in 2004 beating out a building full of servers. If you were to connect 13 (I think) DGX’s together, you would qualify to be on the top 500 super computers in the world.

~~~~

Comments?

Photo credit(s): Steinbach’s “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute” presentation at FMS 2017.

Axellio, next gen, IO intensive server for RT analytics by X-IO Technologies

We were at X-IO Technologies last week for SFD13 in Colorado Springs talking with the team and they showed us their new IO and storage intensive server, the Axellio. They want to sell Axellio to customers that need extreme IOPS, very high bandwidth, and large storage requirements. Videos of X-IO’s sessions at SFD13 are available here.

The hardware

Axellio comes in 2U appliance with two server nodes. Each server supports  2 sockets of Intel E5-26xx v4 CPUs (4 sockets total) supporting from 16 to 88 cores. Each server node can be configured with up to 1TB of DRAM or it also supports NVDIMMs.

There are two key differentiators to Axellio:

  1. The FabricExpress™, a PCIe based interconnect which allows both server nodes to access dual-ported,  2.5″ NVMe SSDs; and
  2. Dense drive trays, the Axellio supports up to 72 (6 trays with 12 drives each) 2.5″ NVMe SSDs offering up to 460TB of raw NVMe flash using 6.4TB NVMe SSDs. Higher capacity NVMe SSDS available soon will increase Axellio capacity to 1PB of raw NVMe flash.

They also probably spent a lot of time on packaging, cooling and power in order to make Axellio a reliable solution for edge computing. We asked if it was NEBs compliant and they told us not yet but they are working on it.

Axellio can also be configured to replace 2 drive trays with 2 processor offload modules such as 2x Intel Phi CPU extensions for parallel compute, 2X Nvidia K2 GPU modules for high end video or VDI processing or 2X Nvidia P100 Tesla modules for machine learning processing. Probably anything that fits into Axellio’s power, cooling and PCIe bus lane limitations would also probably work here.

At the frontend of the appliance there are 1x16PCIe lanes of server retained for networking that can support off the shelf NICs/HCAs/HBAs with HHHL or FHHL cards for Ethernet, Infiniband or FC access to the Axellio. This provides up to 2x100GbE per server node of network access.

Performance of Axellio

With Axellio using all NVMe SSDs, we expect high IO performance. Further, they are measuring IO performance from internal to the CPUs on the Axellio server nodes. X-IO says the Axellio can hit >12Million IO/sec with at 35µsec latencies with 72 NVMe SSDs.

Lab testing detailed in the chart above shows IO rates for an Axellio appliance with 48 NVMe SSDs. With that configuration the Axellio can do 7.8M 4KB random write IOPS at 90µsec average response times and 8.6M 4KB random read IOPS at 164µsec latencies. Don’t know why reads would take longer than writes in Axellio, but they are doing 10% more of them.

Furthermore, the difference between read and write IOP rates aren’t close to what we have seen with other AFAs. Typically, maximum write IOPs are much less than read IOPs. Why Axellio’s read and write IOP rates are so close to one another (~10%) is a significant mystery.

As for IO bandwitdh, Axellio it supports up to 60GB/sec sustained and in the 48 drive lax testing it generated 30.5GB/sec for random 4KB writes and 33.7GB/sec for random 4KB reads. Again much closer together than what we have seen for other AFAs.

Also noteworthy, given PCIe’s bi-directional capabilities, X-IO said that there’s no reason that the system couldn’t be doing a mixed IO workload of both random reads and writes at similar rates. Although, they didn’t present any test data to substantiate that claim.

Markets for Axellio

They really didn’t talk about the software for Axellio. We would guess this is up to the customer/vertical that uses it.

Aside from the obvious use case as a X-IO’s next generation ISE storage appliance, Axellio could easily be used as an edge processor for a massive fabric of IoT devices, analytics processor for large RT streaming data, and deep packet capture and analysis processing for cyber security/intelligence gathering, etc. X-IO seems to be focusing their current efforts on attacking these verticals and others with similar processing requirements.

X-IO Technologies’ sessions at SFD13

Other sessions at X-IO include: Richard Lary, CTO X-IO Technologies gave a very interesting presentation on an mathematically optimized way to do data dedupe (caution some math involved); Bill Miller, CEO X-IO Technologies presented on edge computing’s new requirements and Gavin McLaughlin, Strategy & Communications talked about X-IO’s history and new approach to take the company into more profitable business.

Again all the videos are available online (see link above). We were very impressed with Richard’s dedupe session and haven’t heard as much about bloom filters, since Andy Warfield, CTO and Co-founder Coho Data, talked at SFD8.

For more information, other SFD13 blogger posts on X-IO’s sessions:

Full Disclosure

X-IO paid for our presence at their sessions and they provided each blogger a shirt, lunch and a USB stick with their presentations on it.

 

Google releases new Cloud TPU & Machine Learning supercomputer in the cloud

Last year about this time Google released their 1st generation TPU chip to the world (see my TPU and HW vs. SW … post for more info).

This year they are releasing a new version of their hardware called the Cloud TPU chip and making it available in a cluster on their Google Cloud.  Cloud TPU is in Alpha testing now. As I understand it, access to the Cloud TPU will eventually be free to researchers who promise to freely publish their research and at a price for everyone else.

What’s different between TPU v1 and Cloud TPU v2

The differences between version 1 and 2 mostly seem to be tied to training Machine Learning Models.

TPU v1 didn’t have any real ability to train machine learning (ML) models. It was a relatively dumb (8 bit ALU) chip but if you had say a ML model already created to do something like understand speech, you could load that model into the TPU v1 board and have it be executed very fast. The TPU v1 chip board was also placed on a separate PCIe board (I think), connected to normal x86 CPUs  as sort of a CPU accelerator. The advantage of TPU v1 over GPUs or normal X86 CPUs was mostly in power consumption and speed of ML model execution.

Cloud TPU v2 looks to be a standalone multi-processor device, that’s connected to others via what looks like Ethernet connections. One thing that Google seems to be highlighting is the Cloud TPU’s floating point performance. A Cloud TPU device (board) is capable of 180 TeraFlops (trillion or 10^12 floating point operations per second). A 64 Cloud TPU device pod can theoretically execute 11.5 PetaFlops (10^15 FLops).

TPU v1 had no floating point capabilities whatsoever. So Cloud TPU is intended to speed up the training part of ML models which requires extensive floating point calculations. Presumably, they have also improved the ML model execution processing in Cloud TPU vs. TPU V1 as well. More information on their Cloud TPU chips is available here.

So how do you code a TPU?

Both TPU v1 and Cloud TPU are programmed by Google’s open source TensorFlow. TensorFlow is a set of software libraries to facilitate numerical computation via data flow graph programming.

Apparently with data flow programming you have many nodes and many more connections between them. When a connection is fired between nodes it transfers a multi-dimensional matrix (tensor) to the node. I guess the node takes this multidimensional array does some (floating point) calculations on this data and then determines which of its outgoing connections to fire and how to alter the tensor to send to across those connections.

Apparently, TensorFlow works with X86 servers, GPU chips, TPU v1 or Cloud TPU. Google TensorFlow 1.2.0 is now available. Google says that TensorFlow is in use in over 6000 open source projects. TensorFlow uses Python and 1.2.0 runs on Linux, Mac, & Windows. More information on TensorFlow can be found here.

So where can I get some Cloud TPUs

Google is releasing their new Cloud TPU in the TensorFlow Research Cloud (TFRC). The TFRC has 1000 Cloud TPU devices connected together which can be used by any organization to train machine learning algorithms and execute machine learning algorithms.

I signed up (here) to be an alpha tester. During the signup process the site asked me: what hardware (GPUs, CPUs) and platforms I was currently using to training my ML models; how long does my ML model take to train; how large a training (data) set do I use (ranging from 10GB to >1PB) as well as other ML model oriented questions. I guess there trying to understand what the market requirements are outside of Google’s own use.

Google’s been using more ML and other AI technologies in many of their products and this will no doubt accelerate with the introduction of the Cloud TPU. Making it available to others is an interesting play but this would be one way to amortize the cost of creating the chip. Another way would be to sell the Cloud TPU directly to businesses, government agencies, non government agencies, etc.

I have no real idea what I am going to do with alpha access to the TFRC but I was thinking maybe I could feed it all my blog posts and train a ML model to start writing blog post for me. If anyone has any other ideas, please let me know.

Comments?

Photo credit(s): From Google’s website on the new Cloud TPU