Compressing information through the information bottleneck during deep learning

Read an article in Quanta Magazine (New theory cracks open the black box of deep learning) about a talk (see 18: Information Theory of Deep Learning, YouTube video) done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information.

The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer.

Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.

Limits of deep learning CNNs

In his talk the Professor identifies two modes of operations of a deep learning CNN: the encoder layers and decoder layers. The encoder function identifies relevant information in the input and the decoder function takes this relevant information and maps this to an output.

This view results in two statistics that can characterize any deep learning CNN:

  • Sample complexity which refers to the the mutual information inside the last hidden layer of the encoder function, and
  • Accuracy or generalization error, which refers to the mutual information inside the last hidden layer of the decoder function.

Where mutual information is defined as how much of the uncertainty of an input is removed when you have an output that is based on that input. (See the talk for a more formal explanation).

The professor states that any complex deep learning CNN can be characterized by these two statistics where sample complexity determines the number of samples required and accuracy determines the precision by which the deep learning CNN can properly interpret those samples. The deep black line in the chart represents the limits of accuracy achievable at some number of training events, with some number of hidden layers and some sample set.

What happens during deep learning

Moreover, the professor shows an interesting characteristic of all CNNs is that they converge over time in accuracy and that convergence differs based mostly on the number of layers, sample size and training count used.

In the chart, the top row show 3 CNNs with different amounts of training data (5%, 40% and 80% of total). The chart shows the end result and trace of learning within the CNN over the same number of epochs (training cycles). More training data generates more accurate results.

The Professor views those epochs after the farthest right traces (where the trace essentially starts moving up and to the left in the chart), the compression phase of deep learning.

Statistics of deep learning process

The professor goes on to characterize the deep learning  process by calculating the mean and variance of each layers connection weights.

In the chart he shows an standard “eiffel tower” neural network, with 6 hidden layers, each with less neurons (nodes)  than the previous layer (12 nodes, 10 nodes, 7 nodes, etc.). And what he plots is the average weights and variance between layers (red lines are average and variance of the weights for arcs[connections] between nodes in layer 1 to nodes in layer 2, blue lines the mean and variance of weights for arcs between layer 2 and 3, purple lines the mean and variance of weights for arcs between layer 3 and 4, etc.).

He shows that at the start of training the (randomly assigned) weights for each layer have a normalized mean which is higher than its normalized variance. He calls this phase as high signal to noise (I would say the opposite, its low signal to noise, more noise than signal). But as training proceeds (over more epochs), there comes a point where the layer mean drops below its variance and the signal to noise ratio changes dramatically. After that point the mean weights and variance of the group of layers start to diverge or move apart.

The phase (epochs) after the line where the weights means are lower than its variance, he calls the Compression phase of the deep layer CNN training.

The Professor suggests that every complex deep learning CNN looks the same during training if you perform the calculations. The professor shows charts like this for other deep learning CNNs used on different problems and they all exhibit some point where their means are lower than their weights after which means and variances between layers starts to differentiate.

Do layer counts and sample size matter?

It turns out that the more hidden layers you have, the sooner (less training) you need to begin the compression phase. This chart shows the same problem, with different hidden layer counts. One can see in the traces, that not only is accuracy improved with more layers but it also more quickly reaches the compression phase.

Using his sample complexity and accuracy statistics, the Professor has also shown that their are limits to the amount of accuracy to any deep learning CNN based on the function of layer counts, sample size and training event counts.


As far as I know, The Professor and his team are the first to try to characterize and understand what happens during deep learning. In doing so, he has shown that the number of layers and the number of samples can be used to predict the speed of learning. And ultimately how accurate any deep learning CNN can be.


Hyperloop One in Colorado?

Read a couple of articles last week (TechCrunch, ArsTechnica & Denver Post) about Colorado becoming a winner in the Hyperloop One Global Challenge. The Colorado Department of Transportation (DoT) have joined with Hyperloop One to commission a study on Hyperloop transportation across the front range, from Cheyenne, WY to Pueblo, CO.

There’s been talk forever about adding a passenger train in Colorado from Fort Collins to Pueblo but every time they look at it they can’t make the economics work. How’s this different?

Transportation and the Queen city of the Prairie

Transportation has always been important to Denver. It was the Denver Pacific railroad from Denver to Cheyenne that first linked Denver to the rest of the nation. But even before that there was a stage coach line (Leavenworth & Pike’s Peak Express) that went through Denver to reduce travel time. Denver is currently the largest city within 500 miles and the second only to Phoenix as the most populus city in the mountain west.

Denver International Airport is a major hub and the world’s sixth busiest airport. Denver is a cross road for major north-south and east-west highways through the mountain west. Both the BNSF and Union Pacific railroads serve Denver and Denver is one of the major stops on the Amtrak  passenger train from San Francisco to Chicago.

Why Hyperloop?

Hyperloop can provide much faster travel, even faster than airplanes. Hyperloop can go up to 760 mph (1200 km/h) and should average 600 mph (970 km/h) from point to point

Further, it could potentially require less security.  Hyperloop can go above or below ground. But in either case a terrorist act shouldn’t be as harmful as one on a plane thats traveling at 20 to 30,000 feet in the air.

And because it can go above or below ground it could potentially make use current transportation right of way corridors for building its tubes. Although to go west, it’s going to need a new tunnel or two through the mountains.

Stops along the way

The proposed hyperloop track will bring it through Greeley and as far west as Vail. For a total of 360 miles. Cheyenne to Pueblo have about 10 urban centers between and west of them (Cheyenne, Fort Collins, Greely, Longmont-Boulder, Denver, Denver Tech Center [DTC], West [Denver] metro, Silverthorne/Dillon, Vail, Colorado Springs and Pueblo).

Cheyenne to Pueblo is is 213 miles apart and ~3.5 hr drive with Denver at about the 1/2 way point. With Hyperloop, Denver to either location should take ~10 minutes without stops and the total trip, Cheyenne to Pueblo should be ~21 minutes.

Yes but is there any demand

I would think the way to get a handle on any potential market is to examine airline traffic between these cities. Airplanes can travel at close to these speeds and the costs are public.

But today there’s not much airline traffic between Cheyenne, Denver and Pueblo.  Flights to Vail are mostly seasonal. I could only find one flight from Denver to Cheyenne over a week, one flight between Cheyenne and Pueblo, and 16 flights between Denver and Pueblo. The airplanes used on these trips only holds 9 passengers, so maybe that would amount to a maximum of 162 air travelers a week.

The other approach to estimating potential passengers is to use highway traffic between these destinations. Yes the interstate (I25) from Cheyenne through Denver to Pueblo is constantly busy and needs another lane or two in each direction to handle peak travel. And travel to Vail is very busy during weekends. But how many of these people would be willing to forego a car and travel by Hyperloop?

I travel on tollroads to get to the Denver Airport and it’s a lot faster then traveling non-tollroad highways. But the cost for me is a business expense and it’s not that frequent. These days there’s not much traffic on my tollroad corridor and at rush hour, there’s very few times where one has to slow down. But there are plenty of people coming to the airport each day from the NorthWest and SouthEast Denver suburbs that could use these tollroads but don’t.

And what can you do in Pueblo, Cheyenne or Denver for that matter without a car. It depends on where you end up. The current stops in Denver include the Denver International Airport, DTC, or West Metro (Golden?). Denver, Golden, Boulder, Vail, Greeley and Fort Collins all have compact downtowns with decent transportation. But for the rest of the stops along the way, you will probably want access to a car to get anywhere. There’s always Uber and Left and worst case renting a car.

So maybe Hyperloop would compete for all air travel and some portion of the car travel between along the Cheyenne to Denver to Pueblo. It just may not be large enough.

Other alternative routes

Why stop at Cheyenne, what about Jackson WY or Billings MT? And why Pueblo what about Sante Fe and Albuquerque in NM. And you could conceivably go down to Brownsville, TX and extend up to Calgary and Edmonton in Alberta, Canada, if it made sense. I suppose it’s a question of how many people for what distance.

I would think that going east-west would be more profitable. Say Kansas City to Salt Lake City with Denver in between. With this corridor: 1) the distances are longer (Kansas to Salt Lake is 910 mi [~1465 km]); 2) the metropolitan areas are much larger; and 3) the air travel between them is more popular.

There are currently 10 winners for Hyperloop One’s Global Challenge Contest.  The other routes in the USA include Texas (Dallas, Houston & San Antonio), Florida (Miami to Orlando), & the midwest (Chicago IL to Columbus OH to Pittsburgh PA). But there are others in Canada and Mexico in North America and more in Europe and India.

Hyperloop One will “commit meaningful business and engineering resources and work closely with each of the winning teams/routes to determine their commercial viability.” All this means that each of the winners will be examined professionally to see if it makes economic sense.

Of the 10 winners, Colorado’s route has the least population, almost by a factor of 2. Not sure why we are even in contention, but maybe it’s the ease of building the tubes that makes us a good candidate.

In any case, the public-private partnership has begun to work on the feasibility study.


Photo Credit(s): 7 hyperloop facts Elon Musk would love us to know, Detechter

Take a ride on Hyperloop…, Daily Mail


Mesosphere, Kubernetes and the coming container orchestration consensus

Read a story this past week in TechCrunch, Mesosphere adds Kubernetes support, about how Mesosphere with their own container orchestration software (called Marathon) will now support Google Kubernetes clusters and container orchestration services.

Mesosphere uses their own DC/OS (data center/operating system) to provide service discovery, resource management and networking for container cluster deployments across multiple machines.

DC/OS sounds similar to Kubo discussed in last week’s post, VMworld2017 forecast, cloudy with high chance of containers. Although Kubo was an open source development led by Pivotal to run Kubernetes clusters.

Kubernetes (and Docker) wins

This is indicative of the impact Kubernetes cluster operations is having on the container space.For now, the only holdout in container orchestration without Kubernetes is Docker with their Docker Swarm Engine.

Why add Kubernetes when Mesosphere already had a great container cluster orchestration service? It seems as the container market is maturing, more and more applications are being developed for Kubernetes clusters rather than other container orchestration software.

Although Mesosphere is the current leader in container orchestration both in containers run and revenue (according to their CEO), the move to Kubernetes clusters is likely to accelerate their market adoption/revenues and ultimately help keep them in the lead.

Marathon still lives on

It turns out that Marathon also orchestrates non-container application deployments.

Marathon can also support statefull apps like database machines with persistent storage (unlike Docker containers, stateless apps). These are closer to more typical enterprise applications. This is probably why Mesosphere has done so well up to now.
Marathon also supports both Docker and Mesos containers. Mesos containers depend on Apache Mesos, a specially developed distributed system’s kernel based on Linux for containers.

So Mesosphere will continue to fund development and support for Marathon, even while it rolls out Kubernetes. This will allow them to continue to support their customer base and move them forward into the Kubernetes age.


I see an eventual need for both stateless and statefull apps in the enterprise data center. And that might just be Mesosphere’s key value proposition – the ability to support apps of the future (containers-stateless) and apps of today (statefull) within the same DC/OS.

Picture credit(s): Enormous container ship by Ruth Hartnup

VMworld2017’s forecast, cloudy with a high chance of containers

Attended VMworld2017 this past week in Vegas and aside from all the parties there was a lot of news, mostly for public cloud users.

In talking with analysts and others at the show it seems like VMware has recently discovered that they can’t fight the cloud, so they better join them. Early this year VMware divested itself of its vCloud Air Business to OVH, which removed their owned competition to the cloud. Now, VMware’s on a different tack, figuring out how to best work with today’s public cloud providers and implementing this.

Last year VMware announced an agreement with IBM and to supply vCloud Air services on IBM’s SoftLayer public cloud. This year, VMware ramps up other public cloud offerings with VMware Cloud on AWS and PKS (Pivotal Container Services) on vSphere.

First up, VMware on the (AWS) cloud

You may recall that earlier this year VMware showed a tech preview of vSphere running in AWS. At VMworld2017 they took off the wraps on this service and made it real. At first it’s only available in AWS US WEST region but they plan to roll it out to the rest of US soon and rest of the world after that.

VMware Cloud on AWS is vSphere, vCenter, NSX, and vSAN running ontop of AWS Elastic cloud services. Essentially, any VM that you run onprem, can be run on AWS, using VMware Cloud on AWS.

The AWS EC2 machines you run VMware on are BIG – 2 CPU, 36 cores (72 hyper threads) with 512GiB of memory and a local (SSD) cache of 3.6TB/10.7TB raw capacity. VMware Cloud on AWS requires four EC2 instances to run. No information about the networking capabilities but I assume HIGH SPEED.

The cost for the service is high but you are paying for 7x24x365 AWS EC2 services. For a 3 year “reservation”, it will cost $109.4K/host. That comes out to be about $3K/month/host for 36 months. VMware claims that on a 3 year TCO basis this would be cheaper than running an equivalent configuration onprem.

You can also contract for VMware Cloud on AWS on an hourly basis. You do have to have a VMware login and VMware credits (?) to do so. It’s certainly not as simple as just having a credit card and an AWS login. But the costs for this are $8.361/hour/host. This seems awfully high but there’s no direct comparison to other EC2 machine configurations. Although there is an EC2 X1.16 with 64 vCPUs (hyper thread equivalents), 976GiB DRAM and 1-1920 (GiB) SSD that lists for $6.669/hour – close, but not a complete match.

You are running a VMware service on AWS so the billing is done through VMware. And any data you move in or out of the cloud will be billed (through VMware) at whatever AWS would charge for the data egress/import.

It seems that if you “connect” your VMware Cloud on AWS to your onprem   vSphere cluster (through stretched layer 2 NSX networking and ? other means) you can vMotion VMs from onprem to AWS and back again. There is a behind the scenes Storage vMotion that also happens to get the data to AWS so that the VMs can operate properly.

VMware vCenter offers a dashboard of sorts to tell admins whether a particular VM is a good candidate to move to AWS or not. This is based on the VM’s connections to other VMs and maybe the amount of data that would need to moved.

Next, (PKS) containers and more (GCP) cloud

VMware together with Pivotal and Google Cloud announced a tech preview of the Pivotal Container Service (PKS) on vSphere. The new service implements Pivotal Kubo, or Kubernetes container orchestration with Bosh HA infrastructure management ontop of vSphere. PKS also comes with Harbor a secure, enterprise class container registry from VMware

This would allow a development team to develop a container micro-services application, completely within a VMware environment and to run it under vSphere. This seems tailor made to cloud developers.

Kubernetes has worker and master nodes and each which would run as a VM on vSphere. Inside worker nodes, Kubernetes runs Pods which have one or more tightly connected container(s) which enclose an application and share context.

I was talking with the vSphere team and they had been spending a lot of time making vSphere native services available to PKS. This means that you can use NSX networking and vSAN, VVOLs or VMDK storage for your container (persistent) storage.

Not exactly sure where DevOps fits into PKS on vSphere but my assumption is that you could run PuppetChef or if your up to the challenge, vRA to automate application roll out.

There was specific talk of having PKS run on AWS, probably within VMware Cloud on AWS in the future.

Of course, PKS containers that run on vSphere are completely compatible with GKE (Google Container Engine) which runs on Google Cloud Platform

No information on VMware PKS pricing as of yet.

Where lies Photon and VIC (VMware Integrated Containers)

You may recall that VMware announced Photon last year which was a open source container framework and Photon OS which was an OS for Photon containers. This still exists as an open source project and is still being developed but there was nary a word about Photon this year.

VIC still exists. VIC can support running a container as a VM but is not a real container orchestration engine. Yes you could potentially run Docker Swarm as VM or a number of containers as separate VMs under VI, but this is not the same as having a fully integrated container orchestration and management service layer in vSphere. That’s where PKS fits in.


Although timelines weren’t discussed there were a number of discussions that led me to believe that VMware on AWS would be rolled out to other public cloud provider (read Azure and GCP). And how long it would take to be rolled out to other AWS regions around the world was not discussed.  VMware Cloud would really make sense to run on GCP, but Azure might be a bit of a stretch.

Similarly, PKS seems already heading for VMware Cloud on AWS and is already available in native form as GKE on GCP. But Azure already has a native Kubernetes Container Service. And there was no discussion as to whether PKS would be made available on IBM Softlayer or OVH vCloud Air.

Stay tuned more to come as VMware finds its true path to the cloud.

Research reveals ~liquid nitrogen temperature molecular magnets with 100X denser storage

Must be on a materials science binge these days. I read another article this week in on “Major leap towards data storage at the molecular level” reporting on a Nature article “Molecular magnetic hysteresis at 60K“, where researchers from University of Manchester, led by Dr David Mills and Dr Nicholas Chilton from the School of Chemistry, have come up with a new material that provides molecular level magnetics at almost liquid nitrogen temperatures.

Previously, molecular magnets only operated at from 4 to 14K (degrees Kelvin) from research done over the last 25 years or so, but this new  research shows similar effects operating at ~60K or close to liquid nitrogen temperatures. Nitrogen freezes at 63K and boils at ~77K, and I would guess, is liquid somewhere between those temperatures.

What new material

The new material, “hexa-tert-butyldysprosocenium complex—[Dy(Cpttt)2][B(C6F5)4], with Cpttt = {C5H2tBu3-1,2,4} and tBu = C(CH3)3“, dysprosocenium for short was designed (?) by the researchers at Manchester and was shown to exhibit magnetism at the molecular level at 60K.

The storage effect is hysteresis, which is a materials ability to remember the last (magnetic/electrical/?) field it was exposed to and the magnetic field is measured in oersteds.

The researchers claim the new material provides magnetic hysteresis at a sweep level of 22 oersteds. Not sure what “sweep level of 22 oersteds” means but I assume a molecule of the material is magnetized with a field strength of 22 oersteds and retains this magnetic field over time.

Reports of disk’s death, have been greatly exaggerated

While there seems to be no end in sight for the densities of flash storage these days with 3D NAND (see my 3D NAND, how high can it go post or listen to our GBoS FMS2017 wrap-up with Jim Handy podcast), the disk industry lives on.

Disk industry researchers have been investigating HAMR, ([laser] heat assisted magnetic recording, see my Disk density hits new record … post) for some time now to increase disk storage density. But to my knowledge HAMR has not come out in any generally available disk device on the market yet. HAMR was supposed to provide the next big increase in disk storage densities.

Maybe they should be looking at CAMMR, or cold assisted magnetic molecular recording (heard it here, 1st).

According to Dr Chilton using the new material at 60K in a disk device would increase capacity by 100X. Western Digital just announced a 20TB MyBook Duo disk system for desktop storage and backup. With this new material, at 100X current densities, we could have 2PB Mybook Duo storage system on your desktop.

That should keep my ever increasing video-photo-music library in fine shape and everything else backed up for a little while longer.


Photo Credit(s): Molecular magnetic hysteresis at 60K, Nature article


Materials science rescues civilization, again

Read a bunch of articles this past week from MIT Technology Review, How materials science will determine the future of human civilization, from Stanford University, New ultra thin semiconductor materials…, and Wired, This battery breakthrough could change everything.

The message varied a bit between articles but there was an underlying theme to all of them. Materials science was taking off, unlike it ever has before. Let’s take them on, one by one, last in first out.

New battery materials

I have not reported on new battery structures or materials in the past but it seems that every week or so I run across another article or two on the latest battery technology that will change everything. Yet this one just might do that.

I am no material scientist but Bill Joy has been investing in a company, Ionic Materials, for a while now (both in his job as a VC partner and as in independent invested) that has been working on a solid battery material that could be used to create rechargeable batteries.

The problems with Li(thium)-Ion batteries today are that they are a safety risk (lithium is a highly flammable liquid) and they use an awful lot of a relatively scarce mineral (lithium is mined in Chile, Argentina, Australia, China and other countries with little mined in USA). Electric cars would not be possible today with Li-On batteries.

Ionic Materials claim to have designed a solid polymer electrolyte that can combine the properties of familiar, ultra-safe alkaline batteries we use everyday and the recharge ability of  Li-Ion batteries used in phones and cars today. This would make a cheap, safe rechargeable battery that could work anywhere. The polymer just happens to also be fire retardant.

The historic problems with alkaline, essentially zinc and manganese dioxide is that they can’t be recharged too many times before they short out. But with the new polymer these batteries could essentially be recharged for as many times as Li-Ion today.

Currently, the new material doesn’t have as many recharge cycles as they want but they are working on it. Joy calls the material ional.

New semiconductor materials

Moore’s law will eventually cease. It’s only a question of time and materials.

Silicon is increasingly looking old in the tooth. As researchers shrink silicon devices down to atomic scales, they start to breakdown and stop functioning.

The advantages of silicon are that it is extremely scaleable (shrinkable) and easy to rust. Silicon rust or silicon dioxide was very important because it is used as an insulator. As an insulating layer, it could be patterned just like the silicon circuits themselves. That way everything (circuits, gates, switches and insulators) could all use the same, elemental material.

A couple of Stanford researchers, Eric Pop and Michal Mleczko, a electrical engineering professor and a post doc researcher, have discovered two new materials that may just take Moore’s law into a couple of more chip generations. They wrote about these new materials in their paper in Science Advances.

The new materials: hafnium diselenide and zirconium diselenide have many similar properties to silicon. One is that they can be easily made to scale. But devices made with the new materials still function at smaller geometries, at just three atoms thick (0.67nm) and also consume happen less power.

That’s good but they also rust better. When the new materials rust, they form a high-K insulating material. With silicon, high-K insulators required additional materials/processing and more than just simple silicon rust anymore. And the new materials also match Silicon’s band gap.

Apparently the next step with these new materials is to create electrical contacts. And I am sure as any new material, introduced to chip fabrication will take quite awhile to solver all the technical hurdles. But it’s comforting to know that Moore’s law will be around another decade or two to keep us humming away.

New multiferric materials

But just maybe the endgame in chip fabrication materials and possibly many other domains seems to be new materials coming out of ETH Zurich Switzerland.

There a researcher, Nicola Saldi,n has described a new sort of material that has both ferro-electric and ferro-magnetic properties.

Spaldin starts her paper off by discussing how civilization evolved mainly due to materials science.

Way in the past, fibers and rosin allowed humans to attach stone blades and other material to poles/arrows/axhandles to hunt  and farm better. Later, the discovery of smelting and basic metallurgy led to the casting of bronze in the bronze age and later iron, that could also be hammered, led to the iron age.  The discovery of the electron led to the vacuum tube. Pure silicon came out during World War II and led to silicon transistors and the chip fabrication technology we have today

Spaldin talks about the other major problem with silicon, it consumes lots of energy. At current trends, almost half of all worldwide energy production will be used to power silicon electronics in a couple of decades.

Spaldin’s solution to the  energy consumption problem is multiferric materials. These materials offer both ferro-electric and ferro-magnetic properties in the same materials.

Historically, materials were either ferro-electric or ferro-magnetic but never both. However, Spaldin discovered there was nothing in nature prohibiting the two from co-existing in the same material. Then she and her compatriots designed new multiferric materials that could do just that.

As I understand it, ferro-electric material allow electrons to form chemical structures which create electrical dipoles or electronic fields. Similarly, ferro-magnetic materials allow chemical structures to create magnetic dipoles or magnetic fields.

That is multiferric materials can be used to create both magnetic and electronic fields. And the surprising part was that the boundaries between multiferric magnetic fields (domains) form nano-scale, conducting channels which can be moved around using electrical fields.

Seems to me that if this were all possible and one could fabricate a substrate using multi-ferrics and write (program) any electronic circuit  you want just by creating a precise magnetic and electrical field ontop of it. And with todays disk and tape devices, precise magnetic fields are readily available for circular and linear materials. And it would seem just as easy to use multi multiferric material for persistent data storage.

Spaldin goes on to say that replacing magnetic fields in todays magnetism centric information/storage industry with electrical fields should lead to  reduced energy consumption.

Welcome to the Multiferric age.

Photo Credit(s): Battery Recycling by Heather Kennedy;

AMD Quad Core backside by Don Scansen;  and

Magnetic Field – 14 by Windell Oskay

Industrial revolutions, deep learning & NVIDIA’s 3U AI super computer @ FMS 2017

I was at Flash Memory Summit this past week and besides the fire on the exhibit floor, there was a interesting keynote by Andy Steinbach, PhD from NVIDIA on “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute”.  The title was a bit much but his session was great.

2012 the dawn of the 4th industrial revolution

Steinbach started off describing AI, machine learning and deep learning as another industrial revolution, similar to the emergence of steam engines, mass production and automation of production. All of which have changed the world for the better.

Steinbach said that AI is been gestating for 50 years now but in 2012 there was a step change in it’s capabilities.

Prior to 2012 hand coded AI image recognition algorithms were able to achieve about a 74%  image recognition level but in 2012, a deep learning algorithm achieved almost 85%, in one year.

And since then it’s been on a linear trend of improvements such that in 2015, current deep learning algorithms are better than human image recognition. Similar step function improvements were seen in speech recognition as well around 2012.

What drove the improvement?

Machine and deep learning depend on convolutional neural networks. These are layers of connected nodes. There are typically an input layer and output layer and N number of internal layers in a network. The connection weights between nodes control the response of the network.

Todays image recognition convolutional networks can have ~10 layers, billions of parameters, take ~30 Exaflops to train, using 10M images and took days to weeks to train.

Image recognition covolutional neural networks end up modeling the human visual cortex which has neurons to recognize edges and other specialized characteristics of a visual field.

The other thing that happened was that convolutional neural nets were translated to execute on GPUs in 2011. Neural networks had been around in AI since almost the very beginning but their computational complexity made them impossible to use effectively until recently. GPUs with 1000s of cores all able to double precision floating point operations made these networks now much more feasible.

Deep learning training of a network takes place through optimization of the node connections weights. This is done via a back propagation algorithm that was invented in the 1980’s.  Back propagation typically depends on “supervised learning” which adjust the weights of the connections between nodes to come closer to the correct answer, like recognizing Sarah in an image.

Deep learning today

Steinbach showed multiple examples of deep learning algorithms such as:

  • Mortgage prepayment predictor system which takes information about a mortgagee, location, and other data and predicts whether they will pre-pay their mortgage.
  • Car automation image recognition system which recognizes people, cars, lanes, road surfaces, obstacles and just about anything else in front of a car traveling a road.
  • X-ray diagnostic system that can diagnose diseases present in people from the X-ray images.

As far as I know all these algorithms use supervised learning and back propagation to train a convolutional network.

Steinbach did show an example of “un-supervised learning” which essentially was fed a bunch of images and did clustering analysis on them.  Not sure what the back propagation tried to optimize but the system was used to cluster the images in the set. It was able to identify one cluster of just military aircraft images out of the data.

The other advantage of convolutional neural networks is that they can be reused. E.g. the X-ray diagnostic system above used an image recognition neural net as a starting point and then ran it against a supervised set of X-rays with doctor provided diagnoses.

Another advantage of deep learning is that it can handle any number of dimensions. Mathematical optimization algorithms can handle a relatively few dimensions but deep learning can handle any number of dimensions.  The number of input dimensions, the number of nodes in each layer and number of layers in your network are only limited by computational power.

NVIDIA’s DGX a deep learning super computer

At the end of Stienbach’s talk he mentioned the DGX appliance designed by NVIDIA for AI research.

The appliance has 8 state of the art NVIDIA GPUs, connected over a high speed NVLink with anywhere from ~29K to ~41K cores depending on GPU selected, and is capable of 170 to 960 Flops (FP16).

Steinbach said this single 3u appliance would have been rated the number one supercomputer in 2004 beating out a building full of servers. If you were to connect 13 (I think) DGX’s together, you would qualify to be on the top 500 super computers in the world.



Photo credit(s): Steinbach’s “Deep Learning: Extracting Maximum Knowledge from Big Data using Big Compute” presentation at FMS 2017.

Old world AI, Checkers, and The Champion

Read an article in The Atlantic this week (How checkers was solved) on Jonathan Schaeffer, the man who solved checkers, and his quest to beat Marion Tinsley, The Champion.

But first some personal history, while I was at university (back in the early 70’s) and first learned how to code in real (Fortran, 360/Assembler, IBM PL/I, Cobol) languages, one independent project I worked on was a checkers playing program. It made use of advanced alpha-beta search optimizations, board analysis routines and move trees.

These were the days of punched card decks and JCL, submitting programs to run as a batch job and getting results hours to days later. For one semester, I won the honor of consuming the most CPU time of any person in the school. I still have the card deck someplace but it may be hard to find a card reader, let alone a PL/I compiler/DOS system to run it.

In any case, better men than I have taken up the checkers challenge over time. And Schaeffer had made it his life’s work to conquer checkers and did it with his program, Chinook.

In my day checkers was a young kid and old person game. It was simple enough to learn but devilishly hard to master. My program got to look about 3.5 moves ahead, Schaeffer’s later program, used during an early match, was looking 16 moves ahead and was improved from there.

Besting The Champion

From the 50s through the early 90s there was one man who was the undisputed Champion of Checkers and that was Tinsley. Although he lost a few games during his time to other men, he never lost a match.

The article talks about how Schaeffer improved Chinook over time and at one time it had beaten Tinsley in two games but still lost the match. With a later version, it beat Tinsley a couple of times and then Tinsley fell ill and had to leave the game, later dying and forfeiting the match.

But even after Tinsley’s death, Schaeffer kept on improving Chinook.

Early on Schaeffer had a checkers endgame database and an opening database that were computed by Chinook as optimal move sequences from valid openings (professional checkers has a set of 3 move openings that players select at random and the game takes off from there) and endgames (positions with limited number’s of pieces to the end of the game).

These opening and endgame databases were stored for later retrieval during a game. This way if a game fell into a set opening or endgame the program could just follow the optimal play that was already computed.

Solving checkers

As computing power increased, Chinook’s end game database started earlier in the game with more pieces on the board and his opening database started working towards later into the game, following opening moves farther into the mid game.

When Schaeffer’s program solved checkers, essentially his opening database and his endgame database met in the middle of the game. And at that point he had the solution to every checkers position/game that could ever be.

AI vs. humans today

AI has changed to a different way of operating over time. When I was coding my checkers program, it was search trees/optimizations and board analysis. In fact, in 1996 IBM Deep Blue used variants of these techniques to beat Garry Kasparov, then World Chess Champion.

Today’s machine learning is less about search algorithms, game analyses, and game (or logic) databases and more about neural nets, machine learning and reinforcement learning.

New AI finally conquered Go only a couple of years ago, a game that’s very much more complex than checkers or chess. But in 2017 Google (Deepmind) AlphaGo didn’t use search trees and board analyses, it used neural nets, machine learning and reinforcement learning to beat Ke Jie, the then World #1 ranked Go Master.

Welcome to the new world of AI.

Photo Credit(s):