Better autonomous drone flying with Neural-Fly

Read an article the other day on Neural-Fly (see: Rapid adaptation of deep learning teaches drones to survive any weather) based on research out of CalTech documented in a paper is ScienceRobotics (see: Neural-Fly enables rapid learning for agile flight in strong winds, behind paywall).

Essentially they have trained two neural networks (NN) at the same time and computed an adaptation coefficient matrix (with linear multipliers to compensate for wind speed). The first NN is trained to understand the wind invariant flight characteristics of a drone in wind and the second is trained to the predict the class of wind the drone is flying in (or wind index). These two plus the adaptation control matrix coefficients are used to predict the resultant force on drone flight in wind.

In a CalTech article on the research (see: Rapid Adaptation of Deep Learning Teaches Drones to Survive Any Weather) at the bottom is a YouTube video that shows how well the drone can fly in various wind conditions (up to 27mph).

The data to train the two NNs and compute the adaptation matrix coefficients come from CalTech wind tunnel results with their custom built drone (essentially an RPi4 added to a pretty standard drone) doing random trajectories under different static wind conditions.

The two NNs and the adaptation control matrix functionality run on a Raspberry Pi 4 (RPi4) that’s added to a drone they custom built for the test vehicle. The 2 NNs and the adaptation control tracking are used in the P-I-D (proportional-integral-derivative) controller for drone path prediction. The Neural-Fly 2 NNs plus the adaptation functionality effectively replaces the residual force prediction portion of Integral section of the P-I-D controller.

The wind invariant neural net has 5 layers with relatively few parameters per layer. The wind class prediction neural network has 3 layers and even fewer parameters. Together these two NNs plus the adaptation coefficient provides real time resultant force predictions on the drone which can be fed into the drone controller to control drone flight. All being accomplished, in real time, on an RPi4.

The adaption factor matrix is also learned during 2 NN training. And this is what’s used in the NF-Constant results below. But the key is that the linear factors (adaptation matrix) are updated (periodically) during actual drone flight by sampling the measured actual force and predicated force on the drone. The adaption matrix coefficients are updated using a least squares estimation fit.

In the reports supplemental information, the team showed a couple of state of the art adaptation approaches to problem of drone flight in wind. In the above chart the upper section is the x-axis wind effect and the lower portion is the z-axis wind effect and f (grey) is the actual force in that direction and f-hat (red) is the predicted force. The first column represents results from a normal integral controller. The next two columns are state of the art approaches (INDI and L1, see paper references) to the force prediction using adaptive control. If you look closely at these two columns, and compare the force prediction (in red) and the actual force (in grey), the force prediction always lags behind the actual force.

The next three columns show Neural-Fly constant (Neural-Fly with a constant adaptive control matrix, not being updated during flight), Neural-Fly-transfer (Using the NN trained on one drone and applying it’s adaptation to another drone in flight) and Neural-Fly. Neural-Fly constant also shows a lag between the predicted force and the actual force. But the Neural-Fly Transfer and Neural-Fly reduce this lag considerably.

The measurement for drone flight accuracy is tracking positional error. That is the difference between the desired position and its actual position over a number of trajectories. As shown in the chart tracking error decreased from 5.6cm to ~4 cm at a wind speed of 4.2m/s (15.1km/h or 9.3mph). Tracking error increases for wind speeds that were not used in training and for NF-transfer but in all wind speeds the tracking error is better with Neural-Fly than with any other approach.

Pretty impressive results from just using an RPi4.

[The Eds. would like to thank the CalTech team and especially Mike O’Connell for kindly answering our many questions on Neural-Fly.]

Picture Credit(s):

Go big or go home for robust DNNs

Read a recent article Computer Scientists Prove why Bigger NNs do better discussing scientific research that proved a Universal Law of Robustness via Isoperimetry. This speaks to the perturbability of AI deep learning neural networks (DNN) and how not reduce it. But also applies to many other solutions to diverse multi-dimensional data problems.

Mathmatical Robustness

For AI ML DNN’s, we often witnesssupposedly well trained DNN models that do very well for classifications of examples of data similar to their training data but fail miserably on data that’s outside their training data.

Mathematicians call this attribute robustness and can measure this on a mapping function using a Lipschitz constant. One can consider this as a measure of variability of mapping from one set to another or in the case of DNNs, lack of robustness in classifications means they fail on relatively minor changes to input data.

Most serious AI researchers have empirically discovered that bigger DNNs work better and are more robust than smaller networks. There’s been somewhat of a conundrum as to why DNNs need to get bigger to properly generalize.

Universal Low of Robustness

What the researchers have proved is that in order to achieve some arbitrary level of robustness for a mapping function like DNNs, one needs many more parameters than expected the training data elements would indicate

For example, with the MNIST handwritten digit classification problem, models with 10**5 parameters to 10**6 parameters are required to achieve a 90% and 95% accuracy, respectively. But MNIST training data is 60K examples (10**4). Why should a MNIST DNN classification model need more than 10**4 parameters to achieve 100% accurate?

Author’s MNIST model with 688K parameters

From what we all learned in high school maths, to solve a function with N variables one needs N equations. This would lead one to believe that MNIST DNNs (essentially solving classification equations) should only need 60K or 10**4 parameters. But real DNNs to solve MNIST need more than that.

Looking at it in 2D. If one has two points, (x,y) for point A that maps to another (x,y) point B, one should only need to know one of the points and the slope of the line that connects them, or two parameters: point A (or B) and line slope.

Now with MNIST data that maps handwritten digits to one of 10 digits, we have essentially 10 possibilities being mapped from 60K samples. At best, we should need to know the 60K initial points in this image data space and their slope to the 10 digits they represent. Again something that approaches 60K pairs of parameters: one for the image point and one for the slope. But why doesn’t a MNIST model with 60K parameters achieve 100% accuracy.

I won’t claim to understand the math but what the researchers seem to be saying is that in order to have a relatively smooth mapping from the image space to the digit space one has to have 10**4 parameters X the dimensionality of the data. In this case, for MNIST, the dimensionality of the data is related to image size of 28X28, 0..255 grey scale pixel images. The image space alone would be on the order of 10**5. So multiplying this by the size of the training data, the researchers estimate that the number of parameters should be 10**9 to be 100% accurate.

Although, the researchers say that the data dimensionality of the MNIST images are probably not 10**5 (how they concluded this is not evident). As such, they believe one shouldn’t need 10**9 parameters to reach 100% proper classifications. They say it’s probably 1 or 2 orders of magnitude less, because not all of the image data space is populated. So if we use 10**3 as an estimate of the effective data dimensionality, they would estimate that one would need 10**7 parameter DNN to reach 100% accuracy on MNIST data.

The author’s MNIST model achieved a 99.2% accuracy after training for 15 Epochs, batch size=5. Although 688K parameters is not quite 10**6 parameters, it’s close. Unclear why one would need another factor of 10, but getting that extra 0.8% accuracy (to 100%) can be very difficult to achieve for any DNN model.

Another example, OpenAI’s GPT-3 NLP model

And OpenAI’s GPT-3 NLP model has 175B parameters. Their previous version, GPT-2, only had 1.5B parameters and they say that GPT-4 will have over a 100T parameters. The chart above shows accuracy stats for 3 versions of the GPT-3 model, one with 175B, one with 13B and another with 1.3B parameters.

According to OpenAI’s GPT-3 description, it can complete “almost any english language task” (text in ==> text out). This includes writing articles from a few prompts and text summarization.

GPT-3 was trained on almost 500B tokens (from web crawls to wikipedia dumps). Each token probably represents an english word or word phrase. According to the universal law, 175B parameters would not be sufficient. Probably why GPT-3 in the above chart didn’t reach 70%^ accuracy.

Probably would need at least another 3 orders of magnitude to get there or 175T parameters. Maybe with GPT-4, I can have it start writing my blog posts.

I don’t know about you, but I’m going to need more GPUs for my (home) AI lab.

Photo Credit(s)

Deepmind does code – part 1: the data

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

  • Github code is in any source code language the author feels most appropriate to use.
  • GitHub code is not guaranteed to work correctly.
  • GitHub code is not guaranteed to be completed code.
  • GitHub code represents a wide range of coding skill.
  • GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits:

AI navigation goes with the flow

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

The problem with smarter robots

Read an article the other week about how Deepmind (at Google) is approaching the training of robotics using simulation, reinforcement learning, elastic weights, knowledge distillation and progressive learning.

It seems relatively easy to train a robot to handle some task like grabbing or walking. But doing so can take an awfully long time. If you want to try to train a robot to grab something and put it someplace. You can have it start out making some random movements of its arm, wrist and fingers (if they have such things) and then use reinforcement learning to help it improve its movements over time.

But if each grab attempt takes 10 seconds, using reinforcement learning may take 10,000 attempts before it starts to make any significant progress and perhaps another 20,000-50,000 more to get expert at it. Let’s see 60K *10 seconds is 10,000 minutes or ~170 hours. And that’s just one object pick and place. But then maybe you would like to grab different parts and maybe place them in different locations. All these combinations start adding up.

And of course doing 1000s of movements will wear out gears, motors, mechanisms etc. If only this could all be done in electronic simulations. Then assuming the simulations are accurate enough the whole thing could be done in a matter of hours without wearing anything out. Enter robot simulators such as NVIDIA Isaac Sim, OpenAI RoboSchool/PyBullet

But the problems with simulation are …

Simulations are getting more accurate but at some point their accuracy defeats its purpose because the real world is always noisy, windy and not as deterministic as any simulation. One researcher said you could conceivable have a two armed robot be trained to throw all of a cell phones components up into the air and they will all land in their proper places, proper orientations. But in the real world this could never actually happen, or if it did, it could only happen once.

Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)
Hurricane Ike – 2008/09/12 – 21:26 UTC by CoreBurn (cc) (from Flickr)

Weather researchers have been dealing with this problem in spades for a long time. There appears to be a fundamental limit to how far in advance we can predict weather and it’s due to the accuracy with which sensors operate and the complexity of feedback loops between the atmosphere, oceans, landforms, etc. So at a fundamental level, simulations can never be completely accurate. But they can be better.

Today’s weather simulations we see on TV/radio use models that average a number of distinct simulations, where sensor information has been slightly and randomly modified. Something similar could be done for robotic simulation environments, to make them more realistic.

But there are other problems with training robots to do lots of tasks.

Forget me not…

AI deep learning and reinforcement learning algorithms are great when charged with learning a single task, but having it learn multiple tasks is much harder to do. Because each task requires its own deep neural network (DNN) and if you train a DNN on one task and then try to train in on a another task, it forgets all the learnings from the original task. Researchers call this catastrophic forgetting.

One way researchers have dealt with this problem is to effectively freeze certain DNN nodes from having their weights changed during subsequent training rounds and leave others flexible or changeable. One can see this when one trains an image recognition DNN to classify different objects by importing a well trained object classifier and freezing all of it’s layers except the top one or two and then training these layers to classify new objects.

This works well but you have effectively changed the DNN to forget the original object classification training and replaced it with a new one. One solution to this approach is to have multiple passes of training, after each one, certain nodes and connections (of importance to that particular task) are selectively frozen. This works well for a limited number of different tasks but over time all nodes become frozen which means that no more learning can take place. Researchers call this approach to the catastrophic forgetting problem elastic weights.

One way to get around the all nodes frozen issue in elastic weights is to have multiple NNs. One which is trained on a specific task and whose weights are frozen and then a DNN that exists alongside this one with it’s own initialized set of weights. But which uses the original DNN as part of the new DNN inputs. This effectively includes and incorporates all the previously learned knowledge into the new, combined DNN. This is called Progressive Neural Networks.

In this fashion one progressive DNN can be sequentially trained on any number of tasks each of which ends up providing input to all subsequent task training activity. Such a progressive network never forgets and can use previously learned knowledge on new tasks.

The problem with progressive DNNs is a proliferation of DNN column. one for each trained task. However there are a couple of approaches to shrinking an ensemble of DNN like progressive training creates into one that is simpler and just as effective. One way is to perturb weights in DNN nodes and see how model prediction accuracy is impacted on all its tasks. If accuracy isn’t impacted that much, then that node and all its connections could be deleted from the model with minimal impact on model accuracy.

Another approach is to use one DNN to train another. Sort of like a teacher-student. This is called Knowledge Distilation. Where one DNN is a large network (the teacher) and a smaller (student) network that is trained to mimic the teacher DNN to achieve similar accuracy. This is done by training the smaller student network to match the predictions/classifications of the larger one.

Google researchers have shown that knowledge distillation works best when the gap in the sizes of the two networks (teacher and student) aren’t that large. They have solved this problem by introducing an intermediate step (called teachers assistent). They train this TA first then use the TA to train the student.

In the above graphic, when using a teacher of size 110 and a student of size 8 the resulting accuracy suffers but if one uses an intermediate DNN, with a size 20 the resultant accuracy of the student is much closer to the teacher..

~~~~

So with realistic simulation we can train a robot to do any specific task, all using only compute resources. And using progressive DNN training, a robot could conceivably be trained to do any number of tasks. And with appropriate knowledge distillation one can reduce the DNN from progressive training into something much smaller (<10%) than the original DNN.

Want a personal robot that can clean up around your place, do the wash, cook your food and do anything else needed. You know what to do.

Photonic [AI] computing seeing the light of day – part 2

Read an interesting article in Analytics India Magazine (MIT Researchers Make New Chips That Work On Light) about a startup out of MIT focused on using photonics for AI/ML/DL activities. Not exactly neuromorphic chips, but using analog photonics interactions to perform computational intensive operations required by todays deep neural net training.

We’ve written about photonics computing before ( see Photonic computing seeing the light of day [-part 1]). That post was about spin outs from Princeton and MIT back in 2019. We showed a bit more on how photonics can perform multiplication and other computations with less power.

The article (noted above) talked about LightIntelligence, an MIT spinout/ startup that’s been around since ~2017, but there’s another company in the same space, also out of MIT called LightMatter that just announced early access to their hardware system.

The CEOs of both companies collaborated on a paper (#1&2 authors of the 10 author paper) written back in 2017 on Deep Learning with Coherent Nanophotonic Circuits. This seemed to be the event that launched both companies.

LightMatter just received $80M in Series B funding ( bringing total funding to $113M) last month and LightIntelligence seems to have $40M in total funding So both have decent funding but, LightMatter seems further ahead in funding and product technology.

LightMatter

LightMatter Envise Photonics-RISC AI processing chip

LightMatter Envise AI chip uses standard RISC electronic cores together with Photo Arithmetic Units for accelerated AI computations. Each Envise chip has 500MB of SRAM for large models, offers 400Gbps chip to chip interconnect fabric, 256 RISC cores, a Graph processor, 294 photonic arithmetic units and PCIe 4.0 connectivity.

LightMatter has just announced early access for their Envise AI photonics server. It’s an 4U, AI server with 16 Envise chips, 2 AMD EPYC CPUs, (16×400=)6.4Tpbs optical fabric for inter-chip communications, 1TB of DDR4 DRAM, 3TB of NVMe SSD and supports 2-200GbE SmartNICs for outside communications.

Envise also offers Idiom Software that interfaces with standard AI frameworks to transform models for photonics computing to use Envise hardware . Developers select Envise hardware to run their AI models on and Idiom automatically re-compiles (IdCompile) their model into more parallelized, photonics operations. Idiom also has a model profiler (IdProfiler) to help debug and visualize photonic models in operation (training or inferencing?) on Envise hardware. Idiom also offers an AI model library (IdML) which provides a PyTorch frontend to help compress and quantize a standard set of AI models.

LightMatter also announced their Passage optical interconnect chip that supplies 100Tbps optical switch for photonics, CPU or GPU processing. It’s huge, 8″x8″ and built on 5nm/7nm node process. Passage can connect up to 48 photonics, CPU or GPU chips that are built onto of it (one can see the space for each of these 48 [sub-]chips on the chip). LightMatter states that 40 Passage (photonic/optical) lanes are the width of one optical fibre. Passage chips are sampling now.

LightMatter Passage photonics-transistor chip (carrier) that provides a photonics programmable interconnect for inter-[photonics-electronic-]chip communications.

LightIntelligence

They don’t appear to be announcing any specific hardware just yet but they are at work in creating the world largest integrated photonics processing system. But LightIntelligence have published a number of research papers focused on photonic approaches to CNNs, RNNs/LSTMs/GRUs, Recurrent ISING machines, statistical computing, and invisibility cloaking.

Turns out the processing power needed to provide invisibility cloaking is very intensive and as its all pixels, photonics offers serious speedups (for invisibility, see Nature article, behind paywall).

Photonics Recurrent ISLING Sampler (PRIS)

LightIntelligence did produce a prototype photonics processor in 2019. And they believe the will have de-risked 80-90% of their photonics technology by year end 2021.

If I had to guess, it would appear as if LightIntelligence is trying to re-imagine deep learning taking a predominately all photonics approach.

Why photonics for AI DL

It turns out that one can use the interaction/interference between two light beams to perform matrix multiplication and other computations a lot faster, with a lot less power than using standard RISC (or CISC) electronic processor architectures. Typical GPUs run 400W each and multi-GPU training activities are commonplace today.

The research documented in the (Deep learning using nanophotonics) paper was based on using an optical FPGA which we have talked about before (See Photonics or Optical FPGAs on the horizon) to prototype the technology back in 2017.

Can photonics change the technology underpinning AI or computing?

If by using photonics, one could speed up AI inferencing by 3-5X and do it with 5-6X less power, you might have a market. These are LightMatter Envise performance numbers on ResNet50 with ImageNet and BERT-Base with SQUAD v1.1 against NVIDIA DGX-A100 (state of the art) AI processing system.

The challenge to changing the technology behind multi-million/billion/trillion dollar industry is that it’s not sufficient to offer a product better than the competition. One has to offer a technology that’s better enough to fund the building of a new (multi-million/billion/trillion dollar) ecosystem surrounding that technology. In order to do that it’s got to be orders of magnitude faster/lower power/better so that commercial customers adopt it en masse.

I like where LightMatter is going with their Passage chip. But their Envise server doesn’t seem fast enough to give them enough traction to build a photonics ecosystem or to fund Envise 2, 3, 4, etc. to change the industry.

The 2017 (Deep learning using nanophotonics) paper predicted that an all optical/photonics implementation of CNN would use 3 orders of magnitude less power for small models and that advantage would only go up for larger models (not counting power for data movement, photo detectors, etc.). Now if that’s truly feasible and maybe it takes a more photonics intensive processor to get there, then photonics technology could truly transform the AI or for that matter the computing industry.

But the other thing that LightIntelligence and LightMatter may be counting on is the slowdown in Moore’s law which may inhibit further advances in electronics processing power. Whether the silicon industry is ready to throw in the towel yet on Moore’s law is TBD.

Comments?

Photo Credit(s):

New era of graphical AI is near #AIFD2 @Intel

I attended AIFD2 ( videos of their sessions available here) a couple of weeks back and for the last session, Intel presented information on what they had been working on for new graphical optimized cores and a partner they have, called Katana Graph, which supports a highly optimized graphical analytics processing tool set using latest generation Xeon compute and Optane PMEM.

What’s so special about graphs

The challenges with graphical processing is that it’s nothing like standard 2D tables/images or 3D oriented data sets. It’s essentially a non-Euclidean data space that has nodes with edges that connect them.

But graphs are everywhere we look today, for instance, “friend” connection graphs, “terrorist” networks, page rank algorithms, drug impacts on biochemical pathways, cut points (single points of failure in networks or electrical grids), and of course optimized routing.

The challenge is that large graphs aren’t easily processed with standard scale up or scale out architectures. Part of this is that graphs are very sparse, one node could point to one other node or to millions. Due to this sparsity, standard data caching fetch logic (such as fetching everything adjacent to a memory request) and standardized vector processing (same instructions applied to data in sequence) don’t work very well at all. Also standard compute branch prediction logic doesn’t work. (Not sure why but apparently branching for graph processing depends more on data at the node or in the edge connecting nodes).

Intel talked about a new compute core they’ve been working on, which was was in response to a DARPA funded activity to speed up graphical processing and activities 1000X over current CPU/GPU hardware capabilities.

Intel presented on their PIUMA core technology was also described in a 2020 research paper (Programmable Integrated and Unified Memory Architecture) and YouTube video (Programmable Unified Memory Architecture).

Intel’s PIUMA Technology

DARPA’s goals became public in 2017 and described their Hierarchical Identity Verify Exploit (HIVE) architecture. HIVE is DOD’s description of a graphical analytics processor and is a multi-institutional initiative to speed up graphical processing. .

Intel PIUMA cores come with a multitude of 64-bit RISC processor pipelines with a global (shared) address space, memory and network interfaces that are optimized for 8 byte data transfers, a (globally addressed) scratchpad memory and an offload engine for common operations like scatter/gather memory access.

Each multi-thread PIUMA core has a set of instruction caches, small data caches and register files to support each thread (pipeline) in execution. And a PIUMA core has a number of multi-thread cores that are connected together.

PIUMA cores are optimized for TTEPS (Tera-Traversed Edges Per Second) and attempt to balance IO, memory and compute for graphical activities. PIUMA multi-thread cores are tied together into (completely connected) clique into a tile, multiple tiles are connected within a single node and multiple nodes are tied together with a 8 byte transfer optimized network into a PIUMA system.

P[I]UMA (labeled PUMA in the video) multi-thread cores apparently eschew extensive data and instruction caching to focus on creating a large number of relatively simple cores, that can process a multitude of threads at the same time. Most of these threads will be waiting on memory, so the more threads executing, the less likely that whole pipeline will need to be idle, and hopefully the more processing speedup can result.

Performance of P[I]UMA architecture vs. a standard Xeon compute architecture on graphical analytics and other graph oriented tasks were simulated with some results presented below.

Simulated speedup for a single node with P[I]UMAtechnology vs. Xeon range anywhere from 3.1x to 279x and depends on the amount of computation required at each node (or edge). (Intel saw no speedups between a single Xeon node and multiple Xeon Nodes, so the speedup results for 16 P[I]UMA nodes was 16X a single P[I]UMA node).

Having a global address space across all PIUMA nodes in a system is pretty impressive. We guess this is intrinsic to their (large) graph processing performance and is dependent on their use of photonics HyperX networking between nodes for low latency, small (8 byte) data access.

Katana Graph software

Another part of Intel’s session at AIFD2 was on their partnership with Katana Graph, a scale out graph analytics software provider. Katana Graph can take advantage of ubiquitous Xeon compute and Optane PMEM to speed up and scale-out graph processing. Katana Graph uses Intel’s oneAPI.

Katana graph is architected to support some of the largest graphs around. They tested it with the WDC12 web data commons 2012 page crawl with 3.5B nodes (pages) and 128B connections (links) between nodes.

Katana runs on AWS, Azure, GCP hyperscaler environment as well as on prem and can scale up to 256 systems.

Katana Graph performance results for Graph Neural Networks (GNNs) is shown below. GNNs are similar to AI/ML/DL CNNs but use graphical data rather than images. One can take a graph and reduce (convolute) and summarize segments to classify them. Moreover, GNNs can be used to understand whether two nodes are connected and whether two (sub)graphs are equivalent/similar.

In addition to GNNs, Katana Graph supports Graph Transformer Networks (GTNs) which can analyze meta paths within a larger, heterogeneous graph. The challenge with large graphs (say friend/terrorist networks) is that there are a large number of distinct sub-graphs within the graph. GTNs can break heterogenous graphs into sub- or meta-graphs, which can then be used to understand these relationships at smaller scales.

At AIFD2, Intel also presented an update on their Analytics Zoo, which is Intel’s MLops framework. But that will need to wait for another time.

~~~~

It was sort of a revelation to me that graphical data was not amenable to normal compute core processing using today’s GPUs or CPUs. DARPA (and Intel) saw this defect as a need for a completely different, brand new compute architecture.

Even so, Intel’s partnership with Katana Graph says that even today compute environment could provide higher performance on graphical data with suitable optimizations.

It would be interesting to see what Katana Graph could do using PIUMA technology and appropriate optimizations.

In any case, we shouldn’t need to wait long, Intel indicated in the video that P[I]UMA Technology chips could be here within the next year or so.

Comments?

Photo Credit(s):

  • From Intel’s AIFD2 presentations
  • From Intel’s PUMA you tube video

Swarm learning for distributed & confidential machine learning

Read an article the other week about researchers in Germany working with a form of distributed machine learning they called swarm learning (see: AI with swarm intelligence: a novel technology for cooperative analysis …) which was reporting on a Nature magazine article (see: Swarm Learning for decentralized and confidential clinical machine learning).

The problem of shared machine learning is particularly accute with medical data. Many countries specifically call out patient medical information as data that can’t be shared between organizations (even within country) unless specifically authorized by a patient.

So these organizations and others are turning to use distributed machine learning as a way to 1) protect data across nodes and 2) provide accurate predictions that uses all the data even though portions of that data aren’t visible. There are two forms of distributed machine learning that I’m aware of federated and now swarm learning.

The main advantages of federated and swarm learning is that the data can be kept in the hospital, medical lab or facility without having to be revealed outside that privileged domain BUT the [machine] learning that’s derived from that data can be shared with other organizations and used in aggregate, to increase the prediction/classification model accuracy across all locations.

How distributed machine learning works

Distributed machine learning starts with a common model that all nodes will download and use to share learnings. At some agreed to time (across the learning network), all the nodes use their latest data to re-train the common model and share new training results (essentially weights used in the neural network layers) with all other members of the learning network.

Shared learnings would be encrypted with TLS plus some form of homomorphic encryption that allowed for calculations over the encrypted data.

In both federated and swarm learning, the sharing mechanism was facilitated by a privileged block chain (apparently Etherium for swarm). All learning nodes would use this blockchain to share learnings and download any updates to the common model after sharing.

Federated vs. Swarm learning

The main difference between federated and swarm learning is that with federated learning there is a central authority that updates the model(s) and with swarm learning that processing is replaced by a smart contract executing within the blockchain. Updating model(s) is done by each node updating the blockchain with shared data and then once all updates are in, it triggers a smart contract to execute some Etherium VM code which aggregates all the learnings and constructs a new model (or at least new weights for the model). Thus no node is responsible for updating the model, it’s all embedded into a smart contract within the Etherium block chain. .

Buthow does the swarm (or smart contract) update the common model’s weights. The Nature article states that they used either a straight average or a weighted average (weighted by “weight” of a node [we assume this is a function of the node’s re-training dataset size]) to update all parameters of the common model(s).

Testing Swarm vs. Centralized vs. Individual (node) model learning

In the Nature paper, the researchers compared a central model, where all data is available to retrain the models, with one utilizing swarm learning. To perform the comparison, they had all nodes contribute 20% of their test data to a central repository, which ran the common swarm updated model against this data to compute an accuracy metric for the swarm. The resulting accuracy of the central vs swarm learning comparison look identical.

They also ran the comparison of each individual node (just using the common model and then retraining it over time without sharing this information to the swarm versus using the swarm learning approach. In this comparison the swarm learning approach alway seemed to have as good as if not better accuracy and much narrower dispersion.

In the Nature paper, the researchers used swarm learning to manage the machine learning model predictions for detecting COVID19, Leukemia, Tuberculosis, and other lung diseases. All of these used public data, which included PBMC (peripheral blood mono-nuclear cells) transcription data, whole blood transcription data, and X-ray images.

Swarm learning also provides the ability to onboard new nodes in the network. Which would supply the common model and it’s current weights to the new node and add it to the shared learning smart contract.

The code for the swarm learning can be downloaded from HPE (requires an HPE passport login [it’s free]). The code for the models and data processing used in the paper are available from github. All this seems relatively straight forward, one could use the HPE Swarm Learning Library to facilitate doing this or code it up oneself.

Photo Credit(s):