The myth of AGI

Sorry seem to be on an AGI bent this month…

Read an article the other day about a new book (The myth of AI, by Erik. J. Larson) that explains how the present direction of AI-ML-DL will be very unlikely to achieve artificial general intelligence (AGI) given it’s current direction. Amazon and others offer a short preview of the book which is where most of this discussion comes from.

Types of (human) reasoning

Near as I can tell, (don’t have the book), the book discusses the three types of reasoning that exist in human intellect, i.e., deduction, induction and abduction.

  • Deduction uses formal logic (or its equivalents) to derive facts or theorems from basic principles.
  • Induction uses a multitude of samples and constructs general principles from the analysis of them
  • Abduction uses a set of probabilistic assertions and formal logic, to come up with a probabilistic principle.

Deduction is most famously observed in geometry and arithmetic proofs and was most evident in the early years of AI through its use of expert systems. The challenge with expert systems is that the real world is vastly more complex than any geometrical or arithmetical artifice that humankind can produce.

Expert systems became champions of checkers, chess and some other games but in the end was not easily generalizable beyond a few (gaming and medically) restricted domains.

Induction is presently all the rage and represents what machine learning and deep neural networks (DNN) are doing with all that training data and resultant classification inferencing.

Today we have DNNs that can classify the objects in an image, can learn to play any game on the planet better than humans, and can even safely drive a car down the road.

The current AI world view is that this form of reasoning, DNN induction, will if taken to its extreme will ultimately result in some level of AGI, or human-equivalent levels of intelligence in a system. The author of the book begs to differ.

Abduction is less well known or discussed in rational circles. It’s essentially what any human does when presented with real world examples/experiences to derive an understanding (or principe) of what happened.

For example, a plate full of cookies last night becomes an almost empty plate of crumbs and two cookies. So what happened, your son woke up early, consumed most if not all of them, and left for work. This is a probabilistic (most likely) inference, but has a high probability of being true.

Any AGI will need all forms of reasoning

The challenge is that AI has been through the deduction phase through the rise of expert systems which crashed and burned because of the cost and time required to produce an exhaustive and correct expert system. And AI is currently in the induction phase, via DNN training, which seems to be entirely more generalizable and successfully usable in many different domains, but no one is talking seriously about doing abduction in AI (anymore).

The author claims (again, have not read the book) that any AGI will require as much abduction as induction (as well as perhaps deduction), and therefore, AGI is not inevitable based on our current AI DNN (or induction) intensive path.

Previous and current attempts at abduction reasoning

Some may recall fuzzy logic as one of the avenues taken after expert systems seemed to fail at doing successful and realistic inferencing around the end of last century. Fuzzy logic was a way of bring probabilities into deduction, not unlike abduction as defined above. With fuzzy logic each assertion or base assumption was given a probabilistic value (of being true) and the final derivation was assigned some level of probability of being true.

The wikipedia article has definitions for fuzzy logic and, or and not which of course would allow any system to make these assertions. But fuzzy logic (like expert systems above) suffered from the inability to exhaustively cover all examples in a real world situation.

Furthermore, the (funny) thing about DNNs is that they are much more probabilistic than it appears. If one examines classification outputs of any DNN, it is extremely rare to see some sort of boolean (true or false) yes or no answers. Mostly one sees a series of probabilities that are assigned to each classification bucket.

DNN systems hide these probabilities by just selecting the maximum (or minimum) probability generated as its final classification. This is entirely an artifact of needing to have some discrete output (classification selection). But DNN (internal) results always result in probabilistic values.

So although, pure induction doesn’t include probabilities, DNN induction as practiced today in AI systems, uses probabilistic reasoning in every layer of a DNN and in its final results.

What else may be missing from AI to allow AGI to be developed

Personally, AGI seems to require not just the reasoning approaches above, but a more workable and general purpose planning solution. I’ve tried to identify to see whether some researchers are using DNNs to provide general purpose planning solutions but have been yet to find any (in publcly available research). These are probably the one place where expert (or control) fuzzy systems still shine. But again they are hard to generalize and prove almost impossible to be completely exhaustive.

Nonetheless, in the end, I think that all the above just proves, that there are a number of distinct reasoning and other (planning) techniques that may need to come together to provide AGI. As any of us can attest, all of these different approaches are available within any human intellect.

And if we assume that any AGI will need to follow the human design to intelligence (not a given), they will all need to be stitched together, combined and brought to bear to realize AGI.

But, at present, with all the focus on DNN/induction, we, as AI researchers, are not making any progress on using these other techniques or in combining them into a single system.

And for that I am happy. I would be very pleased to have any AGI be farther out than nearer term. Because for the life of me, AGI scares the s&#t out of me.

Mostly because I don’t see any real way to control AGI, once unleashed. That and given the diversity of motives around this world, I don’t see any realistic mechanism to instill a universal and firm (unalterable) belief in the sanctity of human and other life, the dependance this life has on our environment/biosphere and the rule of law needed to maintain peace across humankind (and I’m probably missing a half dozen more things that we would want any AGI to adhere to).

Maybe, if I saw more effort on how, we as a species can come up with universal views on these and other topics and can come up with some way of instilling, essentially a system of programs, with these unalterable beliefs and AGI controls based on these, I’d be less fearful of AGI emerging.

Lacking that, any way of delaying its emergence, is fine by me.

Comments?

Photo Credit(s):

AI inferencing using light alone

Researchers at UCLA have taken a trained DL neural network and implemented it into a series of passive optical only, 3D printed diffraction gratings to perform fashion MNIST object classification. And did the same with a MNIST handwritten digit and ImageNet DL neural network classifiers.

But first please take our new poll:

Experimental testing of 3D-printed D2NNs.(A and B) After the training phase, the final designs of five different layers (L1, L2, …, L5) of the handwritten digit classifier, fashion product classifier, and the imager D2NNs are shown. To the right of the network layers, an illustration of the corresponding 3D-printed D2NN is shown. (C and D) Schematic (C) and photo (D) of the experimental terahertz setup. An amplifier-multiplier chain was used to generate continuous-wave radiation at 0.4 THz, and a mixer-amplifier-multiplier chain was used for the detection at the output plane of the network. RF, radio frequency; f, frequency.

See the article on SlashGear, 3D printed all-optical diffractive deep learning neural network…. The research article is only available on Optical Society of America’s website/magazine (see Residual D2NN: training diffractive deep neural networks via learnable light shortcuts behind hard paywall). However, I did find a follow on article on ArchivX (see Analysis of Diffractive Optical Neural Networks and Their Integration with Electronic Neural Networks) that discussed how to integrate D2NN approaches with an electronic NN to create a hybrid inference engine. And another earlier Science article (see All-optical machine learning using diffractive deep neural networks) that was available which described earlier versions of D2NN technology for MNIST digit classification, fashion MNIST classification and ImageNet object classification.

How does it work

Apparently the researchers trained a normal (electronic based) deep learning neural network on the MNIST, Fashion MNIST and ImageNet and then converted the resultant trained NNs into a set of multiple diffraction grids. They did some computer simulation of the D2NN and once satisfied it worked and achieved decent accuracy, 3D printed the diffraction plates.

All-optical D2NN-based classifiers. These D2NN designs were based on spatially and temporally coherent illumination and linear optical materials/layers. (a) D2NN setup for the task of classification of handwritten digits (MNIST), where the input information is encoded in the amplitude channel of the input plane. (b) Final design of a 5-layer, phase-only classifier for handwritten digits. (c) Amplitude distribution at the input plane for a test sample (digit ‘0’). (d-e) Intensity patterns at the output plane for the input in (c); (d) is for MSE-based, and (e) is softmax- cross-entropy (SCE)-based designs. (f) D2NN setup for the task of classification of fashion products (Fashion-MNIST), where the input information is encoded in the phase channel of the input plane. (g) Same as (b), except for fashion product dataset. (h) Phase distribution at the input plane for a test sample. (i-j) Same as (d) and (e) for the input in (h),  refers to the illumination source wavelength. Input plane represents the plane of the input object or its data, which can also be generated by another optical imaging system or a lens, projecting an image of the object data onto this plane.

In their D2NN, they start with coherent (laser) light in the THz spectrum, used this to illuminate the input plane (I assume an image of the object/digit/fashion accessory) and passed this through multiple plates of diffraction grids onto THz detector which was used to detect the illuminated spot that indicated the classification.

The article in science has a supplementary materials download that show how the researchers converted NN weights into a diffraction grating. Essentially each pixel on the diffraction grating either transmits, refracts, or reflects a light path. And this represents the connections between layers. It’s unclear whether the 5 or 6 plates used in the D2NN correspond to the NN layers but it’s certainly possible.

And to the life of me I can’t understand what they mean by “Residual D2NN”, other than if it means using a trained (residual) NN and converting this to D2NN.

Some advantages of D2NN

3D printing diffraction gratings means anyone/lab could do this. The 3D printers they used had a spatial accuracy of 600 dpi, with 0.1mm accuracy, almost consumer grade 3D printers. In any case, being able to print these in a matter of hours, while not as easy as changing an all digital NN, seems like an easy way to try out the approach.

For example, for the MNIST digit classifier they used a pixel size of 400um and each diffraction layer they created was equivalent to 200X200 neural weights. Which means that 5 layer D2NN could handle about 0.2M neural weights which were completely connected to one another. This meant they could have (200×200)**2*5=8B connections in the MNIST D2NN. In the image classifier, each diffraction layer had 300×300 neural weights. So D2NN’s seem to scale very well.

Being an all passive optical device, the system is operates entirely in parallel, That is, the researchers indicated that the D2NN devices operate at the speed of light and would perform the inferencing activity in the time it takes a camera to capture the image.

Also the device uses very little energy (I assume just the energy for the THz generator, the input plane detector and the THz detector at the end.

And the researchers also claimed the device was cheap to manufacture, it could be created for less than $50. (Unclear if this included all the electronics or just the D2NN diffraction gratings and holder). And once you have locked into a D2NN that you wanted to use, could be manufactured in volume, very cheaply (sort of like stamping out CD platters). Finally, the number of neural network nodes and layers can be scaled up to a large number of layers and nodes per layer while still fitting on the diffraction gratings. In contrast, all electronic NN require more compute power as you scale up network layers and nodes per layer.

The other article (ArchivX) talked about potentially using a hybrid optical-electronic DNN approach with some layers being D2NN and others being purely digital (electronics). Such a system could potentially be used where some portion of the NN was more stable/more compute intensive than others and where the final output classification layer(s) was more changeable and much smaller/less compute intensive. Such a hybrid system could make use of the best of of the all optical D2NN to efficiently and quickly compress the input space and then have the electronic final classification layer provide the final classification step.

The Oracle

Combining a handful of D2NNs into a device that accepts speech input and provides speech output with the addition of say an offline copy of Wikipedia, Google Books etc. with a search engine that could be used to retrieve responses to questions asked would create an oracle device. Where you would ask a question and the device would respond with the best answer it could find (in it’s databases).

If this could be made out of an all passive optical components and use natural sunlight/electronic illumination to perform it’s functionality, such an all optical, question to answer oracle would be very useful to the populations of the world. And could be manufactured in volume very cheaply and would cost almost nothing to operate.

A couple of other tweaks, if we could collapse the multiple grating D2NNs into a single multi-layer plate/platter and make these replaceable in the device that would allow the oracle’s information base to be updated periodically.

Then if we could embed such a device into a Long Now Clock that would reflect sunlight onto the disk every Solstice, or Equinox, then we could have a quarterly oracle device that could last for 1000 of years. That would provide answers to queries one day every quarter. And that would be quite the oracle…

Photo credit(s):

New DRAM can be layered on top of CPU cores

At the last IEDM (IEEE International ElectronDevices Meenting), there were two sessions devoted to a new type of DRAM cell that consists or 2 transistors and no capacitors (2TOC) that can be built in layers on top of a micro processor which doesn’t disturb the microprocessor silicon. I couldn’t access (behind paywalls) the actual research but one of the research groups was from Belgium (IMEC) and the other from the US (Notre Dame and R.I.T). This was written up in a couple of teaser articles in the tech press (see IEEE Spectrum tech talk article).

DRAM today is built using 1 transistor and 1 capacitor (1T1C). And it appears that capacitors and logic used for microprocessors aren’t very compatible. As such, most DRAM lives outside the CPU (or microprocessor cores) chip and is attached over a memory bus.

New 2T0C DRAM Bit Cell: Data is written by appliying current to the WBL and WWL and bit’s are read by seeing if acurrent can pass through the RWL RBL

Memory busses have gotten faster in order to allow faster access to DRAM but this to is starting to reach fundamental physical limits and DRAM memory sizes aren’t scaling like the used to.

Wouldn’t it be nice if there were a new type of DRAM that could be easlly built closer or even layered on top of a CPU chip, with faster direct access from/to CPU cores. through inter chip electronics.

Oxide based 2T0C DRAM

DRAM was designed from the start with 1T1C so that it could hold a charge. With a charge in place it could be read out quickly and refreshed periodically without much of a problem.

The researcher found that at certain sizes (and with proper dopants) small transistors can also hold a (small) charge without needing any capacitor.

By optimizing the chemistry used to produce those transistors they were able to make 2T0C transistors hold memory values. And given the fabrication ease of these new transistors, they can easily be built on top of CPU cores, at a low enough temperature so as not to disturb the CPU core logic.

But, given these characteristics the new 2T0C DRAMB can also be built up in layers. Just like 3D NAND and unlike current DRAM technologies.

Today 3D NAND is being built at over 64 layers, with Flash NAND roadmap’s showing double or quadruple that number of layers on the horizon. Researchers presenting at IMEC were able to fabricate an 8 layer 2T0C DRAM on top of a microprocessor and provide direct, lightening fast access to it.

The other thing about the new DRAM technology is that it doesn’t need to be refreshed as often. Current DRAM must be refreshed every 64 msec. This new 2T0C technology has a much longer retention time and currently only needs to be refreshed every 400s and much longer retention times are technically feasible.

Some examples of processing needing more memory:

  • AI/ML and the memory wall -Deep learning models are getting so big that memory size is starting to become a limiting factor in AI model effectiveness. And this is just with DRAM today. Optane and other SCM can start to address some of this problem but ithe problem doesn’t go away, AI DL models are just getting more complex I recently read an article where Google trained a trillion parameter language model.
  • In memory databases – SAP HANA is just one example but they are other startups as well as traditional database providers that are starting to use huge amounts of memory to process data at lightening fast speeds. Data only seems to grow not shrink.

Yes Optane and other SCM today can solve some of thise problems. But having a 3D scaleable DRAM memory, that can be built right on chip core, with longer hold times and faster direct access can be a game changer.

It’s unclear whether we will see all DRAM move to the new 2T0C format, but if it can scale well in the Z direction has better access times, and longer retention, it’s unclear why this wouldn’t displace all current 1T1C DRAM over time. However, given the $Bs of R&D spend on new and current DRAM 1T1C fabrication technology, it’s going to be a tough and long battle.

Now if the new 2T0C DRAM could only move from 1 bit per cell to multiple bits per cell, like SLC to MLC NAND, the battle would heat up considerably.

Photo Credits:

Is hardware innovation accelerating – hardware vs. software innovation (round 6)

There’s something happening to the IT industry, that maybe has not happened in a couple of decades or so but hardware innovation is back. We’ve been covering bits and pieces of it in our hardware vs software innovation series (see Open source ASiCs – HW vs. SW innovation [round 5] post).

But first please take our new poll:

Hardware innovation never really went away, Intel, AMD, Apple and others had always worked on new compute chips. DRAM and NAND also have taken giant leaps over the last two decades. These were all major hardware suppliers. But special purpose chips, non CPU compute engines, and hardware accelerators had been relegated to the dustbins of history as the CPU giants kept assimilating their functionality into the next round of CPU chips.

And then something happened. It kind of made sense for GPUs to be their own electronics as these were SIMD architectures intrinsically different than SISD, standard von Neumann X86 and ARM CPUs architectures

But for some reason it didn’t stop there. We first started seeing some inklings of new hardware innovation in the AI space with a number of special purpose DL NN accelerators coming online over the last 5 years or so (see Google TPU, SC20-Cerebras, GraphCore GC2 IPU chip, AI at the Edge Mythic and Syntiants IPU chips, and neuromorphic chips from BrainChip, Intel, IBM , others). Again, one could look at these as taking the SIMD model of GPUs into a slightly different direction. It’s probably one reason that GPUs were so useful for AI-ML-DL but further accelerations were now possible.

But it hasn’t stopped there either. In the last year or so we have seen SPUs (Nebulon Storage), DPUs (Fungible, NVIDIA Networking, others), and computational storage (NGD Systems, ScaleFlux Storage, others) all come online and become available to the enterprise. And most of these are for more normal workload environments, i.e., not AI-ML-DL workloads,

I thought at first these were just FPGAs implementing different logic but now I understand that many of these include ASICs as well. Most of these incorporate a standard von Neumann CPU (mostly ARM) along with special purpose hardware to speed up certain types of processing (such as low latency data transfer, encryption, compression, etc.).

What happened?

It’s pretty easy to understand why non-von Neumann computing architectures should come about. Witness all those new AI-ML-DL chips that have become available. And why these would be implemented outside the normal X86-ARM CPU environment.

But SPU, DPUs and computational storage, all have typical von Neumann CPUs (mostly ARM) as well as other special purpose logic on them.

Why?

I believe there are a few reasons, but the main two are that Moore’s law (every 2 years halving the size of transistors, effectively doubling transistor counts in same area) is slowing down and Dennard scaling (as you reduce the size of transistors their power consumption goes down and speed goes up) has stopped almost. Both of these have caused major CPU chip manufacturers to focus on adding cores to boost performance rather than just adding more transistors to the same core to increase functionality.

This hasn’t stopped adding instruction functionality to each CPU, but it has slowed considerably. And single (core) processor speeds (GHz) have reached a plateau.

But what it has stopped is having the real estate available on a CPU chip to absorb lots of additional hardware functionality. Which had been the case since the 1980’s.

I was talking with a friend who used to work on math co-processors, like the 8087, 80287, & 80387 that performed floating point arithmetic. But after the 486, floating point logic was completely integrated into the CPU chip itself, killing off the co-processors business.

Hardware design is getting easier & chip fabrication is becoming a commodity

We wrote a post a couple of weeks back talking about an open foundry (see HW vs. SW innovation round 5 noted above)that would take a hardware design and manufacture the ASICs for you for free (or at little cost). This says that the tool chain to perform chip design is becoming more standardized and much less complex. Does this mean that it takes less than 18 months to create an ASIC. I don’t know but it seems so.

But the real interesting aspect of this is that world class foundries are now available outside the major CPU developers. And these foundries, for a fair but high price, would be glad to fabricate a 1000 or million chips for you.

Yes your basic state of the art fab probably costs $12B plus these days. But all that has meant is that A) they will take any chip design and manufacture it, B) they need to keep the factory volume up by manufacturing chips in order to amortize the FAB’s high price and C) they have to keep their technology competitive or chip manufacturing will go elsewhere.

So chip fabrication is not quite a commodity. But there’s enough state of the art FABs in existence to make it seem so.

But it’s also physics

The extremely low latencies that are available with NVMe storage and, higher speed networking (100GbE & above) are demanding a lot more processing power to keep up with. And just the physics of how long it takes to transfer data across a distance (aka racks) is starting to consume too much overhead and impacting other work that could be done.

When we start measuring IO latencies in under 50 microseconds, there’s just not a lot of CPU instructions and task switching that can go on anymore. Yes, you could devote a whole core or two to this process and keep up with it. But wouldn’t the data center be better served keeping that core busy with normal work and offloading that low-latency, realtime (like) work to a hardware accelerator that could be executing on the network rather than behind a NIC.

So real time processing has become faster, or rather the amount of time to execute CPU instructions to switch tasks and to process data that needs to be done in realtime to keep up with faster line speed is becoming shorter.

So that explains DPUs, smart NICS, DPUs, & SPUs. What about the other hardware accelerator cards.

  • AI-ML-DL is becoming such an important and data AND compute intensive workload that just like GPUs before them, TPUs & IPUs are becoming a necessary evil if we want to service those workloads effectively and expeditiously.
  • Computational storage is becoming more wide spread because although data compression can be easily done at the CPU, it can be done faster (less data needs to be transferred back and forth) at the smart Drive.

My guess we haven’t seen the end of this at all. When you open up the possibility of having a long term business model, focused on hardware accelerators there would seem to be a lot of stuff that needs to be done and could be done faster and more effectively outside the core CPU.

There was a point over the last decade where software was destined to “eat the world”. I get a lot of flack for saying that was BS and that hardware innovation is really eating the world. Now that hardtware innovation’s back, it seems to be a little of both.

Comments?

Photo Credits:

Ok, maybe neuromorphic chips aren’t a deadend – Neuromorphic Part 6

Those of you who followe my blog will no doubt recall that I pronounced neuromorphic chips dead (see our Are neuromorphic chips a deadend blog post). Not because the hardware technology wasn’t improving or good enough, but because software support for the technology was sorely lacking and it was extremely complex or nigh impossible to program and use.

But first please take our new poll:

And, in the meantime GPUs, TPUs and other more “normal” neural network hardware and accelerators, all were able to utilize standard, easy to use, mostly open source, AI DL frameworks. And all this hardware was steadily improving, coming out regularly with more power and performance, with no end in sight.

But then I attended AIFD1 (AI Field Day 1) and at one of the sessions, Anil Mankar, COO & Co-Founder of a company named BrainChip Inc, (see video of their talk) presented yet another neuromorphic chip, called the AKIDA Neural Processor. Their current generation of the technology is available in their AKD 1000 SoC chip, focused on IoT solutions. But they had created a a software development environment that allowed one to use standard TensorFlow neural network trained models and deploy these on their hardware. And that got my interest.

BrainChip’s AKIDA AKD 1000 hardware AND software

Their AI DL nueromoryhic chip is made app of Event Domain Neural Processing Units (NPUs). AKIDA technology is focused on low power, sensor like applications. They claim to save power by only consumuing power (or is running) when an event takes place. They are also able to save on memory requirements by using 1, 2 or 4 bits (vs. 8, 16, 32 or more bits) for model weights/activations

Their hardware seems to run spiking neural networks (SNN, see our blog post on another chip technology using SNNs). In their SDK, they have a CNN2SNN tool that could take a any (TensorFlow) trained CNN model and convert it to a SNN, that could then run on their AKIDA tecnology.

They also have an AKIDA Model Zoo with a handful of pre-trained CNN type models that have already been converted to run on their technology. They also provide a tutorial on their technology. Mankar, said that if you understand how to use TensorFlow Keras today, to construct and train your models, it shouldn’t be too hard to understand how to use their tools to do what you want.

Their chip hardware is available today on a separate PCIe card, M.2 form factor card. or as a chip. Finally, they also license their AKIDA IP to other chip designers.

AKIDA AKD 1000 performance

At the AIFD1 Mankar showed statistics on the performance and accuracy attained using their chip vs. using standard 32 bit floating point CNN implementations.

As discussed above, their processor uses 1-4 bits for weight quantization and as such loses some accuracy but as you can see it’s a matter of one to a few percent vs. these same models using a 32bit floating point CNN implementation.

Because of their smaller weights, AKIDA uses less memory and less bandwidth to update models vs. models using larger weights.

As shown in the chart the the memory required for the 8-bit deep learning algorithms (DLAs) were all significantly larger than the memory requirements for the AKIDA solution. For one algorithm, they required ~1/2 the memory size of the 8-bit DLA version of the model.

Mankar also provided information on the amount of calculations required per inference using AKIDA vs. 8-bit DLAs.

Just to set the stage, MMACs/Inference is (matrix or multiple) multiplications and accumulations required to perform a single inference with the selected CNN model. ImageNet (1000), ImageNette (20) and Visual Wake Word models are all standard CNN models, that have pre-trained on vast repositories of data, that can run in many hardware environments. The non-AKIDA solutions above were all running using an 8-bit DLA CNN model. Activity regularization is a method of reducing the learning rate and weights used during training that shrinks the weight changes during training to reduce model overfit.

He also showed some comparisons of their technology vs. Intel’s LoiHi hardware. LoiHi is another neuromorphic chip, whose original introduction prompted me to write the “Are neuromorphic chips a deadend” post (link above). Unfortunately, I didn’t capture any of these charts, but from my recollection, they showed that AKIDA technology used slightly less power than LoiHi technology in all their comparisons.

AKIDA technology demo

In their live, on camera, demo, they used a previously downloaded VGG16 (if I recall correctly) CNN trained model. Offline they had replaced the last classification layer with a (blank, untrained) dense network and they converted this to a SNN and downloaded onto one of their boards. They had developed an application that used this board with a camera to perform more CNN training or CNN image inferencing (classification).

They first (one-shot) trained their board’s model to recognize the background of what the camera was seeing and then proceeded to perform (one-shot) trainings to classify toys of tigers, elephants and cars. All these were completed in real time in the demo. They were able to verify the training took using pictures of tigers, elephants and cars as well as classify all the toys in different orientations and a different toy car

The AIFD1 (a tuff) crowd, said had seen all this before but would be really interested to see if their chip could distinguish between different cars (one a toy race car and the other a toy police car). On camera, they were able to re-train their CNN to distinguish between (toy) car 1 and car 2 to classify properly between the two of them. They had one or two instances where their CNN model was confused, but they were able to re-train it to recognize the toy car and place it into the correct classification (using two-shot[?] learning).

At AIFD1, Mankar also presented detailed, real world data on how they were able to perform Keyword spotting, person detection, E-nose classification, E-tongue classification, and auditory (E-ear?) classification in embedded sensor systems.

AKIDA technology limitations

At the moment, their chip doesn’t support neural networks that use memory such as LSTM or RNN’s but it seems to work fine for any CNN, which was shown multiple times in the data they presented and in their demo.

We were really impressed with their software stack, liked what we saw of their hardware/IP, and enjoyed their demo and its one-shot learning. Check out their videos (link above) for more information on them.

Photo Credit(s): all charts are from BrainChip Inc’s website or were presented at their AIFD1 session

Open source digital assistant

I’ve come by and purchased a number of digital assistants over the last couple of years from both Google and Amazon but not Apple. At first their novelty drove me to take advantage of them to do a number of things. But over time I started to only use them for music playing or jokes. But then I started to hear about some other concerns with the technology.

The problems with today’s vendor based, digital assistants

My and others main concern was their ability to listen into conversations in the home and workplace without being queried. Yes, there are controls on some of them to turn off the mic and thus any recordings. But these are not hardwired switches and as software may or may not work depending on the implementation. As such, there is no guarantee that they won’t still be recording audio feeds even with their mic (supposedly) turned off.

At one point I saw a news article where police had subpoenaed recordings of a digital assistant to use in a criminal case. Now I’m ok with use of this for specific, court approved, criminal cases but what’s to limit its use to such. And not all courts, or governments for that matter, are as protective of personal privacy as some.

Open source digital assistant on the way

But with an open source version of a digital assistant, one where the user had complete programmatical control over its recording and use of audio data is another matter. I suppose this doesn’t necessarily help the technically challenged among us that can’t program our way out of a paper bag but even for those individuals, the fact that an open source version exists to protect privacy, could be construed as something much more secure than a company or vendor’s product.

All that made it very interesting when I saw an article recently about a project put together at Standford on an Open source challenger to popular virtual assistants”.

How to create a open source digital assistant

The main problem facing an open source digital assistant is the need for massive amounts of annotated training request data. This is one of the main reasons that commercial digital assistants often record conversations when not specifically requested.

But Stanford University who is responsible for creating the open source digital assistant above has managed to design and create a “rules based” system to help generate all the training data needed for a virtual assistant.

With all this automatically generated training data they can use it to train a digital assistant’s natural language processing neural network to understand what’s being asked and drive whatever action is being requested.

At the moment the digital assistant (and its conversation generator) has somewhat limited skills, or rather only works in a restricted set of domains such as restaurants, people, movies, books and music. For example, “identify a restaurant near me that has deep dish pizza and is rated greater than 4 on a 5 point scale”, “find me an mystery novel talking that is about magic”, or “who was the 22nd president of the USA”.

But as the digital assistant and its annotated, rules based conversation generator are both open source, anyone can contribute more skills code or add more conversational capabilities. Over time, if there’s enough participation, perhaps even someday perform all of the skills or capabilities of commercial digital assistants.

Introducing Almond and Stanford’s OVAL

Stanford work on this project is out of their OVAL (Open Virtual Assistant Lab). Their open source virtual assistant is called Almond.

Almond’s verbal generator is called Genie and uses compositional technology to generate conversations that are used to train their linguistic user interface (LUInet). Almond also uses ThingTalk a new declaritive program language to process responses to queries and requests. Finally, Almond makes use of Thingpedia, a repository of information about internet services and IoT devices to tell it how to interact with these systems.

Stanford Genie technology

The technology behind Genie is based on using source text statements to create templates that can generate sentences for any domain you wish to have Almond work in. If one is interested in expanding the Almond domains, they can create their own templates using the Genie toolkit.

One essentially provides a small set of input sentences that are converted into templates and used by Genie to understand how to parse all similar sentences. This enables Almond to “understand” what’s being requested of it

The set of input sentences can start small and be augmented or added to over time to handle more diverse or complex queries or requests. Their GitHub toolkit and Genie technology is described more fully in a paper Genie: A generator of natural language symantec parsers for virtual assistant commands

Stanford ThingTalk declarative language

ThingTalk is the programming language used to control what Almond can do for requests and queries. Essentially it’s a multi-part statement about what to do when a request comes along. The main parts in a ThingTalk statement include:

  1. When a particular action is supposed to be triggered.
  2. What service does the request need in order to perform its action.
  3. What action is requested

The “what service does a request need” are based on Open API calls (See ThingPedia below). The “what action is requested” can either be standard Almond actions or invoke other ThingPedia open source API calls, such as create a tweet, post on FB, send email etc.

For example, a ThingTalk statement looks like:

monitor @com.foxnews.get() => @com.slack.send();

Which monitors Fox news for any new news articles and sends them (the link) to your Slack channel.

Stanford Thingpedia

Thingpedia is an open source repository of structured information available on the Web and of API services available on the web. Structured information or data is the information behind calendars, contact databases, article repositories, etc. Any of which can be queried for information and some of which can be updated or have actions performed on them. API services are the way that those queries and actions are performed.

One page of the Thingpedia multi-page summary of services that are offered

The Thingpedia web page shows a number of services that already have Open source APIs defined and registered. For example, things like twitter, facebook, bing search, BBC news, gmail and a host of other services. More are being added all the time and these represent the domains that Almond can be used to act upon.

Some of these domains are more defined that others. But in any case any service that takes the form of an web based API can be added to Thingpedia.

Thingpedia as a standalone open source repository is valuable in and of itself regardless of its use by Almond. But Almond would be impossible without Thingpedia. Thingpedia wants to be the wikipedia of APIs.

Almond, putting it all together

Almond consists of mainly the Almond Agent, Engine and Thingpedia. The Agent is used by the various Almond implementions to parse and understand the request and access the ThinkTalk program statement. Almond Agent uses its LUInet natural language interpreter, interpret that request and to select the ThingTalk program for the request. Once the ThinkTalk program is identified, it uses the various Thingpedia APIs requested by the ThinkTalk statement to generate the proper API calls to the service being requested and generate any output that is requested.

Where can you run Almond

Almond is available currently as a web app, an Android App, a Gnome (Linux) desktop/laptop App, a CLI application or can be run on your Mac or Windows computers. You could of course create your own smart speaker to run Almond or perhaps hack a current smart speaker to do so.

One important consideration is that with the Android app, all your data and credentials are only stored on the phone. And will not go out into the cloud or elsewhere. I didn’t see similar statements about privacy protections for the web app or any of the other deployments. But as Almond is open source, you potentially have much greater control over where your data resides.

~~~~

What I would really like is a smart speaker app running on a RPi with a microphones and a decent speaker attached, all in the package of a cube or cylinder.

I thought their videos on Almond were pretty cheesy but the technology is very interesting and could potentially make for an interesting competitor of today’s smar

Photo Credit(s):

All photos and graphics from Stanford Almond and OVAL Lab websites.

AI ML DL hardware performance results from MLPerf

Read an article a couple of weeks back from IEEE Spectrum, New Records for AI Training which discussed recent MLPerf v0.7 performance results. The article mentioned that MLPerf performance on its benchmarks has increased by ~2.7X in the last year alone.

The MLPerf organization was started back in 2018 to supply machine learning workload performance results, somewhat like what SPEC and TPC did for NFS and transaction processing. The MLPerf organization documented their philosophy in a paper.

But first please take our new poll:

As far as I can tell, MLPerf is the only benchmark currently available to show hardware system performance on AI training and inferencing. Below we report on MLPerf training results.

MLPerf also reports on both closed and open division benchmark results. Closed division submission all use the same software algorithms for each workload submission. This way one can compare workload performance across different hardware systems. Open division results can make use of any algorithm to achieve the desired results on the problem set. We report on MLPerf closed division results below.

Current MLPerf v0.7 (open and closed division) training results are available online (on GitHub) and are summarized in a training results page on their web site.

MLPerf v0.7 workload changes

The MLPerf team added a few new workloads and upped the game of another benchmark for V0.7

  • Recommendation DLRM: a replacement for what was used in MLPerf v0.6 and is from Facebook providing more parallelism in training for recommendations.
  • Wikipedia BERT: an addition to what was used in MLPerf v0.6 and is a new natural language processing (N?P) frontend, trained on Wikipedia which is used with other language processing capabilities.
  • Go MiniGo: an enhancement to MLPerf v0.6 MiniGo accuracy requirements and uses reinforcement learning to learn to play Go well enough to achieve a 50% win rate. For v0.7, they now use a full sized, 19X19 Go board and upped the win rate requirement to 50%.

MiniGo Results

A couple of items of note for the MiniGo results. There are essentially 3 different architectures represented in the above: NVIDIA DGX series (DGX A100, DGX-2H, DGX-1), Google TPUs (V4 and V3) and Intel (8 server nodes with Copper Lake-6 CPUs).

Google TPUs are considered internal and are only available to Google, its hardware partners or on GCP. Although MLPerf include GCP TPU system results for other workloads, there were none submitted for MiniGo.

The Intel system is a preview of their latest gen Copper Lake chips, which may not be commercially available yet. On the other hand, all NVIDIA systems are commercially available and can be deployed in your data center today.

As one can see in the above, NVIDIA systems swept the first 3 positions on our Top 10 MiniGo chart. A DGX A100 came in at #1, reaching a 50% win rate at MiniGo in mere 17 seconds using 448 CPUs and 1792 A100 GPUs. Coming in at #2 at 30 seconds was another DGX A100 using 64 CPUs and 256 A100 GPUs. And at #3 at 35 seconds was a DGX-2H using 64 CPUs and 512 V100 GPUs.

Next at #4 at 151 seconds was a Google TPU system with 64 TPUv4 accelerators (unclear how many CPUs, if any are used, results show 0). Note, an 8-node Intel server with the 32 CPUs (4/node) using the latest gen Copper Lake (-6) CPU came in at #7 using 409 seconds to achieve the training results.

There are 6 other MLPerf workloads including DLRM and BERT mentioned above. Each of these deserve their own discussion on top ten results. Alas, they will need to wait for another time and I will cover all of them in future posts.

~~~~

Nowadays, with much of IT turning to AI ML DL to provide critical services, it’s more important than ever to understand what can and can’t be done with available hardware. The fact that one can train a model to play decent Go in 17 seconds on a large DGX A100 cluster and under 7 minutes on an 8-node, leading edge, Intel server cluster is pretty impressive.

Despite MLPerf’s best efforts, it’s still tough to compare ML performance across systems when there’s so much diversity in the underlying hardware, especially in GPU, TPU and CPU counts. IMHO, it would be very useful to have a single GPU , TPU or CPU system submission requirement for each workload. That way one could compare how well each hardware element can perform the workload in isolation.

Nonetheless, the MLPerf suite of benchmarks provides a great first step in understanding what today’s hardware can accomplish in ML training (and inferencing).

Comments?

Photo Credits:

Hybrid digital training-analog inferencing AI

Read an article from IBM Research, Iso-accuracy DL inferencing with in-memory computing, the other day that referred to an article in Nature, Accurate DNN inferencing using computational PCM (phase change memory or memresistive technology) which discussed using a hybrid digital-analog computational approach to DNN (deep neural network) training-inferencing AI systems. It’s important to note that the PCM device is both a storage device and a computational device, thus performing two functions in one circuit.

In the past, we have seenPCM circuitry used in neuromorphic AI. The use of PCM here is not that (see our Are neuromorphic chips a dead end? post).

Hybrid digital-analog AI has the potential to be more energy efficient and use a smaller footprint than digital AI alone. Presumably, the new approach is focused on edge devices for IoT and other energy or space limited AI deployments.

Whats different in Hybrid digital-analog AI

As researchers began examining the use of analog circuitry for use in AI deployments, the nature of analog technology led to inaccuracy and under performance in DNN inferencing. This was because of the “non-idealities” of analog circuitry. In other words, analog electronics has some intrinsic capabilities that induce some difficulties when modeling digital logic and digital exactitude is difficult to implement precisely in analog circuitry.

The caption for Figure 1 in the article runs to great length but to summarize (a) is the DNN model for an image classification DNN with fewer inputs and outputs so that it can ultimately fit on a PCM array of 512×512; (b) shows how noise is injected during the forward propagation phase of the DNN training and how the DNN weights are flattened into a 2D matrix and are programmed into the PCM device using differential conductance with additional normalization circuitry

As a result, the researchers had to come up with some slight modifications to the typical DNN training and inferencing process to improve analog PCM inferencing. Those changes involve:

  • Injecting noise during DNN neural network training, so that the resultant DNN model becomes more noise resistant;
  • Flattening the resultant DNN model from 3D to 2D so that neural network node weights can be implementing as differential conductance in the analog PCM circuitry.
  • Normalizing the internal DNN layer outputs before input to the next layer in the model

Analog devices are intrinsically more noisy than digital devices, so DNN noise sensitivity had to be reduced. During normal DNN training there is both forward pass of inputs to generate outputs and a backward propagation pass (to adjust node weights) to fit the model to the required outputs. The researchers found that by injecting noise during the forward pass they were able to create a more noise resistant DNN.

Differential conductance uses the difference between the conductance of two circuits. So a single node weight is mapped to two different circuit conductance values in the PCM device. By using differential conductance, the PCM devices inherent noisiness can be reduced from the DNN node propagation.

In addition, each layer’s outputs are normalized via additional circuitry before being used as input for the next layer in the model. This has the affect of counteracting PCM circuitry drift over time (see below).

Hybrid AI results

The researchers modeled their new approach and also performed some physical testing of a digital-analog DNN. Using CIFAR-10 image data and the ResNet-32 DNN model. The process began with an already trained DNN which was then retrained while injecting noise during forward pass processing. The resultant DNN was then modeled and programed into a PCM circuit for implementation testing.

Part D of Figure 4 shows the Baseline which represents a completely digital implementation using FP32 multiplication logic; Experiment which represents the actual use of the PCM device with a global drift calibration performed on each layer before inferencing; Mode which represents theira digital model of the PCM device and its expected accuracy. Blue band is one standard-deviation on the modeled result.

One challenge with any memristive device is that over time its functionality can drift. The researchers implemented a global drift calibration or normalization circuitry to counteract this. One can see evidence of drift in experimental results between ~20sec and ~60 seconds into testing. During this interval PCM inferencing accuracy dropped from 93.8% to 93.2% but then stayed there for the remainder of the experiment (~28 hrs). The baseline noted in the chart used digital FP32 arithmetic for infererenci and achieved ~93.9% for the duration of the test.

Certainly not as accurate as the baseline all digital implementation, but implementing DNN inferencing model in PCM and only losing 0.7% accuracy seems more than offset by the clear gain in energy and footprint reduction.

While the simplistic global drift calibration (GDC) worked fairly well during testing, the researchers developed another adaptive (batch normalization statistical [AdaBS]) approach, using a calibration image set (from the training data) and at idle times, feed these through the PCM device to calculate an average error used to adjust the PCM circuitry. As modeled and tested, the AdaBS approach increased accuracy and retained (at least modeling showed) accuracy over longer time frames.

The researchers were also able to show that implementing part (first and last layers) of the DNN model in digital FP32 and the rest in PCM improved inferencing accuracy even more.

~~~~

As shown above, a hybrid digital-analog PCM AI deployment can provide similar accuracy (at least for CIFAR-10/ResNet-24 image recognition) to an all digital DNN model but due to the efficiencies of the PCM analog circuitry allowed for a more energy efficient DNN deployment.

Photo Credit(s):