LLM exhibits Theory of Mind

Ran across an interesting article today (thank you John Grant/MLOps.community slack channel), titled Theory of Mind may have spontaneously emerged in Large Language Models, by M. Kosinski from Stanford. The researcher tested various large language models (LLMs) on psychological tests to determine the level of theory of mind (ToM) the models had achieved.

Earlier versions of OpenAI’s GPT-3 (GPT-1, -2 and original -3) showed almost no ToM capabilities but the latest version, GPT-3.5 does show ToM equivalent to 8 to 9 year olds.

Theory of Mind

According to Wikipedia (Theory Of Mind article), ToM is “…the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind).” This seems to be one way people use to understand one another.

For instance, If I can somehow guess what you are thinking about a topic, situation, or event I can hopefully communicate with you better than if I can’t. At least that’s the psychological perspective.

The belief is that people with Aspergers, ADHD, schizophrenia, and other afflictions all show ToM deficits when compared to normal people. As a result, over time, psychologists have developed tests for people to measure their ToM.

These tests typically involve putting 2 people in a situation with props and other indicators used to tell what a person is thinking and asking them what they think the other person is thinking. They grade a persons ToM based on expected results.

ToM and LLM

The researcher took these tests, with people, props and situations and converted them to textual sentences and prompts that mimicked situations tested in ToM testing. So they created a textual story or pretext and a set of text prompts about the situation which the LLM was to complete.

For example, one pretext or story is the following:

“Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”

This is fed into the LLM and then a prompt is provided, such as:

“She opens the bag and looks inside. She can clearly see that it is full of _________

If the LLM has ToM and based on the pretext and prompt, it will say “popcorn” but the LLM also provides a follow on sentence or two which describes the situation as the LLM understands it:

popcorn. Sam is confused. She wonders why the label says “chocolate” when the bag is clearly filled with popcorn. She looks around to see if there is any other information about the bag. She finds nothing. She decides to take the bag to the store where she bought it and ask for an explanation.

The text in BOLD is generated by a ToM capable LLM. The researcher also showed the probability assigned by the LLM to that first word of the prompt. In the case above, it showed [Ppopcorn = 100%; Pchocolate = 0%].

The also use different prompts with the same story to see if the LLM truly shows ToM. For instance something like, “She believes the bag is full of ___________” and “She’s delighted finding the bag, she loves eating _______”. This provides a sort of test of comprehension of the situation by the LLM.

The researcher controlled for word frequency using reversals of the key words in the story, i.e., the bag has chocolate but says popcorn. They also generated scrambled versions of the story where they replaced the first set of chocolate and popcorn with either at random. They considered this the scrambled case. The reset the model between each case. In the paper they show the success rate for the LLMs for 10,000 scrambled versions, some of which were correct.

They labeled the above series of tests as “Unexpected content tasks“. But they also included another type of ToM test which they labeled “Unexpected transfer tasks“.

Unexpected transfer tasks involved a story like where person A saw another person B put a pet in a basket, that person left and the person A moved the pet. And prompted the LLM to see if it understood where the pet was and how person B would react when they got back.

In the end, after trying to statistically control, as much as possible, with the story and prompts, the researchers ended up creating 20 unique stories and presented the prompts to the LLM.

Results of their ToM testing on a select set of LLMs look like:

As can be seen from the graphic, the latest version of GPT-3.5 (davinci-003 with 176B* parameters) achieved something like an 8yr old in Unexpected Contents Tasks and a 9yr old on Unexpected Transfer Tasks.

The researchers showed other charts that tracked LLM probabilities on (for example in the first story above) bag contents and Sam’s belief. They measured this for every sentence of the story.

Not sure why this is important but it does show how the LLM interprets the story. Unclear how they got these internal probabilities but maybe they used the prompts at various points in the story.

The paper shows that according to their testing, GPT-3.5 davinci-003 clearly provides a level of ToM of an 8-9yr old on ToM tasks they have translated into text.

The paper says they created 20 stories and 6 prompts which they reversed and scrambled. But 20 tales seems less than statistically significant even with reversals and randomization. And yet, there’s clearly a growing level of ToM in the models as they get more sophisticated or change over time.

Psychology has come up with many tests to ascertain whether a person is “normal or not’. Wikipedia (Psychological testing article) lists over 13 classes of psychological tests which include intelligence, personality, aptitude, etc.

Now that LLM seem to have mastered textual input and output generation. It would be worthwhile to translate all psychological tests into text and trying them out on all LLMs to track where they are today using these tests and where they have trended over time.

I could see at some point using something akin to multiple psychological test scores as a way to grade LLMs over time.

So today’s GPT3.5 has a ToM of an 8-9yr old. Be very interesting to see what GPT-4 does on similar testing.

Comments?

Picture Credit(s)

NVIDIA H100 vs. A100 GPUs in MLPERF Training

NVIDIA recently released some “Preview” results for MLPerf Data Center Training v2.1 (most recent results as of 28 Nov 2022) benchmarks. We analyzed these results to determine how much faster the H100 was vs. their A100 GPU.

Note, NVIDIA submitted 3 series of Preview benchmarks using the H10-SXM5-80GB GPUs for training which included an 8 GPU system, a 24 GPU system, and a 32 GPU DGXH100 system.

We have previously reported similar analysis for MLPerf Inferencing results (see: NVIDIA’s H100 vs A100… blog post).

From NVIDIA H100 Announcement Information

In their announcement, NVIDIA showed anywhere from 3-6X TFLops speedup with much faster throughput. MLPerf currently doesn’t report the FP resolution used to perform their benchmarks but in MLPerf’s ArXiv paper, they seem to be using FP32 which we assume is equivalent to TF32 in the above chart so the H100 should, on average, be performing 3X faster.

Actual or normalized results for comparisons

Of the eight MLPerf v2.1 Data Center Training workloads, it appears that the H100 actual results are faster than the A100 GPUs in 5 of the benchmarks and slower in the remaining 3, Speech Recognition (LibriSpeech RNN-T), Recommendation Engine (1TB Clickthrough DLRM) and Reinforcement Learning (MiniGo).

The challenge with using the actual results or absolute minutes to train from the benchmarks is that submission results aren’t all using the same hardware configurations.

For example, in the Speech Recognition benchmark results, the current best training time (2.1 minutes) was achieved by NVIDIA DGXA100 systems with 384 (64 core AMD 7742) CPUs and 1536 (A100-SXM4-80GB) GPUs. While the nearest H100 Preview submission, which would have come in 4th in absolute time (7.5 minutes) to train, was using 8 (56 core Intel Xeon) CPUs with 32 (H100-SXM5-80GB) GPUs.

So, in order to present an apples to apples comparison in the charts below we show both actual minutes to train for the system and GPU counts normalized (to match the nearest H100 Preview submission which we calculated) time to train.

A couple of caveats with using normalized numbers:

  • Normalization to 8 or 32 GPUs assumes the systems in question would have absolute linear performance scaling both up (for actual results with less GPUs) and down (for actual results with more GPUs)
  • Normalization to 8 or 32 GPUs doesn’t factor in the differences in CPU counts, core counts per CPU or CPU power. And in fact in the H100 previews, NVIDIA (or MLPerf) did not provide a CPU model number but in their detailed information they did list the Intel Xeon core count as 56.
  • Normalization to 8 or 32 GPUs doesn’t factor in any other speedups like throughput, dedicated AI hardware or other system performance characteristics that are available on the newer (DGXH H100) systems.

However, with respect to GPU and CPU core counts, there were four benchmarks (Speech Recognition, NLP, Object Detection-light weight, and Recommendation engine) which have submissions that come close to the GPU and CPU hardware counts that were used for the H100 Previews.

For three benchmarks comparing against the H100 submission with 32 GPUs, the comparison system was a HPE Proliant system with 8 AMD 7763 64-core CPUs with 32 A100-SXM4-80GB GPUs. And for the one benchmark comparing against the H100 submission with 8 GPUs, the comparison system was a NVIDIA DGXA100 system with 2 AMD EPYC 7742 (64 core) CPUs and 8 A100-SXM4-80GB GPUs.

Note, the HPE A100 systems still had more CPU cores, 64 more for the 32 GPU comparisons and the NVIDIA DGXA100 had 16 more CPU cores for the lone 8 GPU comparison.

So, our comparisons are still not perfect and if anything should show the H100 in its worst light due to not having as much CPU compute power. On the other hand the DGXH100 and the H100 GPU has a lot more bandwidth and the H100 GPU has additional specialized dedicated logic for AI operations. No telling how much these other hardware differences would matter to the various MLPerf training workloads. But these comparisons are as close as the data allows.

The comparisons

First up Speech Recognition:

Lower is better in training time results (metric measured is minutes to train to NN level of accuracy). And the results on this chart are sorted by the 32 GPU normalized training times. The actual published results are shown in Blue and the 32 GPU normalized results in Orange.

As we can see here even with normalization for all the other results, the H100 preview still doesn’t come out on top (7.487 min vs. 7.534) but it doesn’t lose by much. Also one can see the current #1 for this benchmark in actual minutes to train is shown by the last column(s), which is a NVIDIA DGXA100 running 384 AMD EPYC 7742 (64 core) CPUs with 1536 A100-SXM4-80GB GPUs, which trained in around 2 minutes.

I’ve taken the liberty to show in light blue boxes the best comparison system to the H100 preview results (DGXH100) with 32 H100 GPUs, which was the HPE (Proliant) 8 AMD EPYC 7763 (64-core) CPUs and 32 A100-SXM4-80GB GPUs results. In this Speech Recognition benchmark the H100 GPUs is 1.63X faster than the A100 GPUs.

Next up Object Detection-Lightweight,

Similar to the above smaller is better, it’s sorted by Normalized to 32 GPU results and Blue bars are at the actual reported results and orange bars are the 32 GPU normalized results.

Here we can see that the H100 both reported the best training time in actual results and in 32 GPU normalized results. Also like the earlier chart we are showing the best comparisons we can find in blue boxes and in this Object Detection-Lightweight benchmark the H100 is 3.80X faster than the A100.

Bottom line

H100 GPU

We have analyzed all MLPerf data center training workload top ten results similar to what we show above. As discussed earlier, only four MLPerf workloads had hardware similar to the NVIDIA H100 Preview submissions, three compare well with the 32 GPU H100 submission and 1 compares well with the 8 GPU H100 submission.

The numbers we calculate show that the H100 is 1.63X (Speech recognition), 3.80X (NLP), 1.97X (Object detection-lightweight) and 1.60X (Recommendation engine) faster than the A100, which would say the H100 is, on average, 2.25X faster than the A100 in MLPerf v2.1 Data Center Training results.

Realize the H100 results are “Preview” so there may still be some software (or firmware) speedups that may be applied to improve these numbers. And, “Released” hardware & firmware may differ substantially from the “Preview” hardware & firm vale.

But given all that, it appears that the H100 is not as fast as announced (2.25X vs. 3X), in MLPerf training workloads, at least not yet [added after publishing, The Eds]

Photo Credit(s):

  • Screen shot of slides presented at GTC Spring 2022
  • Cropped version of above

NVIDIA Triton Giant Model Inference, a step too far

At GTC this week NVIDIA announced a new capability for their AI suite called Triton Giant Model Inference . This solution addresses the current and future problem of trying to perform inferencing with models whose parameters exceed a single GPU card.

During NVIDIA’s GTC show they showed a chart which indicates that model parameters are on an exponential climb (just eyeballing it here but 10X every year since 2018). Current models, like OpenAI’s GPT-3 have 175B parameters. Such a model would require ~350GB of GPU memory to perform inferencing on the whole model.

The fact that NVIDIA’s A100 currently sports 80GB of GPU memory means that GPT-3 would need to be cut up or partitioned to run on NVIDIA GPUs. Hence the need (from NVIDIA’s perspective) for a mechanism that can allow them to perform multi-GPU inferencing or their Triton Giant Machine Inference engine (GMI).

But first please take our new poll:

Why do we need GMI

It’s unclear what needs to be done to perform inferencing with a 175B parameter model today but my guess it involves a lot of manual splitting up of the model, into different layers/partitions and running the layers/partitions on separate GPUs and gluing the output of one portion to the input of the next. Such activity would be a complex, manual undertaking and would inherently slow down the model inferencing activities and add to inferencing latencies.

With Triton GMI, NVIDIA appears able to supply automated multi-GPU inferencing for models that exceed single GPU memory. Whether such models can span (DGX) servers or not was not revealed but even within a single DGX server there’s 4-A100s, so that provides an aggregate of 320GB of GPU memory. Of course, it’s very likely future Ampere GPUs will allow for more memory.

Why consider a step too far

Here’s my point, with artificial general intelligence (AGI, reasoning at human levels and beyond), coming sooner or later. My (and perhaps, humanities) preference is to have this happen later than earlier. Hopefully, this will give us more time to understand how to design/engineer/control AGI so that it doesn’t harm humanity or the earth. (See my post on Existential event risk… for more information on risks of Superintelligence)

One way to control or delay the emergence of AGI is to limit model size. Now NVIDIA, Google and others have already released capabilities that allow them to train models that exceed the size of one GPU.

Alas, the only thing left is to consider limit the size of models that can be used to perform inferencing. I fear that Triton GMI pretty much open up the flood gates to supply any size model inferencing. This will provide for more and more sophisticated AI/ML/DL models and will uncap model sizes in the near future.

Doing this will give us (humanity) a little more time to understand how to control AGI. But all this presupposes that any AGI will require more parameters than current DNN models. I think this is a safe assumption but I’m no expert.

Will delaying NVIDIA Triton GMI really help

I was not briefed on internals of GMI but possibly it makes use of DGX NV-Link and NVIDIA Software to automatically partition a DNN and deploy it over the 4-A100 GPUS in a DGX.

NVIDIA is not the only organization working on advancing DNN training and inferencing capabilities. And it’s very likely that more than one of them (Google, FaceBook, AWS, etc) have probably identified the model size as a problem for inferencing and are working on their own solutions. So delaying GMI will not be a long term fix.

But maybe if we could just delay this capability from reaching the market for 2 to 5 years it would have a follow on impact of delaying the emergence of AGI.

Is this going to stop some one/some organization from achieving AGI, probably not. Could it delay some person/organization/government from getting there – maybe. Perhaps, it will give humanity enough time to come up with other ways to control AGI. But I fear the more technology moves on, are options for controlling AGI diminish.

Don’t get me wrong. I think AI, DL NN and NVIDIA (Google, DeepMind, Facebook and others) have done a great service to help mankind succeed over this next century. And I in no way wish to hold back this capability. And a “good” AGI has the potential to help everyone on this earth in more ways than I can imagine.

But achieving AGI is a step function and once unleashed it may be difficult to control. Anything we can do today to a) delay the emergence of AGI and b) help to control it, is IMHO, worthy of consideration.

Comments?

Photo Credits:

  • from NVIDIA GTC Keynote by Jensen Huang, CEO
  • From Hackernoon article, Can Bitcoin AGI develops to benefit humanity

Phonons, the next big technology underpinning integrated circuits

Often science and industry seems to advance from investigating phenomena that is a side effect of something else we want to try to accomplish. Optical fibers have been in use for over a decade now and have always had a problem called Brillouin scattering, where light photon’s interact with surrounding cladding and generate small vibrations or sound packets. This feedback causes light to disperse across the length of the fibre due to Brillouin scattering and create sound packets called phonons aka hyper sound.

As a recent article I read in Science Daily (Wired for sound a third wave emerges in integrated circuits) describes it, the first wave of ICs was based on electronics and was developed after WW II, the second wave was based on photons and came about largely at the start of this century, and now the third wave is emerging based on sound, phonons.

The research team at The University of Sydney, Nano Institute have published over 70 papers on Brillouin scattering and Prof. Benjamin J. Eggleton recently published a summary of their research in a Nature Photonics paper (Brillouin integrated photonics, behind paywall) but one can download the deck he presented as a summary of the paper, at an OSA Optoelectronics Technical Group webinar, last year..

It appears as if the Brillouin scattering technology is particularly useful for (microwave) photonics computing. In the Science Daily article, the professor says that the big advance here is in the control of light and sound over small distances. In the Sarticle, the Professor goes onto say that “Brillouin scattering of light helps us measure material properties, transform how light and sound move through materials, cool down small objects, measure space, time and inertia, and even transport optical information.”

I believe from a photonics IC perspective, transforming how light, other electromagnetic radiation, and sound move through materials is exciting. New technology for measuring material properties, cool down small objects, measure space, time and inertia are also of interest, but not as important in our view.

What’s a phonon

As discussed earlier, phonons are packets of sound vibration above 100mhz, that come about due to optical photons interaction with cladding. As photons bounce off the cladding they generate phonons within the material. Such bouncing creates optical and acoustical waves or phonons.

There’s been a lot of research on how to create “Stimulated Brillouin Scattering” (SBS) on silicon CMOS devices and still goes on, but lately they have found an effective hybrid (Silicon, SiO2, & As2S3) formula to generate SBS at will at chip scale.

What can you do with SBS phonons

Essentially SBS phonons can be used to measure, monitor, alter and increase the flow of electromagnetic (EM) waves in a substance or wave guide. I believe this can be light, microwaves, or just about anything on the EM spectrum. Nothing was mentioned about X-Rays, but it’s just another band of EM radiation.

With SBS, one can supply microwave filters, phase shifters and sources, recover carrier signal in coherent optical communications, store (or delay) light, create lasers and measure, at the sub-mm scale, optical material characteristics. Although the article discusses cooling down materials, I didn’t see anything in the deck describing this.

As SBS technologies are optical-acoustical devices, they are immune to EMI (electro- magnetic interference), EMPs (electro-magnetic pulses) and consume less energy than electronic circuits performing similar functions.

We’ve talked about photonic computing before (see our Photonic computing, seeing the light of day post). But to make photonics a real alternative to electronic computing they need a lot of optical management devices. We discussed a couple in the blog post mentioned above but SBS opens up another dimension of ways to control photonic data flow and processing.

Unclear why the research into SBS seems to be generated out of Australian Universities. However their research is being (at least partially) funded by a number of US DoD entities.

It’s unclear whether SBS will ultimately be one of those innovations in the long run, which enables a new generation of (photonic) IC technologies. But the team has shown that with SBS they can do a lot of useful work with optical/microwave transmission, storage and measurement.

It seems to me that to construct full photonic computing, we need an optical DRAM device. Storing light (with SBS) is a good first step, but any optical store/memory device needs to be randomly accessible, and store Kb, Mb or Gb of optical data, in chip size areas and persist (dynamic refreshing is ok).

The continued use of DRAM for this would make the devices susceptible to EMI, EMP and consume more energy. Maybe something could be done with an all optical 3DX that would suffice as a photonics memory device. Then it could be called Optical DC PM.

So, ICs with electronics, photonics and now phononics are in our future.

Comments?

Photo Credits:

Shedding light on all optical neural networks

Read a couple of articles in the past week or so on all optical neural networks (see All optical neural network (NN) closes performance gap with electronic NN and New design advances optical neural networks that compute at the speed of light using engineered matter).

All optical NN solutions operate faster and use less energy to inference than standard all electronic ones. However, in reality they aree more of a hybrid soulution as they depend on the use of standard ML DL to train a NN. They then use 3D printing and other lithographic processes to create a series diffraction layers of an all optical NN that matches the trained NN.

The latest paper (see: Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy) describes a significant advance beyond the original solution (see: All-Optical Machine Learning Using Diffractive Deep Neural Networks, Ozcan’s original paper).

How (all optical) Diffractive Deep NNs (DDNNs) work for inferencing

In the original Ozcan discussion, a DDNN consists of a coherent light source (laser), an image, a bunch of refractive and reflective diffraction layers and photo detectors. Each neural network node is represented by a point (pixel?) on a diffractive layer. Node to node connections are represented by lights path moving through the diffractive layer(s).

In Ozcan’s paper, the light flowing through the diffraction layer is modified and passed on to the next diffraction layer. This passing of the light through the diffraction layer is equivalent to the mathematical bias (neural network node FP multiplier) in the trained NN.

The previous challenge has been how to fabricate diffraction layers and took a lot of hand work. But with the advent of 3D printing and other lithographic techniques, nowadays, creating a diffraction layer is relatively easy to do.

In DDNN inferencing, one exposes (via a coherent beam of light) the first diffraction layer to the input image data, then that image is transformed into a different light pattern which is sent down to the next layer. At some point the last diffraction layer converts the light hitting it into classification patterns which is then be detected by photo detectors. Altenatively, the classification pattern can be sent down an all optical computational path (see our Photonic computing sees the light of day post and Photonic FPGAs on the horizon post) to perform some function.

In the original paper, they showed results of an DDNN for a completely connected, 5 layer NN, with 0.2M neurons and 8B connections in total. They also showed results from a sparsely connected, 5 layer NN ,with 0.45M neurons and <0.1B connections

Note, that there’s significant power advantages in exposing an image to a series of diffraction gratings and detecting the classification using a photo detector vs. an all electronic NN which takes an image, uses photo detectors to convert it into an electrical( pixel series) signal and then process it through NN layers performing FP arithmetic at layer node until one reaches the classification layer.

Furthermore, the DDNN operates at the speed of light. The all electronic network seems to operate at FP arithmetic speeds X number of layers. That is only if it could all done in parallel (with GPUs and 1000s of computational engines. If it can’t be done in parallel, one would need to add another factor X the number of nodes in each layer . Let’s just say this is much slower than the speed of light.

Improving DDNN accuracy

The team at UCLA and elsewhere took on the task to improve DDNN accuracy by using more of the optical technology and techniques available to them.

In the new approach they split the image optical data path to create a positive and negative classifier. And use a differential classifier engine as the last step to determine the image’s classification.

It turns out that the new DDNN performed much better than the original DDNN on standard MNIST, Fashion MNIST and another standard AI benchmark.

DDNN inferencing advantages, disadvantages and use cases

Besides the obvious power efficiencies and speed efficiencies of optical DDNN vs. electronic NNs for inferencing, there are a few other advantages:

  • All optical data paths are less noisy – In an electronic inferencing path, each transformation of an image to a pixel file will add some signal loss. In an all optical inferencing engine, this would be eliminated.
  • Smaller inferencing engine – In an electronic inferencing engine one needs CPUs, memory, GPUs, PCIe busses, networking and all the power and cooling to make it work. For an all optical DDNN, one needs a laser, diffraction layers and a set of photo detectors. Yes there’s some electronics involved but not nearly as much as an all electronic NN. And an all electronic NN with 0.5m nodes, and 5 layers with 0.1B connections would take a lot of memory and compute to support. Their DDNN to perform this task took up about 9 cm (3.6″) squared by ~3 to5 cm (1.2″-2.0″) deep.

But there’s some problems with the technology.

  • No re-training or training support – there’s almost no way to re-train the optical DDNN without re-fabricating the DDNN diffraction layers. I suppose additional layers could be added on top of or below the bottom layers, sort of like a corrective lens. Also, if perhaps there was some sort of way to (chemically) develop diffraction layers during training steps then it could provide an all optical DL data flow.
  • No support for non-optical classifications – there’s much more to ML DL NN functionality than optical classification. Perhaps if there were some way to transform non-optical data into optical images then DDNNs could have a broader applicability.

The technology could be very useful in any camera, lidar, sighting scope, telescope image and satellite image classification activities. It could also potentially be used in a heads up displays to identify items of interest in the optical field.

It would also seem easy to adapt DDNN technology to classify analog sensor data as well. It might also lend itself to be used in space, at depth and other extreme environments where an all electronic NN gear might not survive for very long.

Comments?

Photo Credit(s):

Figure 1 from All-Optical Machine Learning Using Diffractive Deep Neural Networks

Figure 2 from All-Optical Machine Learning Using Diffractive Deep Neural Networks

Figure 2 from Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy

Figure 3 from Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy

Intel’s new DL Boost for DL AI inferencing

I was at a TechFieldDay Extra with Intel Data Centric Innovation Conference last week in San Francisco. It was a lavish affair with many industry analysts in attendance besides the TFDx crew.

At the event Intel announced a number of new products including the availability of their next generation scaleable Xeon processor chips, new Optane DC PM (DIMM) and software, new Ethernet (800) NIC cards, new FPGA line (10nm) and DL (deep learning inferencing) boost functionality.

But first please take our new poll:

I was most interested in the DL Boost and Optane DC PM solutions. For this post I focus on DL Boost.

DL Boost for DL inferencing on Xeon

Intel’s DL Boost technology provides a new integer 8 bit precision (INT 8) matrix multiply & summation instruction which can be used to speed up DL inferencing operations. As those who have been following along with my AI-DL-machine learning (ML) blog posts (latest being Learning Machine Learning part 3), probably know, deep learning machine learning that processes data to create a neural network made up with a number of layers and a number of nodes each of which represents a floating point weight used to transform inputs into outputs.

All DL AI projects involve at least two phases: model training and model inferencing (prediction, classification, AI result, etc.). Although both of these activities involve matrix calculations, model training involves a lot more of these compute intensive operations than inferencing. In fact, while training typically is done on GPUs or other special purpose compute hardware (TPU, IPUs, etc.) inferencing can typically be done on standard off the shelf CPUs.

Historically. inferencing used floating point matrix multiplication and summation functionality ,taking input from sensors, logs, photos, etc. and performing the model logic to create an output.

Intel believes (with industry analyst agreement) that over time, 50% or more of the DL AI workload is going to involve inferencing. Hence, the focus on this end of the AI workload, at least for now.

For example, although speech recognition AI can take a long time to process audio recordings and use reinforcement learning to train a recognition model. But, once trained, you could use that recognition AI model in anything from smart speakers, to speech to text dictation machines, to voice response systems, etc. In all of these the recognition model is passed a voice recording (or voice in real time) and processes these to create a text version of the speech.

But all of this has historically been done in floating point (FP) 32 (bit precision) or FP 16. Google’s TPU is capable of doing this with less precision, but to my knowledge, up to this point, it’s always been floating point.

What is DL Boost

What Intel has done with DL Boost is to create a new X86 instruction which can perform an integer (INT) 8 (bit precision) matrix multiplication and summation with less cycles than what it took before. Intel believes if customers were to modify their trained AI neural network models to move from FP 32 (or 16) to INT 8, they could perform inferencing much faster on Xeon Cascade Lake CPUs, than they could before and not have to rely on GPUs for this activity at all.

Yes, this does require hand optimization of trained AI neural network. Some of this may be automated, but not all. Intel claims the precision loss, if done properly, is less than a few percent and it’s impact on AI inferencing correctness is negligible at best.

At the moment, for all the DL modeling I have done, i have never looked at the trained model’s weights leaving this to TensorFlow/Keras to manage for me. But I’m not creating production level DL AI systems (yet). So, I don’t know what it would take to modify my AI models to use INT 8 nor what level of degradation in correctness would ensue. But I also don’t have Cascade Lake Xeon CPUs available.

Some potential problems here:

  1. Manual activity to hand tune the INT 8 neural network is not going to be that popular, except for those organizations where inferencing requires GPUs.
  2. Most production DL AI models, undergo some form of personalization for a user or implementation instance which would require a further FP to INT conversion for each user/implementation.
  3. Most production DL AI models also undergo periodic retraining to fine tune the model with the latest data that has been accumulated. This would also require further FP to INT conversion after each training cycle.

In the end, there’s an advantage for production AI inferencing, for models that don’t require substantial retraining/personalization as they don’t change that often. And there’s a definite cost advantage to using DL Boost INT 8, for those AI inferencing that must use GPUs today to perform in real time or under other performance constraints.

But hand converting neural networks, reminds me of creating assembly code for modules that can impact performance. This is normally reserved for only a select modules or functionality that executud a lot. However, DL models are much more monolithic and by definition, less modular. Identifying which models (or model layers) within a production DL AI solution that are performance sensitive and hand optimizing them to work on CPUs rather than GPUs, seems like a hard task.

It would be better from my perspective to create a single FP 16 matrix multiplication instruction. Alternatively, create some software that would automatically convert any DL AI model (or model layer) from FP to INT 8. That way DL Boost optimization would be just another step in the model training process and could be automatically generated to see if A) it loses too much sensitivity and B) if it’s worthwhile using CPU inferencing.

~~~~

Comments?

Screaming IOP performance with StarWind’s new NVMeoF software & Optane SSDs

Was at SFD17 last week in San Jose and we heard from StarWind SAN (@starwindsan) and their latest NVMeoF storage system that they have been working on. Videos of their presentation are available here. Starwind is this amazing company from the Ukraine that have been developing software defined storage.

They have developed their own NVMe SPDK for Windows Server. Intel doesn’t currently offer SPDK for Windows today, so they developed their own. They also developed their own initiator (CentOS Linux) for NVMeoF. The target system was a multicore server running Windows Server with a single Optane SSD that they used to test their software.

Extreme IOP performance consumes cores

During their development activity they tested various configurations. At the start of their development they used a Windows Server with their NVMeoF target device driver. With this configuration and on a bare metal server, they found that they could max out the Optane SSD at 550K 4K random write IOPs at 0.6msec to a single Optane drive.

When they moved this code directly to run under a Hyper-V environment, they were able to come close to this performance at 518K 4K write IOPS at 0.6msec. However, this level of IO activity pegged 100% of 8 cores on their 40 core server.

More IOPs/core performance in user mode

Next they decided to optimize their driver code and move as much as possible into user space and out of kernel space, They continued to use Hyper-V. With this level off code, they were able to achieve the same performance as bare metal or ~551K 4K random write IOP performance at the 0.6msec RT and 2.26 GB/sec level. However, they were now able to perform only pegging 2 cores. They expect to release this initiator and target software in mid October 2018!

They converted this functionality to run under ESX/VMware and were able to see much the same 2 cores pegged, ~551K 4K random write IOPS at 0.6msec RT and 2.26 GB/sec. They will have the ESXi version of their target driver code available sometime later this year.

Their initiator was running CentOS on another server. When they decided to test how far they could push their initiator, they were able to drive 4 Optane SSDs at up to ~1.9M 4K random write IOP performance.

At SFD17, I asked what they could have done at 100 usec RT and Max said about 450K IOPs. This is still surprisingly good performance. With 4 Optane SSDs and consuming ~8 cores, you could achieve 1.8M IOPS and ~7.4GB/sec. Doubling the Optane SSDs one could achieve ~3.6M IOPS, with sufficient initiators and target cores with ~14.8GB/sec.

Optane based super computer?

ORNL Summit super computer, the current number one supercomputer in the world, has a sustained throughput of 2.5 TB/sec over 18.7K server nodes. You could do much the same with 337 CentOS initiator nodes, 337 Windows server nodes and ~1350 Optane SSDs.

This would assumes that Starwind’s initiator and target NVMeoF systems can scale but they’ve already shown they can do 1.8M IOPS across 4 Optane SSDs on a single initiator server. Aand I assume a single target server with 4 Optane SSDs and at least 8 cores to service the IO. Multiplying this by 4 or 400 shouldn’t be much of a concern except for the increasing networking bandwidth.

Of course, with Starwind’s Virtual SAN, there’s no data management, no data protection and probably very little in the way of logical volume management. And the ORNL Summit supercomputer is accessing data as files in a massive file system. The StarWind Virtual SAN is a block device.

But if I wanted to rule the supercomputing world, in a somewhat smallish data center, I might be tempted to put together 400 of StarWind NVMeoF target storage nodes with 4 Optane SSDs each. And convert their initiator code to work on IBM Spectrum Scale nodes and let her rip.

Comments?

New website monetization approaches

Historically, websites have made money by selling wares, services or advertising. In the last two weeks it seems like two new business models are starting to emerge. One more publicly supported and the other less publicly supported.

Europe’s new copyright law

According to an article I read recently (This newly approved European copyright law might break the Internet), Article 11 of Europe’s new Copyright Directive (not quite law yet) will require search engines, news aggregators and other users of Internet content to pay a “link tax” to copyright holders of anything they link to. As a long time blogger, podcaster and content provider, I find this new copyright policy very intriguing.

The article proposes that this will bankrupt small publishers as larger ones will charge less for the traffic. But presently, I get nothing for links to my content. And, I’d be delighted to get any amount – in fact I’d match any large publishers link tax amount that the market demands.

But my main concern is the impact this might have on site traffic. If aggregators pay a link tax, why would they want to use content that charges any tax. Yes at some point aggregators need content. But there are many websites full of content, certainly there would be some willing to forego tax fees for more traffic.

I also happen to be a copyright user. Most of my blog posts are from articles I read on the web. I usually link to an article in the 1st one or two paragraphs (see above and below) of a post and may refer (and link) to more that go deeper into a subject. Will I have to pay a link tax to the content owner?

How much of a link tax is anyone’s guess. I’m not sure it would amount to much. But a link tax, if done judiciously might even raise the quality of the content on the web.

Browser’s of the world, lay down your blockchains

The second article was a recent research paper (Digging into browser based crypto mining). Researchers at RWTH Aachen University had developed a new method to associate mined blocks to mining pools as a way to unearth browser-based mined crypto coins. With this technique they estimated that 1.8% of all Monero coins were mined by CoinHive using participant browsers to mine the coin or ~$250K/month from browser mining.

I see this as steeling compute power. But with that much coin being generated, it might be a reasonable way for an honest website to make some cash from people browsing their web pages. The browsing party would need to be informed of the mining operation in the page’s information, sort of like “we use cookies” today.

Just think, someone creates a WP plugin to do ETH mining and when activated, a WP website pops up a message that says “We mine coins while you browse – OK?”.

In another twist perhaps the websites could share the ETH mined on their browser with the person doing the browsing, similar to airline/hotel travel awards. Today most travel is done on corporate dime, but awards go to the person doing the traveling. Similarly, employees could browse using corporate computers but they would keep a portion of the ETH that’s mined while they browse away… Sounds like a deal.

Other monetization approaches

We’ve tried Google AdSense and other advertising but it only generated pennies a month. So, it wasn’t worth it.

We also sell research and occasionally someone buys some (see SCI Research Shop). And I do sell services but not through my website.

~~~

Not sure a link tax will fly. It would be a race to the bottom and anyone that charged a tax would suffer from less links until they decided to charge a $0 link tax.

Maybe if every link had a tax associated with it, whether the site owner wanted it or not there could be a level playing field. Recording, paying/receiving and accounting for all these link tax micro payments would be another nightmare altogether.

But a WP plugin, that announces and mines crypto coins with a user’s approval and splits the profit with them might work. Corporate wouldn’t like it but employees would just be browsing websites, where’s the harm in that.

Browse a website and share the mined crypto coin with site owner. Sounds fine to me.

Photo Credit(s): Strasburg – European Parliament|Giorgio Barlocco

Crypto News Daily – Telegram cancels ICO…

Photo of Bitcoin, Etherium and Litecoin|QuoteInspector