Photonic [AI] computing seeing the light of day – part 2

Read an interesting article in Analytics India Magazine (MIT Researchers Make New Chips That Work On Light) about a startup out of MIT focused on using photonics for AI/ML/DL activities. Not exactly neuromorphic chips, but using analog photonics interactions to perform computational intensive operations required by todays deep neural net training.

We’ve written about photonics computing before ( see Photonic computing seeing the light of day [-part 1]). That post was about spin outs from Princeton and MIT back in 2019. We showed a bit more on how photonics can perform multiplication and other computations with less power.

The article (noted above) talked about LightIntelligence, an MIT spinout/ startup that’s been around since ~2017, but there’s another company in the same space, also out of MIT called LightMatter that just announced early access to their hardware system.

The CEOs of both companies collaborated on a paper (#1&2 authors of the 10 author paper) written back in 2017 on Deep Learning with Coherent Nanophotonic Circuits. This seemed to be the event that launched both companies.

LightMatter just received $80M in Series B funding ( bringing total funding to $113M) last month and LightIntelligence seems to have $40M in total funding So both have decent funding but, LightMatter seems further ahead in funding and product technology.

LightMatter

LightMatter Envise Photonics-RISC AI processing chip

LightMatter Envise AI chip uses standard RISC electronic cores together with Photo Arithmetic Units for accelerated AI computations. Each Envise chip has 500MB of SRAM for large models, offers 400Gbps chip to chip interconnect fabric, 256 RISC cores, a Graph processor, 294 photonic arithmetic units and PCIe 4.0 connectivity.

LightMatter has just announced early access for their Envise AI photonics server. It’s an 4U, AI server with 16 Envise chips, 2 AMD EPYC CPUs, (16×400=)6.4Tpbs optical fabric for inter-chip communications, 1TB of DDR4 DRAM, 3TB of NVMe SSD and supports 2-200GbE SmartNICs for outside communications.

Envise also offers Idiom Software that interfaces with standard AI frameworks to transform models for photonics computing to use Envise hardware . Developers select Envise hardware to run their AI models on and Idiom automatically re-compiles (IdCompile) their model into more parallelized, photonics operations. Idiom also has a model profiler (IdProfiler) to help debug and visualize photonic models in operation (training or inferencing?) on Envise hardware. Idiom also offers an AI model library (IdML) which provides a PyTorch frontend to help compress and quantize a standard set of AI models.

LightMatter also announced their Passage optical interconnect chip that supplies 100Tbps optical switch for photonics, CPU or GPU processing. It’s huge, 8″x8″ and built on 5nm/7nm node process. Passage can connect up to 48 photonics, CPU or GPU chips that are built onto of it (one can see the space for each of these 48 [sub-]chips on the chip). LightMatter states that 40 Passage (photonic/optical) lanes are the width of one optical fibre. Passage chips are sampling now.

LightMatter Passage photonics-transistor chip (carrier) that provides a photonics programmable interconnect for inter-[photonics-electronic-]chip communications.

LightIntelligence

They don’t appear to be announcing any specific hardware just yet but they are at work in creating the world largest integrated photonics processing system. But LightIntelligence have published a number of research papers focused on photonic approaches to CNNs, RNNs/LSTMs/GRUs, Recurrent ISING machines, statistical computing, and invisibility cloaking.

Turns out the processing power needed to provide invisibility cloaking is very intensive and as its all pixels, photonics offers serious speedups (for invisibility, see Nature article, behind paywall).

Photonics Recurrent ISLING Sampler (PRIS)

LightIntelligence did produce a prototype photonics processor in 2019. And they believe the will have de-risked 80-90% of their photonics technology by year end 2021.

If I had to guess, it would appear as if LightIntelligence is trying to re-imagine deep learning taking a predominately all photonics approach.

Why photonics for AI DL

It turns out that one can use the interaction/interference between two light beams to perform matrix multiplication and other computations a lot faster, with a lot less power than using standard RISC (or CISC) electronic processor architectures. Typical GPUs run 400W each and multi-GPU training activities are commonplace today.

The research documented in the (Deep learning using nanophotonics) paper was based on using an optical FPGA which we have talked about before (See Photonics or Optical FPGAs on the horizon) to prototype the technology back in 2017.

Can photonics change the technology underpinning AI or computing?

If by using photonics, one could speed up AI inferencing by 3-5X and do it with 5-6X less power, you might have a market. These are LightMatter Envise performance numbers on ResNet50 with ImageNet and BERT-Base with SQUAD v1.1 against NVIDIA DGX-A100 (state of the art) AI processing system.

The challenge to changing the technology behind multi-million/billion/trillion dollar industry is that it’s not sufficient to offer a product better than the competition. One has to offer a technology that’s better enough to fund the building of a new (multi-million/billion/trillion dollar) ecosystem surrounding that technology. In order to do that it’s got to be orders of magnitude faster/lower power/better so that commercial customers adopt it en masse.

I like where LightMatter is going with their Passage chip. But their Envise server doesn’t seem fast enough to give them enough traction to build a photonics ecosystem or to fund Envise 2, 3, 4, etc. to change the industry.

The 2017 (Deep learning using nanophotonics) paper predicted that an all optical/photonics implementation of CNN would use 3 orders of magnitude less power for small models and that advantage would only go up for larger models (not counting power for data movement, photo detectors, etc.). Now if that’s truly feasible and maybe it takes a more photonics intensive processor to get there, then photonics technology could truly transform the AI or for that matter the computing industry.

But the other thing that LightIntelligence and LightMatter may be counting on is the slowdown in Moore’s law which may inhibit further advances in electronics processing power. Whether the silicon industry is ready to throw in the towel yet on Moore’s law is TBD.

Comments?

Photo Credit(s):

AI inferencing using light alone

Researchers at UCLA have taken a trained DL neural network and implemented it into a series of passive optical only, 3D printed diffraction gratings to perform fashion MNIST object classification. And did the same with a MNIST handwritten digit and ImageNet DL neural network classifiers.

But first please take our new poll:

Experimental testing of 3D-printed D2NNs.(A and B) After the training phase, the final designs of five different layers (L1, L2, …, L5) of the handwritten digit classifier, fashion product classifier, and the imager D2NNs are shown. To the right of the network layers, an illustration of the corresponding 3D-printed D2NN is shown. (C and D) Schematic (C) and photo (D) of the experimental terahertz setup. An amplifier-multiplier chain was used to generate continuous-wave radiation at 0.4 THz, and a mixer-amplifier-multiplier chain was used for the detection at the output plane of the network. RF, radio frequency; f, frequency.

See the article on SlashGear, 3D printed all-optical diffractive deep learning neural network…. The research article is only available on Optical Society of America’s website/magazine (see Residual D2NN: training diffractive deep neural networks via learnable light shortcuts behind hard paywall). However, I did find a follow on article on ArchivX (see Analysis of Diffractive Optical Neural Networks and Their Integration with Electronic Neural Networks) that discussed how to integrate D2NN approaches with an electronic NN to create a hybrid inference engine. And another earlier Science article (see All-optical machine learning using diffractive deep neural networks) that was available which described earlier versions of D2NN technology for MNIST digit classification, fashion MNIST classification and ImageNet object classification.

How does it work

Apparently the researchers trained a normal (electronic based) deep learning neural network on the MNIST, Fashion MNIST and ImageNet and then converted the resultant trained NNs into a set of multiple diffraction grids. They did some computer simulation of the D2NN and once satisfied it worked and achieved decent accuracy, 3D printed the diffraction plates.

All-optical D2NN-based classifiers. These D2NN designs were based on spatially and temporally coherent illumination and linear optical materials/layers. (a) D2NN setup for the task of classification of handwritten digits (MNIST), where the input information is encoded in the amplitude channel of the input plane. (b) Final design of a 5-layer, phase-only classifier for handwritten digits. (c) Amplitude distribution at the input plane for a test sample (digit ‘0’). (d-e) Intensity patterns at the output plane for the input in (c); (d) is for MSE-based, and (e) is softmax- cross-entropy (SCE)-based designs. (f) D2NN setup for the task of classification of fashion products (Fashion-MNIST), where the input information is encoded in the phase channel of the input plane. (g) Same as (b), except for fashion product dataset. (h) Phase distribution at the input plane for a test sample. (i-j) Same as (d) and (e) for the input in (h),  refers to the illumination source wavelength. Input plane represents the plane of the input object or its data, which can also be generated by another optical imaging system or a lens, projecting an image of the object data onto this plane.

In their D2NN, they start with coherent (laser) light in the THz spectrum, used this to illuminate the input plane (I assume an image of the object/digit/fashion accessory) and passed this through multiple plates of diffraction grids onto THz detector which was used to detect the illuminated spot that indicated the classification.

The article in science has a supplementary materials download that show how the researchers converted NN weights into a diffraction grating. Essentially each pixel on the diffraction grating either transmits, refracts, or reflects a light path. And this represents the connections between layers. It’s unclear whether the 5 or 6 plates used in the D2NN correspond to the NN layers but it’s certainly possible.

And to the life of me I can’t understand what they mean by “Residual D2NN”, other than if it means using a trained (residual) NN and converting this to D2NN.

Some advantages of D2NN

3D printing diffraction gratings means anyone/lab could do this. The 3D printers they used had a spatial accuracy of 600 dpi, with 0.1mm accuracy, almost consumer grade 3D printers. In any case, being able to print these in a matter of hours, while not as easy as changing an all digital NN, seems like an easy way to try out the approach.

For example, for the MNIST digit classifier they used a pixel size of 400um and each diffraction layer they created was equivalent to 200X200 neural weights. Which means that 5 layer D2NN could handle about 0.2M neural weights which were completely connected to one another. This meant they could have (200×200)**2*5=8B connections in the MNIST D2NN. In the image classifier, each diffraction layer had 300×300 neural weights. So D2NN’s seem to scale very well.

Being an all passive optical device, the system is operates entirely in parallel, That is, the researchers indicated that the D2NN devices operate at the speed of light and would perform the inferencing activity in the time it takes a camera to capture the image.

Also the device uses very little energy (I assume just the energy for the THz generator, the input plane detector and the THz detector at the end.

And the researchers also claimed the device was cheap to manufacture, it could be created for less than $50. (Unclear if this included all the electronics or just the D2NN diffraction gratings and holder). And once you have locked into a D2NN that you wanted to use, could be manufactured in volume, very cheaply (sort of like stamping out CD platters). Finally, the number of neural network nodes and layers can be scaled up to a large number of layers and nodes per layer while still fitting on the diffraction gratings. In contrast, all electronic NN require more compute power as you scale up network layers and nodes per layer.

The other article (ArchivX) talked about potentially using a hybrid optical-electronic DNN approach with some layers being D2NN and others being purely digital (electronics). Such a system could potentially be used where some portion of the NN was more stable/more compute intensive than others and where the final output classification layer(s) was more changeable and much smaller/less compute intensive. Such a hybrid system could make use of the best of of the all optical D2NN to efficiently and quickly compress the input space and then have the electronic final classification layer provide the final classification step.

The Oracle

Combining a handful of D2NNs into a device that accepts speech input and provides speech output with the addition of say an offline copy of Wikipedia, Google Books etc. with a search engine that could be used to retrieve responses to questions asked would create an oracle device. Where you would ask a question and the device would respond with the best answer it could find (in it’s databases).

If this could be made out of an all passive optical components and use natural sunlight/electronic illumination to perform it’s functionality, such an all optical, question to answer oracle would be very useful to the populations of the world. And could be manufactured in volume very cheaply and would cost almost nothing to operate.

A couple of other tweaks, if we could collapse the multiple grating D2NNs into a single multi-layer plate/platter and make these replaceable in the device that would allow the oracle’s information base to be updated periodically.

Then if we could embed such a device into a Long Now Clock that would reflect sunlight onto the disk every Solstice, or Equinox, then we could have a quarterly oracle device that could last for 1000 of years. That would provide answers to queries one day every quarter. And that would be quite the oracle…

Photo credit(s):

Ok, maybe neuromorphic chips aren’t a deadend

Those of you who followe my blog will no doubt recall that I pronounced neuromorphic chips dead (see our Are neuromorphic chips a deadend blog post). Not because the hardware technology wasn’t improving or good enough, but because software support for the technology was sorely lacking and it was extremely complex or nigh impossible to program and use.

But first please take our new poll:

And, in the meantime GPUs, TPUs and other more “normal” neural network hardware and accelerators, all were able to utilize standard, easy to use, mostly open source, AI DL frameworks. And all this hardware was steadily improving, coming out regularly with more power and performance, with no end in sight.

But then I attended AIFD1 (AI Field Day 1) and at one of the sessions, Anil Mankar, COO & Co-Founder of a company named BrainChip Inc, (see video of their talk) presented yet another neuromorphic chip, called the AKIDA Neural Processor. Their current generation of the technology is available in their AKD 1000 SoC chip, focused on IoT solutions. But they had created a a software development environment that allowed one to use standard TensorFlow neural network trained models and deploy these on their hardware. And that got my interest.

BrainChip’s AKIDA AKD 1000 hardware AND software

Their AI DL nueromoryhic chip is made app of Event Domain Neural Processing Units (NPUs). AKIDA technology is focused on low power, sensor like applications. They claim to save power by only consumuing power (or is running) when an event takes place. They are also able to save on memory requirements by using 1, 2 or 4 bits (vs. 8, 16, 32 or more bits) for model weights/activations

Their hardware seems to run spiking neural networks (SNN, see our blog post on another chip technology using SNNs). In their SDK, they have a CNN2SNN tool that could take a any (TensorFlow) trained CNN model and convert it to a SNN, that could then run on their AKIDA tecnology.

They also have an AKIDA Model Zoo with a handful of pre-trained CNN type models that have already been converted to run on their technology. They also provide a tutorial on their technology. Mankar, said that if you understand how to use TensorFlow Keras today, to construct and train your models, it shouldn’t be too hard to understand how to use their tools to do what you want.

Their chip hardware is available today on a separate PCIe card, M.2 form factor card. or as a chip. Finally, they also license their AKIDA IP to other chip designers.

AKIDA AKD 1000 performance

At the AIFD1 Mankar showed statistics on the performance and accuracy attained using their chip vs. using standard 32 bit floating point CNN implementations.

As discussed above, their processor uses 1-4 bits for weight quantization and as such loses some accuracy but as you can see it’s a matter of one to a few percent vs. these same models using a 32bit floating point CNN implementation.

Because of their smaller weights, AKIDA uses less memory and less bandwidth to update models vs. models using larger weights.

As shown in the chart the the memory required for the 8-bit deep learning algorithms (DLAs) were all significantly larger than the memory requirements for the AKIDA solution. For one algorithm, they required ~1/2 the memory size of the 8-bit DLA version of the model.

Mankar also provided information on the amount of calculations required per inference using AKIDA vs. 8-bit DLAs.

Just to set the stage, MMACs/Inference is (matrix or multiple) multiplications and accumulations required to perform a single inference with the selected CNN model. ImageNet (1000), ImageNette (20) and Visual Wake Word models are all standard CNN models, that have pre-trained on vast repositories of data, that can run in many hardware environments. The non-AKIDA solutions above were all running using an 8-bit DLA CNN model. Activity regularization is a method of reducing the learning rate and weights used during training that shrinks the weight changes during training to reduce model overfit.

He also showed some comparisons of their technology vs. Intel’s LoiHi hardware. LoiHi is another neuromorphic chip, whose original introduction prompted me to write the “Are neuromorphic chips a deadend” post (link above). Unfortunately, I didn’t capture any of these charts, but from my recollection, they showed that AKIDA technology used slightly less power than LoiHi technology in all their comparisons.

AKIDA technology demo

In their live, on camera, demo, they used a previously downloaded VGG16 (if I recall correctly) CNN trained model. Offline they had replaced the last classification layer with a (blank, untrained) dense network and they converted this to a SNN and downloaded onto one of their boards. They had developed an application that used this board with a camera to perform more CNN training or CNN image inferencing (classification).

They first (one-shot) trained their board’s model to recognize the background of what the camera was seeing and then proceeded to perform (one-shot) trainings to classify toys of tigers, elephants and cars. All these were completed in real time in the demo. They were able to verify the training took using pictures of tigers, elephants and cars as well as classify all the toys in different orientations and a different toy car

The AIFD1 (a tuff) crowd, said had seen all this before but would be really interested to see if their chip could distinguish between different cars (one a toy race car and the other a toy police car). On camera, they were able to re-train their CNN to distinguish between (toy) car 1 and car 2 to classify properly between the two of them. They had one or two instances where their CNN model was confused, but they were able to re-train it to recognize the toy car and place it into the correct classification (using two-shot[?] learning).

At AIFD1, Mankar also presented detailed, real world data on how they were able to perform Keyword spotting, person detection, E-nose classification, E-tongue classification, and auditory (E-ear?) classification in embedded sensor systems.

AKIDA technology limitations

At the moment, their chip doesn’t support neural networks that use memory such as LSTM or RNN’s but it seems to work fine for any CNN, which was shown multiple times in the data they presented and in their demo.

We were really impressed with their software stack, liked what we saw of their hardware/IP, and enjoyed their demo and its one-shot learning. Check out their videos (link above) for more information on them.

Photo Credit(s): all charts are from BrainChip Inc’s website or were presented at their AIFD1 session

Undersea datacenter in our future?

Read about Microsoft’s Project Natick Phase 2 this past week. Microsoft submerged a steel encased tube filled with servers, storage and compute for 2 years in the UK and just took it out of the water this past July. We’ve written before about underwater and in space data centers (see our IT in space post)

Project Natick’s Phase 2 underwater data center had 12 racks with 864 servers and 27.5PB of disk storage and was connected to the nearby Orkney island’s power grid (250Kw) and networking infrastructure. The Orkney’s islands are located off the NE coast of the Scotland and its power grid is 100% renewable, using tidal, solar and wind power. During the data center test, Orkney was able was able to power the data center, the islands and still provide power back to the Scottish power grid.

More reliable underwater

According to early reports, the servers in the underwater data center had 1/8th the failures that a control data center, on land, had. Microsoft attributes the enhanced server reliability to the use of a 100% Nitrogen (at 1 atmosphere pressure) rather than normal air and the lack of any humans to jostle the equipment/disturb the environment.

It’s also likely that the temperature variability present in a normal, on the surface of the earth, data center was measurably less than for a data center on the sea floor. If this were true, that could also help explain its better reliability.

Why underwater?

It’s all about cooling modern servers (and storage). According to NREL ( USA National Renewable Energy Lab), most data centers operate at 1.8 PUE (power use efficiency) that is, using 180% of the power required for the servers, storage and networking equipment. The other 80% is used mainly for cooling electronics, but also includes lighting, HVAC, and other essential services for humans. NREL says that high efficiency data centers can achieve a PUE of 1.2.

PUE for Project Natick Phase 2 data center was reported to be 1.07. The only additional electricity needed would probably be power for cooling.

Cooling for the Project Natick Phase 2 data center used seawater pumped through the back of server racks. The data center was placed on the seafloor at 35m (117ft) deep.

It kind looked like a submarine. According to Microsoft, the data center was contracted for, built and deployed in under 90 days. The intent was to have the data center be smaller than a standard ISO shipping container. The data center was driven ontop of an 18 wheeler, from where it was built to the Orkney Island, including ferry crossings. It was placed on a triangular support, towed out to see and deposited on the seafloor.

While 864 servers and 27.5PB of storage seem like a lot to most of us, for Microsoft Azure it’s too small to be used as a regional zone. But for (large) edge deployments. something this size or (10X) smaller might be just the thing.

Microsoft notes that 1/2 the world’s population lives within 200km (120mi) of the ocean. So there’s a ready supply of people and businesses that could take advantage of any underwater data center.

And of course, such a structure when laid on the bottom of the ocean floor, could create an artificial reef (if left in place long enough). Artificial reefs have been made out of ocean oil rigs, sunken war ships and large chunks of steel/concrete. So a underwater data center could do so just as well. And maybe the heating coming from the data center cooling pumps would foster even more coral life.

Microsoft plans Project Natick Phase 3 to be a full Azure AZ that will be deployed underwater which will include about 12 Phase 2 datacenter pressurized units.

Photo Credits:

New PCM could supply 36PB of memory to CPUs

Read an article this past week on how quantum geometry can enable a new form of PCM (phase change memory) that is based on stacks of metallic layers (SciTech Daily article: Berry curvature memory: quantum geometry enables information storage in metallic layers), That article referred to a Nature article (Berry curvature memory through electrically driven stacking transitions) behind a paywall but I found a pre-print of it, Berry curvature memory through electrically driven stacking transitions.

Figure 1| Signatures of two different electrically-driven phase transitions in WTe2. a, Side view (b–c plane) of unit cell showing possible stacking orders in WTe2 (monoclinic 1T’, polar orthorhombic Td,↑ or Td,↓) and schematics of their Berry curvature distributions in momentum space. The spontaneous polarization and the Berry curvature dipole are labelled as P and D, respectively. The yellow spheres refer to W atoms while the black spheres represent Te atoms. b, Schematic of dual-gate h-BN capped WTe2 evice. c, Electrical conductance G with rectangular-shape hysteresis (labeled as Type I) induced by external doping at 80 K. Pure doping was applied following Vt/dt = Vb/db under a scan sequence indicated by black arrows. d, Electrical conductance G with butterfly-shape switching (labeled as Type II) driven by electric field at 80 K. Pure E field was applied following -Vt/dt = Vb/db under a scan sequence indicated by black arrows. Positive E⊥ is defined along +c axis. Based on the distinct hysteresis observations in c and d, two different phase transitions can be induced by different gating configurations.

The number one challenge in IT today,is that data just keeps growing. 2+ Exabytes today and much more tomorrow.

All that information takes storage, bandwidth and ultimately some form of computation to take advantage of it. While computation, bandwidth, and storage density all keep going up, at some point the energy required to read, write, transmit and compute over all these Exabytes of data will become a significant burden to the world.

PCM and other forms of NVM such as Intel’s Optane PMEM, have brought a step change in how much data can be stored close to server CPUs today. And as, Optane PMEM doesn’t require refresh, it has also reduced the energy required to store and sustain that data over DRAM. I have no doubt that density, energy consumption and performance will continue to improve for these devices over the coming years, if not decades.

In the mean time, researchers are actively pursuing different classes of material that could replace or improve on PCM with even less power, better performance and higher densities. Berry Curvature Memory is the first I’ve seen that has several significant advantages over PCM today.

Berry Curvature Memory (BCM)

I spent some time trying to gain an understanding of Berry Curvatures.. As much as I can gather it’s a quantum-mechanical geometric effect that quantifies the topological characteristics of the entanglement of electrons in a crystal. Suffice it to say, it’s something that can be measured as a elecro-magnetic field that provides phase transitions (on-off) in a metallic crystal at the topological level. 

In the case of BCM, they used three to five atomically thin, mono-layers of  WTe2 (Tungsten Ditelluride), a Type II  Weyl semi-metal that exhibits super conductivity, high magneto-resistance, and the ability to alter interlayer sliding through the use of terahertz (Thz) radiation. 

It appears that by using BCM in a memory, 

Fig. 4| Layer-parity selective Berry curvature memory behavior in Td,↑ to Td,↓ stacking transition. a,
The nonlinear Hall effect measurement schematics. An applied current flow along the a axis results in the generation of nonlinear Hall voltage along the b axis, proportional to the Berry curvature dipole strength at the Fermi level. b, Quadratic amplitude of nonlinear transverse voltage at 2ω as a function of longitudinal current at ω. c, d, Electric field dependent longitudinal conductance (upper figure) and nonlinear Hall signal (lower figure) in trilayer WTe2 and four-layer WTe2 respectively. Though similar butterfly-shape hysteresis in longitudinal conductance are observed, the sign of the nonlinear Hall signal was observed to be reversed in the trilayer while maintaining unchanged in the four-layer crystal. Because the nonlinear Hall signal (V⊥,2ω / (V//,ω)2 ) is proportional to Berry curvature dipole strength, it indicates the flipping of Berry curvature dipole only occurs in trilayer. e, Schematics of layer-parity selective symmetry operations effectively transforming Td,↑ to Td,↓. The interlayer sliding transition between these two ferroelectric stackings is equivalent to an inversion operation in odd layer while a mirror operation respect to the ab plane in even layer. f, g, Calculated Berry curvature Ωc distribution in 2D Brillouin zone at the Fermi level for Td,↑ and Td,↓ in trilayer and four-layer WTe2. The symmetry operation analysis and first principle calculations confirm Berry curvature and its dipole sign reversal in trilayer while invariant in four-layer, leading to the observed layer-parity selective nonlinear Hall memory behavior.
  • To alter a memory cell takes “a few meV/unit cell, two orders of magnitude less than conventional bond rearrangement in phase change materials” (PCM). Which in laymen’s terms says it takes 100X less energy to change a bit than PCM.
  • To alter a memory cell it uses terahertz radiation (Thz) this uses pulses of light or other electromagnetic radiation whose wavelength is on the order of picoseconds or less to change a memory cell. This is 1000X faster than other PCM that exist today.
  • To construct a BCM memory cell takes between 13 and 16  atoms of W and Te2 constructed of 3 to 5 layers of atomically thin, WTe2 semi-metal.

While it’s hard to see in the figure above, the way this memory works is that the inner layer slides left to right with respect to the picture and it’s this realignment of atoms between the three or five layers that give rise to the changes in the Berry Curvature phase space or provide on-off switching.

To get from the lab to product is a long road but the fact that it has density, energy and speed advantages measured in multiple orders of magnitude certainly bode well for it’s potential to disrupt current PCM technologies.

Potential problems with BCM

Nonetheless, even though it exhibits superior performance characteritics with respect to PCM, there are a number of possible issues that could limit it’s use.

One concern (on my part) is that the inner-layer sliding may induce some sort of fatigue. Although, I’ve heard that mechanical fatigue at the atomic level is not nearly as much of a concern as one sees in (> atomic scale and) larger structures. I must assume this would induce some stress and as such, limit the (Write cycles) endurance of BCM.

Another possible concern is how to shrink size of the Thz radiation required to only write a small area of the material. Yes one memory cell can be measured bi the width of 3 atoms, but the next question is how far away do I need to place the next memory cell. The laser used in BCM focused down to ~1.5 μm. At this size it’s 1,000X bigger than the BCM memory cell width (~1.5 nm).

Yet another potential problem is that current BCM must be embedded in a continuous flow of liquid nitrogen (@80K). Unclear how much of a requirement this temperature is for BCM to function. But there are no computers nowadays that require this level of cooling.

Figure 3| Td,↑ to Td,↓ stacking transitions with preserved crystal orientation in Type II hysteresis. a,
in-situ SHG intensity evolution in Type II phase transition, driven by a pure E field sweep on a four-layer and a five-layer Td-WTe2 devices (indicated by the arrows). Both show butterfly-shape SHG intensity hysteresis responses as a signature of ferroelectric switching between upward and downward polarization phases. The intensity minima at turning points in four-layer and five-layer crystals show significant difference in magnitude, consistent with the layer dependent SHG contrast in 1T’ stacking. This suggests changes in stacking structures take place during the Type II phase transition, which may involve 1T’ stacking as the intermediate state. b, Raman spectra of both interlayer and intralayer vibrations of fully poled upward and downward polarization phases in the 5L sample, showing nearly identical characteristic phonons of polar Td crystals. c, SHG intensity of fully poled upward and downward polarization phases as a function of analyzer polarization angle, with fixed incident polarization along p direction (or b axis). Both the polarization patterns and lobe orientations of these two phases are almost the same and can be well fitted based on the second order susceptibility matrix of Pm space group (Supplementary Information Section I). These observations reveal the transition between Td,↑ and Td,↓ stacking orders is the origin of
Type II phase transition, through which the crystal orientations are preserved.

Finally, from my perspective, can such a memory can be stacked vertically, with a higher number of layers. Yes there are three to five layers of the WTe2 used in BCM but can you put another three to five layers on top of that, and then another. Although the researchers used three, four and five layer configurations, it appears that although it changed the amplitude of the Berry Curvature effect, it didn’t seem to add more states to the transition.. If we were to more layers of WTe2 would we be able to discern say 16 different states (like QLC NAND today).

~~~~

So there’s a ways to go to productize BCM. But, aside from eliminating the low-temperature requirements, everything else looks pretty doable, at least to me.

I think it would open up a whole new dimension of applications, if we had say 60TB of memory to compute with, don’t you think?

Comments?

[Updated the title from 60TB to PB to 36PB as I understood how much memory PMEM can provide today…, the Eds.]

Photo Credit(s):

Software defined power grid

Read an article this past week in IEEE Spectrum (The Software Defined Power Grid is here) about a company that has been implementing software defined power grids throughout USA and the world to better integrate and utilize renewable energy alongside conventional power generation equipment.

Moreover, within the last year or so, Tesla has installed a Virtual Power Plant (VPP) using residential solar and grid scale batteries to better manage the electrical grid of South Australia (see Tesla’s Australian VPP propped up grid during coal outage). VPP use to offset power outages would necessitate something like a software defined power grid.

Software defined power grid

Not sure if there’s a real definition somewhere but from our perspective, a software defined power grid is one where power generation and control is all done through the use of programatic automation. The human operator still exists to monitor and override when something goes wrong but they are not involved in the moment to moment control of which power is saved vs. fed into the grid.

About a decade ago, we wrote a post about smart power meters (Smart metering’s data storage appetite) discussing the implementation of smart meters for home owners that had some capabilities to help monitor and control power use. But although that technology still exists, the software defined power grid has moved on.

The IEEE Spectrum article talks about a phasor measurement units (PMUs) that are already installed throughout most power grids. It turns out that most PMUs are capable of transmitting phasor power status at 60 times a second granularity and each status report is time stamped with high accuracy, GPS synchronized time.

On the other hand, most power grids today use SCADAs (supervisory control and data acquisition) to monitor and manage the power grid. But SCADAs only send data every 2-4 seconds. PMU’s are also installed in most power grids, but their information is not as important as SCADA to the monitoring, management and control of most (non-software defined) power grids.

One software defined power grid

PXiSE, the company in the IEEE Spectrum article, implemented their first demonstration project in Hawaii. That power grid had reached the limit of wind and solar power that it could support with human management. The company took their time and implemented a digital simulation of the power grid. But with the simulation in hand, battery storage and a off the shelf PC, the company was able to manage the grids power generation mix in real time with complete automation.

After that success, the company next turned to a micro-grid (building level power) with electronic vehicles, battery and solar power. Their software defined power grid reduced peak electricity demand within the building, saving significant money. With that success the company took their software defined power grid on the road to South Korea, Chile, Mexico and a number of other locations the world.

Tesla’s VPP

The Tesla VPP in South Australia, is planned to consists of up to 50K houses with solar PV panels and 13.5Kwh of batteries, able to deliver up to 250Mw of power generation and 650Mwh of power storage.

At the present time, the system has ~1000 house systems installed but even with that limited generation and storage capability it has already been called upon at least twice to compensate for coal generation power outage. To manage each and every household, they’d need something akin to the smart meters mentioned above in conjunction with a plethora of PMUs.

Puerto Rico’s power grid problems and solutions

There was an article not so long ago about the disruption to Puerto Rico’s power grid caused by Hurricanes Irma and Maria in IEEE Spectrum (Rebuilding Puerto Rico’s Power Grid: The Inside Story) and a subsequent article on making Puerto Rico’s power grid more resilient to hurricanes and other natural disasters (How to harden Puerto Rico’s power grid). The later article talked about creating micro grids, community PV and battery storage that could be disconnected from the main grid in times of disaster but also used to distribute power generation throughout the island.

Although the researchers didn’t call for the software defined power grid, it is our understanding that something similar would be an outstanding addition to their work there.

~~~~

As the use of renewables goes up and the price of batteries decreases while their capabilities go up over time, more and more power grids will need to become software defined. In the end, more software defined power grids with increasing renewables power generation and storage will make any power grid, more resilient and more fault tolerant.

Photo Credit(s):

Hybrid digital training-analog inferencing AI

Read an article from IBM Research, Iso-accuracy DL inferencing with in-memory computing, the other day that referred to an article in Nature, Accurate DNN inferencing using computational PCM (phase change memory or memresistive technology) which discussed using a hybrid digital-analog computational approach to DNN (deep neural network) training-inferencing AI systems. It’s important to note that the PCM device is both a storage device and a computational device, thus performing two functions in one circuit.

In the past, we have seenPCM circuitry used in neuromorphic AI. The use of PCM here is not that (see our Are neuromorphic chips a dead end? post).

Hybrid digital-analog AI has the potential to be more energy efficient and use a smaller footprint than digital AI alone. Presumably, the new approach is focused on edge devices for IoT and other energy or space limited AI deployments.

Whats different in Hybrid digital-analog AI

As researchers began examining the use of analog circuitry for use in AI deployments, the nature of analog technology led to inaccuracy and under performance in DNN inferencing. This was because of the “non-idealities” of analog circuitry. In other words, analog electronics has some intrinsic capabilities that induce some difficulties when modeling digital logic and digital exactitude is difficult to implement precisely in analog circuitry.

The caption for Figure 1 in the article runs to great length but to summarize (a) is the DNN model for an image classification DNN with fewer inputs and outputs so that it can ultimately fit on a PCM array of 512×512; (b) shows how noise is injected during the forward propagation phase of the DNN training and how the DNN weights are flattened into a 2D matrix and are programmed into the PCM device using differential conductance with additional normalization circuitry

As a result, the researchers had to come up with some slight modifications to the typical DNN training and inferencing process to improve analog PCM inferencing. Those changes involve:

  • Injecting noise during DNN neural network training, so that the resultant DNN model becomes more noise resistant;
  • Flattening the resultant DNN model from 3D to 2D so that neural network node weights can be implementing as differential conductance in the analog PCM circuitry.
  • Normalizing the internal DNN layer outputs before input to the next layer in the model

Analog devices are intrinsically more noisy than digital devices, so DNN noise sensitivity had to be reduced. During normal DNN training there is both forward pass of inputs to generate outputs and a backward propagation pass (to adjust node weights) to fit the model to the required outputs. The researchers found that by injecting noise during the forward pass they were able to create a more noise resistant DNN.

Differential conductance uses the difference between the conductance of two circuits. So a single node weight is mapped to two different circuit conductance values in the PCM device. By using differential conductance, the PCM devices inherent noisiness can be reduced from the DNN node propagation.

In addition, each layer’s outputs are normalized via additional circuitry before being used as input for the next layer in the model. This has the affect of counteracting PCM circuitry drift over time (see below).

Hybrid AI results

The researchers modeled their new approach and also performed some physical testing of a digital-analog DNN. Using CIFAR-10 image data and the ResNet-32 DNN model. The process began with an already trained DNN which was then retrained while injecting noise during forward pass processing. The resultant DNN was then modeled and programed into a PCM circuit for implementation testing.

Part D of Figure 4 shows the Baseline which represents a completely digital implementation using FP32 multiplication logic; Experiment which represents the actual use of the PCM device with a global drift calibration performed on each layer before inferencing; Mode which represents theira digital model of the PCM device and its expected accuracy. Blue band is one standard-deviation on the modeled result.

One challenge with any memristive device is that over time its functionality can drift. The researchers implemented a global drift calibration or normalization circuitry to counteract this. One can see evidence of drift in experimental results between ~20sec and ~60 seconds into testing. During this interval PCM inferencing accuracy dropped from 93.8% to 93.2% but then stayed there for the remainder of the experiment (~28 hrs). The baseline noted in the chart used digital FP32 arithmetic for infererenci and achieved ~93.9% for the duration of the test.

Certainly not as accurate as the baseline all digital implementation, but implementing DNN inferencing model in PCM and only losing 0.7% accuracy seems more than offset by the clear gain in energy and footprint reduction.

While the simplistic global drift calibration (GDC) worked fairly well during testing, the researchers developed another adaptive (batch normalization statistical [AdaBS]) approach, using a calibration image set (from the training data) and at idle times, feed these through the PCM device to calculate an average error used to adjust the PCM circuitry. As modeled and tested, the AdaBS approach increased accuracy and retained (at least modeling showed) accuracy over longer time frames.

The researchers were also able to show that implementing part (first and last layers) of the DNN model in digital FP32 and the rest in PCM improved inferencing accuracy even more.

~~~~

As shown above, a hybrid digital-analog PCM AI deployment can provide similar accuracy (at least for CIFAR-10/ResNet-24 image recognition) to an all digital DNN model but due to the efficiencies of the PCM analog circuitry allowed for a more energy efficient DNN deployment.

Photo Credit(s):

OFA DNNs, cutting the carbon out of AI

Read an article (Reducing the carbon footprint of AI… in Science Daily) the other day about a new approach to reducing the energy demands for AI deep neural net (DNN) training and inferencing. The article was reporting on a similar piece in MIT News but both were discussing a technique original outlined in a ICLR 2020 (Int. Conf. on Learning Representations) paper, Once-for-all: Train one network & specialize it for efficient deployment.

The problem stems from the amount of energy it takes to train a DNN and use it for inferencing. In most cases, training and (more importantly) inferencing can take place on many different computational environments, from IOT devices, to cars, to HPC super clusters and everything in between. In order to create DNN inferencing algorithms for use in all these environments, one would have to train a different DNN for each. Moreover, if you’re doing image recognition applications, resolution levels matter. Resolution levels would represent a whole set of more required DNNs that would need to be trained.

The authors of the paper suggest there’s a better approach. Train one large OFA (once-for-all) DNN, that covers the finest resolution and largest neural net required in such a way that smaller, sub-nets could be extracted and deployed for less weighty computational and lower resolution deployments.

The authors contend the OFA approach takes less overall computation (and energy) to create and deploy than training multiple times for each possible resolution and deployment environment. It does take more energy to train than training a few (4-7 judging by the chart) DNNs, but that can be amortized over a vastly larger set of deployments.

OFA DNN explained

Essentially the approach is to train one large (OFA) DNN, with sub-nets that can be used by themselves. The OFA DNN sub-nets have been optimized for different deployment dimensions such as DNN model width, depth and kernel size as well as resolution levels.

While DNN width is purely the number of numeric weights in each layer, and DNN depth is the number of layers, Kernel size is not as well known. Kernels were introduced in convolutional neural networks (CovNets) to identify the number of features that are to be recognized. For example, in human faces these could be mouths, noses, eyes, etc. All these dimensions + resolution levels are used to identify all possible deployment options for an OFA DNN.

OFA secrets

One key to the OFA success is that any model (sub-network) selected actually shares the weights of all of its larger brethren. That way all the (sub-network) models can be represented by the same DNN and just selecting the dimensions of interest for your application. If you were to create each and every DNN, the number would be on the order of 10**19 DNNs for the example cited in the paper with depth using {2,3,4) layers, width using {3,4,6} and kernel sizes over 25 different resolution levels.

In order to do something like OFA, one would need to train for different objectives (once for each different resolution, depth, width and kernel size). But rather than doing that, OFA uses an approach which attempts to shrink all dimensions at the same time and then fine tunes that subsets NN weights for accuracy. They call this approach progressive shrinking.

Progressive shrinking, training for different dimensions

Essentially they train first with the largest value for each dimension (the complete DNN) and then in subsequent training epochs reduce one or more dimensions required for the various deployments and just train that subset. But these subsequent training passes always use the pre-trained larger DNN weights. As they gradually pick off and train for every possible deployment dimension, the process modifies just those weights in that configuration. This way the weights of the largest DNN are optimized for all the smaller dimensions required. And as a result, one can extract a (defined) subnet with the dimensions needed for your inferencing deployments.

They use a couple of tricks when training the subsets. For example, when training for smaller kernel sizes, they use the center most kernels and transform their weights using a transformation matrix to improve accuracy with less kernels. When training for smaller depths, they use the first layers in the DNN and ignore any layers lower in the model. Training for smaller widths, they sort each layer for the highest weights, thus ensuring they retain those parameters that provide the most sensitivity.

It’s sort of like multiple video encodings in a single file. Rather than having a separate file for every video encoding format (Mpeg 2, Mpeg 4, HVEC, etc.), you have one file, with all encoding formats embedded within it. If for example you needed Mpeg-4, one could just extract those elements of the video file representing that encoding level

OFA DNN results

In order to do OFA, one must identify, ahead of time, all the potential inferencing deployments (depth, width, kernel sizes) and resolution levels to support. But in the end, you have a one size fits all trained DNN whose sub-nets can be selected and deployed for any of the pre-specified deployments.

The authors have shown (see table and figure above) that OFA beats (in energy consumed and accuracy level) other State of the Art (SOTA) and Neural (network) Architectural Search (NAS) approaches to training multiple DNNs.

The report goes on to discuss how OFA could be optimized to support different latency (inferencing response time) requirements as well as diverse hardware architectures (CPU, GPU, FPGA, etc.).

~~~~

When I first heard of OFA DNN, I thought we were on the road to artificial general intelligence but this is much more specialized than that. It’s unclear to me how many AI DNNs have enough different deployment environments to warrant the use of OFA but with the proliferation of AI DNNs for IoT, automobiles, robots, etc. their will come a time soon where OFA DNNs and its competition will become much more important.

Comments

Photo Credit(s):