Weight Agnostic Neural Networks (WANNs)

Read an article the other day (Neural Networks Can Drive Without [weight] Learning) about a new form of deep learning neural network (NN) that is not dependent on the weights assigned to network nodes. The new NN is called WANN (Weight Agnostic NN). There’s also a scientific paper (on Github, Weight Agnostic Neural Networks) that describes WANNs in more detail.

How WANNs differ from normal NN

If I understand them properly, WANNs are trained, but instead of assigning weights during training, WANN networks architectures (nodes and connections) are modified and optimized to perform well against the training data.

Indeed, most NN start out with assigning random weights to all network nodes and then these weights are adjusted through the training cycle, until the NN performs well on the training data. But NN such as these, have a structure (# nodes/layer, # layers, connectivity type, etc.) defined by the researcher, that is stable and unchanging during a training-validation cycle. If the NN model is not accurate enough, the researcher has two choices, find better data or change the model’s structure. WANNs start and end with changing the model’s structure.

With WANNs they start out with a set of NN architectures (#nodes/layer, #layers, connection types, etc). Each NN architecture is evaluated against the training data with a single shared randomized weight. That shared weight is altered (randomly) for a training pass and the model evaluated for accuracy.

At the end of a WANN training pass you have a set of evaluation metrics for each model structure. The resultant WANNs are then ordered by performance and complexity. The highest performing networks are then used to create a new population (set) of WANN architecture to be tested and the process iterates from there. This would presumably continue until you have reached a plateau of accuracy statistics across a number of shared randomized weights. And this would be the WANN model used for the application

Why WANN?

For a normal NN, each node weight would be adjusted automatically and independently at the end of each training batch. There would, of course, be a large number of batches, causing each weight in the NN nodes to be altered (via floating point arithmetic). So the math would be floating point arithmetic*#nodes*#layers*# of training batches (* # training passes (or epochs).

WANNs avoid this inner loop math altogether. Instead they would need to test a model on a number of shared random weights. This would presumably be done after a complete training pass (each epoch). And even if you had the same number of WANN models as nodes in a normal NN, the computations would be much less. Something on the order of #models * # epochs (each training pass [or epoch] could conceivable test a different shared random weight).

Another advantage of WANNs is that they result in simpler, less complex NN models (# nodes, # layers, # of connections, etc.) than normal DL NNs. Simpler NN models could be very useful for IoT applications, where computational power and storage is limited.

The main disadvantage of WANNs is that they aren’t as accurate as normally (weight adjusted) NNs. However, once you have a WANN, you can always elect to re-train it in the normal fashion by adjusting weights to gain more accuracy. And doing so would likely be much closer to a more complex NN model that was trained from the start by altering weights.

WANNs are more like nature

Human and other mammal (probably avian, aquatic, etc as well) seem to be born with certain innate abilities, visual, perceptive, mobility and with certain habits such as nursing, facial mimicking, hunger-feeding, etc. Presumably these innate abilities and habits are hardwired neuron networks that don’t depend on envirnonmental learning. Something that they are all born with.

Concievably WANNs could be consider similar to these hardwired (unlearned) neuron networks. WANNs could be used in a similar fashion to embed certain innate habits and abilities into robots or other automation that could be further trained with their interactions with their environment

““`

The Github paper has an online WANN model widget with a slider where you can alter a shared random weight and see its impact on the operation of a the widget. Playing with this, the only weight that seems to have a significant impact on the actions of the widget is zero…

Photo Credit(s): “Neural Connections In the Human Brain” by Image Editor is licensed under CC BY-NC-ND 2.0 

Supercomputing 2019 (SC19) conference

I was at SC19 last week and as always there was lots to see on the expo floor and at the show in general. Two expo booths that I thought were especially interesting were:

  • Zapata Computing systems – a quantum computing programming for hire outfit and
  • Cerebras – a new AI wafer scale accelerator chip that sported 400K+ cores in a single package.

Zapata Computing, quantum coding for hire

We’ve been on a sort of quantum thread this past month or so (e.g., see our Quantum computing – part 2 and part 1, The race for quantum supremacy posts). Zapata Computing was at the edge of the exhibit floor in a small booth pretty much just one guy (Michael Warren) and their booth with some handouts. Must have had something on the booth about quantum computing, because I stopped by

Warren said they have ~20 PhDs, from around the world working for them and provide quantum coding for hire. Zapata works with organizations to either get them up to speed on quantum programing or write quantum programs themselves under contract for clients and help run them on quantum computers.

Zapata’s quantum algorithms are designed to run on any type of quantum computer such as ion trap, superconducting qubit, quantum annealers, etc. They also work with Microsoft Azure Quantum, IBM Q, Rigetti, and Honeywell systems to run quantum programs for customers. Notably missing from this list was Google and Honeywell is new to me but seem active in quantum computing.

Zapata has their own Orquestra quantum toolkit. We have discussed quantum software development kits like IBM Q Qiskit previously but Microsoft has their own, QDK and Rigetti has Forrest SDK. So, presumably, Orquestra front ends these other development kits. Couldn’t find anything on Honeywell but it’s likely they have their own development kit as well or make use of others.

In talking to the Warren at the show, Zapata is working to come up with a quantum computing cloud, which can be used to run quantum code on any of these quantum computers with the click of a button. Warren sounded like this was coming out soon.

Some of the Zapata Computing quantum programs they have developed for clients include: logistic simulations, materials design, chemistry simulations, etc.

Warren didn’t mention the cost of running on quantum computers but he said that some companies are more forthright with pricing than others. It seemed Rigetti had a published price list to use their systems but others seemed to want to negotiate price on a per use basis.

It seems only a matter of time before quantum computing becomes just like GPUs. Just another computational accelerator that works well for some workloads but not others. Zapata Computing and Orquestra are just steps along this path.

Cerebras

AI accelerator chips have also been a hot topic for us (see our posts on Google TPU, GraphCore’s system, and the Mythic’s and Syntiant’s AI accelerators). But none,. with the possible exception of GraphCore, has taken this on to quite the same level as Cerebras.

Cerebras offers a wafer scale chip that is embedded into their CS-1 system. The chip has 400K cores, 18GB of (very fast) SRAM (memory), 100Pb/sec (peta-bits or 10**15 bits per second) of bandwidth and draws ~20kW. Their CS-1 system fits in a standard rack taking up 15U of space.

The on-chip fabric is called SWARM which supports a 2D mesh. The SWARM mesh is entirely configurable, to support optimal neural network connectivity. I assume this means that any core can talk directly (with 0 hops) to any other core on the chip through a configuration setup.

The high speed on chip SRAM supports up to 9PB/sec of memory bandwidth and can be accessed in a single clock cycle. They call the cores Sparse Linear Algebra Compute (SLAC) cores and say that they are optimized to support ML-DL computations, which we assume meansfloating point aritmetic.

Although you can’t really see the (wafer scale) chip in the picture above, it’s located in the section between the copper plate and the copper heat sink and is starts at the copper line between the two. CS-1 consumes a lot of power and much of its design is to provide proper cooling. One can view some of that on the left side of the picture above.

As for software, Cerebras CS-1 supports TensorFlow and PyTorch as well as standard C++. Their Cerebras Software Platform stack, consists of two layers: the Cerebras Intermediate Representation and Cerebras Graph Compiler (CGC) that feeds their Cerebras Wafer Scale Engine (WSE). The CGC maps neural network nodes to cores on the WSE and probably configures SWARM to provide NN core to NN core connectivity.

It’s great to see hardware innovation again. There was a time where everyone thought that software alone was going to kill off hardware innovation. But the facts are that both need to innovate to take computing forward. Cerebras didn’t tell me any PetaFlop rate for their system and but my guess it would beat out the 2PFlop GraphCore2 (GC2) system but it’s only a matter of time before GC3 comes out. That being said, what could be beyond wafer scale integration?

~~~~

I enjoy going to SC19 for all the leading edge technology on display. They have some very interesting cooling solutions that I don’t ever see anywhere else. And the student competition is fun. Teams of students running HPC workloads around the clock, on donated equipment, from Monday evening until Wednesday evening. With (by SC19) spurious fault injection to see how they and their systems react to the faults to continue to perform the work needed.

For every SC conference, they create an SCinet to support the show. This year it supported Tb/sec of bandwidth and the WiFi for the floor and conference. All the equipment and time that goes into creating SCinet is donated.

Unfortunately, I didn’t get a chance to go to keynotes or plenary sessions. I did attend one workshop on container use in HPC and it was completely beyond me. Next years, SC20 will be in Atlanta.

Photo Credit(s):

Cambrian Explosion of AI DL app’s in industry and the world

I was at the NetApp Insight conference last week and recorded a podcast (see: GreyBeards Podcast) on what NetApp is doing in the AI DL (Deep Learning) space. On the podcast, we talked about a number of verticals that were deploying AI DL right now and using it to improve outcomes.

It was only is 2012 that AI DL broke out and pretty much conquered the speech recognition contest by improving recognition accuracy by leaps and bounds. Prior to that improvements had been very small and incremental at best. Here we are, just 7 years later and AI DL models are proliferating across industry and every other sector of the world economy.

DL applications in the real world

At the show. we talked about AI DL models being used in healthcare (radiological image analysis, cell counts for infection assessments), automotive (self driving cars), financial services (fraud detection), and retail (predicting how make up would look on someone).

And early this year, at HPE Discover, they discussed a new technique to share training data but still keep it private. In this case, they use block chain technology to publish and share a DL neural network model weights and other hyper parameters trained for some real world purpose.

Customers download and use the model in their day to day activities but record the data that their model analyzes and its predictions. They use this data to update (re-train) their DL neural net. They then publish their new neural net model weights and other parameters to all the other customers. Each customer of the model do the same, updating (re-training) their DL neural net.

At some point an owner or global model arbitrator takes all these individual model updates and aggregates the neural net weights, into a new neural net model and publishes the new model. And then the process starts over again. In this way, training data is never revealed, kept secure and private but DL model updates that result from re-training the model with secured private data would be available to any customer.

Recently, there’s been a slew of articles across many different organizations that show how AI DL is being adopted to work in different areas:

And that’s just a sample of the last few weeks of papers of AI DL activity.

Next Steps

All it takes is data, that can be quantified and classified. With data and classifications in hand, anyone can train a DL model that performs that classification. It doesn’t require GPU farms, decent CPUs are up to the task for TB of data.

But if you want better prediction/classificatoin accuracy, you will need more data which means longer AI DL training runs. So at some point, maybe at >100TB of data, or use AI DL training a lot, you may want that GPU farm.

The Deep Learning with Python book (my favorite) has a number of examples such as, sentiment analysis of text, median real estate pricing predictions, generating text that looks like an authors work, with maybe a dozen more that one can use to understand AI DL technology. But it’s not rocket science, I believe any qualified programmer could do it, with some serious study.

So the real question is what are you doing with your data to make use of AI DLmodels now?

I suppose the other question ought to be, how can you collect more data and classification information, to train more AI DL models?

~~~~

It’s great to be in the storage business.

Photo Credit(s):

Quantum computing NNs

As many who have been following our blog know, AI, Machine Learning (ML) and Deep Learning (DL) (e.g. see our Learning machine learning – part 3, & Industrial revolution deep learning & NVIDIA’s 3U supercomputer, AI reaches a crossroads posts), have become much more mainstream and AI has anointed DL as the best approach for pattern recognition, classification, and prediction, but has applicability beyond that.

One problem with DL has been it’s energy costs. There have been some approaches to address this, but none have been entirely successful (e.g. see Intel new DL Boost, New GraphCore GC2 chips, AI processing at the edge posts) just yet. At one time neuromorphic hardware was the answer but I’ve become disillusioned with that technology over time (see Are neuromorphic chips a dead end post).

This past week we learned of a whole new approach, something called a Quantum Convolutional NN or QCNN (see PhysOrg Introducing QCNN, pre-print of Quantum CNNs, presentation deck on QCNNs, Nature QCNN paper paywall).

Some of you may not know that convolutional neural networks (ConvNets) are the latest in a long line of DL architectures focused on recognizing patterns and classification of data in sequence. DL ConvNets can be used to recognize speech, classify photo segments, analyze ticker tapes, etc.

But why quantum computing

First off, quantum computing (QC) is a new leading edge technology targeted to solving very hard (NP Complete, wikipedia) problems, like cracking Public Key encryption keys, solving the traveling salesperson problem and assembling an optimum Bitcoin block problem (see List of NP complete problems, wikipedia).

QC utilizes quantum mechanical properties of the universe to solve these problems without resorting to brute force searches, such as, going down every path in the traveling salesmen problem (see our QC programming and QC at our doorsteps posts).

At the moment, IBM, Google, Intel and others are all working on the QC and trying to scale it up, by increasing the number of Qubits (quantum bits) their systems support. The more qubits, the more quantum storage you have, and the more sophisticated NP complete problems one can solve. Current qubit counts include: 72 qubits for Google, 42 for Intel, and 50 for IBM. Apparently not all qubits are alike, and they don’t last very long, ~100 microseconds (see Timeline of QC, wikipedia).

What’s a QCNN?

What’s new is the use of quantum computing circuits to create ConvNets. Essentially the researchers have created a way to apply AI DL (ConvNet) techniques to quantum computing data (qubits).

Apparently there are QC [qubit] phases that need to be recognized and what better way to do that than use DL ConvNets. The only problem is that performing DL on QC data with today’s tools, would require reading out the phase into a digital (a pattern recognition problem), converting to digital data, and then processing it via CPU/GPU DL ConvNets, a classic chicken or egg problem. But with QCNNs, one has a DL ConvNet entirely implemented in QC.

DL ConvNets are typically optimized for a specific problem, varying layer counts, nodes/layer, node connectivity, etc. QCNNs match this and also come in various sizes. Above is a QCNN circuit, optimized to recognize the phase (joining?) of two sets of symmetrically-protected topology numbers (SPT, see pre-print article).

I won’t go into the QC technology used in any detail (as I barely understand it), but the researchers have come up with a way to map DL ConvNets into QC circuitry. Assuming this all works, one can then use QC to perform DL pattern recognition on qubit data.

~~~~

Comments?

Photo Credits:

Shedding light on all optical neural networks

Read a couple of articles in the past week or so on all optical neural networks (see All optical neural network (NN) closes performance gap with electronic NN and New design advances optical neural networks that compute at the speed of light using engineered matter).

All optical NN solutions operate faster and use less energy to inference than standard all electronic ones. However, in reality they aree more of a hybrid soulution as they depend on the use of standard ML DL to train a NN. They then use 3D printing and other lithographic processes to create a series diffraction layers of an all optical NN that matches the trained NN.

The latest paper (see: Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy) describes a significant advance beyond the original solution (see: All-Optical Machine Learning Using Diffractive Deep Neural Networks, Ozcan’s original paper).

How (all optical) Diffractive Deep NNs (DDNNs) work for inferencing

In the original Ozcan discussion, a DDNN consists of a coherent light source (laser), an image, a bunch of refractive and reflective diffraction layers and photo detectors. Each neural network node is represented by a point (pixel?) on a diffractive layer. Node to node connections are represented by lights path moving through the diffractive layer(s).

In Ozcan’s paper, the light flowing through the diffraction layer is modified and passed on to the next diffraction layer. This passing of the light through the diffraction layer is equivalent to the mathematical bias (neural network node FP multiplier) in the trained NN.

The previous challenge has been how to fabricate diffraction layers and took a lot of hand work. But with the advent of 3D printing and other lithographic techniques, nowadays, creating a diffraction layer is relatively easy to do.

In DDNN inferencing, one exposes (via a coherent beam of light) the first diffraction layer to the input image data, then that image is transformed into a different light pattern which is sent down to the next layer. At some point the last diffraction layer converts the light hitting it into classification patterns which is then be detected by photo detectors. Altenatively, the classification pattern can be sent down an all optical computational path (see our Photonic computing sees the light of day post and Photonic FPGAs on the horizon post) to perform some function.

In the original paper, they showed results of an DDNN for a completely connected, 5 layer NN, with 0.2M neurons and 8B connections in total. They also showed results from a sparsely connected, 5 layer NN ,with 0.45M neurons and <0.1B connections

Note, that there’s significant power advantages in exposing an image to a series of diffraction gratings and detecting the classification using a photo detector vs. an all electronic NN which takes an image, uses photo detectors to convert it into an electrical( pixel series) signal and then process it through NN layers performing FP arithmetic at layer node until one reaches the classification layer.

Furthermore, the DDNN operates at the speed of light. The all electronic network seems to operate at FP arithmetic speeds X number of layers. That is only if it could all done in parallel (with GPUs and 1000s of computational engines. If it can’t be done in parallel, one would need to add another factor X the number of nodes in each layer . Let’s just say this is much slower than the speed of light.

Improving DDNN accuracy

The team at UCLA and elsewhere took on the task to improve DDNN accuracy by using more of the optical technology and techniques available to them.

In the new approach they split the image optical data path to create a positive and negative classifier. And use a differential classifier engine as the last step to determine the image’s classification.

It turns out that the new DDNN performed much better than the original DDNN on standard MNIST, Fashion MNIST and another standard AI benchmark.

DDNN inferencing advantages, disadvantages and use cases

Besides the obvious power efficiencies and speed efficiencies of optical DDNN vs. electronic NNs for inferencing, there are a few other advantages:

  • All optical data paths are less noisy – In an electronic inferencing path, each transformation of an image to a pixel file will add some signal loss. In an all optical inferencing engine, this would be eliminated.
  • Smaller inferencing engine – In an electronic inferencing engine one needs CPUs, memory, GPUs, PCIe busses, networking and all the power and cooling to make it work. For an all optical DDNN, one needs a laser, diffraction layers and a set of photo detectors. Yes there’s some electronics involved but not nearly as much as an all electronic NN. And an all electronic NN with 0.5m nodes, and 5 layers with 0.1B connections would take a lot of memory and compute to support. Their DDNN to perform this task took up about 9 cm (3.6″) squared by ~3 to5 cm (1.2″-2.0″) deep.

But there’s some problems with the technology.

  • No re-training or training support – there’s almost no way to re-train the optical DDNN without re-fabricating the DDNN diffraction layers. I suppose additional layers could be added on top of or below the bottom layers, sort of like a corrective lens. Also, if perhaps there was some sort of way to (chemically) develop diffraction layers during training steps then it could provide an all optical DL data flow.
  • No support for non-optical classifications – there’s much more to ML DL NN functionality than optical classification. Perhaps if there were some way to transform non-optical data into optical images then DDNNs could have a broader applicability.

The technology could be very useful in any camera, lidar, sighting scope, telescope image and satellite image classification activities. It could also potentially be used in a heads up displays to identify items of interest in the optical field.

It would also seem easy to adapt DDNN technology to classify analog sensor data as well. It might also lend itself to be used in space, at depth and other extreme environments where an all electronic NN gear might not survive for very long.

Comments?

Photo Credit(s):

Figure 1 from All-Optical Machine Learning Using Diffractive Deep Neural Networks

Figure 2 from All-Optical Machine Learning Using Diffractive Deep Neural Networks

Figure 2 from Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy

Figure 3 from Class-specific Differential Detection in Diffractive Optical Neural Networks Improves Inference Accuracy

Where should IoT data be processed – part 1

I was at FlashMemorySummit 2019 (FMS2019) this week and there was a lot of talk about computational storage (see our GBoS podcast with Scott Shadley, NGD Systems). There was also a lot of discussion about IoT and the need for data processing done at the edge (or in near-edge computing centers/edge clouds).

At the show, I was talking with Tom Leyden of Excelero and he mentioned there was a real need for some insight on how to determine where IoT data should be processed.

For our discussion let’s assume a multi-layered IoT architecture, with 1000s of sensors at the edge, 100s of near-edge processing/multiplexing stations, and 1 to 3 core data center or cloud regions. Data comes in from the sensors, is sent to near-edge processing/multiplexing and then to the core data center/cloud.

Data size

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)
Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

When deciding where to process data one key aspect is the size of the data. Tin GB or TB but given today’s world, can be PB as well. This lone parameter has multiple impacts and can affect many other considerations, such as the cost and time to transfer the data, cost of data storage, amount of time to process the data, etc. All of these sub-factors include the size of the data to be processed.

Data size can be the largest single determinant of where to process the data. If we are talking about GB of data, it could probably be processed anywhere from the sensor edge, to near-edge station, to core. But if we are talking about TB the processing requirements and time go up substantially and are unlikely to be available at the sensor edge, and may not be available at the near-edge station. And PB take this up to a whole other level and may require processing only at the core due to the infrastructure requirements.

Processing criticality

Human or machine safety may depend on quick processing of sensor data, e. g. in a self-driving car or a factory floor, flood guages, etc.. In these cases, some amount of data (sufficient to insure human/machinge safety) needs to be done at the lowest point in the hierarchy, with the processing power to perform this activity.

This could be in the self-driving car or factory automation that controls a mechanism. Similar situations would probably apply for any robots and auto pilots. Anywhere some IoT sensor array was used to control an entity, that could jeopardize the life of human(s) or the safety of machines would need to do safety level processing at the lowest level in the hierarchy.

If processing doesn’t involve safety, then it could potentially be done at the near-edge stations or at the core. .

Processing time and infrastructure requirements

Although we talked about this in data size above, infrastructure requirements must also play a part in where data is processed. Yes sensors are getting more intelligent and the same goes for near-edge stations. But if you’re processing the data multiple times, say for deep learning, it’s probably better to do this where there’s a bunch of GPUs and some way of keeping the data pipeline running efficiently. The same applies to any data analytics that distributes workloads and data across a gaggle of CPU cores, storage devices, network nodes, etc.

There’s also an efficiency component to this. Computational storage is all about how some workloads can better be accomplished at the storage layer. But the concept applies throughout the hierarchy. Given the infrastructure requirements to process the data, there’s probably one place where it makes the most sense to do this. If it takes a 100 CPU cores to process the data in a timely fashion, it’s probably not going to be done at the sensor level.

Data information funnel

We make the assumption that raw data comes in through sensors, and more processed data is sent to higher layers. This would mean at a minimum, some sort of data compression/compaction would need to be done at each layer below the core.

We were at a conference a while back where they talked about updating deep learning neural networks. It’s possible that each near-edge station could perform a mini-deep learning training cycle and share their learning with the core periodicals, which could then send this information back down to the lowest level to be used, (see our Swarm Intelligence @ #HPEDiscover post).

All this means that there’s a minimal level of processing of the data that needs to go on throughout the hierarchy between access point connections.

Pipe availability

binary data flow

The availability of a networking access point may also have some bearing on where data is processed. For example, a self driving car could generate TB of data a day, but access to a high speed, inexpensive data pipe to send that data may be limited to a service bay and/or a garage connection.

So some processing may need to be done between access point connections. This will need to take place at lower levels. That way, there would be no need to send the data while the car is out on the road but rather it could be sent whenever it’s attached to an access point.

Compliance/archive requirements

Any sensor data probably needs to be stored for a long time and as such will need access to a long term archive. Depending on the extent of this data, it may help dictate where processing is done. That is, if all the raw data needs to be held, then maybe the processing of that data can be deferred until it’s already at the core and on it’s way to archive.

However, any safety oriented data processing needs to be done at the lowest level and may need to be reprocessed higher up in the hierachy. This would be done to insure proper safety decisions were made. And needless the say all this data would need to be held.

~~~~

I started this post with 40 or more factors but that was overkill. In the above, I tried to summarize the 6 critical factors which I would use to determine where IoT data should be processed.

My intent is in a part 2 to this post to work through some examples. If there’s anyone example that you feel may be instructive, please let me know.

Also, if there’s other factors that you would use to determine where to process IoT data let me know.

Improving floating point

Read a post this week in Reddit pointing to an article that was from The Next Platform (New approach could sink floating point computation). It was all about changing IEEE floating point format to something better called posits, which was designed by noted computer architect, John Gustafson, et al, (see their paper Beating floating point at its own game: Posit arithmetic, for more info).

The problems with standard floating point have been known since they were first defined, in 1985 by the IEEE. As you may recall, an IEEE 754 floating point number has three parts a sign, an exponent and a mantissa (fraction or significand part). Both the exponent and mantissa can be negative.

IEEE defined floating point numbers

The IEEE 754 standard defines the following formats (see Floating point Floating -point arithmetic, for more info)

  • Half precision floating point, (added in 2008), has 1 sign bit (for the significand or mantissa), 5 exponent bits (indicating 2**-62 to 2**+64) and 10 significand bits for a total of 16 bits.
  • Single precision floating point, has 1 sign bit, 8 exponent bits (indicating 2**-126 to 2**+128) and 23 significand bits for a total of32 bits.
  • Double precision floating point, has 1 sign bit, 11 exponent bits (2**-1022 to 2**+1024) and 52 significand bits.
  • Quadrouple precision floating point, has 1 sign bit, 15 exponent bits (2**-16,382 to 2**+16,384) and 112 significand bits.

I believe Half precision was introduced to help speed up AI deep learning training and inferencing.

Some problems with the IEEE standard include, it supports -0 and +0 which have different representations and -∞ and +∞ as well as can be used to represent a number of unique, Not-a-Numbers or NaNs which are illegal floating point numbers. So when performing IEEE standard floating point arithmetic, one needs to check to see if a result is a NaN which would make it an illegal result, and must be wary when comparing numbers such as -0, +0 and -∞ , +∞. because, sigh, they are not equal.

Posits to the rescue

It’s all a bit technical (read the paper to find out) but posits don’t support -0 and +0, just 0 and there’s no -∞ or +∞ in posits either, just ∞. Posits also allow for a variable number of exponent bits (which are encoded into Regime scale factor bits [whose value is determined by a useed factor] and Exponent scale factor bits) which means that the number of significand bits can also vary.

So, with a 32 bit, single precision Posit, the number range represented can be quite a bit larger than single precision floating point. Indeed, with the approach put forward by Gustafson, a single 32 bit posit has more numeric range than a single precision IEEE 754 float and about as 1/2 as much range as double precision IEEE floating point number but only uses 32 bits.

Presently, there’s no commercial hardware implementations of posits, but there’s a lot of interest. Mostly because, the same number of bits can represent a lot more numeric range than equivalently sized IEEE 754 floats. And for HPC environments, AI deep learning applications, scientific computing, etc. having more numeric range (or precision), in less space, means they can jam more data in the same storage, transfer more data over the same networking bandwidth and save more numbers in limited amounts of DRAM.

Although, commercial implementations do not exist, there’s been some FPGA simulations of posit floating point arithmetic. Those simulations have shown it to be more energy efficient than IEEE 754 floating point arithmetic for the same number of bits. So, you need to add better energy efficiency to the advantages of posit arithmetic.

Is it any wonder that HPC/big science (weather prediction, Square Kilometer Array, energy simulations, etc.) and many AI hardware accelerator chip designers are examining posits as a potential way to boost precision, reduce storage/memory footprint and reduce energy consumption.

~~~~

Yet, standards have a way of persisting. Just look at how long the QWERTY keyboard has lasted. It was originally designed in the 1870’s to slow down typing and reduce jamming, when typewriters were mechanical devices. But ever since 1934, when the DVORAK keyboard was patented, there’s been much better layouts for keyboards. And there’s no arguing that the DVORAK keyboard is better for typing on non-mechanical typewriters. Yet today, I know of no computer vendor that ships DVORAK labeled keyboards. Once a standard becomes set, it’s very hard to dislodge.

Comments?

Photo Credit(s):

Swarm intelligence at #HPEDiscover

I attended HPE Discover conference in Vegas this past week and among all their product announcements, there was a panel discussion on something called Swarm Intelligence. But it was really about collaborative learning.

Swarm Intelligence at HPE is a way for multiple organizations/edge devices to train a model collaboratively. They end up using their own data to train local models but then share their models (actually model node weights) with one another.

In this fashion, if say one hospital specializes in the detection and treatment of pneumonia and another in TB, they could both train a shared model on their respective sets of data. But during training, they share their model weights between them and after some number of training iterations, have a single model than supports detection of both.

How does swarm intelligence work?

To make swarm intelligence work:

  1. All parties have to reach consensus on model hyper parameters, i.e, type of model (CNN, RNN, LSM, etc.), number of nodes per layer, number of layers, levels of connections between nodes, etc. So there’s a single model architecture to be trained across all the organizations.
  2. All organization training data needs to be the same type, (e.g., X-rays).
  3. After each model training session all model weights have to be shared with each other
  4. All organizations have to decide on the method used to merge or combine the model weights (e.g. averaging). .

In the end, after N number of training epochs, their combined model would be essentially cross trained on each organizations data. But no one shared any data!

Why attempt swarm intelligence

HPE believes swarm intelligence would be a way to not have to transmit all that edge data to a central repository, but there’s other advantages:

  • A combined model could be trained on more data than any single organization could provide.
  • A combined model would have less organizational bias.

There’s one other possibility, but it’s unclear whether this is legally valid or not, but a combined model could be trained on data that it didn’t have legal access to.

One problem with the edge is the vast amount of data there

It turns out that a self driving car could generate 4TB of data/day of driving. Moving 4TB a day from all the cars in say a major metropolitan area (4 million people with ~1 million cars of which 20% are on the road each week day) could represent as much as 200K*4TB or 200EB of data/day.

There is not enough bandwidth in a fully 5G world to move that amount of data each day wirelessly and probably not enough bandwidth to move that amount of data over wire.

But if each car were to train its own (self-driving) model each day on its own data and then share that training model of say 1024 nodes with 1024 layers, it would represent 1M node weights ( floating point numbers), or ~16MB of data, if done effectively, one could have a cities worth of training data to train your self driving car models.

The allure of swarm intelligence/collaborative learning is high. It seems a small cost to reach consensus on the model hyper-parameters, collaborative learning methodology and synchronized training epochs to create a model trained on multiple organization/edge device training data.

HPE discussed using private blockchains to coordinate the sharing of model training across organizations or edge devices and use the block chain to compensate organizations for the use of their trained models. Certainly this could work well with edge devices but it seems an unnecessary complication for collaborative organizations.

Nonetheless, swarm intelligence may just be one way to address some of serious problems with deep learning today.

Photo Credit(s): “Starling Flock” by Mike Legend is licensed under CC BY-NC-ND 2.0 

“Artificial Intelligence & AI & Machine Learning” by mikemacmarketing is licensed under CC BY 2.0 

“Geese in v-formation, Walberswick” by stephengg is licensed under CC BY-NC-ND 2.0