Data compression – Silverton Consulting

Is hardware innovation accelerating – hardware vs. software innovation (round 6)

Posted on December 17, 2020April 8, 2021 by Ray in Artificial Intelligence, Brain emulation, Cognitive computing, Data compression, Data QoS, Data transmission, Deep Learning, Ethernet, Infiniband, Market dynamics, Neuromorphic, NVMe, Processing performance, Strategic Inflection Points

There’s something happening to the IT industry, that maybe has not happened in a couple of decades or so but hardware innovation is back. We’ve been covering bits and pieces of it in our hardware vs software innovation series (see Open source ASiCs – HW vs. SW innovation [round 5] post).

But first please take our new poll:

Hardware innovation never really went away, Intel, AMD, Apple and others had always worked on new compute chips. DRAM and NAND also have taken giant leaps over the last two decades. These were all major hardware suppliers. But special purpose chips, non CPU compute engines, and hardware accelerators had been relegated to the dustbins of history as the CPU giants kept assimilating their functionality into the next round of CPU chips.

And then something happened. It kind of made sense for GPUs to be their own electronics as these were SIMD architectures intrinsically different than SISD, standard von Neumann X86 and ARM CPUs architectures

But for some reason it didn’t stop there. We first started seeing some inklings of new hardware innovation in the AI space with a number of special purpose DL NN accelerators coming online over the last 5 years or so (see Google TPU, SC20-Cerebras, GraphCore GC2 IPU chip, AI at the Edge Mythic and Syntiants IPU chips, and neuromorphic chips from BrainChip, Intel, IBM , others). Again, one could look at these as taking the SIMD model of GPUs into a slightly different direction. It’s probably one reason that GPUs were so useful for AI-ML-DL but further accelerations were now possible.

But it hasn’t stopped there either. In the last year or so we have seen SPUs (Nebulon Storage), DPUs (Fungible, NVIDIA Networking, others), and computational storage (NGD Systems, ScaleFlux Storage, others) all come online and become available to the enterprise. And most of these are for more normal workload environments, i.e., not AI-ML-DL workloads,

I thought at first these were just FPGAs implementing different logic but now I understand that many of these include ASICs as well. Most of these incorporate a standard von Neumann CPU (mostly ARM) along with special purpose hardware to speed up certain types of processing (such as low latency data transfer, encryption, compression, etc.).

What happened?

It’s pretty easy to understand why non-von Neumann computing architectures should come about. Witness all those new AI-ML-DL chips that have become available. And why these would be implemented outside the normal X86-ARM CPU environment.

But SPU, DPUs and computational storage, all have typical von Neumann CPUs (mostly ARM) as well as other special purpose logic on them.

Why?

I believe there are a few reasons, but the main two are that Moore’s law (every 2 years halving the size of transistors, effectively doubling transistor counts in same area) is slowing down and Dennard scaling (as you reduce the size of transistors their power consumption goes down and speed goes up) has stopped almost. Both of these have caused major CPU chip manufacturers to focus on adding cores to boost performance rather than just adding more transistors to the same core to increase functionality.

This hasn’t stopped adding instruction functionality to each CPU, but it has slowed considerably. And single (core) processor speeds (GHz) have reached a plateau.

But what it has stopped is having the real estate available on a CPU chip to absorb lots of additional hardware functionality. Which had been the case since the 1980’s.

I was talking with a friend who used to work on math co-processors, like the 8087, 80287, & 80387 that performed floating point arithmetic. But after the 486, floating point logic was completely integrated into the CPU chip itself, killing off the co-processors business.

Hardware design is getting easier & chip fabrication is becoming a commodity

We wrote a post a couple of weeks back talking about an open foundry (see HW vs. SW innovation round 5 noted above)that would take a hardware design and manufacture the ASICs for you for free (or at little cost). This says that the tool chain to perform chip design is becoming more standardized and much less complex. Does this mean that it takes less than 18 months to create an ASIC. I don’t know but it seems so.

But the real interesting aspect of this is that world class foundries are now available outside the major CPU developers. And these foundries, for a fair but high price, would be glad to fabricate a 1000 or million chips for you.

Yes your basic state of the art fab probably costs $12B plus these days. But all that has meant is that A) they will take any chip design and manufacture it, B) they need to keep the factory volume up by manufacturing chips in order to amortize the FAB’s high price and C) they have to keep their technology competitive or chip manufacturing will go elsewhere.

So chip fabrication is not quite a commodity. But there’s enough state of the art FABs in existence to make it seem so.

But it’s also physics

The extremely low latencies that are available with NVMe storage and, higher speed networking (100GbE & above) are demanding a lot more processing power to keep up with. And just the physics of how long it takes to transfer data across a distance (aka racks) is starting to consume too much overhead and impacting other work that could be done.

When we start measuring IO latencies in under 50 microseconds, there’s just not a lot of CPU instructions and task switching that can go on anymore. Yes, you could devote a whole core or two to this process and keep up with it. But wouldn’t the data center be better served keeping that core busy with normal work and offloading that low-latency, realtime (like) work to a hardware accelerator that could be executing on the network rather than behind a NIC.

So real time processing has become faster, or rather the amount of time to execute CPU instructions to switch tasks and to process data that needs to be done in realtime to keep up with faster line speed is becoming shorter.

So that explains DPUs, smart NICS, DPUs, & SPUs. What about the other hardware accelerator cards.

AI-ML-DL is becoming such an important and data AND compute intensive workload that just like GPUs before them, TPUs & IPUs are becoming a necessary evil if we want to service those workloads effectively and expeditiously.
Computational storage is becoming more wide spread because although data compression can be easily done at the CPU, it can be done faster (less data needs to be transferred back and forth) at the smart Drive.

My guess we haven’t seen the end of this at all. When you open up the possibility of having a long term business model, focused on hardware accelerators there would seem to be a lot of stuff that needs to be done and could be done faster and more effectively outside the core CPU.

There was a point over the last decade where software was destined to “eat the world”. I get a lot of flack for saying that was BS and that hardware innovation is really eating the world. Now that hardtware innovation’s back, it seems to be a little of both.

Comments?

Photo Credits:

Cerebras chip, Cerebras (see SC20 post)
Mythic architecture, Mythic computing (see AI at the edge post)
TPU2-iot, Google (see TPU post)
130nm layouts (see Open source ASICs post)
Moore’s law chart – wikipedia, By Max Roser – https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=79751151

Where should IoT data be processed – part 1

Posted on August 13, 2019August 13, 2019 by Ray in Data analytics, Data compression, Data efficiency, Data growth, Data transmission, Decision making, Deep Learning, Distributed computing, Internet of Things, Mobile computing, Networking, Neural network, Robots, Storage, System effectiveness

I was at FlashMemorySummit 2019 (FMS2019) this week and there was a lot of talk about computational storage (see our GBoS podcast with Scott Shadley, NGD Systems). There was also a lot of discussion about IoT and the need for data processing done at the edge (or in near-edge computing centers/edge clouds).

At the show, I was talking with Tom Leyden of Excelero and he mentioned there was a real need for some insight on how to determine where IoT data should be processed.

For our discussion let’s assume a multi-layered IoT architecture, with 1000s of sensors at the edge, 100s of near-edge processing/multiplexing stations, and 1 to 3 core data center or cloud regions. Data comes in from the sensors, is sent to near-edge processing/multiplexing and then to the core data center/cloud.

Data size

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

When deciding where to process data one key aspect is the size of the data. Tin GB or TB but given today’s world, can be PB as well. This lone parameter has multiple impacts and can affect many other considerations, such as the cost and time to transfer the data, cost of data storage, amount of time to process the data, etc. All of these sub-factors include the size of the data to be processed.

Data size can be the largest single determinant of where to process the data. If we are talking about GB of data, it could probably be processed anywhere from the sensor edge, to near-edge station, to core. But if we are talking about TB the processing requirements and time go up substantially and are unlikely to be available at the sensor edge, and may not be available at the near-edge station. And PB take this up to a whole other level and may require processing only at the core due to the infrastructure requirements.

Processing criticality

Human or machine safety may depend on quick processing of sensor data, e. g. in a self-driving car or a factory floor, flood guages, etc.. In these cases, some amount of data (sufficient to insure human/machinge safety) needs to be done at the lowest point in the hierarchy, with the processing power to perform this activity.

This could be in the self-driving car or factory automation that controls a mechanism. Similar situations would probably apply for any robots and auto pilots. Anywhere some IoT sensor array was used to control an entity, that could jeopardize the life of human(s) or the safety of machines would need to do safety level processing at the lowest level in the hierarchy.

If processing doesn’t involve safety, then it could potentially be done at the near-edge stations or at the core. .

Processing time and infrastructure requirements

Although we talked about this in data size above, infrastructure requirements must also play a part in where data is processed. Yes sensors are getting more intelligent and the same goes for near-edge stations. But if you’re processing the data multiple times, say for deep learning, it’s probably better to do this where there’s a bunch of GPUs and some way of keeping the data pipeline running efficiently. The same applies to any data analytics that distributes workloads and data across a gaggle of CPU cores, storage devices, network nodes, etc.

There’s also an efficiency component to this. Computational storage is all about how some workloads can better be accomplished at the storage layer. But the concept applies throughout the hierarchy. Given the infrastructure requirements to process the data, there’s probably one place where it makes the most sense to do this. If it takes a 100 CPU cores to process the data in a timely fashion, it’s probably not going to be done at the sensor level.

Data information funnel

We make the assumption that raw data comes in through sensors, and more processed data is sent to higher layers. This would mean at a minimum, some sort of data compression/compaction would need to be done at each layer below the core.

We were at a conference a while back where they talked about updating deep learning neural networks. It’s possible that each near-edge station could perform a mini-deep learning training cycle and share their learning with the core periodicals, which could then send this information back down to the lowest level to be used, (see our Swarm Intelligence @ #HPEDiscover post).

All this means that there’s a minimal level of processing of the data that needs to go on throughout the hierarchy between access point connections.

Pipe availability

The availability of a networking access point may also have some bearing on where data is processed. For example, a self driving car could generate TB of data a day, but access to a high speed, inexpensive data pipe to send that data may be limited to a service bay and/or a garage connection.

So some processing may need to be done between access point connections. This will need to take place at lower levels. That way, there would be no need to send the data while the car is out on the road but rather it could be sent whenever it’s attached to an access point.

Compliance/archive requirements

Any sensor data probably needs to be stored for a long time and as such will need access to a long term archive. Depending on the extent of this data, it may help dictate where processing is done. That is, if all the raw data needs to be held, then maybe the processing of that data can be deferred until it’s already at the core and on it’s way to archive.

However, any safety oriented data processing needs to be done at the lowest level and may need to be reprocessed higher up in the hierachy. This would be done to insure proper safety decisions were made. And needless the say all this data would need to be held.

~~~~

I started this post with 40 or more factors but that was overkill. In the above, I tried to summarize the 6 critical factors which I would use to determine where IoT data should be processed.

My intent is in a part 2 to this post to work through some examples. If there’s anyone example that you feel may be instructive, please let me know.

Also, if there’s other factors that you would use to determine where to process IoT data let me know.

Improving floating point

Posted on July 10, 2019April 8, 2021 by Ray in Artificial Intelligence, Data compression, Data precision, Deep Learning, Energy efficiency, Processing performance, Strategic Inflection Points

Read a post this week in Reddit pointing to an article that was from The Next Platform (New approach could sink floating point computation). It was all about changing IEEE floating point format to something better called posits, which was designed by noted computer architect, John Gustafson, et al, (see their paper Beating floating point at its own game: Posit arithmetic, for more info).

But first please take our new poll:

The problems with standard floating point have been known since they were first defined, in 1985 by the IEEE. As you may recall, an IEEE 754 floating point number has three parts a sign, an exponent and a mantissa (fraction or significand part). Both the exponent and mantissa can be negative.

IEEE defined floating point numbers

The IEEE 754 standard defines the following formats (see Floating point Floating -point arithmetic, for more info)

Half precision floating point, (added in 2008), has 1 sign bit (for the significand or mantissa), 5 exponent bits (indicating 2**-62 to 2**+64) and 10 significand bits for a total of 16 bits.
Single precision floating point, has 1 sign bit, 8 exponent bits (indicating 2**-126 to 2**+128) and 23 significand bits for a total of32 bits.
Double precision floating point, has 1 sign bit, 11 exponent bits (2**-1022 to 2**+1024) and 52 significand bits.
Quadrouple precision floating point, has 1 sign bit, 15 exponent bits (2**-16,382 to 2**+16,384) and 112 significand bits.

I believe Half precision was introduced to help speed up AI deep learning training and inferencing.

Some problems with the IEEE standard include, it supports -0 and +0 which have different representations and -∞ and +∞ as well as can be used to represent a number of unique, Not-a-Numbers or NaNs which are illegal floating point numbers. So when performing IEEE standard floating point arithmetic, one needs to check to see if a result is a NaN which would make it an illegal result, and must be wary when comparing numbers such as -0, +0 and -∞ , +∞. because, sigh, they are not equal.

Posits to the rescue

It’s all a bit technical (read the paper to find out) but posits don’t support -0 and +0, just 0 and there’s no -∞ or +∞ in posits either, just ∞. Posits also allow for a variable number of exponent bits (which are encoded into Regime scale factor bits [whose value is determined by a useed factor] and Exponent scale factor bits) which means that the number of significand bits can also vary.

So, with a 32 bit, single precision Posit, the number range represented can be quite a bit larger than single precision floating point. Indeed, with the approach put forward by Gustafson, a single 32 bit posit has more numeric range than a single precision IEEE 754 float and about as 1/2 as much range as double precision IEEE floating point number but only uses 32 bits.

Presently, there’s no commercial hardware implementations of posits, but there’s a lot of interest. Mostly because, the same number of bits can represent a lot more numeric range than equivalently sized IEEE 754 floats. And for HPC environments, AI deep learning applications, scientific computing, etc. having more numeric range (or precision), in less space, means they can jam more data in the same storage, transfer more data over the same networking bandwidth and save more numbers in limited amounts of DRAM.

Although, commercial implementations do not exist, there’s been some FPGA simulations of posit floating point arithmetic. Those simulations have shown it to be more energy efficient than IEEE 754 floating point arithmetic for the same number of bits. So, you need to add better energy efficiency to the advantages of posit arithmetic.

Is it any wonder that HPC/big science (weather prediction, Square Kilometer Array, energy simulations, etc.) and many AI hardware accelerator chip designers are examining posits as a potential way to boost precision, reduce storage/memory footprint and reduce energy consumption.

~~~~

Yet, standards have a way of persisting. Just look at how long the QWERTY keyboard has lasted. It was originally designed in the 1870’s to slow down typing and reduce jamming, when typewriters were mechanical devices. But ever since 1934, when the DVORAK keyboard was patented, there’s been much better layouts for keyboards. And there’s no arguing that the DVORAK keyboard is better for typing on non-mechanical typewriters. Yet today, I know of no computer vendor that ships DVORAK labeled keyboards. Once a standard becomes set, it’s very hard to dislodge.

Comments?

Photo Credit(s):

From Geek for Geeks IEEE Standard 754 Floating Point Numbers article
Figure 5 from Beating Floating Point.. paper, by J. Gustafson et al
Figure 1 from Beating Floating Point… paper, by J. Gustafson et al
Figure 4 from Beating Floating Point… paper, by J. Gustafson et al
Figure 7 from Beating Floating Point… paper, by J. Gustafson et al

MIT’s new Navion chip for better Nano drone navigation

Posted on June 21, 2018 by Ray in Artificial Intelligence, Data compression, Drones, Strategic Inflection Points, Visionary leadershp

Read an article this week in Science Daily (Chip upgrade help’s bee-sized drones navigate) about a recent chip created by MIT, called Navion, that reduces size and power consumption for electronics used in drone navigation. The chip is also documented on MIT’s Navion project homepage and in a technical paper describing the new VIO (Visual-Inertial Odometry ) Navion chip.

The Navion chip can perform inertial measurement at 52Khz as well as process video streams of 752×480 stereo images at 171 frames per second in a 20 sqmm package consuming only 24mW of power. The chip was fabricated on a 65nm CMOS process line.

Navion is the result of a collaborative design process which optimized electronics required to perform drone navigation processing. By placing all the memory required for inertial measurement and image analysis and all the processing hardware on the same chip, they have substantially reduced power consumption and space requirements for drone navigation.

Navion architecture

Navion uses a state of the art, non-linear factor graph optimization algorithm to navigate in space. It doesn’t sound like DL neural net image recognition but more like a statistical/probabilistic approach to image mapping and place estimation. The chip uses image compression, two stage memory, and sparse linear solver memory to reduce image processing memory requirements from 3.5MB to less than 1MB.

The chip uses 3 inputs: two images (right & left image) and IMU (inertial management unit sensor) and has one (complex output), its estimate of the current state of where it is on the map.

Navion processing creates and maintains a 3D map using stereo images and provides navigational support to move through that space. According to the paper, the Navion chip updates the state(s) and sparse 3D map at a KF (Kalman filter) rate of between 16 and 90 fps. Navion also offers configurations options to maximize accuracy, throughput or energy efficiency.

Navion compares well to other navigation electronics

The table shows comparisons of the Navion chip against other traditional navigational systems that use Xeon, ARM or FPGA chips. As far as I can tell it’s either much better or at least on a par with these other larger, more complex, power hungry systems.

Nano drones are coming to our space, sooner than anyone expects.

Comments?

Huawei presents OceanStor architecture at SFD15

Posted on May 21, 2018May 21, 2018 by Ray in Block Storage, Clustered storage, Data compression, Data reduction, IOPS, LRT, NVMe storage, SPC-1, SSD storage, Storage architecture, Storage Features, Storage performance

At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).

Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.

The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.

System architecture

Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.

They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.

IO without serialization

They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.

To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.

IO flow

Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.

The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.

Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.

Other optimizations

They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.

For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.

Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.

~~~~

Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.

It never occurred to me we didn’t have to use locks at all.

For more information, please read these other SFD15 blogger posts on Huawei:

Dorado – All about Speed – Storage Gaga, Chin-Fah Heoh (@StorageGaga)
Huawei – Probably Not What You Expected, Dan Firth (@PenguinPunk)

Compressing information through the information bottleneck during deep learning

Posted on September 23, 2017April 8, 2021 by Ray in Artificial Intelligence, Data compression, Machine Learning, Neural network, System effectiveness

Read an article in Quanta Magazine (New theory cracks open the black box of deep learning) about a talk (see 18: Information Theory of Deep Learning, YouTube video) done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information.

But first please take our new poll:

The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer.

Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.

Limits of deep learning CNNs

In his talk the Professor identifies two modes of operations of a deep learning CNN: the encoder layers and decoder layers. The encoder function identifies relevant information in the input and the decoder function takes this relevant information and maps this to an output.

This view results in two statistics that can characterize any deep learning CNN:

Sample complexity which refers to the the mutual information inside the last hidden layer of the encoder function, and
Accuracy or generalization error, which refers to the mutual information inside the last hidden layer of the decoder function.

Where mutual information is defined as how much of the uncertainty of an input is removed when you have an output that is based on that input. (See the talk for a more formal explanation).

The professor states that any complex deep learning CNN can be characterized by these two statistics where sample complexity determines the number of samples required and accuracy determines the precision by which the deep learning CNN can properly interpret those samples. The deep black line in the chart represents the limits of accuracy achievable at some number of training events, with some number of hidden layers and some sample set.

What happens during deep learning

Moreover, the professor shows an interesting characteristic of all CNNs is that they converge over time in accuracy and that convergence differs based mostly on the number of layers, sample size and training count used.

In the chart, the top row show 3 CNNs with different amounts of training data (5%, 40% and 80% of total). The chart shows the end result and trace of learning within the CNN over the same number of epochs (training cycles). More training data generates more accurate results.

The Professor views those epochs after the farthest right traces (where the trace essentially starts moving up and to the left in the chart), the compression phase of deep learning.

Statistics of deep learning process

The professor goes on to characterize the deep learning process by calculating the mean and variance of each layers connection weights.

In the chart he shows an standard “eiffel tower” neural network, with 6 hidden layers, each with less neurons (nodes) than the previous layer (12 nodes, 10 nodes, 7 nodes, etc.). And what he plots is the average weights and variance between layers (red lines are average and variance of the weights for arcs[connections] between nodes in layer 1 to nodes in layer 2, blue lines the mean and variance of weights for arcs between layer 2 and 3, purple lines the mean and variance of weights for arcs between layer 3 and 4, etc.).

He shows that at the start of training the (randomly assigned) weights for each layer have a normalized mean which is higher than its normalized variance. He calls this phase as high signal to noise (I would say the opposite, its low signal to noise, more noise than signal). But as training proceeds (over more epochs), there comes a point where the layer mean drops below its variance and the signal to noise ratio changes dramatically. After that point the mean weights and variance of the group of layers start to diverge or move apart.

The phase (epochs) after the line where the weights means are lower than its variance, he calls the Compression phase of the deep layer CNN training.

The Professor suggests that every complex deep learning CNN looks the same during training if you perform the calculations. The professor shows charts like this for other deep learning CNNs used on different problems and they all exhibit some point where their means are lower than their weights after which means and variances between layers starts to differentiate.

Do layer counts and sample size matter?

It turns out that the more hidden layers you have, the sooner (less training) you need to begin the compression phase. This chart shows the same problem, with different hidden layer counts. One can see in the traces, that not only is accuracy improved with more layers but it also more quickly reaches the compression phase.

Using his sample complexity and accuracy statistics, the Professor has also shown that their are limits to the amount of accuracy to any deep learning CNN based on the function of layer counts, sample size and training event counts.

~~~~

As far as I know, The Professor and his team are the first to try to characterize and understand what happens during deep learning. In doing so, he has shown that the number of layers and the number of samples can be used to predict the speed of learning. And ultimately how accurate any deep learning CNN can be.

Comments?

Dreaming of SCM but living with NVDIMMs…

Posted on December 8, 2016December 16, 2016 by Ray in Block Storage, Data compression, File Storage, storage class memory

Last months GreyBeards on Storage podcast was with Rob Peglar, CTO and Sr. VP of Symbolic IO. Most of the discussion was on their new storage product but what also got my interest is that they are developing their storage system using NVDIMM technologies.

In the past I would have called NVDIMMs NonVolatile RAM but with the latest incarnation it’s all packaged up in a single DIMM and has both NAND and DRAM on board. It looks a lot like 3D XPoint but without the wait.

The first time I saw similar technology was at SFD5 with Diablo Technologies and SANdisk, a Western Digital company (videos here and here). At that time they were calling them UltraDIMM and memory class storage. ULTRADIMMs had an onboard SSD and DRAM and they provided a sort of virtual memory (paged) access to the substantial (SSD) storage behind the DRAM page(s). I wrote two blog posts about UltraDIMM and MCS (called MCS, UltraDIMM and memory IO, the new path ahead part1 and part2).

NVDIMM defined

NVDIMMs are currently available today from Micron, Crucial, NetList, Viking, and probably others. With today’s NVDIMM there is no large SSD (like ULTRADIMMs, just backing flash) and the complete storage capacity is available from the DRAM in the NVDIMM. At power reset, the NVDIMM sort of acts like virtual memory paging in data from the flash until all the data is in DRAM.

NVDIMM hardware includes control logic, DRAM, NAND and SuperCAPs/Batteries together in one DIMM. DRAM is used for normal memory traffic but in the case of a power outage, the data from DRAM is offloaded onto the NAND in the NVDIMM using the SuperCAP/Battery to hold up the DRAM memory just long enough to transfer it to flash..

Th problem with good, old DRAM is that it is volatile, which means when power is gone so is your data. With NVDIMMs (3D XPoint and other new non-volatile storage class memories also share this characteristic), when power goes away your data is still available and persists across power outages.

For example, Micron offers an 8GB, JEDEC DDR4 compliant, 288-pin NVDIMM that has 8GB of DRAM and 16GB of SLC flash in a single DIMM. Depending on part, it has 14.9-16.2GB/s of bandwidth and 1866-2400 MT/s (million memory transfers/second). Roughly translating MT/s to IOPS, says with ~17GB/sec and at an 8KB block size, the device should be able to do ~2.1 MIO/s (million IO operations per second [never thought I would need an acronym for that]).

Another thing that makes NVDIMMs unique in the storage world is that they are byte addressable.

Hardware – check, Software?

SNIA has a NVM Programming (NVMP) Technical Working Group (TWG), which has been working to help adoption of the new technology. In addition to the NVMP TWG, there’s pmem.io, SANdisk’s NVMFS (2013 FMS paper, formerly known as DirectFS) and Intel’s pmfs (persistent memory file system) GitHub repository. Couldn’t find any GitHub for NVMFS but both pmem.io and pmfs are well along the development path for Linux.

The TWG identified a three prong approach to NVDIMM adoption: crawl, walk, run (see pmem.io blog post for more info).

The Crawl approach uses standard block and file system drivers on Linux to talk to a NVDIMM driver. This way has the benefit of being well tested, well known and widely available (except for the NVDIMM driver). The downside is that you have a full block IO or file IO stack in front of a device that can potentially do 2.1 MIO/s and it is likely to cause a lot of overhead reducing this potential significantly.
The Walk approach uses a persistent memory file system (pmfs?) to directly access the NVDIMM storage using memory mapped IO. The advantage here is that there’s absolutely no kernel code active during a NVDIMM data access. But building a file system or block store up around this may require some application level code.
The Run approach wasn’t described well in the blog post but it seems like SANdisk’s NVMFS approach which uses both standard NVMe SSDs and non-volatile memory to build a hybrid (NVDIMM-SSD) file system.

Symbolic IO as another run approach?

Symbolic IO computationally defined storage is intended to make use of NVDIMM technology and in the Store [update 12/16/16] appliance version has SSD storage as well in a hybrid NVDIMM-SSD run-like solution. The appliance has a full version of Linux SymCE which doesn’t use a file system or the PMEM library to access the data, it’s just byte addressable storage ~~with a PMEM file system embedded within~~ [update 12/16/16]. This means that applications can use standard Linux file APIs to (directly) reference NVDIMM and the backend SSD storage.

It’s computationally defined because they use compute power to symbolically transform the data reducing data footprint in NVDIMM and subsequently in the SSD backing tier. Checkout the podcast to learn more

I came away from the podcast thinking that NVDIMMs are more prevalent than I thought. So, that’s what prompted this post.

Comments?

Photo Credit(s): UltraDIMM photo taken by Ray at SFD5, Architecture picture from pmem.io blog post,

Hedvig storage system, Docker support & data protection that spans data centers

Posted on July 23, 2016August 1, 2016 by Ray in Block Storage, Cloud storage, Data compression, data protection, Disk storage, File Storage, Object storage

Hedvig003 We talked with Hedvig (@HedvigInc) at Storage Field Day 10 (SFD10), a month or so ago and had a detailed deep dive into their technology. (Check out the videos of their sessions here.)

Hedvig implements a software defined storage solution that runs on X86 or ARM processors and depends on a storage proxy operating in a hypervisor host (as a VM) and storage service nodes. Their proxy and the storage services can execute as separate VMs on the same host in a hyper-converged fashion or on different nodes as a separate storage cluster with hosts doing IO to the storage cluster.

Hedvig’s management team comes from hyper-scale environments (Amazon Dynamo/Facebook Cassandra) so they have lots of experience implementing distributed software defined storage at (hyper-)scale.
Continue reading “Hedvig storage system, Docker support & data protection that spans data centers” →