Compressing information through the information bottleneck during deep learning

Read an article in Quanta Magazine (New theory cracks open the black box of deep learning) about a talk (see 18: Information Theory of Deep Learning, YouTube video) done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information.

The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer.

Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.

Limits of deep learning CNNs

In his talk the Professor identifies two modes of operations of a deep learning CNN: the encoder layers and decoder layers. The encoder function identifies relevant information in the input and the decoder function takes this relevant information and maps this to an output.

This view results in two statistics that can characterize any deep learning CNN:

  • Sample complexity which refers to the the mutual information inside the last hidden layer of the encoder function, and
  • Accuracy or generalization error, which refers to the mutual information inside the last hidden layer of the decoder function.

Where mutual information is defined as how much of the uncertainty of an input is removed when you have an output that is based on that input. (See the talk for a more formal explanation).

The professor states that any complex deep learning CNN can be characterized by these two statistics where sample complexity determines the number of samples required and accuracy determines the precision by which the deep learning CNN can properly interpret those samples. The deep black line in the chart represents the limits of accuracy achievable at some number of training events, with some number of hidden layers and some sample set.

What happens during deep learning

Moreover, the professor shows an interesting characteristic of all CNNs is that they converge over time in accuracy and that convergence differs based mostly on the number of layers, sample size and training count used.

In the chart, the top row show 3 CNNs with different amounts of training data (5%, 40% and 80% of total). The chart shows the end result and trace of learning within the CNN over the same number of epochs (training cycles). More training data generates more accurate results.

The Professor views those epochs after the farthest right traces (where the trace essentially starts moving up and to the left in the chart), the compression phase of deep learning.

Statistics of deep learning process

The professor goes on to characterize the deep learning  process by calculating the mean and variance of each layers connection weights.

In the chart he shows an standard “eiffel tower” neural network, with 6 hidden layers, each with less neurons (nodes)  than the previous layer (12 nodes, 10 nodes, 7 nodes, etc.). And what he plots is the average weights and variance between layers (red lines are average and variance of the weights for arcs[connections] between nodes in layer 1 to nodes in layer 2, blue lines the mean and variance of weights for arcs between layer 2 and 3, purple lines the mean and variance of weights for arcs between layer 3 and 4, etc.).

He shows that at the start of training the (randomly assigned) weights for each layer have a normalized mean which is higher than its normalized variance. He calls this phase as high signal to noise (I would say the opposite, its low signal to noise, more noise than signal). But as training proceeds (over more epochs), there comes a point where the layer mean drops below its variance and the signal to noise ratio changes dramatically. After that point the mean weights and variance of the group of layers start to diverge or move apart.

The phase (epochs) after the line where the weights means are lower than its variance, he calls the Compression phase of the deep layer CNN training.

The Professor suggests that every complex deep learning CNN looks the same during training if you perform the calculations. The professor shows charts like this for other deep learning CNNs used on different problems and they all exhibit some point where their means are lower than their weights after which means and variances between layers starts to differentiate.

Do layer counts and sample size matter?

It turns out that the more hidden layers you have, the sooner (less training) you need to begin the compression phase. This chart shows the same problem, with different hidden layer counts. One can see in the traces, that not only is accuracy improved with more layers but it also more quickly reaches the compression phase.

Using his sample complexity and accuracy statistics, the Professor has also shown that their are limits to the amount of accuracy to any deep learning CNN based on the function of layer counts, sample size and training event counts.


As far as I know, The Professor and his team are the first to try to characterize and understand what happens during deep learning. In doing so, he has shown that the number of layers and the number of samples can be used to predict the speed of learning. And ultimately how accurate any deep learning CNN can be.


Dreaming of SCM but living with NVDIMMs…

Last months GreyBeards on Storage podcast was with Rob Peglar, CTO and Sr. VP of Symbolic IO. Most of the discussion was on their new storage product but what also got my interest is that they are developing their storage system using NVDIMM technologies.

In the past I would have called NVDIMMs NonVolatile RAM but with the latest incarnation it’s all packaged up in a single DIMM and has both NAND and DRAM on board. It looks a lot like 3D XPoint but without the wait.

IMG_2338The first time I saw similar technology was at SFD5 with Diablo Technologies and SANdisk, a Western Digital company (videos here and here). At that time they were calling them UltraDIMM and memory class storage. ULTRADIMMs had an onboard SSD and DRAM and they  provided a sort of virtual memory (paged) access to the substantial (SSD) storage behind the DRAM page(s). I  wrote two blog posts about UltraDIMM and MCS (called MCS, UltraDIMM and memory IO, the new path ahead part1 and part2).


NVDIMM defined

NVDIMMs are currently available today from Micron, Crucial, NetList, Viking, and probably others. With today’s NVDIMM there is no large SSD (like ULTRADIMMs, just backing flash) and the complete storage capacity is available from the DRAM in the NVDIMM. At power reset, the NVDIMM sort of acts like virtual memory paging in data from the flash until all the data is in DRAM.

NVDIMM hardware includes control logic, DRAM, NAND and SuperCAPs/Batteries together in one DIMM. DRAM is used for normal memory traffic but in the case of a power outage, the data from DRAM is offloaded onto the NAND in the NVDIMM using the SuperCAP/Battery to hold up the DRAM memory just long enough to transfer it to flash..

Th problem with good, old DRAM is that it is volatile, which means when power is gone so is your data. With NVDIMMs (3D XPoint and other new non-volatile storage class memories also share this characteristic), when power goes away your data is still available and persists across power outages.

For example, Micron offers an 8GB, JEDEC DDR4 compliant, 288-pin NVDIMM that has 8GB of DRAM and 16GB of SLC flash in a single DIMM. Depending on part, it has 14.9-16.2GB/s of bandwidth and 1866-2400 MT/s (million memory transfers/second). Roughly translating MT/s to IOPS, says with ~17GB/sec and at an 8KB block size, the device should be able to do ~2.1 MIO/s (million IO operations per second [never thought I would need an acronym for that]).

Another thing that makes NVDIMMs unique in the storage world is that they are byte addressable.

Hardware – check, Software?

SNIA has a NVM Programming (NVMP) Technical Working Group (TWG), which has been working to help adoption of the new technology. In addition to the NVMP TWG, there’s, SANdisk’s NVMFS (2013 FMS paper, formerly known as DirectFS) and Intel’s pmfs (persistent memory file system) GitHub repository.  Couldn’t find any GitHub for NVMFS but both and pmfs are well along the development path for Linux.

swarchThe TWG identified a three prong approach to NVDIMM adoption:  crawl, walk, run (see blog post for more info).

  • The Crawl approach uses standard block and file system drivers on Linux to talk to a NVDIMM driver. This way has the benefit of being well tested, well known and widely available (except for the NVDIMM driver). The downside is that you have a full block IO or file IO stack in front of a device that can potentially do 2.1 MIO/s and it is likely to cause a lot of overhead reducing this potential significantly.
  • The Walk approach uses a persistent memory file system (pmfs?) to directly access the NVDIMM storage using memory mapped IO. The advantage here is that there’s absolutely no kernel code active during a NVDIMM data access. But building a file system or block store up around this may require some application level code.
  • The Run approach wasn’t described well in the blog post but it seems like SANdisk’s NVMFS approach which uses both standard NVMe SSDs and non-volatile memory to build a hybrid (NVDIMM-SSD) file system.

Symbolic IO as another run approach?

Symbolic IO computationally defined storage is intended to make use of NVDIMM technology and in the Store [update 12/16/16] appliance version has SSD storage as well in a hybrid NVDIMM-SSD run-like solution. The appliance has a full version of Linux SymCE which doesn’t use a file system or the PMEM library to access the data, it’s just byte addressable storage  with a PMEM file system embedded within [update 12/16/16]. This means that applications can use standard Linux file APIs to (directly) reference NVDIMM and the backend SSD storage.

It’s computationally defined because they use compute power to symbolically transform the data reducing data footprint in NVDIMM and subsequently in the SSD backing tier. Checkout the podcast to learn more

I came away from the podcast thinking that NVDIMMs are more prevalent than I thought. So, that’s what prompted this post.


Photo Credit(s): UltraDIMM photo taken by Ray at SFD5, Architecture picture from blog post


Hedvig storage system, Docker support & data protection that spans data centers

Hedvig003We talked with Hedvig (@HedvigInc) at Storage Field Day 10 (SFD10), a month or so ago and had a detailed deep dive into their technology. (Check out the videos of their sessions here.)

Hedvig implements a software defined storage solution that runs on X86 or ARM processors and depends on a storage proxy operating in a hypervisor host (as a VM) and storage service nodes. Their proxy and the storage services can execute as separate VMs on the same host in a hyper-converged fashion or on different nodes as a separate storage cluster with hosts doing IO to the storage cluster.

Hedvig’s management team comes from hyper-scale environments (Amazon Dynamo/Facebook Cassandra) so they have lots of experience implementing distributed software defined storage at (hyper-)scale.
Continue reading “Hedvig storage system, Docker support & data protection that spans data centers”

Rubrik has a better idea for VMware backup

Cluster nodesRubrik has been around since January 2014 and just GA’d in April of last year. They recently presented at  TechFieldDay 10 (TFD10, videos here) with Chris Wahl, Technical Evangelist, Arvin “Nitro” Nithrakashyap, Co-Founder and Bipul Sinha, Co-Founder, in attendance.

I have known Chris Wahl since November of 2013, from our time together on Storage Field Day 4 (SFD4). Howard and I (the “Greybeards”) also interviewed Chris Wahl for Rubrik on a Greybeards on Storage podcast.
Continue reading “Rubrik has a better idea for VMware backup”

Springpath SDS springs forth

Springpath presented at SFD7 and has a new Software Defined Storage (SDS) that attempts to provide the richness of enterprise storage in a SDS solution running on commodity hardware. I would encourage you to watch the SFD7 video stream if you want to learn more about them.

HALO software

Their core storage architecture is called HALO which stands for Hardware Agnostic Log-structured Object store. We have discussed log-structured file systems before. They are essentially a sequential file that can be randomly accessed (read) but are sequentially written. Springpath HALO was written from scratch, operates in user space and unlike many SDS solutions, has no dependencies on Linux file systems.

HALO supports both data deduplication and compression to reduce storage footprint. The other unusual feature  is that they support both blade servers and standalone (rack) servers as storage/compute nodes.

Tiers of storage

Each storage node can optionally have SSDs as a persistent cache, holding write data and metadata log. Storage nodes can also hold disk drives used as a persistent final tier of storage. For blade servers, with limited drive slots, one can configure blades as part of a caching tier by using SSDs or PCIe Flash.

All data is written to the (replicated) caching tier before the host is signaled the operation is complete. Write data is destaged from the caching tier to capacity tier over time, as the caching tier fills up. Data reduction (compression/deduplication) is done at destage.

The caching tier also holds read cached data that is frequently read. The caching tier also has a non-persistent segment in server RAM.

Write data is distributed across caching nodes via a hashing mechanism which allocates portions of an address space across nodes. But during cache destage, the data can be independently spread and replicated across any capacity node, based on node free space available.  This is made possible by their file system meta-data information.

The capacity tier is split up into data and a meta-data partitions. Meta-data is also present in the caching tier. Data is deduplicated and compressed at destage, but when read back into cache it’s de-compressed only. Both capacity tier and caching tier nodes can have different capacities.

HALO has some specific optimizations for flash writing which includes always writing a full SSD/NAND page and using TRIM commands to free up flash pages that are no longer being used.

HALO SDS packaging under different Hypervisors

In Linux & OpenStack environments they run the whole storage stack in Docker containers primarily for image management/deployment, including rolling upgrade management.

In VMware and HyperVM, Springpath runs as a VM and uses direct path IO to access the storage. For VMware Springpath looks like an NFSv3 datastore with VAAI and VVOL support. In Hyper-V Springpath’s SDS is an SMB storage device.

For KVM its an NFS storage, for OpenStack one can use NFS or they have a CINDER plugin for volume support.

The nice thing about Springpath is you can build a cluster of storage nodes that consists of VMware, HyperV and bare metal Linux nodes that supports all of them. (Does this mean it’s multi protocol, supporting SMB for Hyper-V, NFSv3 for VMware?)

HALO internals

Springpath supports (mostly) file, block (via Cinder driver) and object access protocols. Backend caching and capacity tier all uses a log structured file structure internally to stripe data across all the capacity and caching nodes.  Data compression works very well with log structured file systems.

All customer data is supported internally as objects. HALO has a write-log which is spread across their caching tier and a capacity-log which is spread across the capacity tier.

Data is automatically re-balanced across nodes when new nodes are added or old nodes deleted from the cluster.

Data is protected via replication. The system uses a minimum of 3 SSD nodes and 3 drive (capacity) nodes but these can reside on the same servers to be fully operational. However, the replication factor can be configured to be less than 3 if you’re willing to live with the potential loss of data.

Their system supports both snapshots (2**64 times/object) and storage clones for test dev and backup requirements.

Springpath seems to have quite a lot of functionality for a SDS. Although, native FC & iSCSI support is lacking. For a file based, SDS for hypbervisors, it seems to have a lot of the bases covered.


Other SFD7 blogger posts on Springpath:

Picture credit(s): Architectural layout (from 

New deduplication solutions from Sepaton and NEC

In the last few weeks both Sepaton and NEC have announced new data deduplication appliance hardware and for Sepaton at least, new functionality. Both of these vendors compete against solutions from EMC Data Domain, IBM ProtectTier, HP StoreOnce and others.

Sepaton v7.0 Enterprise Data Protection

From Sepaton’s point of view data growth is exploding, with little increase in organizational budgets and system environments are becoming more complex with data risks expanding, not shrinking. In order to address these challenges Sepaton has introduced a new version of their hardware appliance with new functionality to help address the rising data risks.

Their new S2100-ES3 Series 2925 Enterprise Data Protection Platform with latest Sepaton software now supports up to 80 TB/hour of cluster data ingest (presumably with Symantec OST) and up to 2.0 PB of raw storage in an 8-node cluster. The new appliances support 4-8Gbps FC and 2-10GbE host ports per node, based on HP DL380p Gen8 servers with Intel Xeon E5-2690 processors, 8 core, dual 2.9Ghz CPU, 128 GB DRAM and a new high performance compression card from EXAR. With the bigger capacity and faster throughput, enterprise customers can now support large backup data streams with fewer appliances, reducing complexity and maintenance/licensing fees. S2100-ES3 Platforms can scale from 2 to 8 nodes in a single cluster.

The new appliance supports data-at-rest encryption for customer data security as well as data compression, both of which are hardware based, so there is no performance penalty. Also, data encryption is an optional licensed feature and uses OASIS KMIP 1.0/1.1 to integrate with RSA, Thales and other KMIP compliant, enterprise key management solutions.

NEC HYDRAstor Gen 4

With Gen4 HYDRAstor introduces a new Hybrid Node which contains both the logic for accelerator nodes and capacity for storage nodes, in one 2U rackmounted server. Before the hybrid node similar capacity and accessibility would have required 4U of rack space, 2U for the accelerator node and another 2U for the storage node.

The HS8-4000 HN supports 4.9TB/hr standard or 5.6TB/hr per node with NetBackup OST IO express ingest rates and 12-4TB, 3.5in SATA drives, with up to 48TB of raw capacity. They have also introduced an HS8-4000 SN which just consists of the 48TB of additional storage capacity. Gen4 is the first use of 4TB drives we have seen anywhere and quadruples raw capacity per node over the Gen3 storage nodes. HYDRAstor clusters can scale from 2- to 165-nodes and performance scales linearly with the number of cluster nodes.

With the new HS8-4000 systems, maximum capacity for a 165 node cluster is now 7.9PB raw and supports up to 920.7 TB/hr (almost a PB/hr, need to recalibrate my units) with an all 165-HS8-4000 HN node cluster. Of course, how many customers need a PB/hr of backup ingest is another question. Let alone, 7.9PB of raw storage which of course gets deduplicated to an effective capacity of over 100PBs of backup data (or 0.1EB, units change again).

NEC has also introduced a new low end appliance the HS3-410 for remote/branch office environments that has a 3.2TB/hr ingest with up to 24TB of raw storage. This is only available as a single node system.

Maybe Facebook could use a 0.1EB backup repository?

Image: Intel Team Inside Facebook Data Center by IntelFreePress


Genome informatics takes off at 100GB/human

All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)
All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)

Read a recent article (actually a series of charts and text) on MIT Technical Review called Bases to Bytes which discusses how the costs of having your DNA sequenced is dropping faster than Moore’s law and how storing a person’s DNA data now takes ~100GB.

Apparently Nature magazine says ~30,000 genomes have been sequenced (not counting biotech sequenced genomes), representing ~3PB of data.

Why it takes 100GB

At the moment DNA sequencing is not doing any compression, no deduplication nor any other storage efficiency tools to reduce this capacity footprint.  The 3.2Billion DNA base pairs each would take a minimum of 2 bits to store which should be ~800MB but for some reason more information about each base is saved (for future needs?) and they often re-sequence the DNA multiple times just to be sure (replica’s?).  All this seems to add up  to needing 100GB of data for a typical DNA sequencing output.

How they go from 0.8GB to 100GB with more info on each base pair and multiple copies or 125X the original data requirement is beyond me.

However, we have written about DNA informatics before (see our Dits, codons & chromozones – the storage of life post).  In that post I estimated that human DNA would need ~64GB of storage, almost right on.  (Although there was a math error somewhere in that analysis. Let’s see, 1B codons each with 64 possibilities [needing 6 bits] should require 6Bbits or ~750MB of storage, close enough).

Dedupe to the rescue

But in my view some deduplication should help.  Not clear if it’s at the Codon level or at some higher organizational level (chromosome, protein, ?)  but a “codon-differential” deduplication algorithm might just do the trick and take DNA capacity requirements down to size.  In fact with all the replication in junk DNA, it starts to looks more and more like backup sets already.

I am sure any of my Deduplication friends in the industry such as EMC Data Domain, HP StoreOnce, NetApp, SEPATON, and others would be happy to give it some thought if adequate funding were to follow.  But with this much storage at stake, some of them may take it on just to go after the storage requirements.

Gosh with a 50:1 deduplication ratio, maybe we could get a human DNA sequence down to 2GB.  Then it would only take 14EB to sequence the worlds 7B population today.

Now if we could just sequence the human microbiome with metagenomic analysis of the microbiological communities of organisms that live upon, within and around all of us.  Then we might have the answer to everything biologically we wanted to know about some person.

What we could do with all this information is another matter.


IBMEdge2012 Day 1 part A – new Ultradrawer, integrated real time compression and more

Brian got up and talked about SmarterStorage that is coming out of STG. A couple of items of specific interest included:

  • IBM’s new ultra drawer – apparently server side all flash array, unclear if this is a shared storage device but would think so.
  • EasyTier cross domain support – In conjunction with the Ultra drawer rollout Brian mentioned that EasyTier would support multi-domain (outside just shared storage tiering?!)
  • Virtual Storage Center – a new administration capability targeted to private and public cloud services which supports a storage service catalog and more self-service provisioning.
  • Real time data compression – for SVC and Storwize V7000 a new (integrated) storage data compression capability based on LZ compression for real time primary active data compression on storage. This will be integrated with other IBM storage over time.


That’s about it from the keynote other than the electronic strings music was awesome…