Data reduction – Silverton Consulting

Huawei presents OceanStor architecture at SFD15

Posted on May 21, 2018May 21, 2018 by Ray in Block Storage, Clustered storage, Data compression, Data reduction, IOPS, LRT, NVMe storage, SPC-1, SSD storage, Storage architecture, Storage Features, Storage performance

At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).

Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.

The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.

System architecture

Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.

They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.

IO without serialization

They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.

To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.

IO flow

Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.

The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.

Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.

Other optimizations

They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.

For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.

Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.

~~~~

Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.

It never occurred to me we didn’t have to use locks at all.

For more information, please read these other SFD15 blogger posts on Huawei:

Dorado – All about Speed – Storage Gaga, Chin-Fah Heoh (@StorageGaga)
Huawei – Probably Not What You Expected, Dan Firth (@PenguinPunk)

Axellio, next gen, IO intensive server for RT analytics by X-IO Technologies

Posted on June 22, 2017 by Ray in Block Storage, Data reduction, Distributed computing, Machine Learning, NVMe, NVMe storage, PCIe, Storage performance, Strategic Inflection Points, Strategic planning

We were at X-IO Technologies last week for SFD13 in Colorado Springs talking with the team and they showed us their new IO and storage intensive server, the Axellio. They want to sell Axellio to customers that need extreme IOPS, very high bandwidth, and large storage requirements. Videos of X-IO’s sessions at SFD13 are available here.

The hardware

Axellio comes in 2U appliance with two server nodes. Each server supports 2 sockets of Intel E5-26xx v4 CPUs (4 sockets total) supporting from 16 to 88 cores. Each server node can be configured with up to 1TB of DRAM or it also supports NVDIMMs.

There are two key differentiators to Axellio:

The FabricExpress™, a PCIe based interconnect which allows both server nodes to access dual-ported, 2.5″ NVMe SSDs; and
Dense drive trays, the Axellio supports up to 72 (6 trays with 12 drives each) 2.5″ NVMe SSDs offering up to 460TB of raw NVMe flash using 6.4TB NVMe SSDs. Higher capacity NVMe SSDS available soon will increase Axellio capacity to 1PB of raw NVMe flash.

They also probably spent a lot of time on packaging, cooling and power in order to make Axellio a reliable solution for edge computing. We asked if it was NEBs compliant and they told us not yet but they are working on it.

Axellio can also be configured to replace 2 drive trays with 2 processor offload modules such as 2x Intel Phi CPU extensions for parallel compute, 2X Nvidia K2 GPU modules for high end video or VDI processing or 2X Nvidia P100 Tesla modules for machine learning processing. Probably anything that fits into Axellio’s power, cooling and PCIe bus lane limitations would also probably work here.

At the frontend of the appliance there are 1x16PCIe lanes of server retained for networking that can support off the shelf NICs/HCAs/HBAs with HHHL or FHHL cards for Ethernet, Infiniband or FC access to the Axellio. This provides up to 2x100GbE per server node of network access.

Performance of Axellio

With Axellio using all NVMe SSDs, we expect high IO performance. Further, they are measuring IO performance from internal to the CPUs on the Axellio server nodes. X-IO says the Axellio can hit >12Million IO/sec with at 35µsec latencies with 72 NVMe SSDs.

Lab testing detailed in the chart above shows IO rates for an Axellio appliance with 48 NVMe SSDs. With that configuration the Axellio can do 7.8M 4KB random write IOPS at 90µsec average response times and 8.6M 4KB random read IOPS at 164µsec latencies. Don’t know why reads would take longer than writes in Axellio, but they are doing 10% more of them.

Furthermore, the difference between read and write IOP rates aren’t close to what we have seen with other AFAs. Typically, maximum write IOPs are much less than read IOPs. Why Axellio’s read and write IOP rates are so close to one another (~10%) is a significant mystery.

As for IO bandwitdh, Axellio it supports up to 60GB/sec sustained and in the 48 drive lax testing it generated 30.5GB/sec for random 4KB writes and 33.7GB/sec for random 4KB reads. Again much closer together than what we have seen for other AFAs.

Also noteworthy, given PCIe’s bi-directional capabilities, X-IO said that there’s no reason that the system couldn’t be doing a mixed IO workload of both random reads and writes at similar rates. Although, they didn’t present any test data to substantiate that claim.

Markets for Axellio

They really didn’t talk about the software for Axellio. We would guess this is up to the customer/vertical that uses it.

Aside from the obvious use case as a X-IO’s next generation ISE storage appliance, Axellio could easily be used as an edge processor for a massive fabric of IoT devices, analytics processor for large RT streaming data, and deep packet capture and analysis processing for cyber security/intelligence gathering, etc. X-IO seems to be focusing their current efforts on attacking these verticals and others with similar processing requirements.

X-IO Technologies’ sessions at SFD13

Other sessions at X-IO include: Richard Lary, CTO X-IO Technologies gave a very interesting presentation on an mathematically optimized way to do data dedupe (caution some math involved); Bill Miller, CEO X-IO Technologies presented on edge computing’s new requirements and Gavin McLaughlin, Strategy & Communications talked about X-IO’s history and new approach to take the company into more profitable business.

Again all the videos are available online (see link above). We were very impressed with Richard’s dedupe session and haven’t heard as much about bloom filters, since Andy Warfield, CTO and Co-founder Coho Data, talked at SFD8.

For more information, other SFD13 blogger posts on X-IO’s sessions:

SFD13 primer – X-IO Axellio Edge Computing Platform by Max Mortillaro (@Darkkavenger)
X-IO Technology – a #SFD13 preview by Mike Preston (@mwpreston)

Full Disclosure

X-IO paid for our presence at their sessions and they provided each blogger a shirt, lunch and a USB stick with their presentations on it.

Exablox, bring your own disk storage

Posted on July 8, 2016July 8, 2016 by Ray in CIFS/SMB, Clustered storage, Data reduction, Ethernet, File Storage, NFS, Object storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading “Exablox, bring your own disk storage” →

A tale of two AFAs: EMC DSSD D5 & Pure Storage FlashBlade

Posted on March 17, 2016May 17, 2016 by Ray in Block Storage, data protection, Data reduction, data services, Ethernet, File Storage, PCIe, SSD storage

There’s been an ongoing debate in the analyst community about the advantages of software only innovation vs. hardware-software innovation (see Commodity hardware loses again and Commodity hardware always loses posts). Here is another example where two separate companies have turned to hardware innovation to take storage innovation to the next level.

DSSD D5 and FlashBlade

Within the last couple of weeks, two radically different AFAs were introduced. One by perennial heavyweight EMC with their new DSSD D5 rack scale flash system and the other by relatively new comer Pure Storage with their new FlashBlade storage system.

These two arrays seem to be going after opposite ends of the storage market: the 5U DSSD D5 is going after both structured and unstructured data that needs ultra high speed IO access (<100µsec) times and the 4U FlashBlade going after more general purpose unstructured data. And yet the two have have many similarities at least superficially.
Continue reading “A tale of two AFAs: EMC DSSD D5 & Pure Storage FlashBlade” →

Rubrik has a better idea for VMware backup

Posted on February 16, 2016May 17, 2016 by Ray in Data compression, Data reduction, Distributed computing, NFS, Object storage, Storage Backup

Cluster nodes Rubrik has been around since January 2014 and just GA’d in April of last year. They recently presented at TechFieldDay 10 (TFD10, videos here) with Chris Wahl, Technical Evangelist, Arvin “Nitro” Nithrakashyap, Co-Founder and Bipul Sinha, Co-Founder, in attendance.

I have known Chris Wahl since November of 2013, from our time together on Storage Field Day 4 (SFD4). Howard and I (the “Greybeards”) also interviewed Chris Wahl for Rubrik on a Greybeards on Storage podcast.
Continue reading “Rubrik has a better idea for VMware backup” →

Springpath SDS springs forth

Posted on June 8, 2015June 8, 2015 by Ray in Clustered storage, Data compression, Data reduction, Disk storage, File Storage, SSD storage, Storage, Strategic Inflection Points

Springpath presented at SFD7 and has a new Software Defined Storage (SDS) that attempts to provide the richness of enterprise storage in a SDS solution running on commodity hardware. I would encourage you to watch the SFD7 video stream if you want to learn more about them.

HALO software

Their core storage architecture is called HALO which stands for Hardware Agnostic Log-structured Object store. We have discussed log-structured file systems before. They are essentially a sequential file that can be randomly accessed (read) but are sequentially written. Springpath HALO was written from scratch, operates in user space and unlike many SDS solutions, has no dependencies on Linux file systems.

HALO supports both data deduplication and compression to reduce storage footprint. The other unusual feature is that they support both blade servers and standalone (rack) servers as storage/compute nodes.

Tiers of storage

Each storage node can optionally have SSDs as a persistent cache, holding write data and metadata log. Storage nodes can also hold disk drives used as a persistent final tier of storage. For blade servers, with limited drive slots, one can configure blades as part of a caching tier by using SSDs or PCIe Flash.

All data is written to the (replicated) caching tier before the host is signaled the operation is complete. Write data is destaged from the caching tier to capacity tier over time, as the caching tier fills up. Data reduction (compression/deduplication) is done at destage.

The caching tier also holds read cached data that is frequently read. The caching tier also has a non-persistent segment in server RAM.

Write data is distributed across caching nodes via a hashing mechanism which allocates portions of an address space across nodes. But during cache destage, the data can be independently spread and replicated across any capacity node, based on node free space available. This is made possible by their file system meta-data information.

The capacity tier is split up into data and a meta-data partitions. Meta-data is also present in the caching tier. Data is deduplicated and compressed at destage, but when read back into cache it’s de-compressed only. Both capacity tier and caching tier nodes can have different capacities.

HALO has some specific optimizations for flash writing which includes always writing a full SSD/NAND page and using TRIM commands to free up flash pages that are no longer being used.

HALO SDS packaging under different Hypervisors

In Linux & OpenStack environments they run the whole storage stack in Docker containers primarily for image management/deployment, including rolling upgrade management.

In VMware and HyperVM, Springpath runs as a VM and uses direct path IO to access the storage. For VMware Springpath looks like an NFSv3 datastore with VAAI and VVOL support. In Hyper-V Springpath’s SDS is an SMB storage device.

For KVM its an NFS storage, for OpenStack one can use NFS or they have a CINDER plugin for volume support.

The nice thing about Springpath is you can build a cluster of storage nodes that consists of VMware, HyperV and bare metal Linux nodes that supports all of them. (Does this mean it’s multi protocol, supporting SMB for Hyper-V, NFSv3 for VMware?)

HALO internals

Springpath supports (mostly) file, block (via Cinder driver) and object access protocols. Backend caching and capacity tier all uses a log structured file structure internally to stripe data across all the capacity and caching nodes. Data compression works very well with log structured file systems.

All customer data is supported internally as objects. HALO has a write-log which is spread across their caching tier and a capacity-log which is spread across the capacity tier.

Data is automatically re-balanced across nodes when new nodes are added or old nodes deleted from the cluster.

Data is protected via replication. The system uses a minimum of 3 SSD nodes and 3 drive (capacity) nodes but these can reside on the same servers to be fully operational. However, the replication factor can be configured to be less than 3 if you’re willing to live with the potential loss of data.

Their system supports both snapshots (2**64 times/object) and storage clones for test dev and backup requirements.

Springpath seems to have quite a lot of functionality for a SDS. Although, native FC & iSCSI support is lacking. For a file based, SDS for hypbervisors, it seems to have a lot of the bases covered.

Comments?

“… would consume nearly half the world’s digital storage capacity.”

Posted on January 28, 2014 by Ray in Data reduction, Disk storage, Neuron connection mapping, Optical storage, Storage, System effectiveness, Tape storage

A recent National Geographic article on recent research into the brain (February 2014) said something which I find intriguing. “Producing an image of an entire human brain at the same resolution [as a mouse brain] would consume nearly half of the world’s current digital storage capacity.”

They were imaging slices of a mouse brain with an electron microscope, in slices one millimeter square, at a micron in depth, representing just a thousand cubic microns per image. Such a scan of the full mouse brain would require 450,000 TB (0.45 EB, exabyte=10E18 bytes) of storage for the images.

Getting an equivalent resolution image of a single human brain would require 1.3 billion TB (or 1.3 ZB, zettabyte=10E21 bytes). They went on to say that the world’s digital storage was just 2.7 billion TB (or 2.7 ZB), which is where they came up with the “… nearly half the world’s digital storage capacity.”

So how much digital storage is there in the world today

Setting aside the need for such a detailed map for the moment. Let’s talk about the world’s digital storage.

Tape – I don’t have much information about the enterprise tape capacity currently available in IBM TS1120/TS1130 or Oracle T10000C/B/A but a relatively recent article indicated that the 225 millionth LTO cartridge was shipped sometime in 3Q13 which represented a capacity of 90,000 PB (or 90 EB, exabyte=10E18 bytes) of storage capacity
Disk – Although I couldn’t find a reasonable estimate of installed disk capacity, IDC reported that 2012 disk capacity shipments were 20EB and through 3Q13 there had been 24.3EB shipped. It’s probably safe to assume that capacity shipments were ~8.3EB or more in 4Q13 so we have shipped ~32.5EB of disk capacity in 2013. One estimate of worldwide disk storage capacity (also provided by IDC) is that we are doubling worldwide disk storage capacity every two years so one estimate of installed disk capacity as of the end of 4Q13 is something on the order of 113.6EB of disk storage.

I won’t delve into optical storage as that’ s even more difficult to get a handle on but my guess is it’s not quite to the level of LTO digital storage so maybe another 90EB there for a total of ~0.3ZB of digital storage in disks, LTO tape and optical.

However, back in February of 2010, researchers reported in Science that the world’s information storage capacity was 2.0 ZB of storage. Also, last October IDC reported that the US alone had a digital storage capacity of 2.6 ZB and that the US had somewhere between 24 to 40% of the world’s storage. Let’s use 33%, for simplicity sake, this would put world’s digital capacity at around 7.8ZB of storage according to IDC.

Thankfully, a human brain scan at the resolutions above would take only a sixth of the world’s digital storage based on my estimates.

But, we really need to talk about data reduction techniques

I think we need to start discussing some form of data reduction, data compression/fractal compression or even graphical encoding. For example, with appropriate software and compute power the neural scans could be encoded at appropriate levels of detail into a graphical representation. Hopefully, this should be many orders of magnitude less storage intensive. So maybe only 1/600th to 1/60,000 of all the world’s digital storage

Another approach might be to use a form of fractal compression similar to that done in motion pictures/photographic images. Perhaps, I am being naive but it seems to me that there ought to be some form of fractal encoding of neural branching. Most of nature’s branching structures have an underlying fractal basis and I see nothing in neural anatomy that would show me it’s any different.

Of course, I am not a neural biologist, but I am a storage expert and there’s got to be a way to reduce this data load somehow.

Comments?

Photo Credit: Microscopic embryonic mouse brain (DAPI, GFP) by Joseph Elsbernd

Genome informatics takes off at 100GB/human

Posted on June 29, 2012June 29, 2012 by Ray in Data compression, Data reduction, Information economy, Storage Features, Strategic Inflection Points, Uncategorized

All is One, the I-ching and Genome case by TheAlieness (cc) (from flickr)

Read a recent article (actually a series of charts and text) on MIT Technical Review called Bases to Bytes which discusses how the costs of having your DNA sequenced is dropping faster than Moore’s law and how storing a person’s DNA data now takes ~100GB.

Apparently Nature magazine says ~30,000 genomes have been sequenced (not counting biotech sequenced genomes), representing ~3PB of data.

Why it takes 100GB

At the moment DNA sequencing is not doing any compression, no deduplication nor any other storage efficiency tools to reduce this capacity footprint. The 3.2Billion DNA base pairs each would take a minimum of 2 bits to store which should be ~800MB but for some reason more information about each base is saved (for future needs?) and they often re-sequence the DNA multiple times just to be sure (replica’s?). All this seems to add up to needing 100GB of data for a typical DNA sequencing output.

How they go from 0.8GB to 100GB with more info on each base pair and multiple copies or 125X the original data requirement is beyond me.

However, we have written about DNA informatics before (see our Dits, codons & chromozones – the storage of life post). In that post I estimated that human DNA would need ~64GB of storage, almost right on. (Although there was a math error somewhere in that analysis. Let’s see, 1B codons each with 64 possibilities [needing 6 bits] should require 6Bbits or ~750MB of storage, close enough).

Dedupe to the rescue

But in my view some deduplication should help. Not clear if it’s at the Codon level or at some higher organizational level (chromosome, protein, ?) but a “codon-differential” deduplication algorithm might just do the trick and take DNA capacity requirements down to size. In fact with all the replication in junk DNA, it starts to looks more and more like backup sets already.

I am sure any of my Deduplication friends in the industry such as EMC Data Domain, HP StoreOnce, NetApp, SEPATON, and others would be happy to give it some thought if adequate funding were to follow. But with this much storage at stake, some of them may take it on just to go after the storage requirements.

Gosh with a 50:1 deduplication ratio, maybe we could get a human DNA sequence down to 2GB. Then it would only take 14EB to sequence the worlds 7B population today.

Now if we could just sequence the human microbiome with metagenomic analysis of the microbiological communities of organisms that live upon, within and around all of us. Then we might have the answer to everything biologically we wanted to know about some person.

What we could do with all this information is another matter.

Comments?