Clustered storage – Silverton Consulting

Huawei presents OceanStor architecture at SFD15

Posted on May 21, 2018May 21, 2018 by Ray in Block Storage, Clustered storage, Data compression, Data reduction, IOPS, LRT, NVMe storage, SPC-1, SSD storage, Storage architecture, Storage Features, Storage performance

At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).

Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.

The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.

System architecture

Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.

They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.

IO without serialization

They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.

To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.

IO flow

Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.

The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.

Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.

Other optimizations

They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.

For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.

Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.

~~~~

Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.

It never occurred to me we didn’t have to use locks at all.

For more information, please read these other SFD15 blogger posts on Huawei:

Dorado – All about Speed – Storage Gaga, Chin-Fah Heoh (@StorageGaga)
Huawei – Probably Not What You Expected, Dan Firth (@PenguinPunk)

NetApp’s new NVMeoF/FC AFF & Cloud Data Volumes for every cloud

Posted on May 14, 2018May 14, 2018 by Ray in Block Storage, Cloud services, Cloud storage, Clustered storage, FC, File Storage, NVMe, SSD storage, Storage, Storage performance, Strategic Inflection Points

We attended a NetApp analyst event in their CA HQ last week and they had some interesting announcements as well other information to share. 1st up new faster ONTAP storage.

NVMeoF AFF

NetApp announced this week that their latest generation AFF (All Flash FAS) systems will support FC NVMeoF. We asked if this was just for NVMe SSDs or did it apply to all AFF media. The answer was it’s just another host interface which the customer can license for NVMe SSDs (available only on AFF F800) or SAS SSDs (A700S, A700, and A300). The only AFF not supporting the new host interface is their lowend AFF A220.

As for which NVMeoF, they only support FC at the moment, and it’s our belief that the FC NVMeoF spec is most well defined these days and the FC switch hardware (Brocade-Broadcom since Gen 5, now shipping Gen 6, Cisco not sure) already has NVMeoF support.

NetApp also mentioned support for 100GbE (A800 & A700S only) and 32Gbs FC hardware (all AFF systems but A220). So, presumably they offer NVMeoF for both 32Gbps and 16Gbps FC.

No word on when this will be available for Ethernet FCoE or iSCSI (iNVMe?) but with all the major storage vendors bar one, moving to NVMe SSDs it’s only a matter of time before they also support Ethernet NVMeoF.

As for AFF NVMeoF performance, the answer wasn’t entirely satisfactory. The indication was that the interface reduced response time by 10 usecs or so for NVMe SSDs over SAS SSDs. But I didn’t see any other performance information to substantiate that.

We did see on their AFF d atasheet that with NVMe SSDs and NVMeoF FC, the AFF A800 response time was sub 200usec with throughput of 300GB/s (in a 24 node cluster, 12 HA pairs). This means they add only about 100usec for ONTAP data services, a decent trade off from our perspective. Later in their datasheet they say the A800 is capable of 1.3M IOPS and sub-500usec latencies. Unsure why they quoted both numbers.

Cloud Data Volumes

NetApp is taking storage to the cloud. They just announced that NetApp Cloud Data Volumes will be available as a native service under Google Cloud Platform (GCP). NetApp Cloud Data Volume is a storage-as-a-service offering that provides on demand ONTAP file services in the cloud.

For GCP, both Google and NetApp will be offering the service. Dianne Green, GCP VP said Cloud Data Volumes are a bit like Kubernetes, disruption without disrupting. Customers can easily migrate their onprem file based applications to the cloud without having to worry about the performance of their data or data protection for that matter.

Getting the data there is another matter, but NetApp has other services like CloudSync and someday (maybe for Cloud Data Volumes), SnapMirror, which can help customers move data to and from the cloud.

Currently Cloud Data Volumes are in public preview as an Microsoft Azure Enterprise NFS (and SMB) service. It’s also in beta (I think) in AWS marketplace. And availability on GCP is still restricted. There’s a lot of emphasis at NetApp events on Cloud Data Volumes given its current status on public cloud providers but we think they are trying to gain some experience before they roll it out to the rest of the world.

However, Jean English, NetApp CMO mentioned that NetApp’s Cloud Data Service business unit has over 1800 customers and currently supports a multi-PB storage footprint in various clouds. Note, this is not just Cloud Data Volumes but comprises all NetApp Cloud Data Services, which includes ONTAP Cloud, NPS, CloudSync, AltaVault, etc. Nonetheless, it’s an impressive indicator of just how far they have come in applying their storage magic to the public cloud in a short time. The hyperscalers (read public cloud providers) say NetApp is 2 or more years ahead of all the other competition and from what we can see, it’s true.

One of the key differentiators between NetApp Cloud Data Volumes and ONTAP Cloud is performance SLAs. Cloud Data Volume customers can select and purchase a specified performance SLA. We believe it comes at three levels and is normally purchased on a pay as you go, consumption based, service offering. However, it’s also available to be billed periodically, other purchase options may be available as well.

When asked what storage was behind the service, the only thing NetApp would confirm was that it was ONTAP storage, present in public cloud data centers in various regions. So Cloud Data Volumes is available in only specific regions but I would expect that to expand over time.

Data Visualization Center

They also christened their new Data Visualization Center (DVC) and we had a multi-course meal at the Bistro at the center. The DVC had a wrap around, 1.5 floor tall screen which showed some of NetApp customer success stories. Inside the screen was a more immersive setting and there was plenty of VR equipment in work spaces alongside customer conference rooms.

Full Disclosure: NetApp paid for all our travel, hotel and food during the analyst event and gave us all Google Home Minis as going away presents and NetApp is a long time customer of my firm.

Random access, DNA object storage system

Posted on February 24, 2018 by Ray in Clustered storage, data access, Data density, Data index, Data longevity, data protection, File Storage, Object storage, Strategic Inflection Points, Visionary leadershp

Read a couple of articles this week Inching closer to a DNA-based file system in ArsTechnica and DNA storage gets random access in IEEE Spectrum. Both of these seem to be citing an article in Nature, Random access in large-scale DNA storage (paywall).

We’ve known for some time now that we can encode data into DNA strings (see my DNA as storage … and Genomic informatics takes off posts).

However, accessing DNA data has been sequential and reading and writing DNA data has been glacial. Researchers have started to attack the sequentiality of DNA data access. The prize, DNA can store 215PB of data in one gram and DNA data can conceivably last millions of years.

Researchers at Microsoft and the University of Washington have come up with a solution to the sequential access limitation. They have used polymerase chain reaction (PCR) primers as a unique identifier for files. They can construct a complementary PCR primer that can be used to extract just DNA segments that match this primer and amplify (replicate) all DNA sequences matching this primer tag that exist in the cell.

DNA data format

The researchers used a Reed-Solomon (R-S) erasure coding mechanism for data protection and encode the DNA data into many DNA strings, each with multiple (metadata) tags on them. One of tags is the PCR primer tag header, another tag indicates the position of the DNA data segment in the file and an end of data tag that is the same PCR primer tag.

The PCR primer tag was used as sort of a file address. They could configure a complementary PCR tag to match the primer tag of the file they wanted to access and then use the PCR process to replicate (amplify) only those DNA segments that matched the searched for primer tag.

Apparently the researchers chunk file data into a block of 150 base pairs. As there are 2 complementary base pairs, I assume one bit to one base pair mapping. As such, 150 base pairs or bits of data per segment means ~18 bytes of data per segment. Presumably this is to allow for more efficient/effective encoding of data into DNA strings.

DNA strings don’t work well with replicated sequences of base pairs, such as all zeros. So the researchers created a random sequence of 150 base pairs and XOR the file DNA data with this random sequence to determine the actual DNA sequence to use to encode the data. Reading the DNA data back they need to XOR the data segment with the random string again to reconstruct the actual file data segment.

Not clear how PCR replicated DNA segments are isolated and where they are originally decoded (with a read head). But presumably once you have thousands to millions of copies of a DNA segment, it’s pretty straightforward to decode them.

Once decoded and XORed, they use the R-S erasure coding scheme to ensure that the all the DNA data segments represent the actual data that was encoded in them. They can then use the position of the DNA data segment tag to indicate how to put the file data back together again.

What’s missing?

I am assuming the cellular data storage system has multiple distinct cells of data, which are clustered together into some sort of organism.

Each cell in the cellular data storage system would hold unique file data and could be extracted and a file read out individually from the cell and then the cell could be placed back in the organism. Cells of data could be replicated within an organism or to other organisms.

To be a true storage system, I would think we need to add:

DNA data parity – inside each DNA data segment, every eighth base pair would be a parity for the eight preceding base pairs, used to indicate when a particular base pair in eight has mutated.
DNA data segment (block) and file checksums – standard data checksums, used to verify and correct for double and triple base pair (bit) corruption in DNA data segments and in the whole file.
Cell directory – used to indicate the unique Cell ID of the cell, a file [name] to PCR primer tag mapping table, a version of DNA file metadata tags, a version of the DNA file XOR string, a DNA file data R-S version/level, the DNA file length or number of DNA data segments, the DNA data creation data time stamp, the DNA last access date-time stamp,and DNA data modification data-time stamp (these last two could be omited)
Organism directory – used to indicate unique organism ID, organism metadata version number, organism unique cell count, unique cell ID to file list mapping, cell ID creation data-time stamp and cell ID replication count.

The problem with an organism cell-ID file list is that this could be quite long. It might be better to somehow indicate a range or list of ranges of PCR primer tags that are in the cell-ID. I can see other alternatives using a segmented organism directory or indirect organism cell to file lists b-tree, which could hold file name lists to cell-ID mapping.

It’s unclear whether DNA data storage should support a multi-level hierarchy, like file system directories structures or a flat hierarchy like object storage data, which just has buckets of objects data. Considering the cellular structure of DNA data it appears to me more like buckets and the glacial access seems to be more useful to archive systems. So I would lean to a flat hierarchy and an object storage structure.

Is DNA data is WORM or modifiable? Given the effort required to encode and create DNA data segment storage, it would seem it’s more WORM like than modifiable storage.

How will the DNA data storage system persist or be kept alive, if that’s the right word for it. There must be some standard internal cell mechanisms to maintain its existence. Perhaps, the researchers have just inserted file data DNA into a standard cell as sort of junk DNA.

If this were the case, you’d almost want to create a separate, data nucleus inside a cell, that would just hold file data and wouldn’t interfere with normal cellular operations.

But doesn’t the PCR primer tag approach lend itself better to a key-value store data base?

Photo Credit(s): Cell structure National Cancer Institute

Prentice Hall textbook

Guide to Open VMS file applications

Unix Inodes CSE410 Washington.edu

Key Value Databases, Wikipedia By Clescop – Own work, CC BY-SA 4.0, Link

Scratch file use in HPC @ORNL, a statistical analysis

Posted on November 17, 2017 by Ray in Clustered storage, data access, File Storage, Infiniband, R&D measures, System effectiveness

Attended SC17 (Supercomputing Conference) this past week and I received a copy of the accompanying research proceedings. There are a number of interesting papers in the research and I came across one, Scientific User Behavior and Data Sharing Trends in a Peta Scale File System by Seung-Hwan Lim, et al from Oak Ridge National Laboratory (ORNL) and the use of files at the Oak Ridge Leadership Computing Facility (OLCF) which was very interesting.

The paper statistically describes the use of a Scratch files in a multi PB file system (Lustre) at OLCF from January 2015 to August 2016. The OLCF supports over 32PB of storage, has a peak aggregate of over 1TB/s and Spider II (current Lustre file system) consists of 288 Lustre Object Storage Servers, all interconnected and connected to all the supercomputing cluster of servers via an InfiniBand network. Spider II supports all scratch storage requirements for active/queued jobs for the Titan (#4 in Top 500 [super computer clusters worldwide] list) and other clusters at ORNL.

ORNL uses an HPSS (High Performance Storage System) archive for permanent storage but uses the Spider II file system for all scratch files generated and used during supercomputing applications. ORNL is expecting Spider III (2018-2023) to host 10 billion files.

Scratch files are purged from Spider II after 90 days of no access.The paper is based on metadata analysis captured during scratch purging process for 500 days of access.

The paper displays a number of statistics and metrics on the use of Spider II:

Less than 3% of projects have a directory depth >15, the maximum directory depth was recorded at 432, with most projects having a shallow (<10) directory depth.
A project typically has 10X the files that a specific researcher has and a median file count/researcher is 2000 files with a median project having 20,000 files.
Storage system performance is actively managed by many projects. For instance, 20 out of 35 science domains manually managed their Lustre cluster configuration to improve throughput.
File count continues to grow and reached a peak of 1B files during the time being analyzed.
On average only 3% of files were accessed readonly, 10% of files updated (read-write) and 76% of files were untouched during a week period. However, median and maximum file age was 138 and 214 days respectively, which means that these scratch files can continue to be accessed over the course of 200+ days.

There was more information in the paper but one item missing is statistics on scratch file size distribution a concern.

Nonetheless, in paints an interesting picture of scratch file use in HPC application/supercluster environments today.

Comments?

There’s a new cluster filesystem on the block, Elastifile

Posted on April 14, 2017April 14, 2017 by Ray in Cloud storage, Clustered storage, Distributed computing, Ethernet, File Storage, NVMe, NVMe storage, SSD storage

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each data object holds the 4KB of compressed file data and journal log entries.
Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information

~~~~

New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

Andrew Mauro’s (@Andrew_Mauro) post, Elastifile launches cross-cloud data fabric
Adam Bergh’s (@AJBergh) post recap of SFD12 – day 1
Chin-Fat Heoh’s (@StorageGaga) post The engineering of Elastifile

Exablox, bring your own disk storage

Posted on July 8, 2016July 8, 2016 by Ray in CIFS/SMB, Clustered storage, Data reduction, Ethernet, File Storage, NFS, Object storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading “Exablox, bring your own disk storage” →

Faster Docker initialization through Slacker snapshots & NFS storage

Posted on May 6, 2016May 17, 2016 by Ray in Clustered storage, Distributed computing, File Storage, Storage Features, Storage performance, System effectiveness

Just got back from EMCWorld2016 this week but on the way there and back I was perusing the FAST’16 papers. One of the papers I read (see Slacker: Fast Distribution with Lazy Docker Containers, p. 181) discussed performance problems with initializing Docker container micro-services and how they could be solved using persistent, intelligent NFS storage.

It appears that Docker container initialization spends a lot of time provisioning and initializing a local file system for each container. Docker containers typically make use of an AUFS (Another Union File System) storage driver which makes use of another file system (like ext4) as its underlying storage which has beneath it either DAS or external storage.

When using persistent and intelligent NFS storage, Docker can take advantage of storage system snapshots and cloning to improve container initialization significantly. In the paper, the researchers used Tintri as the underlying persistent, enterprise class NFS storage but I believe the functionality that’s taken advantage of is available with most enterprise class NAS systems and as such, is readily available with other storage subsystems.
Continue reading “Faster Docker initialization through Slacker snapshots & NFS storage” →

(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability

Posted on February 29, 2016May 17, 2016 by Ray in Clustered storage, Ethernet, Forecasting, Infiniband, Omni-Path, QoM 2016, QoW 2015, SSD storage

Question of the month (QoM for February is: Will Intel Omni-Path (Architecture, OPA) GA in scale out enterprise storage by February 2016?

In this forecast enterprise storage are the major and startup vendors supplying storage to data center customers.

What is OPA?

OPA is Intel’s replacement for InfiniBand and starts out at 100Gbps. It’s intended more for high performance computing (HPC), to be used as an inter-cluster server interconnect or next generation fabric. Intel says it “will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases”. Seems like Intel is making it as easy as possible for vendors to adopt the technology.
Continue reading “(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability” →