At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).
Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.
The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.
Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.
They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.
IO without serialization
They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.
To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.
Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.
The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.
Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.
They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.
For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.
Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.
Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.
It never occurred to me we didn’t have to use locks at all.
For more information, please read these other SFD15 blogger posts on Huawei: