Hedvig storage system, Docker support & data protection that spans data centers

Hedvig003We talked with Hedvig (@HedvigInc) at Storage Field Day 10 (SFD10), a month or so ago and had a detailed deep dive into their technology. (Check out the videos of their sessions here.)

Hedvig implements a software defined storage solution that runs on X86 or ARM processors and depends on a storage proxy operating in a hypervisor host (as a VM) and storage service nodes. Their proxy and the storage services can execute as separate VMs on the same host in a hyper-converged fashion or on different nodes as a separate storage cluster with hosts doing IO to the storage cluster.

Hedvig’s management team comes from hyper-scale environments (Amazon Dynamo/Facebook Cassandra) so they have lots of experience implementing distributed software defined storage at (hyper-)scale.

Hedvig functionality

Hedvig004Hedvig supports data deduplication, data compression, snapshots, clones and data replication/protection.

Deduplication, compression and replication can be specified separately or together on a virtual disk (vdisk) basis. Dedupe,compression or replication cannot be changed on a vdisk once created, but you can snapshot and clone the vdisk and at clone creation, modify dedupe, compression and replication settings.

Deduplication is on a fixed block basis and is global across all vdisks in the system that have dedupe enabled.

Hedvig storage proxy’s talk iSCSI or NFS protocols to VMs and the proxy  service uses RESTful APIs to talk to the storage nodes. Vdisks are essentially iSCSI LUNs or NFS file shares.

Hedvig storage services can also supports native object storage access by using the RESTful APIs directly without the need for a storage proxy.

Hedvig’s data protection can span data centers

Hedvig001Hedvig data protection uses replication. This is specified per vdisk by a replication factor (RF) that can be anywhere from 1 (no data replication) to 6 (6 copies of all data blocks written to 6 separate storage services).  Vdisk replication can also be:

  • Agnostic replication, which means that data will be replicated across storage service nodes to the extent requested with no knowledge of which data center or rack the storage nodes are located.
  • Rack aware replication, which means that Hedvig will insure that data replicas are distributed to storage service nodes that are located in different racks.
  • Data center aware replication, which means that Hedvig will insure that data replicas are distributed to storage service nodes that are in different data centers.

For data center aware replication, customers can specify which data center to host replicas. So if you have two nearby data centers and a two others that are farther away, for a RF=3 admins can specify two nearby and one farther away data centers used for replication.

Hedvig supports a quorum write which insures that data is written to RF/2+1 nodes in a synchronous fashion (writes won’t be acknowledged to the host until at least 1/2 the replicas have been written successfully). The remaining writes will be written to asynchronously and tracked to insure they are completed successfully.  This way if you have data center aware replication, with a RF=3, a host write will be acknowledged when the 1st two writes occur successfully, while the 3rd write can complete sometime later, asynchronously.

Traditionally, cross data center data protection (mirroring) has been distinct and separately specified, from within data center protection (RAID levels, replication counts). Hedvig doesn’t share this view.

(Storage) containers, pools and services

Hedvig002Data in a vdisk is chunked into (16GB) Containers and Hedvig supports 2 different block sizes 512 or 4KB for a vdisk. Containers are written and accessed from Storage Pools (sets of 3 disks) on a storage server.

Each storage service nodes runs two processes. When a storage proxy writes to a block in a container for the first time, it talks first with a metadata (Pages) process running on the storage nodes, to allocate a new container and decide which nodes to replicate to. The storage proxy then writes the data to a data (Hblock) process on the storage nodes on one of these replication nodes, which then writes it to the other replicated storage services.

Their Pages metadata process runs globally across all the storage service nodes and maintains the global state of the Hedvig cluster (across data centers). The Hblock process runs locally, on each node but writes replicas to other nodes as needed for data protection.

We didn’t get a lot of info on how the Pages process works across nodes and especially across data centers, supplying the global state for all Hedvig cluster nodes. I guess this is some of their secret sauce used to implement Hedvig.

Hedvig & Docker (Containers)

Hedvig talked about their Docker plugin in support of Docker Engines and Containers and operates the Docker Volume API. The Docker UCP (universal control plane) or CLI sends a volume request to Docker Engine which talks to the Hedvig Docker plugin which the talks to the storage proxy to define a volume (through the Pages metadata process running in the storage cluster).

When a Docker Container starts, the volume is mounted as a file share and file IO can take place to the volume through Hedvig’s storage proxy. If the container moves, IO can continue by re-mounting the share for the container on another Docker Engine with IO going through another Hedvig storage proxy. The storage proxy runs in a container on Docker Engine hosts.

Caching in Hedvig

Hedvig supports three different caches at the Proxy service:

  • MetaCache in DRAM on host/server or SSDs if available, holds container addresses for blocks written through this host’s proxy service. Each Proxy has a MetaCache, I believe it’s 32GB on DRAM. MetaCache’s do not persist, and are recreated after a crash from the Pages processes.
  • BlockCache is a read only block cache residing in SSD only and can be enabled for each disk. This cache is optional and only uses SSDs available in the host running the Proxy service.
  • DedupeCache in DRAM or SSD, enabled only if vdisks are deduplicated and holds additional dedupe metadata like block hashes and dedupe log addresses.

MetaCache helps to alleviate Pages traffic for data and volumes that have been written to before. BlockCache’s can supply high performance for blocks that have been seen before. The Dedupe Cache, helps speed up the global deduplication process.

Hedvig in operations

Hedvig clusters can non-disruptively grow or shrink and the Pages service will automatically re-distribute replicas to remaining nodes (if shrinking) or use new nodes to hold replicas of new containers (if growing).

I assume you inform Hedvig what datacenter and racks the storage service nodes are located in, but it’s possible it could discover this automatically.

Each proxy service monitors IO activity to the storage nodes and will automatically adjust their IO activity to favor nodes that are performing better.

The Pages process also monitors node IO activity and can migrate data replicas around the cluster to balance out workloads.

As Hedvig’s a software defined storage and it’s data center aware, you could potentially have a multi-data center system that spans AWS, Azure and Google Cloud Platform, replicating data between them… How’s that for cloud storage!

With appropriate BlockCache and DedupeCache, Hedvig can generate 90K IOPS per proxy server, which would scale as you increase hosts. But this all assumes that data working sets in their respective caches on the proxy server.


The technical deep dive is in the 2nd video (see link above) and is pretty informative, if you’re looking for more information. The showed some demo’s of their storage in the following videos as well.

I was impressed with the functionality and the unique view of cross data center data protection services. I probed pretty hard on their caching logic but as far as I can tell it all makes sense, assuming a single proxy is writing to a vdisk. Shared writers won’t work in Hedvig’s scheme.

Photo Credit(s): Screenshots from their video deep dive session at SFD10 & Docker explanation slide provided by Hedvig


BlockStack, a Bitcoin secured global name space for distributed storage

At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).

Bitcoin blockchain simplified

Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.

Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.

All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.

The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).

All blockchain transactions are owned by a cryptographic address. Each cryptographic address has a public and private key associated with it.
Continue reading BlockStack, a Bitcoin secured global name space for distributed storage

Exablox, bring your own disk storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading Exablox, bring your own disk storage

Surprises in disk reliability from Microsoft’s “free cooled” datacenters

HH5At Usenix ATC’16 last week, there was a “best of the rest” session which repeated selected papers presented at FAST’16 earlier this year. One that caught my interest was discussing disk reliability in free cooled data centers at Microsoft (Environmental conditions and disk reliability in free-cooled datacenters, see pp. 53-66).

The paper discusses disk reliability at 9 different datacenters in Microsoft for over 1M drives over the course of 1.5 to 4 years vs. how datacenters were cooled.
Continue reading Surprises in disk reliability from Microsoft’s “free cooled” datacenters

Testing filesystems for CPU core scalability

IMG_6536I attended HotStorage’16 and Usenix ATC’16 conferences this past week and there was a paper presented at ATC titled “Understanding Manicure Scalability of File Systems” (see p. 71 in PDF) by Changwoo Min and others at Georgia Institute of Technology. This team of researchers set out to understand the bottlenecks in a typical file systems as they scaled from 1 to 80 (or more) CPU cores on the same server.

FxMark, a new scalability benchmark

They created a new benchmark to probe CPU core scalability they called FxMark (source code available at FxMark), consisting of 19 “micro benchmarks” stressing specific scalability scenarios and three application level benchmarks, representing popular file system activities.

The application benchmarks in FxMark included: standard mail server (Exim), a NoSQL DB (RocksDB) and a standard user file server (DBENCH).

In the micro benchmarks, they stressed 7 different components of files systems: 1) path name resolution; 2) page cache for buffered IO; 3) node management; 4) disk block management; 5) file offset to disk block mapping; 6) directory management; and 7) consistency guarantee mechanism.
Continue reading Testing filesystems for CPU core scalability

Pure Storage FlashBlade well positioned for next generation storage

IMG_6344Sometimes, long after I listen to a vendor’s discussion, I come away wondering why they do what they do. Oftentimes, it passes but after a recent session with Pure Storage at SFD10, it lingered.

Why engineer storage hardware?

In the last week or so, executives at Hitachi mentioned that they plan to reduce  hardware R&D activities for their high end storage. There was much confusion what it all meant but from what I hear, they are ahead now, and maybe it makes more sense to do less hardware and more software for their next generation high end storage. We have talked about hardware vs. software innovation a lot (see recent post: TPU and hardware vs. software innovation [round 3]).
Continue reading Pure Storage FlashBlade well positioned for next generation storage

Has triple parity Raid time come?

Data center with hard drives
Data center with hard drives

Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that their latest all flash storage array was going to support triple-parity RAID.

And last week at a NetApp-SolidFire analyst event, someone mentioned that the new ONTAP 9 triple parity RAID-TEC™ for larger SSDs. Also heard at the meeting was that a 15.3TB SSD would take on the order of 12 hours to rebuild.

Need for better protection

When Nimble discussed the need for triple parity RAID they mentioned the report from Google I talked about recently (see my Surprises from 4 years of SSD experience at Google post). In that post, the main surprise was the amount of read errors they had seen from the SSDs they deployed throughout their data center.

I think the need for triple-parity RAID and larger (+15TB SSDs) will become more common over time. There’s no reason to think that the SSD vendors will stop at 15TB. And if it takes 12 hours to rebuild a 15TB one, I think it’s probably something like  ~30 hours to rebuild a 30TB one, which is just a generation or two away.

A read error on one SSD in a RAID group during an SSD rebuild can be masked by having dual parity. A read error on two SSDs can only be masked by having triple parity RAID.
Continue reading Has triple parity Raid time come?

Surprises in flash storage IO distributions from 1 month of Nimble Storage customer base

We were at Nimble Storage (videos of their sessions) for Storage Field Day 10 (SFD10) last week and they presented some interesting IO statistics from data analysis across their 7500 customer install base using InfoSight.

As I understand it, the data are from all customers that have maintenance and are currently connected to InfoSight, their SaaS service solution for Nimble Storage. The data represents all IO over the course of a single month across the customer base. Nimble wrote a white paper summarizing their high level analysis, called Busting the myth of storage block size.
Continue reading Surprises in flash storage IO distributions from 1 month of Nimble Storage customer base