Hedvig storage system, Docker support & data protection that spans data centers

Hedvig003We talked with Hedvig (@HedvigInc) at Storage Field Day 10 (SFD10), a month or so ago and had a detailed deep dive into their technology. (Check out the videos of their sessions here.)

Hedvig implements a software defined storage solution that runs on X86 or ARM processors and depends on a storage proxy operating in a hypervisor host (as a VM) and storage service nodes. Their proxy and the storage services can execute as separate VMs on the same host in a hyper-converged fashion or on different nodes as a separate storage cluster with hosts doing IO to the storage cluster.

Hedvig’s management team comes from hyper-scale environments (Amazon Dynamo/Facebook Cassandra) so they have lots of experience implementing distributed software defined storage at (hyper-)scale.

Hedvig functionality

Hedvig004Hedvig supports data deduplication, data compression, snapshots, clones and data replication/protection.

Deduplication, compression and replication can be specified separately or together on a virtual disk (vdisk) basis. Dedupe,compression or replication cannot be changed on a vdisk once created, but you can snapshot and clone the vdisk and at clone creation, modify dedupe, compression and replication settings.

Deduplication is on a fixed block basis and is global across all vdisks in the system that have dedupe enabled.

Hedvig storage proxy’s talk iSCSI or NFS protocols to VMs and the proxy  service uses RESTful APIs to talk to the storage nodes. Vdisks are essentially iSCSI LUNs or NFS file shares.

Hedvig storage services can also supports native object storage access by using the RESTful APIs directly without the need for a storage proxy.

Hedvig’s data protection can span data centers

Hedvig001Hedvig data protection uses replication. This is specified per vdisk by a replication factor (RF) that can be anywhere from 1 (no data replication) to 6 (6 copies of all data blocks written to 6 separate storage services).  Vdisk replication can also be:

  • Agnostic replication, which means that data will be replicated across storage service nodes to the extent requested with no knowledge of which data center or rack the storage nodes are located.
  • Rack aware replication, which means that Hedvig will insure that data replicas are distributed to storage service nodes that are located in different racks.
  • Data center aware replication, which means that Hedvig will insure that data replicas are distributed to storage service nodes that are in different data centers.

For data center aware replication, customers can specify which data center to host replicas. So if you have two nearby data centers and a two others that are farther away, for a RF=3 admins can specify two nearby and one farther away data centers used for replication.

Hedvig supports a quorum write which insures that data is written to RF/2+1 nodes in a synchronous fashion (writes won’t be acknowledged to the host until at least 1/2 the replicas have been written successfully). The remaining writes will be written to asynchronously and tracked to insure they are completed successfully.  This way if you have data center aware replication, with a RF=3, a host write will be acknowledged when the 1st two writes occur successfully, while the 3rd write can complete sometime later, asynchronously.

Traditionally, cross data center data protection (mirroring) has been distinct and separately specified, from within data center protection (RAID levels, replication counts). Hedvig doesn’t share this view.

(Storage) containers, pools and services

Hedvig002Data in a vdisk is chunked into (16GB) Containers and Hedvig supports 2 different block sizes 512 or 4KB for a vdisk. Containers are written and accessed from Storage Pools (sets of 3 disks) on a storage server.

Each storage service nodes runs two processes. When a storage proxy writes to a block in a container for the first time, it talks first with a metadata (Pages) process running on the storage nodes, to allocate a new container and decide which nodes to replicate to. The storage proxy then writes the data to a data (Hblock) process on the storage nodes on one of these replication nodes, which then writes it to the other replicated storage services.

Their Pages metadata process runs globally across all the storage service nodes and maintains the global state of the Hedvig cluster (across data centers). The Hblock process runs locally, on each node but writes replicas to other nodes as needed for data protection.

We didn’t get a lot of info on how the Pages process works across nodes and especially across data centers, supplying the global state for all Hedvig cluster nodes. I guess this is some of their secret sauce used to implement Hedvig.
Hedvig005

Hedvig & Docker (Containers)

Hedvig talked about their Docker plugin in support of Docker Engines and Containers and operates the Docker Volume API. The Docker UCP (universal control plane) or CLI sends a volume request to Docker Engine which talks to the Hedvig Docker plugin which the talks to the storage proxy to define a volume (through the Pages metadata process running in the storage cluster).

When a Docker Container starts, the volume is mounted as a file share and file IO can take place to the volume through Hedvig’s storage proxy. If the container moves, IO can continue by re-mounting the share for the container on another Docker Engine with IO going through another Hedvig storage proxy. The storage proxy runs in a container on Docker Engine hosts.

Caching in Hedvig

Hedvig supports three different caches at the Proxy service:

  • MetaCache in DRAM on host/server or SSDs if available, holds container addresses for blocks written through this host’s proxy service. Each Proxy has a MetaCache, I believe it’s 32GB on DRAM. MetaCache’s do not persist, and are recreated after a crash from the Pages processes.
  • BlockCache is a read only block cache residing in SSD only and can be enabled for each disk. This cache is optional and only uses SSDs available in the host running the Proxy service.
  • DedupeCache in DRAM or SSD, enabled only if vdisks are deduplicated and holds additional dedupe metadata like block hashes and dedupe log addresses.

MetaCache helps to alleviate Pages traffic for data and volumes that have been written to before. BlockCache’s can supply high performance for blocks that have been seen before. The Dedupe Cache, helps speed up the global deduplication process.

Hedvig in operations

Hedvig clusters can non-disruptively grow or shrink and the Pages service will automatically re-distribute replicas to remaining nodes (if shrinking) or use new nodes to hold replicas of new containers (if growing).

I assume you inform Hedvig what datacenter and racks the storage service nodes are located in, but it’s possible it could discover this automatically.

Each proxy service monitors IO activity to the storage nodes and will automatically adjust their IO activity to favor nodes that are performing better.

The Pages process also monitors node IO activity and can migrate data replicas around the cluster to balance out workloads.

As Hedvig’s a software defined storage and it’s data center aware, you could potentially have a multi-data center system that spans AWS, Azure and Google Cloud Platform, replicating data between them… How’s that for cloud storage!

With appropriate BlockCache and DedupeCache, Hedvig can generate 90K IOPS per proxy server, which would scale as you increase hosts. But this all assumes that data working sets in their respective caches on the proxy server.

~~~~

The technical deep dive is in the 2nd video (see link above) and is pretty informative, if you’re looking for more information. The showed some demo’s of their storage in the following videos as well.

I was impressed with the functionality and the unique view of cross data center data protection services. I probed pretty hard on their caching logic but as far as I can tell it all makes sense, assuming a single proxy is writing to a vdisk. Shared writers won’t work in Hedvig’s scheme.

Photo Credit(s): Screenshots from their video deep dive session at SFD10 & Docker explanation slide provided by Hedvig