Coho Data, hyperloglog and the quest for IO performance

We were at SFD6, last month and Coho Data‘s CTO & Co-Founder, Andy Warfield got up to tell us what’s happening at Coho. (We also met with Andy at SFD4, check out the videolinks to learn more.)

What’s new at Coho Data

Coho Data has been shipping GA product for about 3 quarters and is a simple to use, scale-out, hybrid (SSD & disk) storage system for VMware NFS datastores. Coho Data storage uses Software Defined Networking (SDN) switches to perform faster networking handoffs and optimized data flow across storage nodes. They use standard servers and a SDN switch that can scale from two nodes (micro-arrays) to lots (100 or more?).

Version 2.0 will add remote asynch replication and enhanced API enhancements. We won’t discuss the update anymore but if you want your storage to tweet its messages/alerts check it out. Thank Chris Wahl when you start seeing storage system tweets pollute your  twitter feed.

The highlight of the session, was Andy’s discussion of HyperLogLog, a new approach to understanding customer workloads.

HyperLogLog

Coho Data was designed from the start using Microsoft IO traces (1-week of MSR Cambridge datacenter block IO traces available at SNIA IO Trace repository).  [bold italics added later, ed.] which recorded all IO from 10 But Coho also recorded linux developers developer desktop IO activity for a year, amounting to ~ 1B 7.6B IOs and multi-TBs of data. I just got a call looking for some file activity tracing, so everybody in storage could use more IO traces. But detailed IO traces take up CPU cycles and lot’s of space. HyperLogLogs can solve a portion of this.

Before we go there, a little background. For instance, with a Bloom Filter you can tell whether a block has been referenced or not. In a bloom filter you hash a key, term or whatever multiple times and then OR them into separate bitfields, one per hash. Bloom filters have a small possibility of a false positive (block-id present in filter but was not really in IO stream) but no possibility of a false negative (block-id NOT present in filter but it really was in IO stream). However, bloom filters tells us nothing about how frequently blocks were read.

With a HyperLogLog, one can approximate (within ~2%) how many times a block was referenced. By capturing multiple HyperLogLogs pictures over time, one can determine block access frequency during application processing. Each HyperLogLog trace only occupies ~2 KB, so recording one/hour takes ~50KB/day. The math is beyond me but there’s plenty info online (e.g. here).

HyperLogLog functionality will be included in a future Coho Data update. Coho Data will be implementing what they call “Counter Stacks” which makes use of hyperloglogs in a future release (see Jake Wire’s Usenix Session video/PDF)Once present, Coho Data will save hyperloglog counter stack data, analyze it, and use it to better characterize customer IO with the goal of better optimizing their storage system to actual workloads

For more info please see other SFD6 blogger posts on Coho Data:

~~~~

Now if someone could just develop a super efficient algorithm/storage structure to record block sequences I think we have this licked.

Disclosure statement: I have done work for Coho Data over the last year.

Picture credits: (Lego) Me holding a Coho (Data) Salmon 🙂