Coho Data, hyperloglog and the quest for IO performance

We were at SFD6, last month and Coho Data‘s CTO & Co-Founder, Andy Warfield got up to tell us what’s happening at Coho. (We also met with Andy at SFD4, check out the videolinks to learn more.)

What’s new at Coho Data

Coho Data has been shipping GA product for about 3 quarters and is a simple to use, scale-out, hybrid (SSD & disk) storage system for VMware NFS datastores. Coho Data storage uses Software Defined Networking (SDN) switches to perform faster networking handoffs and optimized data flow across storage nodes. They use standard servers and a SDN switch that can scale from two nodes (micro-arrays) to lots (100 or more?).

Version 2.0 will add remote asynch replication and enhanced API enhancements. We won’t discuss the update anymore but if you want your storage to tweet its messages/alerts check it out. Thank Chris Wahl when you start seeing storage system tweets pollute your  twitter feed.

The highlight of the session, was Andy’s discussion of HyperLogLog, a new approach to understanding customer workloads.

HyperLogLog

Coho Data was designed from the start using Microsoft IO traces (1-week of MSR Cambridge datacenter block IO traces available at SNIA IO Trace repository).  [bold italics added later, ed.] which recorded all IO from 10 But Coho also recorded linux developers developer desktop IO activity for a year, amounting to ~ 1B 7.6B IOs and multi-TBs of data. I just got a call looking for some file activity tracing, so everybody in storage could use more IO traces. But detailed IO traces take up CPU cycles and lot’s of space. HyperLogLogs can solve a portion of this.

Before we go there, a little background. For instance, with a Bloom Filter you can tell whether a block has been referenced or not. In a bloom filter you hash a key, term or whatever multiple times and then OR them into separate bitfields, one per hash. Bloom filters have a small possibility of a false positive (block-id present in filter but was not really in IO stream) but no possibility of a false negative (block-id NOT present in filter but it really was in IO stream). However, bloom filters tells us nothing about how frequently blocks were read.

With a HyperLogLog, one can approximate (within ~2%) how many times a block was referenced. By capturing multiple HyperLogLogs pictures over time, one can determine block access frequency during application processing. Each HyperLogLog trace only occupies ~2 KB, so recording one/hour takes ~50KB/day. The math is beyond me but there’s plenty info online (e.g. here).

HyperLogLog functionality will be included in a future Coho Data update. Coho Data will be implementing what they call “Counter Stacks” which makes use of hyperloglogs in a future release (see Jake Wire’s Usenix Session video/PDF)Once present, Coho Data will save hyperloglog counter stack data, analyze it, and use it to better characterize customer IO with the goal of better optimizing their storage system to actual workloads

For more info please see other SFD6 blogger posts on Coho Data:

~~~~

Now if someone could just develop a super efficient algorithm/storage structure to record block sequences I think we have this licked.

Disclosure statement: I have done work for Coho Data over the last year.

Picture credits: (Lego) Me holding a Coho (Data) Salmon 🙂

Proximal Data, server SSD caching software

7707062406_6508dba2a4_oI attended Storage Field Day 4 (SFD4) about a month ago now and had a chance to visit with Rory Bolt, CEO/Founder of Proximal Data, a new server side caching software solution. Last month the GreyBeards (Howard Marks and I) talked with Satyam Vaghani, Co-founder and CTO of PernixData another server side caching solution. You can find that podcast here. But this post is about Proximal Data. These guys could use some better marketing but when you spend 90% of your funding on engineers this is what you get.

Proximal Data doesn’t believe in agent software. because it takes a long time to deploy and could potentially disrupt IT operations when being installed. In contrast, Proximal Data installs their AutoCache solution software into the hypervisor as a VIB (vSphere Installation Bundle). There was some discussion at SFD4, on whether installing the VIB would be disruptive or not to customer operations. Not being a VMware expert I won’t comment on the results of the discussion but if you want to find out more I suggest viewing the SFD4 video of their Proximal Data’s presentation.

Of course, being at the Hypervisor layer can give them IO activity information at the VM level and could use this to control their caching software at VM granularity. In addition, by executing at the Hypervisor layer AutoCache doesn’t require any guest OS specific functionality or hooks. Another nice thing about executing at the hypervisor level is that they can cache RDM devices.

To use AutoCache you will need one or more PCIe SSD(s) or DAS SSD(s) in your ESXi server.  Once the PCIe SSD or DAS SSD is installed and after you have installed/activated the AutoCache software you will need to partition or dedicate the device to Proximal Data’s AutoCache.

AutoCache is managed as a virtual appliance with a Web server GUI.  With the networking setup and AutoCache VIB, installed you can access their operator panels via a tab in vCenter. Once the software is installed you don’t have to use their GUI ever again.

AutoCache read caching algorithms

Not every read IO for a VM being cached is brought into AutoCache’s SSD cache. They are trying to insure that cached data will be referenced again. As such, they typically wait for two reads before the data is placed into cache.

They support two different read caching algorithms called during the presentation as Algorithm A or Algorithm B. (They really need some marketing – Turbo Boost and Extreme Boost sounds better to me). Not sure they ever described the differences between the two, but the fact that they have multiple caching algorithms speaks to some sophistication. They also maintain a “Ghost data list”. Ghost data is data whose metadata is still in cache, but whose actual data is no longer in cache.

When a miss occurs, they determine if the data would have been a hit in Ghost data, in Algorithm A or in Algorithm B if they were active on the VM.  If it would have been a hit in Ghost data then in general, you probably need more SSD caching space on this ESXi server for the VMs being cached. If Algorithm A or B, probably should be using that algorithm for this VM’s IO.

Another approach AutoCache supports is called “Glimmer IO”. I liken this to sequential read-ahead where AutoCache keeps track, on a VM basis, all the IO being performed and try to determine if it is sequential or random. If the VM is doing sequential IO, AutoCache can start reading ahead of where the VM is currently reading. By doing so, they could stage the data in cache before the VM needs it/reads it. According to Rory there are policies which can be set on a VM basis to limit how much read-ahead is being performed. I assume there are policies associated with the use of Algorithm A and B on a VM basis as well but they didn’t go into this.

AutoCache cache warmup for vMotion

The other nice thing that AutoCache does is it provides a cache warmup for the target ESXi server when moving VMs via vMotion. This is done by registering for Vmotion API and trapping Vmotion requests. Once they detect that a VM is being moved they send the VM’s  Autocache metadata over to the target Host at which time the target system AutoCache can start to fill it’s cache from the shared storage. Not a bad approach from my perspective. The amount of data that needs to be moved is minimal and you get the AutoCache code running in the target machine to start preloading blocks that were in cache from the source Host. They also mentioned that once they have copied the metadata over to the target Host, they could free up (invalidate) all the space in the source Host’s cache that was being held by the VM being moved.

Proximal Data for Hyper-V

At SFD4, Rory mentioned that a Hyper-V version of AutoCache was coming out shortly. And although they specifically indicated that write back caching was not a great idea (in contrast to Satyam and PernixData), there was a potential for them to look at implementing this as well over time.

The product is sold through resellers, distributors and OEMs.  They claim support for any flash device although they have an approved HCL.

Current pricing is $1000 for the AutoCache software to support a SSD cache of 500GB or less. From what we see in the enterprise storage systems having a cache of 2-5% of your total backend storage is about right. (But see my VM working set inflection points and  SSD caching post for another side on this).   So a 500GB SSD cache should be able to support 10-25TB of backend data if all goes well.

~~~~

After the podcast on PernixData’s clustering, write-back caching software, Proximal Data didn’t seem as complex or useful. But there is a place for read-only caching. The fact that they can help warm the target Host’s cache for a vMotion is a great feature if you plan on doing a lot of movement of VMs in your shop. The fact that they have distinct support for multiple cache algorithms, understand sequential detect and have some way of telling you that you could use more SSD caching is also good in my mind.

Comments?

Photo: 20-nanometer NAND flash chip, IntelFreePress’ photostream