Coho Data, hyperloglog and the quest for IO performance

We were at SFD6, last month and Coho Data‘s CTO & Co-Founder, Andy Warfield got up to tell us what’s happening at Coho. (We also met with Andy at SFD4, check out the videolinks to learn more.)

What’s new at Coho Data

Coho Data has been shipping GA product for about 3 quarters and is a simple to use, scale-out, hybrid (SSD & disk) storage system for VMware NFS datastores. Coho Data storage uses Software Defined Networking (SDN) switches to perform faster networking handoffs and optimized data flow across storage nodes. They use standard servers and a SDN switch that can scale from two nodes (micro-arrays) to lots (100 or more?).

Version 2.0 will add remote asynch replication and enhanced API enhancements. We won’t discuss the update anymore but if you want your storage to tweet its messages/alerts check it out. Thank Chris Wahl when you start seeing storage system tweets pollute your  twitter feed.

The highlight of the session, was Andy’s discussion of HyperLogLog, a new approach to understanding customer workloads.

HyperLogLog

Coho Data was designed from the start using Microsoft IO traces (1-week of MSR Cambridge datacenter block IO traces available at SNIA IO Trace repository).  [bold italics added later, ed.] which recorded all IO from 10 But Coho also recorded linux developers developer desktop IO activity for a year, amounting to ~ 1B 7.6B IOs and multi-TBs of data. I just got a call looking for some file activity tracing, so everybody in storage could use more IO traces. But detailed IO traces take up CPU cycles and lot’s of space. HyperLogLogs can solve a portion of this.

Before we go there, a little background. For instance, with a Bloom Filter you can tell whether a block has been referenced or not. In a bloom filter you hash a key, term or whatever multiple times and then OR them into separate bitfields, one per hash. Bloom filters have a small possibility of a false positive (block-id present in filter but was not really in IO stream) but no possibility of a false negative (block-id NOT present in filter but it really was in IO stream). However, bloom filters tells us nothing about how frequently blocks were read.

With a HyperLogLog, one can approximate (within ~2%) how many times a block was referenced. By capturing multiple HyperLogLogs pictures over time, one can determine block access frequency during application processing. Each HyperLogLog trace only occupies ~2 KB, so recording one/hour takes ~50KB/day. The math is beyond me but there’s plenty info online (e.g. here).

HyperLogLog functionality will be included in a future Coho Data update. Coho Data will be implementing what they call “Counter Stacks” which makes use of hyperloglogs in a future release (see Jake Wire’s Usenix Session video/PDF)Once present, Coho Data will save hyperloglog counter stack data, analyze it, and use it to better characterize customer IO with the goal of better optimizing their storage system to actual workloads

For more info please see other SFD6 blogger posts on Coho Data:

~~~~

Now if someone could just develop a super efficient algorithm/storage structure to record block sequences I think we have this licked.

Disclosure statement: I have done work for Coho Data over the last year.

Picture credits: (Lego) Me holding a Coho (Data) Salmon 🙂

VM working set inflection points & SSD caching – chart-of-the-month

Attended SNW USA a couple of weeks ago and talked with Irfan Ahmad, Founder/CTO of CloudPhysics, a new Management-as-a-Service offering for VMware. He took out a chart which I found very interesting which I reproduce below as my Chart of the Month for October.

© 2013 CloudPhysics, Inc., All Rights Reserved

Above is a plot of a typical OLTP like application’s IO activity fed into CloudPhysics’ SSD caching model. (I believe this is a read-only SSD cache although they have write-back and write-through SSD caching models as well.)

On the horizontal access is SSD cache size in MB and ranges from 0MB to 3,500MB. On the left vertical access is % of application IO activity which is cache hits. On the right vertical access is the amount of data that comes out of cache in MB, which ranges from 0MB to 18,000MB.

The IO trace was for a 24-hour period and shows how much of the application’s IO workload that could be captured and converted to (SSD) cache hits given a certain sized cache.

The other thing that would have been interesting is to tell the size of the OLTP database that’s being used by the application, it could easily be 18GB or TBs in size, we don’t see that here.

Analyzing the chart

First, in the mainframe era (we’re still there, aren’t we), the rule of thumb was doubling cache size should increase cache hit rate by 10%.

Second, I don’t understand why at 0MB of cache the cache hit rate is ~25%. From my perspective, at 0MB of cache the hit rate should be 0%.  Seems like a bug in the model but that aside the rest of the curve is very interesting.

Somewhere around 500MB of cache there is a step function where cache hit rate goes from ~30% to ~%50.  This is probably some sort of DB index that has been moved into cache and has now become cache hits.

As for the rule of thumb, going from 500MB to 1000MB doesn’t seem to do much, maybe it increases the cache hit ration by a few %. And doubling it again (to 2000MB), only seems to get you another percent or two of more cache hit rates.

But moving to the 2300MB size cache gets you over 80% cache hit rate. I would have to say the rule of thumb doesn’t work well for this workload.

Not sure what the step up really represents from the OLTP workload perspective but at 80% cache hit, most of the database tables that are accessed more frequently must reside now in cache. Prior to this cache size (<2300MB) all of those tables apparently just didn’t fit in cache, thus, as one was being accessed and moved into cache, another was being pushed out of cache causing a read miss the next time it was accessed. After this cache size (>=2300MB), all these frequently accessed tables could now remain in cache, resulting in the ~80% cache hit rate seen on the chart.

Irfan said that they do not display the chart in CloudPhysics solution but rather display the inflection points. That is their solution would say something like at 500MB of SSD the traced application should see ~50% cache hit rate and at 2300MB of SSD the application should generate ~80% cache hits.  This nets it out for the customer but hides the curve above and the underlying complexity.

Caching models & application working sets …

With CloudPhysics SSD trace simulation Card (caching model) and the ongoing lightweight IO trace collection (IO tracing) available with their service, any VM’s working set can be understood at this fine level of granularity. The advantage of CloudPhysics is that with these tools, one could determine the optimum sized cache required to generate some level of cache hits.

I would add some cautions to the above:

  • The results shown here are based on a CloudPhysics SSD caching model.  Not all SSDs cache in the same way, and there can be quite a lot of sophistication in caching algorithms (having worked on a few in my time). So although,  from this may show the hit rate for a simplistic SSD cache, it could easily under or over estimate real cache hit rates, perhaps by a significant amount. The only way to validate CloudPhysics SSD simulation model is to put a physical cache in at the appropriate size and measure the VM’s cache hit rate.
  • Real caching algorithms have a number of internal parameters which can impact cache hit rates. Not the least of which is the size of the IO block being cached. This can be (commonly) fixed  or (rarely) variable in length. But there are plenty of others which can adversely impact cache hit rates as well for differing workloads.
  • Real caches have a warm up period. During this time the cache is filling up with tracks which may never be referenced again. Some warm up periods take minutes while some I have seen take weeks or longer. The simulation is for 24 hours only, unclear how the hit rate would be impacted if the trace/simulation was for longer or shorter periods.
  • Caching IO activity can introduce a positive (or negative) feedback into any application’s IO stream. If without a cache, an index IO took, let’s say 10 msec to complete and now with an appropriate sized cache, it takes 10 μseconds to complete, the application users are going to complete more transactions, faster. As this takes place, then database IO activity will be change from what it looked like without any caching. Also even the non-cache hits should see some speedup, because the amount of IO issued to the backend storage is reduced substantially.  At some point this all reaches some sort of stasis and we have an ongoing cache hit rate. But the key it’s unlikely to be an exact cache hit match to using a trace and model to predict it. The point is that adding cache to any application environment has affects which are chaotic in nature and inherently difficult to model.

Nonetheless, I like what I see here. I believe it would be useful to understand a bit more about CloudPhysics caching model’s algorithm, the size of the application’s database being traced here, and how well their predictions actually matched up to physical cache’s at the sizes recommended.

… the bottom line

Given what I know about caching in the real world, my suggestion is to take the cache sizes recommended here as a bottom end estimate and the cache hit predictions as a top end estimate of what could be obtained with real SSD caches.  I would increase the cache size recommendations somewhat and expect something less than the cache hits they predicted.

In any case, having application (even VM) IO traces like this that could be accessed and used to drive caching simulation models should be a great boon to storage developers everywhere. I can only hope that server side SSDs and caching storage  vendors supply their own proprietary cache model cards that can be supplied with CachePhysics Cards so that potential customers could use their application traces with the vendor cards to predict what their hardware can do for an application.

If you want to learn more about block storage performance from SMB to enterprise class SAN storage systems, please checkout our SAN Buying Guide, available for purchase on our website. Also we report each month on storage performance results from SPC, SPECsfs, and ESRP in our free newsletter. If you would like to subscribe to this, please use the signup form above right.

~~~~

Comments?

Image:  Chart courtesy of and use approved by CloudPhysics