VM working set inflection points & SSD caching – chart-of-the-month

Attended SNW USA a couple of weeks ago and talked with Irfan Ahmad, Founder/CTO of CloudPhysics, a new Management-as-a-Service offering for VMware. He took out a chart which I found very interesting which I reproduce below as my Chart of the Month for October.

© 2013 CloudPhysics, Inc., All Rights Reserved

Above is a plot of a typical OLTP like application’s IO activity fed into CloudPhysics’ SSD caching model. (I believe this is a read-only SSD cache although they have write-back and write-through SSD caching models as well.)

On the horizontal access is SSD cache size in MB and ranges from 0MB to 3,500MB. On the left vertical access is % of application IO activity which is cache hits. On the right vertical access is the amount of data that comes out of cache in MB, which ranges from 0MB to 18,000MB.

The IO trace was for a 24-hour period and shows how much of the application’s IO workload that could be captured and converted to (SSD) cache hits given a certain sized cache.

The other thing that would have been interesting is to tell the size of the OLTP database that’s being used by the application, it could easily be 18GB or TBs in size, we don’t see that here.

Analyzing the chart

First, in the mainframe era (we’re still there, aren’t we), the rule of thumb was doubling cache size should increase cache hit rate by 10%.

Second, I don’t understand why at 0MB of cache the cache hit rate is ~25%. From my perspective, at 0MB of cache the hit rate should be 0%.  Seems like a bug in the model but that aside the rest of the curve is very interesting.

Somewhere around 500MB of cache there is a step function where cache hit rate goes from ~30% to ~%50.  This is probably some sort of DB index that has been moved into cache and has now become cache hits.

As for the rule of thumb, going from 500MB to 1000MB doesn’t seem to do much, maybe it increases the cache hit ration by a few %. And doubling it again (to 2000MB), only seems to get you another percent or two of more cache hit rates.

But moving to the 2300MB size cache gets you over 80% cache hit rate. I would have to say the rule of thumb doesn’t work well for this workload.

Not sure what the step up really represents from the OLTP workload perspective but at 80% cache hit, most of the database tables that are accessed more frequently must reside now in cache. Prior to this cache size (<2300MB) all of those tables apparently just didn’t fit in cache, thus, as one was being accessed and moved into cache, another was being pushed out of cache causing a read miss the next time it was accessed. After this cache size (>=2300MB), all these frequently accessed tables could now remain in cache, resulting in the ~80% cache hit rate seen on the chart.

Irfan said that they do not display the chart in CloudPhysics solution but rather display the inflection points. That is their solution would say something like at 500MB of SSD the traced application should see ~50% cache hit rate and at 2300MB of SSD the application should generate ~80% cache hits.  This nets it out for the customer but hides the curve above and the underlying complexity.

Caching models & application working sets …

With CloudPhysics SSD trace simulation Card (caching model) and the ongoing lightweight IO trace collection (IO tracing) available with their service, any VM’s working set can be understood at this fine level of granularity. The advantage of CloudPhysics is that with these tools, one could determine the optimum sized cache required to generate some level of cache hits.

I would add some cautions to the above:

  • The results shown here are based on a CloudPhysics SSD caching model.  Not all SSDs cache in the same way, and there can be quite a lot of sophistication in caching algorithms (having worked on a few in my time). So although,  from this may show the hit rate for a simplistic SSD cache, it could easily under or over estimate real cache hit rates, perhaps by a significant amount. The only way to validate CloudPhysics SSD simulation model is to put a physical cache in at the appropriate size and measure the VM’s cache hit rate.
  • Real caching algorithms have a number of internal parameters which can impact cache hit rates. Not the least of which is the size of the IO block being cached. This can be (commonly) fixed  or (rarely) variable in length. But there are plenty of others which can adversely impact cache hit rates as well for differing workloads.
  • Real caches have a warm up period. During this time the cache is filling up with tracks which may never be referenced again. Some warm up periods take minutes while some I have seen take weeks or longer. The simulation is for 24 hours only, unclear how the hit rate would be impacted if the trace/simulation was for longer or shorter periods.
  • Caching IO activity can introduce a positive (or negative) feedback into any application’s IO stream. If without a cache, an index IO took, let’s say 10 msec to complete and now with an appropriate sized cache, it takes 10 μseconds to complete, the application users are going to complete more transactions, faster. As this takes place, then database IO activity will be change from what it looked like without any caching. Also even the non-cache hits should see some speedup, because the amount of IO issued to the backend storage is reduced substantially.  At some point this all reaches some sort of stasis and we have an ongoing cache hit rate. But the key it’s unlikely to be an exact cache hit match to using a trace and model to predict it. The point is that adding cache to any application environment has affects which are chaotic in nature and inherently difficult to model.

Nonetheless, I like what I see here. I believe it would be useful to understand a bit more about CloudPhysics caching model’s algorithm, the size of the application’s database being traced here, and how well their predictions actually matched up to physical cache’s at the sizes recommended.

… the bottom line

Given what I know about caching in the real world, my suggestion is to take the cache sizes recommended here as a bottom end estimate and the cache hit predictions as a top end estimate of what could be obtained with real SSD caches.  I would increase the cache size recommendations somewhat and expect something less than the cache hits they predicted.

In any case, having application (even VM) IO traces like this that could be accessed and used to drive caching simulation models should be a great boon to storage developers everywhere. I can only hope that server side SSDs and caching storage  vendors supply their own proprietary cache model cards that can be supplied with CachePhysics Cards so that potential customers could use their application traces with the vendor cards to predict what their hardware can do for an application.

If you want to learn more about block storage performance from SMB to enterprise class SAN storage systems, please checkout our SAN Buying Guide, available for purchase on our website. Also we report each month on storage performance results from SPC, SPECsfs, and ESRP in our free newsletter. If you would like to subscribe to this, please use the signup form above right.

~~~~

Comments?

Image:  Chart courtesy of and use approved by CloudPhysics

Storage throughput vs. IO response time and why it matters

Fighter Jets at CNE by lifecreation (cc) (from Flickr)
Fighter Jets at CNE by lifecreation (cc) (from Flickr)

Lost in much of the discussions on storage system performance is the need for both throughput and response time measurements.

  • By IO throughput I generally mean data transfer speed in megabytes per second (MB/s or MBPS), however another definition of throughput is IO operations per second (IO/s or IOPS).  I prefer the MB/s designation for storage system throughput because it’s very complementary with respect to response time whereas IO/s can often be confounded with response time.  Nevertheless, both metrics qualify as storage system throughput.
  • By IO response time I mean the time it takes a storage system to perform an IO operation from start to finish, usually measured in milleseconds although lately some subsystems have dropped below the 1msec. threshold.  (See my last year’s post on SPC LRT results for information on some top response time results).

Benchmark measurements of response time and throughput

Both Standard Performance Evaluation Corporation’s SPECsfs2008 and Storage Performance Council’s SPC-1 provide response time measurements although they measure substantially different quantities.  The problem with SPECsfs2008’s measurement of ORT (overall response time) is that it’s calculated as a mean across the whole benchmark run rather than a strict measurement of least response time at low file request rates.  I believe any response time metric should measure the minimum response time achievable from a storage system although I can understand SPECsfs2008’s point of view.

On the other hand SPC-1 measurement of LRT (least response time) is just what I would like to see in a response time measurement.  SPC-1 provides the time it takes to complete an IO operation at very low request rates.

In regards to throughput, once again SPECsfs2008’s measurement of throughput leaves something to be desired as it’s strictly a measurement of NFS or CIFS operations per second.  Of course this includes a number (>40%) of non-data transfer requests as well as data transfers, so confounds any measurement of how much data can be transferred per second.  But, from their perspective a file system needs to do more than just read and write data which is why they mix these other requests in with their measurement of NAS throughput.

Storage Performance Council’s SPC-1 reports throughput results as IOPS and provide no direct measure of MB/s unless one looks to their SPC-2 benchmark results.  SPC-2 reports on a direct measure of MBPS which is an average of three different data intensive workloads including large file access, video-on-demand and a large database query workload.

Why response time and throughput matter

Historically, we used to say that OLTP (online transaction processing) activity performance was entirely dependent on response time – the better storage system response time, the better your OLTP systems performed.  Nowadays it’s a bit more complex, as some of todays database queries can depend as much on sequential database transfers (or throughput) as on individual IO response time.  Nonetheless, I feel that there is still a large component of response time critical workloads out there that perform much better with shorter response times.

On the other hand, high throughput has its growing gaggle of adherents as well.  When it comes to high sequential data transfer workloads such as data warehouse queries, video or audio editing/download or large file data transfers, throughput as measured by MB/s reigns supreme – higher MB/s can lead to much faster workloads.

The only question that remains is who needs higher throughput as measured by IO/s rather than MB/s.  I would contend that mixed workloads which contain components of random as well as sequential IOs and typically smaller data transfers can benefit from high IO/s storage systems.  The only confounding matter is that these workloads obviously benefit from better response times as well.   That’s why throughput as measured by IO/s is a much more difficult number to understand than any pure MB/s numbers.

—-

Now there is a contingent of performance gurus today that believe that IO response times no longer matter.  In fact if one looks at SPC-1 results, it takes some effort to find its LRT measurement.  It’s not included in the summary report.

Also, in the post mentioned above there appears to be a definite bifurcation of storage subsystems with respect to response time, i.e., some subsystems are focused on response time while others are not.  I would have liked to see some more of the top enterprise storage subsystems represented in the top LRT subsystems but alas, they are missing.

1954 French Grand Prix - Those Were The Days by Nigel Smuckatelli (cc) (from Flickr)
1954 French Grand Prix - Those Were The Days by Nigel Smuckatelli (cc) (from Flickr)

Call me old fashioned but I feel that response time represents a very important and orthogonal performance measure with respect to throughput of any storage subsystem and as such, should be much more widely disseminated than it is today.

For example, there is a substantive difference a fighter jet’s or race car’s top speed vs. their maneuverability.  I would compare top speed to storage throughput and its maneuverability to IO response time.  Perhaps this doesn’t matter as much for a jet liner or family car but it can matter a lot in the right domain.

Now do you want your storage subsystem to be a jet fighter or a jet liner – you decide.