VMware VSAN 6.0 all-flash & hybrid IO performance at SFD7

We visited with VMware’s VSAN team during last Storage Field Day (SFD7, session available here). The presentation was wide ranging but the last two segments dealt with recent changes to VSAN and at the end provided some performance results for both a hybrid VSAN and an all-Flash VSAN.

Some new features in VSAN 6.0 include:

  • More scaleability, up to 64 hosts in a cluster and up to 200VMs per host
  • New higher performance snapshots & clones
  • Rack awareness for better availability
  • Hardware based checksum for T10 DIF (data integrity feature)
  • Support for blade servers with JBODs
  • All-flash configurations
  • Higher IO performance

Even in the all-flash configuration there are two tiers of storage a write cache tier and a capacity tier of SSDs. These are configured with two different classes of SSDs (high endurance/low-capacity and low-endurance/high capacity).

At the end of the session Christos Karamanolis (@Xtosk), Principal Architect for VSAN showed us some performance charts on VSAN 6.0 hybrid and all-flash configurations.

Hybrid VSAN performance

On the chart we see two plots showing IOmeter performance as VSAN scales across multiple nodes (hosts), on the left  we have  a 100% Read workload and on the right a 70%Read:30%Write workload.

The hybrid VSAN configuration has 4-10Krpm disks and 1-400GB SSD on each host and ranges from 8 to 64 hosts. The bars on the chart show IOmeter IOPS and the line shows the average response time (or IO latency) for each VSAN host configuration. I am not a big fan of IOmeter, as it’s an overly simplified, but that’s what VMware used.

The results show that in a 100% read case the hybrid, 64 host VSAN 6.0 cluster was able to sustain ~3.8M IOPS or over 60K IOPS per host.  or the mixed 70:30 R:W workload VSAN 6.0 was able to sustain ~992K IOPs or ~15.5K IOPS per host.

We see a pretty drastic IOPs degradation (~1/4 the 100% read performance) in the plot on the right, when they added write activity to the mix. But with VSAN’s mirrored data protection each VM write represents at least two VSAN backend writes and at a 70:30 IOmeter R:W this would be ~694K IOPS read and ~298K IOPS write frontend IOs but with mirroring this represents 595K writes to the backend storage.

Then of course, there’s destage activity (data written to SSDs need  to be read off SSD and written to HDD) which also multiplies internal IO operations for every external write IOP. Lets say all that activity multiplies each external write by 6 (3 for each mirror: 1 to the write cache SSD, 1 to read it back and 1 to write to HDD) and we multiply that times the ~298K external write IOPS, it would add up to about a total of ~1.8M write derived IOPS  and ~0.7M read derived IOPS or a total of ~2.5M IOPS but this is still far away from the 3.5M IOPS for 100% read activity. I am probably missing another IO or two in the write path (maybe Virtual to physical mapping structures need to be updated) or have failed to account for more inter-cluster IO activity in support of the writes.

In addition, we see the IO latency was roughly flat across the 100% Read workload at ~2.25msec. and got slightly worse over the 70:30 R:W workload, ranging from ~2.5msec. at 4 hosts to a little over 3.0msec. with 64 hosts. Not sure why this got worse but hosts are scaled up it could induce more inter-cluster overhead.

Rays-pix37

In the chart to the right, we can see similar performance data for systems with one or two disk-groups. The message here is that with two disk groups on a host (2X the disk and SSD resources per host) one can potentially double the performance of the systems, to 116K IOPS/host on 100% read and 31K IOPS/host on a 70:30 R:W workload.

All-flash VSAN performance

Rays-pix41

Here we can see performance data for an 8-host, all-flash VSAN configuration. In this case the chart on the left was a single “disk” group and the chart on the right was a dual disk group, all-flash configuration on each of the 8-hosts. The hosts were configured with 1-400GB and 3-800GB SSDs per disk group.

The various bars on the charts represent different VM working set sizes, 100, 200, 400 & 600GB for the single disk group chart and 100, 200, 400, 800 & 1200GB for dual disk group configurations. For the dual disk group, the 1200GB working set size is much bigger than a cache tier on each host.

The chart text is a bit confusing: the title of each plot says 70% read but the text under the two plots says 100% read. I must assume these were 70:30 R:W workloads. If we just look at the 8 hosts, using a 400GB VM working set size, all-flash VSAN 6.0 single disk group cluster was able to achieve ~37.5K IOPS/host and with two disk groups, the all-flash VSAN 6.0  was able to achieve ~68.75K IOPS/host at the 400GB working set size. Both doubling the hybrid performance.

Response times degrade for both the single and dual disk groups as we increase the working set sizes. It’s pretty hard to see on the two charts but it seems to range from 1.8msec to 2.2msec for the single disk group and 1.8msec to 2.5 msec for the dual disk group. The two charts are somewhat misleading because they didn’t use the exact same working group sizes for the two performance runs but just taking the 100|200|400GB working set sizes, for the single disk group it looks like the latency went from ~1.8msec. to ~2.0msec and for the dual disk group from ~1.8msec to ~2.3msec. Why the higher degradation for the dual disk group is anyone’s guess.

The other thing that doesn’t make much sense is that as you increase the working set size the number of IOPS goes down, worse for the dual disk group than the single. Once again taking just the 100|200|400GB working group sizes this ranges from ~350K IOPS to ~300K IOPS (~15% drop) for the single disk group and ~700K IOPS to ~550K IOPS (~22% drop) for the dual disk group.

Increasing working group sizes should cause additional backend IO as the cache effectivity should be proportionately less as you increase working set size. Which I think goes a long way to explain the degradation in IOPS as you increase working set size. But I would have thought the degradation would have been a proportionally similar for both the single and dual disk groups. The fact that the dual disk group did 7% worse seems to indicate more overhead associated with dual disk groups than single disk groups or perhaps, they were running up against some host controller limits (a single controller supporting both disk groups).

~~~~

At the time (3 months ago) this was the first world-wide look at all-flash VSAN 6.0 performance. The charts are a bit more visible in the video than in my photos (?) and if you want to just see and hear Christos’s performance discussion check out ~11 minutes into the final video segment.

For more information you can also read these other SFD7 blogger posts on VMware’s session: