Boot storms and VDI storage

We were having a discussion about virtual desktop infrastructure (VDI) environments the other day and the topic came around to boot storms.

When VDI first started coming out many storage companies were concerned about the effect boot storms, shutdown storms, AV scan storms, etc. would have on system performance. As such, they were keen to demonstrate how well their systems did against boot storms and other un-natural IO activity to support VDI environments.

But at last years SFD4 and during subsequent discussions around the storage round table noone mention boot storms as being a concern anymore. Nowadays it’s more that VMs in general create a sort of IO mixer and that discerning IO patterns in VM environments is neigh impossible without insight into the VM’s IO workload in isolation.

Why is it that boot storms are no longer a concern?

It seems that a couple of things have emerged as more VDI implementations are put in place. For example, not everyone in a company boots up on Monday morning at 8am, spreading out any potential boot storms over a much longer period of time than anticipated in boot storm simulations.

Also it turns out that a lot of people never actually shut down their desktops so the need to boot is drastically reduced for these people, perhaps once/week or once/month or once/bluescreen. Although, I don’t know this for a fact, someone mentioned that VMware View has a parameter that can disable end user shutdowns and just puts the VDI instances into suspended animation (I would say sleep but that seems to be a Mac term).

Another case in point is that VMware View Planner doesn’t simulate or even measure virtual desktop boot up activity. It seems that simulating boot storms is no longer a reasonable way to help measure how effective storage systems can handle VMware View implementations.

On the other hand, I am aware of at least one other VDI benchmarking tools that make a point of simulating boot storms and other similar extreme workload activities.

So are boot storms, no longer an issue for storage systems in VDI implementations or not?

Photo Credits: Storm cloud, Duncan, Oklahoma by chascar

SpecSFS2008 results NFS throughput vs. flash size – Chart of the Month

Scatter plot with SPECsfs2008 NFS throughput results against flash size, SSD, NFS thoughputThe above chart was sent out in our December newsletter and represents yet another attempt to understand how flash/SSD use is impacting storage system performance. This chart’s interesting twist is to try to categorize the use of flash in hybrid (disk-SSD) systems vs. flash-only/all flash storage systems.

First, we categorize SSD/Flash-only (blue diamonds on the chart) systems as any storage system that has as much or more flash storage capacity than SPECsfs2008 exported file system capacity. While not entirely true, there is one system that has ~99% of their exported capacity in flash, it is a reasonable approximation.  Any other system that has some flash identified in it’s configuration is considered a Hybrid SSD&Disks (red boxes on the chart) system.

Next, we plot the system’s NFS throughput on the vertical axis and the system’s flash capacity (in GB) on the horizontal axis. Then we charted a linear regression for each set of data.

What troubles me with this chart is that hybrid systems are getting much more NFS throughput performance out of their flash capacity than flash-only systems. One would think that flash-only systems would generate more throughput per flash GB than hybrid systems because of the slow access times from disk. But the data shows this is wrong?!

We understand that NFS throughput operations are mostly metadata file calls and not data transfers so one would think that the relatively short random IOPS would favor flash only systems. But that’s not what the data shows.

What the data seems to tell me is that judicious use of flash and disk storage in combination can be better than either alone or at least flash alone.  So maybe those short random IOPS should be served out of SSD and the relatively longer, more sequential like data access (which represents only 28% of the operations that constitute NFS throughput) should be served out of disk.  And as the metadata for file systems is relatively small in capacity, this can be supported with a small amount of SSD, leveraging that minimal flash capacity for the greater good (or more NFS throughput).

I would be remiss if I didn’t mention that there are relatively few (7) flash-only systems in the SPECsfs2008 benchmarks and the regression coefficient is very poor (R**2=~0.14), which means that this could change substantially with more flash-only submissions. However, it’s looking pretty flat from my perspective and it would take an awful lot of flash-only systems showing much higher NFS throughput per flash GB to make a difference in the regression equation

Nonetheless, I am beginning to see a pattern here in that SSD/Flash is good for some things and disk continues to be good for others. And smart storage system developers will do good to realize this fact.  Also, as a side note, I am beginning to see some rational why there aren’t more flash-only SPECsfs2008 results.

Comments?

~~~~

The complete SPECsfs2008 performance report went out in SCI’s December 2013 newsletter.  But a copy of the report will be posted on our dispatches page sometime this quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

Even more performance information and ChampionCharts for NFS and CIFS/SMB storage systems are also available in SCI’s NAS Buying Guide, available for purchase from  website.

As always, we welcome any suggestions or comments on how to improve our SPECsfs2008  performance reports or any of our other storage performance analyses.

 

Proximal Data, server SSD caching software

7707062406_6508dba2a4_oI attended Storage Field Day 4 (SFD4) about a month ago now and had a chance to visit with Rory Bolt, CEO/Founder of Proximal Data, a new server side caching software solution. Last month the GreyBeards (Howard Marks and I) talked with Satyam Vaghani, Co-founder and CTO of PernixData another server side caching solution. You can find that podcast here. But this post is about Proximal Data. These guys could use some better marketing but when you spend 90% of your funding on engineers this is what you get.

Proximal Data doesn’t believe in agent software. because it takes a long time to deploy and could potentially disrupt IT operations when being installed. In contrast, Proximal Data installs their AutoCache solution software into the hypervisor as a VIB (vSphere Installation Bundle). There was some discussion at SFD4, on whether installing the VIB would be disruptive or not to customer operations. Not being a VMware expert I won’t comment on the results of the discussion but if you want to find out more I suggest viewing the SFD4 video of their Proximal Data’s presentation.

Of course, being at the Hypervisor layer can give them IO activity information at the VM level and could use this to control their caching software at VM granularity. In addition, by executing at the Hypervisor layer AutoCache doesn’t require any guest OS specific functionality or hooks. Another nice thing about executing at the hypervisor level is that they can cache RDM devices.

To use AutoCache you will need one or more PCIe SSD(s) or DAS SSD(s) in your ESXi server.  Once the PCIe SSD or DAS SSD is installed and after you have installed/activated the AutoCache software you will need to partition or dedicate the device to Proximal Data’s AutoCache.

AutoCache is managed as a virtual appliance with a Web server GUI.  With the networking setup and AutoCache VIB, installed you can access their operator panels via a tab in vCenter. Once the software is installed you don’t have to use their GUI ever again.

AutoCache read caching algorithms

Not every read IO for a VM being cached is brought into AutoCache’s SSD cache. They are trying to insure that cached data will be referenced again. As such, they typically wait for two reads before the data is placed into cache.

They support two different read caching algorithms called during the presentation as Algorithm A or Algorithm B. (They really need some marketing – Turbo Boost and Extreme Boost sounds better to me). Not sure they ever described the differences between the two, but the fact that they have multiple caching algorithms speaks to some sophistication. They also maintain a “Ghost data list”. Ghost data is data whose metadata is still in cache, but whose actual data is no longer in cache.

When a miss occurs, they determine if the data would have been a hit in Ghost data, in Algorithm A or in Algorithm B if they were active on the VM.  If it would have been a hit in Ghost data then in general, you probably need more SSD caching space on this ESXi server for the VMs being cached. If Algorithm A or B, probably should be using that algorithm for this VM’s IO.

Another approach AutoCache supports is called “Glimmer IO”. I liken this to sequential read-ahead where AutoCache keeps track, on a VM basis, all the IO being performed and try to determine if it is sequential or random. If the VM is doing sequential IO, AutoCache can start reading ahead of where the VM is currently reading. By doing so, they could stage the data in cache before the VM needs it/reads it. According to Rory there are policies which can be set on a VM basis to limit how much read-ahead is being performed. I assume there are policies associated with the use of Algorithm A and B on a VM basis as well but they didn’t go into this.

AutoCache cache warmup for vMotion

The other nice thing that AutoCache does is it provides a cache warmup for the target ESXi server when moving VMs via vMotion. This is done by registering for Vmotion API and trapping Vmotion requests. Once they detect that a VM is being moved they send the VM’s  Autocache metadata over to the target Host at which time the target system AutoCache can start to fill it’s cache from the shared storage. Not a bad approach from my perspective. The amount of data that needs to be moved is minimal and you get the AutoCache code running in the target machine to start preloading blocks that were in cache from the source Host. They also mentioned that once they have copied the metadata over to the target Host, they could free up (invalidate) all the space in the source Host’s cache that was being held by the VM being moved.

Proximal Data for Hyper-V

At SFD4, Rory mentioned that a Hyper-V version of AutoCache was coming out shortly. And although they specifically indicated that write back caching was not a great idea (in contrast to Satyam and PernixData), there was a potential for them to look at implementing this as well over time.

The product is sold through resellers, distributors and OEMs.  They claim support for any flash device although they have an approved HCL.

Current pricing is $1000 for the AutoCache software to support a SSD cache of 500GB or less. From what we see in the enterprise storage systems having a cache of 2-5% of your total backend storage is about right. (But see my VM working set inflection points and  SSD caching post for another side on this).   So a 500GB SSD cache should be able to support 10-25TB of backend data if all goes well.

~~~~

After the podcast on PernixData’s clustering, write-back caching software, Proximal Data didn’t seem as complex or useful. But there is a place for read-only caching. The fact that they can help warm the target Host’s cache for a vMotion is a great feature if you plan on doing a lot of movement of VMs in your shop. The fact that they have distinct support for multiple cache algorithms, understand sequential detect and have some way of telling you that you could use more SSD caching is also good in my mind.

Comments?

Photo: 20-nanometer NAND flash chip, IntelFreePress’ photostream

 

 

VM working set inflection points & SSD caching – chart-of-the-month

Attended SNW USA a couple of weeks ago and talked with Irfan Ahmad, Founder/CTO of CloudPhysics, a new Management-as-a-Service offering for VMware. He took out a chart which I found very interesting which I reproduce below as my Chart of the Month for October.

© 2013 CloudPhysics, Inc., All Rights Reserved

Above is a plot of a typical OLTP like application’s IO activity fed into CloudPhysics’ SSD caching model. (I believe this is a read-only SSD cache although they have write-back and write-through SSD caching models as well.)

On the horizontal access is SSD cache size in MB and ranges from 0MB to 3,500MB. On the left vertical access is % of application IO activity which is cache hits. On the right vertical access is the amount of data that comes out of cache in MB, which ranges from 0MB to 18,000MB.

The IO trace was for a 24-hour period and shows how much of the application’s IO workload that could be captured and converted to (SSD) cache hits given a certain sized cache.

The other thing that would have been interesting is to tell the size of the OLTP database that’s being used by the application, it could easily be 18GB or TBs in size, we don’t see that here.

Analyzing the chart

First, in the mainframe era (we’re still there, aren’t we), the rule of thumb was doubling cache size should increase cache hit rate by 10%.

Second, I don’t understand why at 0MB of cache the cache hit rate is ~25%. From my perspective, at 0MB of cache the hit rate should be 0%.  Seems like a bug in the model but that aside the rest of the curve is very interesting.

Somewhere around 500MB of cache there is a step function where cache hit rate goes from ~30% to ~%50.  This is probably some sort of DB index that has been moved into cache and has now become cache hits.

As for the rule of thumb, going from 500MB to 1000MB doesn’t seem to do much, maybe it increases the cache hit ration by a few %. And doubling it again (to 2000MB), only seems to get you another percent or two of more cache hit rates.

But moving to the 2300MB size cache gets you over 80% cache hit rate. I would have to say the rule of thumb doesn’t work well for this workload.

Not sure what the step up really represents from the OLTP workload perspective but at 80% cache hit, most of the database tables that are accessed more frequently must reside now in cache. Prior to this cache size (<2300MB) all of those tables apparently just didn’t fit in cache, thus, as one was being accessed and moved into cache, another was being pushed out of cache causing a read miss the next time it was accessed. After this cache size (>=2300MB), all these frequently accessed tables could now remain in cache, resulting in the ~80% cache hit rate seen on the chart.

Irfan said that they do not display the chart in CloudPhysics solution but rather display the inflection points. That is their solution would say something like at 500MB of SSD the traced application should see ~50% cache hit rate and at 2300MB of SSD the application should generate ~80% cache hits.  This nets it out for the customer but hides the curve above and the underlying complexity.

Caching models & application working sets …

With CloudPhysics SSD trace simulation Card (caching model) and the ongoing lightweight IO trace collection (IO tracing) available with their service, any VM’s working set can be understood at this fine level of granularity. The advantage of CloudPhysics is that with these tools, one could determine the optimum sized cache required to generate some level of cache hits.

I would add some cautions to the above:

  • The results shown here are based on a CloudPhysics SSD caching model.  Not all SSDs cache in the same way, and there can be quite a lot of sophistication in caching algorithms (having worked on a few in my time). So although,  from this may show the hit rate for a simplistic SSD cache, it could easily under or over estimate real cache hit rates, perhaps by a significant amount. The only way to validate CloudPhysics SSD simulation model is to put a physical cache in at the appropriate size and measure the VM’s cache hit rate.
  • Real caching algorithms have a number of internal parameters which can impact cache hit rates. Not the least of which is the size of the IO block being cached. This can be (commonly) fixed  or (rarely) variable in length. But there are plenty of others which can adversely impact cache hit rates as well for differing workloads.
  • Real caches have a warm up period. During this time the cache is filling up with tracks which may never be referenced again. Some warm up periods take minutes while some I have seen take weeks or longer. The simulation is for 24 hours only, unclear how the hit rate would be impacted if the trace/simulation was for longer or shorter periods.
  • Caching IO activity can introduce a positive (or negative) feedback into any application’s IO stream. If without a cache, an index IO took, let’s say 10 msec to complete and now with an appropriate sized cache, it takes 10 μseconds to complete, the application users are going to complete more transactions, faster. As this takes place, then database IO activity will be change from what it looked like without any caching. Also even the non-cache hits should see some speedup, because the amount of IO issued to the backend storage is reduced substantially.  At some point this all reaches some sort of stasis and we have an ongoing cache hit rate. But the key it’s unlikely to be an exact cache hit match to using a trace and model to predict it. The point is that adding cache to any application environment has affects which are chaotic in nature and inherently difficult to model.

Nonetheless, I like what I see here. I believe it would be useful to understand a bit more about CloudPhysics caching model’s algorithm, the size of the application’s database being traced here, and how well their predictions actually matched up to physical cache’s at the sizes recommended.

… the bottom line

Given what I know about caching in the real world, my suggestion is to take the cache sizes recommended here as a bottom end estimate and the cache hit predictions as a top end estimate of what could be obtained with real SSD caches.  I would increase the cache size recommendations somewhat and expect something less than the cache hits they predicted.

In any case, having application (even VM) IO traces like this that could be accessed and used to drive caching simulation models should be a great boon to storage developers everywhere. I can only hope that server side SSDs and caching storage  vendors supply their own proprietary cache model cards that can be supplied with CachePhysics Cards so that potential customers could use their application traces with the vendor cards to predict what their hardware can do for an application.

If you want to learn more about block storage performance from SMB to enterprise class SAN storage systems, please checkout our SAN Buying Guide, available for purchase on our website. Also we report each month on storage performance results from SPC, SPECsfs, and ESRP in our free newsletter. If you would like to subscribe to this, please use the signup form above right.

~~~~

Comments?

Image:  Chart courtesy of and use approved by CloudPhysics

HP Tech Day – StoreServ Flash Optimizations

Attended HP Tech Field Day late last month in Disneyland. Must say the venue was the best ever for HP, and getting in on Nth Generation Conference was a plus. Sorry it has taken so long for me to get around to writing about it.

We spent a day going over HP’s new converged storage, software defined storage and other storage topics. HP has segmented the Software Defined Data Center (SDDC) storage requirements into cost optimized, Software Defined Storage and SLA optimized, Service Refined Storage. Under Software Defined storage they talked about their StoreVirtual product line which is an outgrowth of the Lefthand Networks VSA, first introduced in 2007. This June, they extended SDS to include their StoreOnce VSA product to go after SMB and ROBO backup storage requirements.

We also discussed some of HP’s OpenStack integration work to integrate current HP block storage into OpenStack Cinder. They discussed some of the integrations they plan for file and object store as well.

However what I mostly want to discuss in this post is the session discussing how HP StoreServ 3PAR had optimized their storage system for flash.

They showed an SPC-1 chart depicting various storage systems IOPs levels and response times as they ramped from 10% to 100% of their IOPS rate. StoreServ 3PAR’s latest entry showed a considerable band of IOPS (25K to over 250K) all within a sub-msec response time range. Which was pretty impressive since at the time no other storage systems seemed able to do this for their whole range of IOPS. (A more recent SPC-1 result from HDS with an all-flash VSP with Hitachi Accelerated Flash also was able to accomplish this [sub-msec response time throughout their whole benchmark], only in their case it reached over 600K IOPS – read about this in our latest performance report in our newsletter, sign up above right).

  • Adaptive Read – As I understood it, this changed the size of backend reads to match the size requested by the front end. For disk systems, one often sees that a host read of say 4KB often causes a read of 16KB from the backend, with the assumption that the host will request additional data after the block read off of disk and 90% of the time spent to do a disk read is getting the head to the correct track and once there it takes almost no effort to read more data. However with flash, there is no real effort to get to a proper location to read a block of flash data and as such, there is no advantage to reading more data than the host requests, because if they come back for more one can immediately read from the flash again.
  • Adaptive Write – Similar to adaptive read, adaptive write only writes the changed data to flash. So if a host writes a 4KB block then 4KB is written to flash. This doesn’t help much for RAID 5 because of parity updates but for RAID 1 (mirroring) this saves on flash writes which ultimately lengthens flash life.
  • Adaptive Offload (destage) – This changes the frequency of destaging or flushing cache depending on the level of write activity. Slower destaging allows written (dirty) data to accumulate in cache if there’s not much write activity going on, which means in RAID 5 parity may not need to be updated as one could potentially accumulate a whole stripe’s worth of data in cache. In low-activity situations such destaging could occur every 200 msecs. whereas with high write activity destaging could occur as fast as every 3 msecs.
  • Multi-tennant IO processing – For disk drives, with sequential reads, one wants the largest stripes possible (due to head positioning penalty) but for SSDs one wants the smallest stripe sizes possible. The other problem with large stripe sizes is that devices are busy during the longer sized IO while performing the stripe writes (and reads). StoreServ modified the stripe size for SSDs to be 32KB so that other IO activity need not have to wait as long to get their turn in the (IO device) queue. The other advantage is when one is doing SSD rebuilds, with a 32KB stripe size one can intersperse more IO activity for the devices involved in the rebuild without impacting rebuild performance.

Of course the other major advantage of HP StoreServ’s 3PAR architecture provides for Flash is its intrinsic wide striping that’s done across a storage pool. This way all the SSDs can be used optimally and equally to service customer IOs.

I am certain there were other optimizations HP made to support SSDs in StoreServ storage, but these are the ones they were willing to talk publicly about.

No mention of when Memristor SSDs were going to be available but stay tuned, HP let slip that sooner or later Memristor Flash storage will be in HP storage & servers.

Comments?

Photo Credits: (c) 2013 Silverton Consulting, Inc

Has latency become the key metric? SPC-1 LRT results – chart of the month

I was at EMCworld a couple of months back and they were showing off a preview of the next version VNX storage, which was trying to achieve a million IOPS with under a millisecond latency.  Then I attended NetApp’s analyst summit and the discussion at their Flash seminar was how latency was changing the landscape of data storage and how flash latencies were going to enable totally new applications.

One executive at NetApp mentioned that IOPS was never the real problem. As an example, he mentioned one large oil & gas firm that had a peak IOPS of 35K.

Also, there was some discussion at NetApp of trying to come up with a way of segmenting customer applications by latency requirements.  Aside from high frequency trading applications, online payment processing and a few other high-performance database activities, there wasn’t a lot that could easily be identified/quantified today.

IO latencies have been coming down for years now. Sophisticated disk only storage systems have been lowering latencies for over a decade or more.   But since the introduction of SSDs it’s been a whole new ballgame.  For proof all one has to do is examine the top 10 SPC-1 LRT (least response time, measured with workloads@10% of peak activity) results.

Top 10 SPC-1 LRT results, SSD system response times

 

In looking over the top 10 SPC-1 LRT benchmarks (see Figure above) one can see a general pattern.  These systems mostly use SSD or flash storage except for TMS-400, TMS 320 (IBM FlashSystems) and Kaminario’s K2-D which primarily use DRAM storage and backup storage.

Hybrid disk-flash systems seem to start with an LRT of around 0.9 msec (not on the chart above).  These can be found with DotHill, NetApp, and IBM.

Similarly, you almost have to get to as “slow” as 0.93 msec. before you can find any disk only storage systems. But most disk only storage comes with a latency at 1msec or more. Between 1 and 2msec. LRT we see storage from EMC, HDS, HP, Fujitsu, IBM NetApp and others.

There was a time when the storage world was convinced that to get really good response times you had to have a purpose built storage system like TMS or Kaminario or stripped down functionality like IBM’s Power 595.  But it seems that the general purpose HDS HUS, IBM Storwize, and even Huawei OceanStore are all capable of providing excellent latencies with all SSD storage behind them. And all seem to do at least in the same ballpark as the purpose built, TMS RAMSAN-620 SSD storage system.  These general purpose storage systems have just about every advanced feature imaginable with the exception of mainframe attach.

It seems nowadays that there is a trifurcation of latency results going on, based on underlying storage:

  • DRAM only systems at 0.4 msec to ~0.1 msec.
  • SSD/flash only storage at 0.7 down to 0.2msec
  • Disk only storage at 0.93msec and above.

The hybrid storage systems are attempting to mix the economics of disk with the speed of flash storage and seem to be contending with all these single technology, storage solutions. 

It’s a new IO latency world today.  SSD only storage systems are now available from every major storage vendor and many of them are showing pretty impressive latencies.  Now with fully functional storage latency below 0.5msec., what’s the next hurdle for IT.

Comments?

Image: EAB 2006 by TMWolf

 

Enhanced by Zemanta

Windows Server 2012 R2 storage changes announced at TechEd

Microsoft TechEd Trends driving IT todayMicrosoft TechEd USA is this week and they announced a number of changes to the storage services that come with Windows Server 2012 R2

  • Azure DRaaS – Microsoft is attempting to democratize DR by supporting a new DR-as-a-Service (DRaaS).  They now have an Azure service that operates in conjunction with Windows Server 2012 R2 that provides orchestration and automation for DR site failover and fail back to/from remote sites.  Windows Server 2012 R2 uses Hyper-V Replica to replicate data across to the other site. Azure DRaaS supports DR plans (scripts) to identify groups of Hyper-V VMs which need to be brought up and their sequencing. VMs within a script group are brought up in parallel but different groups are brought up in sequence.  You can have multiple DR plans, just select the one to execute. You must have access to Azure to use this service. Azure DR plans can pause for manual activities and have the ability to invoke PowerShell scripts for more fine tuned control.  There’s also quite a lot of setup that must be done, e.g. configure Hyper-V hosts, VMs and networking at both primary and secondary locations.  Network IP injection is done via mapping primary to secondary site IP addresses. The Azzure DRaaS really just provides the orchestration of failover or fallback activity. Moreover, it looks like Azure DRaaS is going to be offered by service providers as well as private companies. Currently, Azure’s DRaaS has no support for SAN/NAS replication but they are working with vendors to supply an SRM-like API to provide this.
  • Hyper-V Replica changes – Replica support has been changed from a single fixed asynchronous replication interval (5 minutes) to being able to select one of 3 intervals: 15 seconds; 5 minutes; or 30 minutes.
  • Storage Spaces Automatic Tiering – With SSDs and regular rotating disk in your DAS (or JBOD) configuration , Windows Server 2012 R2 supports automatic storage tiering. At Spaces configuration time one dedicates a certain portion of SSD storage to tiering.  There is a scheduled Windows Server 2012 task which is then used to scan the previous periods file activity and identify which file segments (=1MB in size) that should be on SSD and which should not. Then over time file segments are moved to an  appropriate tier and then, performance should improve.  This only applies to file data and files can be pinned to a particular tier for more fine grained control.
  • Storage Spaces Write-Back cache – Another alternative is to dedicate a certain portion of SSDs in a Space to write caching. When enabled, writes to a Space will be cached first in SSD and then destaged out to rotating disk.  This should speed up write performance.  Both write back cache and storage tiering can be enabled for the same Space. But your SSD storage must be partitioned between the two. Something about funneling all write activity to SSDs just doesn’t make sense to me?!
  • Storage Spaces dual parity – Spaces previously supported mirrored storage and single parity but now also offers dual parity for DAS.  Sort of like RAID6 in protection but they didn’t mention the word RAID at all.  Spaces dual parity does have a write penalty (parity update) and Microsoft suggests using it only for archive or heavy read IO.
  • SMB3.1 performance improvements of ~50% – SMB has been on a roll lately and R2 is no exception. Microsoft indicated that SMB direct using a RAM DISK as backend storage can sustain up to a million 8KB IOPS. Also, with an all-flash JBOD, using a mirrored Spaces for backend storage, SMB3.1 can sustain ~600K IOPS.  Presumably these were all read IOPS.
  • SMB3.1 logging improvements – Changes were made to SMB3.1 event logging to try to eliminate the need for detail tracing to support debug. This is an ongoing activity but one which is starting to bear fruit.
  • SMB3.1 CSV performance rebalancing – Now as one adds cluster nodes,  Cluster Shared Volume (CSV) control nodes will spread out across new nodes in order to balance CSV IO across the whole cluster.
  • SMB1 stack can be (finally) fully removed – If you are running Windows Server 2012, you no longer need to install the SMB1 stack.  It can be completely removed. Of course, if you have some downlevel servers or clients you may want to keep SMB1 around a bit longer but it’s no longer required for Server 2012 R2.
  • Hyper-V Live Migration changes – Live migration can now take advantage of SMB direct and its SMB3 support of RDMA/RoCE to radically speed up data center live migration. Also, Live Migration can now optionally compress the data on the current Hyper-V host, send compressed data across the LAN and then decompress it at target host.  So with R2 you have three options to perform VM Live Migration traditional, SMB direct or compressed.
  • Hyper-V IO limits – Hyper-V hosts can now limit the amount of IOPS consumed by each VM.  This can be hierarchically controlled providing increased flexibility. For example one can identify a group of VMs and have a IO limit for the whole group, but each individual VM can also have an IO limit, and the group limit can be smaller than the sum of the individual VM limits.
  • Hyper-V supports VSS backup for Linux VMs – Windows Server 2012 R2 has now added support for non-application consistent VSS backups for Linux VMs.
  • Hyper-V Replica Cascade Replication – In Windows Server 2012, Hyper V replicas could be copied from one data center to another. But now with R2 those replicas at a secondary site can be copied to a third, cascading the replication from the first to the second and then the third data center, each with their own replication schedule.
  • Hyper-V VHDX file resizing – With Windows Server 2012 R2 VHDX file sizes can now be increased or reduced for both data and boot volumes.
  • Hyper-V backup changes – In previous generations of Windows Server, Hyper-V backups took two distinct snapshots, one instantaneously and the other at quiesce time and then the two were merged together to create a “crash consistent” backup. But with R2, VM backups only take a single snapshot reducing overhead and increasing backup throughput substantially.
  • NVME support – Windows Server 2012 R2 now ships with a Non-Volatile Memory Express (NVME) driver for PCIe flash storage.  R2’s new NVME driver has been tuned for low latency and high bandwidth and can be used for non-clustered storage spaces to improve write performance (in a Spaces write-back cache?).
  • CSV memory read-cache – Windows Server 2012 R2 can be configured to set aside some host memory for a CSV read cache.  This is different than the Spaces Write-Back cache.  CSV caching would operate in conjunction with any other caching done at the host OS or elsewhere.

That’s about it. Some of the MVPs had a preview of R2 up in Redmond, but all of this was to be announced in TechEd, New Orleans, this week.

~~~~

Image: Microsoft TechEd by BetsyWeber





EMCworld 2013 Day 3

IMG_1431Rich Napolitano, President Unified Storage Division got up and showed some technology demonstrations of what they had working in their labs.  Rich had some of his long time engineers up on the stage to show what was running in their labs.

  • First up was a dual controller, dual processors per controller 8 core processing chips (32cores in all) running against an all SSD backend. The configuration was up for a short time but it seemed like 96 SSDs, so an all-flash VNX array.  They used Iometer, random-8KB IO to drive almost 975K IOPS at sub-msec. response time. They hit 1M IOPS with just slightly above 1 msec. response time. You could see the processor utilization of the 32 cores going up as the workload reached higher levels.  Couldn’t see precisely but all the cores were running at ~70-80% busy at the 1Miops level and it seemed like the system performance was entering the knee-of-the-curve
  • Next up was the new VNX data app store demonstration. Similar to iPhone and Android App stores. EMC has identified a select set of apps that can be run directly on VNX hardware. The current demonstration had two versions of anti-virus, Recover Point Virtual Appliance (vRPA), (v?)VPLEX, CloudAccess and MySQL server.  The engineers showed how AV software could be installed and be running on the VNX as well as how vRPA could be installed and provide onboard replication services.
  • Then, they demonstrated a VNX virtual appliance (vVNX?) which was able to run on white box server which I think was running ESX.  In this case, vVNX was running with onboard DAS storage but had all the advanced functionality of VNX
  • Finally, they showed a vVNX running in a cloud services environment. Not sure if this was VMware vCloud or some other compute cloud but Rich stated that they will support many clouds.  With vVNX running in the cloud accessing storage behind the compute engine it’s unclear what the performance would be and how one would access the storage (file or iSCSI no doubt) but it did open up new possibilities as to where one could run VNX services.

It’s readily apparent that the next iteration of VNX software seems focused on taking advantage of multi-core processing (called MCx) to boost storage system performance, providing a virtualized environment within the VNX engine to run specialized data services and supplying a new vVNX functionality which can be deployed just about anywhere you would want.

That’s all for the public sessions, spent much of the rest of the day in NDA sessions.

I had a good time at EMCworld 2013, seeing old friends again and meeting new ones and thank EMC for inviting me.  For information on previous days at EMCworld 2013 please see my Day 1 and Day 2 posts.