Surprises in disk reliability from Microsoft’s “free cooled” datacenters

HH5At Usenix ATC’16 last week, there was a “best of the rest” session which repeated selected papers presented at FAST’16 earlier this year. One that caught my interest was discussing disk reliability in free cooled data centers at Microsoft (Environmental conditions and disk reliability in free-cooled datacenters, see pp. 53-66).

The paper discusses disk reliability at 9 different datacenters in Microsoft for over 1M drives over the course of 1.5 to 4 years vs. how datacenters were cooled.
Continue reading “Surprises in disk reliability from Microsoft’s “free cooled” datacenters”

Windows Server 2012 R2 storage changes announced at TechEd

Microsoft TechEd Trends driving IT todayMicrosoft TechEd USA is this week and they announced a number of changes to the storage services that come with Windows Server 2012 R2

  • Azure DRaaS – Microsoft is attempting to democratize DR by supporting a new DR-as-a-Service (DRaaS).  They now have an Azure service that operates in conjunction with Windows Server 2012 R2 that provides orchestration and automation for DR site failover and fail back to/from remote sites.  Windows Server 2012 R2 uses Hyper-V Replica to replicate data across to the other site. Azure DRaaS supports DR plans (scripts) to identify groups of Hyper-V VMs which need to be brought up and their sequencing. VMs within a script group are brought up in parallel but different groups are brought up in sequence.  You can have multiple DR plans, just select the one to execute. You must have access to Azure to use this service. Azure DR plans can pause for manual activities and have the ability to invoke PowerShell scripts for more fine tuned control.  There’s also quite a lot of setup that must be done, e.g. configure Hyper-V hosts, VMs and networking at both primary and secondary locations.  Network IP injection is done via mapping primary to secondary site IP addresses. The Azzure DRaaS really just provides the orchestration of failover or fallback activity. Moreover, it looks like Azure DRaaS is going to be offered by service providers as well as private companies. Currently, Azure’s DRaaS has no support for SAN/NAS replication but they are working with vendors to supply an SRM-like API to provide this.
  • Hyper-V Replica changes – Replica support has been changed from a single fixed asynchronous replication interval (5 minutes) to being able to select one of 3 intervals: 15 seconds; 5 minutes; or 30 minutes.
  • Storage Spaces Automatic Tiering – With SSDs and regular rotating disk in your DAS (or JBOD) configuration , Windows Server 2012 R2 supports automatic storage tiering. At Spaces configuration time one dedicates a certain portion of SSD storage to tiering.  There is a scheduled Windows Server 2012 task which is then used to scan the previous periods file activity and identify which file segments (=1MB in size) that should be on SSD and which should not. Then over time file segments are moved to an  appropriate tier and then, performance should improve.  This only applies to file data and files can be pinned to a particular tier for more fine grained control.
  • Storage Spaces Write-Back cache – Another alternative is to dedicate a certain portion of SSDs in a Space to write caching. When enabled, writes to a Space will be cached first in SSD and then destaged out to rotating disk.  This should speed up write performance.  Both write back cache and storage tiering can be enabled for the same Space. But your SSD storage must be partitioned between the two. Something about funneling all write activity to SSDs just doesn’t make sense to me?!
  • Storage Spaces dual parity – Spaces previously supported mirrored storage and single parity but now also offers dual parity for DAS.  Sort of like RAID6 in protection but they didn’t mention the word RAID at all.  Spaces dual parity does have a write penalty (parity update) and Microsoft suggests using it only for archive or heavy read IO.
  • SMB3.1 performance improvements of ~50% – SMB has been on a roll lately and R2 is no exception. Microsoft indicated that SMB direct using a RAM DISK as backend storage can sustain up to a million 8KB IOPS. Also, with an all-flash JBOD, using a mirrored Spaces for backend storage, SMB3.1 can sustain ~600K IOPS.  Presumably these were all read IOPS.
  • SMB3.1 logging improvements – Changes were made to SMB3.1 event logging to try to eliminate the need for detail tracing to support debug. This is an ongoing activity but one which is starting to bear fruit.
  • SMB3.1 CSV performance rebalancing – Now as one adds cluster nodes,  Cluster Shared Volume (CSV) control nodes will spread out across new nodes in order to balance CSV IO across the whole cluster.
  • SMB1 stack can be (finally) fully removed – If you are running Windows Server 2012, you no longer need to install the SMB1 stack.  It can be completely removed. Of course, if you have some downlevel servers or clients you may want to keep SMB1 around a bit longer but it’s no longer required for Server 2012 R2.
  • Hyper-V Live Migration changes – Live migration can now take advantage of SMB direct and its SMB3 support of RDMA/RoCE to radically speed up data center live migration. Also, Live Migration can now optionally compress the data on the current Hyper-V host, send compressed data across the LAN and then decompress it at target host.  So with R2 you have three options to perform VM Live Migration traditional, SMB direct or compressed.
  • Hyper-V IO limits – Hyper-V hosts can now limit the amount of IOPS consumed by each VM.  This can be hierarchically controlled providing increased flexibility. For example one can identify a group of VMs and have a IO limit for the whole group, but each individual VM can also have an IO limit, and the group limit can be smaller than the sum of the individual VM limits.
  • Hyper-V supports VSS backup for Linux VMs – Windows Server 2012 R2 has now added support for non-application consistent VSS backups for Linux VMs.
  • Hyper-V Replica Cascade Replication – In Windows Server 2012, Hyper V replicas could be copied from one data center to another. But now with R2 those replicas at a secondary site can be copied to a third, cascading the replication from the first to the second and then the third data center, each with their own replication schedule.
  • Hyper-V VHDX file resizing – With Windows Server 2012 R2 VHDX file sizes can now be increased or reduced for both data and boot volumes.
  • Hyper-V backup changes – In previous generations of Windows Server, Hyper-V backups took two distinct snapshots, one instantaneously and the other at quiesce time and then the two were merged together to create a “crash consistent” backup. But with R2, VM backups only take a single snapshot reducing overhead and increasing backup throughput substantially.
  • NVME support – Windows Server 2012 R2 now ships with a Non-Volatile Memory Express (NVME) driver for PCIe flash storage.  R2’s new NVME driver has been tuned for low latency and high bandwidth and can be used for non-clustered storage spaces to improve write performance (in a Spaces write-back cache?).
  • CSV memory read-cache – Windows Server 2012 R2 can be configured to set aside some host memory for a CSV read cache.  This is different than the Spaces Write-Back cache.  CSV caching would operate in conjunction with any other caching done at the host OS or elsewhere.

That’s about it. Some of the MVPs had a preview of R2 up in Redmond, but all of this was to be announced in TechEd, New Orleans, this week.

~~~~

Image: Microsoft TechEd by BetsyWeber





Fall SNWUSA wrap-up

Attended SNWUSA this week in San Jose,  It’s hard to see the show gradually change when you attend each one but it does seem that the end-user content and attendance is increasing proportionally.  This should bode well for future SNWs. Although, there was always a good number of end users at the show but the bulk of the attendees in the past were from storage vendors.

Another large storage vendor dropped their sponsorship.  HDS no longer sponsors the show and the last large vendor still standing at the show is HP.  Some of this is cyclical, perhaps the large vendors will come back for the spring show, next year in Orlando, Fl.  But EMC, NetApp and IBM seemed to have pretty much dropped sponsorship for the last couple of shows at least.

SSD startup of the show

Skyhawk hardware (c) 2012 Skyera, all rights reserved (from their website)
Skyhawk hardware (c) 2012 Skyera, all rights reserved (from their website)

The best, new SSD startup had to be Skyera. A 48TB raw flash dual controller system supporting iSCSI block protocol and using real commercial grade MLC.  The team at Skyera seem to be all ex-SandForce executives and technical people.

Skyera’s team have designed a 1U box called the Skyhawk, with  a phalanx of NAND chips, there own controller(s) and other logic as well. They support software compression and deduplication as well as a special designed RAID logic that claims to reduce extraneous write’s to something just over 1 for  RAID 6, dual drive failure equivalent protection.

Skyera’s underlying belief is that just as consumer HDAs took over from the big monster 14″ and 11″ disk drives in the 90’s sooner or later commercial NAND will take over from eMLC and SLC.  And if one elects to stay with the eMLC and SLC technology you are destined to be one to two technology nodes behind. That is, commercial MLC (in USB sticks, SD cards etc) is currently manufactured with 19nm technology.  The EMLC and SLC NAND technology is back at 24 or 25nm technology.  But 80-90% of the NAND market is being driven by commercial MLC NAND.  Skyera came out this past August.

Coming in second place was Arkologic an all flash NAS box using SSD drives from multiple vendors. In their case a fully populated rack holds about 192TB (raw?) with an active-passive controller configuration.  The main concern I have with this product is that all their metadata is held in UPS backed DRAM (??) and they have up to 128GB of DRAM in the controller.

Arkologic’s main differentiation is supporting QOS on a file system basis and having some connection with a NIC vendor that can provide end to end QOS.  The other thing they have is a new RAID-AS which is special designed for Flash.

I just hope their USP is pretty hefty and they don’t sell it someplace where power is very flaky, because when that UPS gives out, kiss your data goodbye as your metadata is held nowhere else – at least that’s what they told me.

Cloud storage startup of the show

There was more cloud stuff going on at the show. Talked to at least three or four cloud gateway providers.  But the cloud startup of the show had to be Egnyte.  They supply storage services that span cloud storage and on premises  storage with an in band or out-of-band solution and provide file synchronization services for file sharing across multiple locations.  They have some hooks into NetApp and other major storage vendor products that allows them to be out-of-band for these environments but would need to be inband for other storage systems.  Seems an interesting solution that if succesful may help accelerate the adoption of cloud storage in the enterprise, as it makes transparent whether storage you access is local or in the cloud. How they deal with the response time differences is another question.

Different idea startup of the show

The new technology showplace had a bunch of vendors some I had never heard of before but one that caught my eye was Actifio. They were at VMworld but I never got time to stop by.  They seem to be taking another shot at storage virtualization. Only in this case rather than focusing on non-disruptive file migration they are taking on the task of doing a better job of point in time copies for iSCSI and FC attached storage.

I assume they are in the middle of the data path in order to do this and they seem to be using copy-on-write technology for point-in-time snapshots.  Not sure where this fits, but I suspect SME and maybe up to mid-range.

Most enterprise vendors have solved these problems a long time ago but at the low end, it’s a little more variable.  I wish them luck but although most customers use snapshots if their storage has it, those that don’t, seem unable to understand what they are missing.  And then there’s the matter of being in the data path?!

~~~~

If there was a hybrid startup at the show I must have missed them. Did talk with Nimble Storage and they seem to be firing on all cylinders.  Maybe someday we can do a deep dive on their technology.  Tintri was there as well in the new technology showcase and we talked with them earlier this year at Storage Tech Field Day.

The big news at the show was Microsoft purchasing StorSimple a cloud storage gateway/cache.  Apparently StorSimple did a majority of their business with Microsoft’s Azure cloud storage and it seemed to make sense to everyone.

The SNIA suite was hopping as usual and the venue seemed to work well.  Although I would say the exhibit floor and lab area was a bit to big. But everything else seemed to work out fine.

On Wednesday, the CIO from Dish talked about what it took to completely transform their IT environment from a management and leadership perspective.  Seemed like an awful big risk but they were able to pull it off.

All in all, SNW is still a great show to learn about storage technology at least from an end-user perspective.  I just wish some more large vendors would return once again, but alas that seems to be a dream for now.

New file system capacity tool – Microsoft’s FSCT

Filing System by BinaryApe (cc) (from Flickr)
Filing System by BinaryApe (cc) (from Flickr)

Jose Barreto blogged about a recent report Microsoft did on File Server Capacity Tool (FSCT) results (blog here, report here).  As you may know FSCT is a free tool released in September of 2009, available from Microsoft that verifies a SMB (CIFS) and/or SMB2 storage server configuration.

The FSCT can be used by anyone to verify that a SMB/SMB2 file server configuration can adequately support a particular number of users, doing typical Microsoft Office/Window’s Explorer work with home folders.

Jetstress for SMB file systems?

FSCT reminds me a little of Microsoft’s Jetstress tool used in the Exchange Solution Review Program (ESRP) which I have discussed extensively in prior blog posts (search my blog) and other reports (search my website).  Essentially, FSCT has a simulated “home folder” workload which can be dialed up or down by the number of users selected.  As such, it can be used to validate any NAS system which supports SMB/SMB2 or CIFS protocol.

Both Jetstress and FSCT are capacity verification tools.  However, I look at all such tools as a way of measuring system performance for a solution environment and FSCT is no exception.

Microsoft FSCT results

In Jose’s post on the report he discusses performance for five different storage server configurations running anywhere from 4500 to 23,000 active home directory users, employing white box servers running Windows (Storage) Server 2008 and 2008 R2 with various server hardware and SAS disk configurations.

Network throughput ranged from 114 to 650 MB/sec. Certainly respectable numbers and somewhat orthogonal to the NFS and CIFS throughput operations/second reported by SPECsfs2008.  Unclear if FSCT reports activity in an operations/second.

Microsoft ‘s FSCT reports did not specifically state what the throughput was other than at the scenario level.  I assume Network throughput that Jose reported was extracted concurrently with the test run from something akin to Perfmon.  FSCT seems to only report performance or throughput as the number of home folder scenarios sustainable per second and the number of users.  Perhaps there is an easy way to convert user scenarios to network throughput?

While the results for the file server runs looks interesting, I always want more. For whatever reason, I have lately become enamored with ESRPs log playback results (see my latest ESRP blog post) and it’s not clear whether FSCT reports anything similar to this.  Something like file server simulated backup performance would suffice from my perspective.

—-

Despite that, another performance tool is always of interest and I am sure my readers will want to take a look as well.  The current FSCT tester can be downloaded here.

Not sure whether Microsoft will be posting vendor results for FSCT similar to what they do for Jetstress via ESRP but that would be a great next step.  Getting the vendors onboard is another problem entirely.  SPECsfs2008 took almost a year to get the first 12 (NFS) submissions and today, almost 9 months later there are still only ~40 NFS and ~20 CIFS submissions.

Comments?

R&D effectiveness

A recent Gizmodo blog post compared a decade of R&D at Sony, Microsoft and Apple.  There were some interesting charts but mostly it showed that R&D as a percent of revenue, fluctuates from year to year and R&D spend has been rising for all the companies (although at different rates).

R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved
R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved

Overall from a percentage of Revenue basis, Microsoft wins, spending ~15% of revenue on R&D over the past decade, Apple loses, spending only ~4% on R&D and Sony is right in the middle at spending ~7% on R&D.  Yet viewing the impact on corporate revenue R&D spending had significantly different impacts on each company than what pure % R&D spending would indicate.

How can one measure R&D effectiveness.

  • Number of patents – this is often used as an indicator, but unclear how this correlates to business success.  Patents can be licensed but only if they prove important to other companies. However, patent counts can be gauged early on during the R&D activities rather than much later when a product reaches the market.
  • Number of projects – by projects we mean an idea from research taken into development.  Such projectst may or may not make it out to market.  At one level this can be a leading indicator of “research” effectiveness, as this means an idea was deemed at least of commercial interest.  A problem with this is that not all projects get released to the market or become commercially viable.
  • Number of products – by products, we mean something sold to customers.  At least such a measure reflects that the total R&D effort was deemed worthy enough to take to market.  How successful such a product is still to be determined.
  • Revenue of products – product revenue seems easy enough but often can be hard to allocate properly.  Looking at the iPhone, do we count just handset revenues or include application and cell service revenues. But assuming one can properly allocate revenue sources to R&D efforts, one can come up with a revenue from R&D spending.  The main problem with revenue generated from R&D ratios are all the other non-R&D factors confound it, e.g., marketing, manufacturing, competition, etc.
  • Profitability of products – product profitability is even messier than revenue when it comes to confoundability.  But ultimately profitability of R&D efforts may be the best factor to use as any product that’s truly effective should generate the most profits.

There are probably other R&D effectiveness factors that could be considered but these will suffice for now.

How did they do?

Returning to the Gizmodo discussion, their post didn’t include any patent counts, project counts (only visibly internally), product counts, or profitability measures but they did show revenue for each company.  From a purely Revenue impact one would have to say that Apple’s R&D was a clear winner with Microsoft a clear second.  Although we would have to say that Apple started from considerable smaller revenue than Sony or Microsoft but Apple’s ~$148B of revenue in 2005 was only small in comparison to other giants.  We all know the success of the iPhone and iPod but they also stumbled on the Apple TV.

Why did they do so well?

What then makes Apple do so good?  We have talked before about an elusive quality we called visionary leadership.  Certainly Bill Gates is as technically astute as Steve Jobs and there can be no denying that their respective marketing machines are evenly matched.  But both Microsoft and Apple were certainly led by more technical individuals than Sony over the last decade.   Both Microsoft and Apple have had significant revenue increases over the past ten years, that parallel one another while Sony, in comparison, has remained relatively flat.

I would say both Microsoft and Apple results show that “visionary leadership” has a certain portion of technicality to it that can’t be denied.  Moreover, I think that if one looked at Sony under Akio Morita, HP under Bill Hewlett and Dave Packard or many other large companies today, one could conclude that technical excellence is a significant component of visionary leadership.  All these companies highest revenue growth came under leadership which had significant technical knowledge.  There’s more to visionary leadership then technicality alone but it seems at least foundational.

I still owe a post on just what constitute’s visionary leadership, but I seem to be surrounding it rather than attacking it directly.

NetApp and Microsoft ink 3 year agreement

Microsoft logo
Microsoft logo

NetApp Logo
NetApp Logo

This week, NetApp and Microsoft announced a new agreement that increases the collaboration and integration for both their product technology and sales&marketing activities. Specifically, the new agreement covers better integration of

  • Microsoft’s virtualized infrastructure interaction with NetApp storage. This currently consists of new PRO Packs for System Center to support NetApp storage and Snap Manager for Hyper-V. Better integration such as these between these Microsoft virtualization and NetApp storage should help ease administration and maximize utilization of customer storage for Hyper-V environments.
  • Microsoft Exchange, SQL and Office SharePoint servers with NetApp Storage. This includes new support for Exchange 2010 in NetApp Snap Manager that fully provides deduplication and replication of Exchange data. Further integration along these lines should lead to even better customer utilization of storage in these environments in the near future.
  • Microsoft dynamic data center initiative and NetApp storage. This includes better testing and compatibility for using NetApp storage under Microsoft’s Dynamic Data Center Toolkit for enterprise private clouds (DDTK-E). Better integration and support for Microsoft’s Dynamic Data Center by NetApp storage and vice versa should help customers build more storage efficient cloud infrastructure.
  • Microsoft and NetApp joint go to market activities. This will result in Microsoft and NetApp joint channel partners being better able to offer joint solutions to their customers and also, customers should see better collaboration and partnership between NetApp and Microsoft sales and support.

This agreement should be considered a deepening of NetApp and Microsoft’s ongoing alliance. How this will actually change either company’s product functionality in the future was left unsaid but one should see more of NetApp’s advanced storage features show up as being available directly from Microsoft systems and NetApp storage services tighter integration with Microsoft systems should make it easier for customers to use storage more efficiently. Exactly when these activities result in enhanced NetApp or Microsoft functionality was not stated and is no doubt subject of ongoing discussions between the two teams. Nonetheless, this agreement seems to say that NetApp and Microsoft are taking their alliance to the next level.

Full disclosure: I am currently working on a contract with NetApp on another aspect of their storage but am doing no work with Microsoft.