RoCE – Silverton Consulting

QoM1610: Will NVMe over Fabric GA in enterprise AFA by Oct’2017

Posted on October 16, 2016 by Ray in Block Storage, CIFS/SMB, Ethernet, FC, FCoE, File Storage, Forecasting, Infiniband, iSCSI, Networking, NFS, NVMe, NVMe storage, Omni-Path, QoM 2016, RDMA, RoCE, SSD storage, Storage performance, Strategic Inflection Points

NVMe over fabric (NVMeoF) was a hot topic at Flash Memory Summit last August. Facebook and others were showing off their JBOF (see my Facebook moving to JBOF post) but there were plenty of other NVMeoF offerings at the show.

NVMeoF hardware availability

When Brocade announced their Gen6 Switches they made a point of saying that both their Gen5 and Gen6 switches currently support NVMeoF protocols. In addition to Brocade’s support, in Dec 2015 Qlogic announced support for NVMeoF for select HBAs. Also, as of July 2016, Emulex announced support for NVMeoF in their HBAs.

From an Ethernet perspective, Qlogic has a NVMe Direct NIC which supports NVMe protocol offload for iSCSI. But even without NVMe Direct, Ethernet 40GbE & 100GbE with iWARP, RoCEv1-v2, iSCSI SER, or iSCSI RDMA all could readily support NVMeoF on Ethernet. The nice thing about NVMeoF for Ethernet is not only do you get support for iSCSI & FCoE, but CIFS/SMB and NFS as well.

InfiniBand and Omni-Path Architecture already support native RDMA, so they should already support NVMeoF.

So hardware/firmware is already available for any enterprise AFA customer to want NVMeoF for their data center storage.

NVMeoF Software

Intel claims that ~90% of the software driver functionality of NVMe is the same for NVMeoF. The primary differences between the two seem to be the NVMeoY discovery and queueing mechanisms.

There are two fabric methods that can be used to implement NVMeoF data and command transfers: capsule mode where NVMe commands and data are encapsulated in normal fabric packets or fabric dependent mode where drivers make use of native fabric memory transfer mechanisms (RDMA, …) to transfer commands and data.

A (Linux) host driver for NVMeoF is currently available from Seagate. And as a result, support for NVMeoF for Linux is currently under development, and not far from release in the next Kernel (I think). (Mellanox has a tutorial on how to compile a Linux kernel with NVMeoF driver support).

With Linux coming out, Microsoft Windows and VMware can’t be far behind. However, I could find nothing online, aside from base NVMe support, for either platform.

NVMeoF target support is another matter but with NICs/HBAs & switch hardware/firmware and drivers presently available, proprietary storage system target drivers are just a matter of time.

Boot support is a major concern. I could find no information on BIOS support for booting off of a NVMeoF AFA. Arguably, one may not need boot support for NVMeoF AFAs as they are probably not a viable target for storing App code or OS software.

From what I could tell, normal fabric multi-pathing support should work fine with NVMeoF. This should allow for HA NVMeoF storage, a critical requirement for enterprise AFA storage systems these days.

NVMeoF advantages/disadvantages

Chelsio and others have shown that NVMeoF adds ~8μsec of additional overhead beyond native NVMe SSDs, which if true would warrant implementation on all NVMe AFAs. This may or may not impact max IOPS depending on scale-ability of NVMeoF.

For instance, servers (PCIe bus hardware) typically limit the number of private NVMe SSDs to 255 or less. With an NVMeoF, one could potentially have 1000s of shared NVMe SSDs accessible to a single server. With this scale, one could have a single server attached to a scale-out NVMeoF AFA (cluster) that could supply ~4X the IOPS that a single server could perform using private NVMe storage.

Base level NVMe SSD support and protocol stacks are starting to be available for most flash vendors and operating systems such as, Linux, FreeBSD, VMware, Windows, and Solaris. If Intel’s claim of 90% common software between NVMe and NVMeoF drivers is true, then it should be a relatively easy development project to provide host NVMeoF drivers.

The need for special Ethernet hardware that supports RDMA may delay some storage vendors from implementing NVMeoF AFAs quickly. The lack of BIOS boot support may be a minor irritant in comparison.

NVMeoF forecast

AFA storage systems, as far as I can tell, are all about selling high IOPS and very-low latency IOs. It would seem that NVMeoF would offer early adopter AFA storage vendors a significant performance advantage over slower paced competition.

In previous QoM/QoW posts we have established that there are about 13 new enterprise storage systems that come out each year. Probably 80% of these will be AFA, given the current market environment.

Of the 10.4 AFA systems coming out over the next year, ~20% of these systems pride themselves on being the lowest latency solutions in the market, and thus command high margins. One would think these systems would be the first to adopt NVMeoF. But, most of these systems have their own, proprietary flash modules and do not use standard (NVMe) SSDs and can use their own proprietary interface to their proprietary flash storage. This will delay any implementation for them until they can convert their flash storage to NVMe which may take some time.

On the other hand, most (70%) of the other AFA systems, that currently use SAS/SATA SSDs, could boost their IOP counts and drastically reduce their IO response times, by implementing NVMe SSDs and NVMeoF. But converting SAS/SATA backends to NVMe will take time and effort.

But, there are a select few (~10%) of AFA systems, that already use NVMe SSDs in their AFAs, and for these few, they would seem to have a fast track towards implementing NVMeoF. The fact that NVMeoF is supported over all fabrics and all storage interface protocols make it even easier.

Moreover, NVMeoF has been under discussion since the summer of 2015, which tells me that astute AFA vendors have already had 18+ months to develop it. With NVMeoF host drivers & hardware available since Dec. 2015, means hardware and software exist to test and validate against.

I believe that NVMeoF will be GA’d within the next 12 months by at least one enterprise AFA system. So my QoM1610 forecast for NVMeoF is YES, with a 0.83 probability.

Comments?

Windows Server 2012 R2 storage changes announced at TechEd

Posted on June 6, 2013September 8, 2013 by Ray in CIFS/SMB, Cloud storage, DAS, desktop virtualization, Ethernet, File Storage, RDMA, RoCE, Server virtualization, SSD storage, Storage performance

Microsoft TechEd USA is this week and they announced a number of changes to the storage services that come with Windows Server 2012 R2

Azure DRaaS – Microsoft is attempting to democratize DR by supporting a new DR-as-a-Service (DRaaS). They now have an Azure service that operates in conjunction with Windows Server 2012 R2 that provides orchestration and automation for DR site failover and fail back to/from remote sites. Windows Server 2012 R2 uses Hyper-V Replica to replicate data across to the other site. Azure DRaaS supports DR plans (scripts) to identify groups of Hyper-V VMs which need to be brought up and their sequencing. VMs within a script group are brought up in parallel but different groups are brought up in sequence. You can have multiple DR plans, just select the one to execute. You must have access to Azure to use this service. Azure DR plans can pause for manual activities and have the ability to invoke PowerShell scripts for more fine tuned control. There’s also quite a lot of setup that must be done, e.g. configure Hyper-V hosts, VMs and networking at both primary and secondary locations. Network IP injection is done via mapping primary to secondary site IP addresses. The Azzure DRaaS really just provides the orchestration of failover or fallback activity. Moreover, it looks like Azure DRaaS is going to be offered by service providers as well as private companies. Currently, Azure’s DRaaS has no support for SAN/NAS replication but they are working with vendors to supply an SRM-like API to provide this.
Hyper-V Replica changes – Replica support has been changed from a single fixed asynchronous replication interval (5 minutes) to being able to select one of 3 intervals: 15 seconds; 5 minutes; or 30 minutes.
Storage Spaces Automatic Tiering – With SSDs and regular rotating disk in your DAS (or JBOD) configuration , Windows Server 2012 R2 supports automatic storage tiering. At Spaces configuration time one dedicates a certain portion of SSD storage to tiering. There is a scheduled Windows Server 2012 task which is then used to scan the previous periods file activity and identify which file segments (=1MB in size) that should be on SSD and which should not. Then over time file segments are moved to an appropriate tier and then, performance should improve. This only applies to file data and files can be pinned to a particular tier for more fine grained control.
Storage Spaces Write-Back cache – Another alternative is to dedicate a certain portion of SSDs in a Space to write caching. When enabled, writes to a Space will be cached first in SSD and then destaged out to rotating disk. This should speed up write performance. Both write back cache and storage tiering can be enabled for the same Space. But your SSD storage must be partitioned between the two. Something about funneling all write activity to SSDs just doesn’t make sense to me?!
Storage Spaces dual parity – Spaces previously supported mirrored storage and single parity but now also offers dual parity for DAS. Sort of like RAID6 in protection but they didn’t mention the word RAID at all. Spaces dual parity does have a write penalty (parity update) and Microsoft suggests using it only for archive or heavy read IO.
SMB3.1 performance improvements of ~50% – SMB has been on a roll lately and R2 is no exception. Microsoft indicated that SMB direct using a RAM DISK as backend storage can sustain up to a million 8KB IOPS. Also, with an all-flash JBOD, using a mirrored Spaces for backend storage, SMB3.1 can sustain ~600K IOPS. Presumably these were all read IOPS.
SMB3.1 logging improvements – Changes were made to SMB3.1 event logging to try to eliminate the need for detail tracing to support debug. This is an ongoing activity but one which is starting to bear fruit.
SMB3.1 CSV performance rebalancing – Now as one adds cluster nodes, Cluster Shared Volume (CSV) control nodes will spread out across new nodes in order to balance CSV IO across the whole cluster.
SMB1 stack can be (finally) fully removed – If you are running Windows Server 2012, you no longer need to install the SMB1 stack. It can be completely removed. Of course, if you have some downlevel servers or clients you may want to keep SMB1 around a bit longer but it’s no longer required for Server 2012 R2.
Hyper-V Live Migration changes – Live migration can now take advantage of SMB direct and its SMB3 support of RDMA/RoCE to radically speed up data center live migration. Also, Live Migration can now optionally compress the data on the current Hyper-V host, send compressed data across the LAN and then decompress it at target host. So with R2 you have three options to perform VM Live Migration traditional, SMB direct or compressed.
Hyper-V IO limits – Hyper-V hosts can now limit the amount of IOPS consumed by each VM. This can be hierarchically controlled providing increased flexibility. For example one can identify a group of VMs and have a IO limit for the whole group, but each individual VM can also have an IO limit, and the group limit can be smaller than the sum of the individual VM limits.
Hyper-V supports VSS backup for Linux VMs – Windows Server 2012 R2 has now added support for non-application consistent VSS backups for Linux VMs.
Hyper-V Replica Cascade Replication – In Windows Server 2012, Hyper V replicas could be copied from one data center to another. But now with R2 those replicas at a secondary site can be copied to a third, cascading the replication from the first to the second and then the third data center, each with their own replication schedule.
Hyper-V VHDX file resizing – With Windows Server 2012 R2 VHDX file sizes can now be increased or reduced for both data and boot volumes.
Hyper-V backup changes – In previous generations of Windows Server, Hyper-V backups took two distinct snapshots, one instantaneously and the other at quiesce time and then the two were merged together to create a “crash consistent” backup. But with R2, VM backups only take a single snapshot reducing overhead and increasing backup throughput substantially.
NVME support – Windows Server 2012 R2 now ships with a Non-Volatile Memory Express (NVME) driver for PCIe flash storage. R2’s new NVME driver has been tuned for low latency and high bandwidth and can be used for non-clustered storage spaces to improve write performance (in a Spaces write-back cache?).
CSV memory read-cache – Windows Server 2012 R2 can be configured to set aside some host memory for a CSV read cache. This is different than the Spaces Write-Back cache. CSV caching would operate in conjunction with any other caching done at the host OS or elsewhere.

That’s about it. Some of the MVPs had a preview of R2 up in Redmond, but all of this was to be announced in TechEd, New Orleans, this week.

~~~~

Image: Microsoft TechEd by BetsyWeber

SMB2.2 (CIFS) screams over InfiniBand

Posted on March 27, 2012 by Ray in Disk storage, Distributed computing, File Storage, Infiniband, Networking, RDMA, RoCE, Storage architecture, Storage performance, System effectiveness

Microsoft MVP Summit 2010 by David McCarter (cc) (From Flickr)

I missed the MVP summit last month in Redmond, but I heard there was some more discussion of the Server Message Block v2.2 (SMB2.2, also known previously as CIFS) coming in Windows Server (R) 8.

The big news is SMB2.2 now supports RDMA and can use InfiniBand (announced at SNIA Developer Conference last fall). It also supports RDMA over Ethernet via RoCE (see my Intel buys Qlogic’s Infiniband post) and iWARP.

SMB2.2 over InfiniBand performance

As reported last fall at the SNIA Developer Conference SMB2.2 using RDMA over InfiniBand reached over 3.7GB/sec with no server configuration changes using two QDR cards and 160K IOPs (the IOPs are from an SQLIO run using 8KB IOs, not SPECsfs2008). The pre-beta, SMB2.2 code was running on commodity server hardware using 32Gbps InfiniBand links. I couldn’t find any performance numbers with ROCE or iWARP but I would suspect running on 10GbE these would be much slower than InfiniBand.

Hints are that performance gets even better with the released versions of the code coming out in Windows Server 8.

SMB2.2 gets even faster than NFS

We have noted in the past that SMB (CIFS) on average, shows better throughput (IOPS) performance than NFS in SPECsfs2008 results (for example, see our latest Chart-of-the-Month post on SPECsfs results). However, those results were all at best SMB2 or even SMB1 results, and commonly using Ethernet links.

NFS already supports InfiniBand but I am unsure whether it makes use of RDMA. Nevertheless, the significant speed up shown here for SMB2.2 will potentially take SPECsfs2008 SMB2.2 performance up to a whole new level.

Why InfiniBand?

As you may recall, InfinBand is primarily deployed as a server to server interface and used extensively in the past for high performance computing environments. However nowadays, we find storage clusters, such as EMC Isilon, HP X9000 (Ibrix), IBM XIV and others using InfiniBand for their inter-node communications. The use of InfiniBand in these storage clusters is probably due primarily to its superior latency over Ethernet.

But InfiniBand has another advantage, fast data throughput, when using RDMA it can transfer data faster than almost any other networking protocol alive today. SMB2.2 takes advantage of this throughput boost by using RDMA only for large blocks of data and avoiding it for smaller blocks of data. Not sure what the cutoff is, but this would certainly help in large SQL database queries, disk copies, and any other large file data transfer operations.

Of course with 56Gbps FDR InfiniBand available today and faster transfer rates coming (see IBTA roadmap), there appears to be every reason to believe that superior throughput performance will continue at least for the foreseeable future. Better latency is also certain to be retained as well

Now that Intel’s pushing it, Mellanox continuing to push Infiniband and storage cluster’s using it more frequently, we may start to see more storage protocols supporting it.

We thought that FC only had Ethernet to worry about, with SMB2.2 moving to InfiniBand, NFS already supporting it, can a fully functional FCoIB be far behind?

Intel acquires InfiniBand fabric technology from Qlogic

Posted on January 24, 2012April 10, 2012 by Ray in Clustered storage, Distributed computing, Ethernet, Infiniband, Information economy, Networking, RDMA, RoCE, Storage performance, Strategic Inflection Points

Isilon Packaging by ChrisDag (cc) (from Flickr)”]Intel announced today that they are going to acquire the InfiniBand (IB) fabric technology business from Qlogic.

From many analyst’s perspective, IB is one of the only technologies out there that can efficiently interconnect a cluster of commodity servers into a supercomputing system.

What’s InfiniBand?

Recall that IB is one of three reigning data center fabric technologies available today which include 10GbE, and 16 Gb/s FC. IB is currently available in DDR, QDR and FDR modes of operation, that is 5Gb/s, 10Gb/s or 14Gb/s, respectively per single lane, according to the IB update (see IB trade association (IBTA) technology update). Systems can aggregate multiple IB lanes in units of 4 or 12 paths (see wikipedia IB article), such that an IB QDRx4 supports 40Gb/s and a IB FDRx4 currently supports 56Gb/s.

The IBTA pitch cited above showed that IB is the most widely used interface for the top supercomputing systems and supports the most power efficient interconnect available (although how that’s calculated is not described).

Where else does IB make sense?

One thing IB has going for it is low latency through the use of RDMA or remote direct memory access. That same report says that an SSD directly connected through a FC takes about ~45 μsec to do a read whereas the same SSD directly connected through IB using RDMA would only take ~26 μsec.

However, RDMA technology is now also coming out on 10GbE through RDMA over Converged Ethernet (RoCE, pronounced “rocky”). But ITBA claims that IB RDMA has a 0.6 μsec latency and the RoCE has a 1.3 μsec. Although at these speed, 0.7 μsec doesn’t seem to be a big thing, it doubles the latency.

Nonetheless, Intel’s purchase is an interesting play. I know that Intel is focusing on supporting an ExaFLOP HPC computing environment by 2018 (see their release). But IB is already a pretty active technology in the HPC community already and doesn’t seem to need their support.

In addition, IB has been gradually making inroads into enterprise data centers via storage products like the Oracle Exadata Storage Server using the 40 Gb/s IB QDRx4 interconnects. There are a number of other storage products out that use IB as well from EMC Isilon, SGI, Voltaire, and others.

Of course where IB can mostly be found today is in computer to computer interconnects and just about every server vendor out today, including Dell, HP, IBM, and Oracle support IB interconnects on at least some of their products.

Whose left standing?

With Qlogic out I guess this leaves Cisco (de-emphasized lately), Flextronix, Mellanox, and Intel as the only companies that supply IB switches. Mellanox, Intel (from Qlogic) and Voltaire supply the HCA (host channel adapter) cards which provide the server interface to the switched IB network.

Probably a logical choice for Intel to go after some of this technology just to keep it moving forward and if they want to be seriously involved in the network business.

IB use in Big Data?

On the other hand, it’s possible that Hadoop and other big data applications could conceivably make use of IB speeds and as these are mainly vast clusters of commodity systems it would be a logical choice.

There is some interesting research on the advantages of IB in HDFS (Hadoop) system environments (see Can high performance interconnects boost Hadoop distributed file system performance) out of Ohio State University. This research essentially says that Hadoop HDFS can perform much better when you combine IB with IPoIB (IP over IB, see OpenFabrics Alliance article) and SSDs. But SSDs alone do not provide as much benefit. (Although my reading of the performance charts seems to indicate it’s not that much better than 10GbE with TOE?).

It’s possible other Big data analytics engines are considering using IB as well. It would seem to be a logical choice if you had even more control over the software stack.

~~~~

Comments?