IO Virtualization comes out

Snakes in a plane by richardmasoner [from flickr (cc)]
Snakes in a plane by richardmasoner (from flickr (cc))
Prior to last week’s VMworld, I had never heard of IO virtualization products before – storage virtualization yes but never IO virtualization. Then at last week’s VMworld I met with two vendors of IO virtualization products Aprius and Virtensys.

IO virtualization shares the HBAs/CNAs/NICs that a server tower would normally have plugged into each server and creates a top-of-rack box that shares these IO cards. The top-of-rack IO is connected to each of the tower servers by extending each server’s PCI-express bus.

Each individual server believes it has a local HBA/CNA/NIC card and acts accordingly. The top-of-rack box handles the mapping of each server to a portion of the HBA/CNA/NIC cards being shared. This all seems to remind me of server virtualization, using software to share the server processor, memory and IO resources across multiple applications. But with one significant difference.

How IO virtualization works

Aprius depends on the new SRIOV (Single Root I/O virtualization [requires login]) standards. I am no PCI-express expert but what this seems to do is allow a HBA/CNA/NIC PCI-express card to be a shared resource among a number of virtual servers executing within a physical server. What Aprius has done is sort of a “P2V in reverse” and allows a number of physical servers to share the same PCI-express HBA/CNA/NIC card in the top-of-rack solution.

Virtensys says it’s solution does not depend on SRIOV standards to provide IO virtualization. As such, it’s not clear what’s different but the top-of-box solution could conceivably share the hardware via software magic.

From a FC and HBA perspective there seems to be a number of questions as to how all this works.

  • Does the top-of-box solution need to be powered and booted up first?
  • How is FC zoning and LUN masking supported in a shared environment?

Similar networking questions should arise especially when one considers iSCSI boot capabilities.

Economics of IO virtualization

But the real question is one of economics. My lab owner friends tell me that a CNA costs about $800/port these days. Now when you consider that one could have 4-8 servers sharing each of these ports with IO virtualization the economics become clearer. With a typical configuration of 6 servers

  • For a non-IO virtualized solution, each server would have 2 CNA ports at a minimum so this would cost you $1600/server or $9600.
  • For an IO virtualized solution, each server requires PCI-extenders, costing about $50/server or $300, at least one CNA (for the top-of-rack) costing $1600 and the cost of their top-of-rack box.

If the IO virtualization box cost less than $7.7K it would be economical. But, IO virtualization providers also claim another savings, i.e, less switch ports need to be purchased because there are less physical network links. Unclear to me what a 10Gbe port with FCOE support costs these days but my guess may be 2X what a CNA port costs or another $1600/port or for the 6 server dual ported configuration ~$19.2K. Thus, the top-of-rack solution could cost almost $27K and still be more economical. When using IO virtualization to reduce HBAs and NICs then the top-of-rack solution could be even more economical.

Although the economics may be in favor of IO virtualization – at the moment – time is running out. CNA, HBA and NIC ports are coming down in price as vendors ramp up production. These same factors will reduce switch port cost as well. Thus, the savings gained from sharing CNAs, HBAs and NICs across multiple servers will diminish over time. Also the move to FCOE will eliminate HBAs and NICs and replace them with just CNAs so there are even less ports to amortize.

Moreover, PCI-express extender cards will probably never achieve volumes similar to HBAs, NICs, or CNAs so extender card pricing should remain flat. In contrast, any top-of-rack solution will share in overall technology trends reducing server pricing so relative advantages of IO virtualization over top-of-rack switches should be a wash.

The critical question for the IO virtualization vendors is can they support a high enough fan-in (physical server to top-of-rack) to justify the additional costs in both capital and operational expense for their solution. And will they be able to keep ahead of the pricing trends of their competition (top-of-rack switch ports and server CNA ports).

On one side as CNAs, HBAs, and NICs become faster and more powerful, no single application can consume all the throughput being made available. But on the other hand, server virtualization are now running more applications on each physical server and as such, amortizing port hardware over more and more applications.

Does IO virtualization make sense today at HBAs@8GFC, NICs and CNAs@10Gbe, would it make sense in the future with converged networks? It all depends on port costs. As port costs go down eventually these products will be squeezed.

The significant difference between server and IO virtualization is the fact that IO virtualization doesn’t reduce hardware footprint – one top-of-box IO virtualization appliance replaces a top-of-box switch and server PCI-express slots used by CNAs/HBAs/NICs are now used by PCI-extender cards. In contrast, server virtualization reduced hardware footprint and costs from the start. The fact that IO virtualization doesn’t reduce hardware footprint may doom this product.

VMworld and long distance Vmotion

Moving a VM from one data center to another

In all the blog posts/tweets about VMworld this week I didn’t see much about long distance Vmotion. At Cisco’s booth there was a presentation on how they partnered with VMware and to perform Vmotion over 200 (simulated) miles away.

I can’t recall when I first heard about this capability but for many of us this we heard about this before. However, what was new was that Cisco wasn’t the only one talking about it. I met with a company called NetEx whose product HyperIP was being used to performe long distance Vmotion at over 2000 miles apart . And had at least three sites actually running their systems doing this. Now I am sure you won’t find NetEx on VMware’s long HCL list but what they have managed to do is impressive.

As I understand it, they have an optimized appliance (also available as a virtual [VM] appliance) that terminates the TCP session (used by Vmotion) at the primary site and then transfers the data payload using their own UDP protocol over to the target appliance which re-constitutes (?) the TCP session and sends it back up the stack as if everything is local. According to the NetEx CEO Craig Gust, their product typically offers a data payload of around ~90% compared to standard TCP/IP of around 30%, which automatically gives them a 3X advantage (although he claimed a 6X speed or distance advantage, I can’t seem to follow the logic).

How all this works with vCenter, DRS and HA I can only fathom but my guess is that everything this long distance Vmotion is actually does appears to VMware as a local Vmotion. This way DRS and/or HA can control it all. How the networking is set up to support this is beyond me.

Nevertheless, all of this proves that it’s not just one highend networking company coming away with a proof of concept anymore, at least two companies exist, one of which have customers doing it today.

The Storage problem

In any event, accessing the storage at the remote site is another problem. It’s one thing to transfer server memory and state information over 10-1000 miles, it’s quite another to transfer TBs of data storage over the same distance. The Cisco team suggested some alternatives to handle the storage side of long distance Vmotion:

  • Let the storage stay in the original location. This would be supported by having the VM in the remote site access the storage across a network
  • Move the storage via long distance Storage Vmotion. The problem with this is that transferring TB of data takes (even at 90% data payload for 800 Mb/s) would take hours. And 800Mb/s networking isn’t cheap.
  • Replicate the storage via active-passive replication. Here the storage subsystem(s) concurrently replicate the data from the primary site to the secondary site
  • Replicate the storage via active-active replication where both the primary and secondary site replicate data to one another and any write to either location is replicated to the other

Now I have to admit the active-active replication where the same LUN or file system can be be being replicated in both directions and updated at both locations simultaneously seems to me unobtainium, I can be convinced otherwise. Nevertheless, the other approaches exist today and effectively deal with the issue, albeit with commensurate increases in expense.

The Networking problem

So now that we have the storage problem solved, what about the networking problem. When a VM is Vmotioned to another ESX server it retains its IP addressing so as to retain all it’s current network connections. Cisco has some techniques here where they can seem to extend the VLAN (or subnet) from the primary site to the secondary site and leave the VM with the same network IP address as at the primary site. Cisco has a couple of different ways to extend the VLAN optimized for HA, load ballancing, scaleability or protocol isolation and broadcast avoidance. (all of which is described further in their white paper on the subject). Cisco did mention that their Extending VLAN technology currently would not support distances greater than 500 miles apart.

Presumably NetEx’s product solves all this by leaving the IP addresses/TCP port at the primary site and just transferring the data to the secondary site. In any event multiple solutions to the networking problem exist as well.

Now, that long distance Vmotion can be accomplished is it a DR tool, a mobility tool, a load ballancing tool, or all of the above. That will need to wait for another post.

What’s happening with MRAM?

16Mb MRAM chips from Everspin
16Mb MRAM chips from Everspin

At the recent Flash Memory Summit there were a few announcements that show continued development of MRAM technology which can substitute for NAND or DRAM, has unlimited write cycles and is magnetism based. My interest in MRAM stems from its potential use as a substitute storage technology for today’s SSDs that use SLC and MLC NAND flash memory with much more limited write cycles.

MRAM has the potential to replace NAND SSD technology because of the speed of write (current prototypes write at 400Mhz or a few nanoseconds) and with the potential to go up to 1Ghz. At 400Mhz, MRAM is already much much faster than today’s NAND. And with no write limits, MRAM technology should be very appealing to most SSD vendors.

The problem with MRAM

The only problem is that current MRAM chips use 150nm chip design technology whereas today’s NAND ICs use 32nm chip design technology. All this means that current MRAM chips hold about 1/1000th the memory capacity of today’s NAND chips (16Mb MRAM from Everspin vs 16Gb NAND from multiple vendors). MRAM has to get on the same (chip) design node as NAND to make a significant play for storage intensive applications.

It’s encouraging that somebody at least is starting to manufacture MRAM chips rather than just being lab prototypes with this technology. From my perspective, it can only get better from here…

SSD vs Drive energy use

Hard Disk by Jeff Kubina
Hard Disk by Jeff Kubina

Recently, the Storage Performance Council (SPC) has introduced a new benchmark series, the SPC-1C/E, which provides detailed energy usage for storage subsystems. So far there have been only two published submissions in this category but we look forward to seeing more in the future. The two submissions are for an IBM SSD and a Seagate Savvio (10Krpm) SAS attached storage subsystems.

My only issue with the SPC-1C/E reports is that they focus on a value of nominal energy consumption rather than reporting peak and idle energy usage. I understand that this is probably closer to what an actual data center would see as energy cost but it buries some intrinsic energy use profile differences.

SSD vs Drive power profile differences

The deltas for reported energy consumption for the two current SPC-1C/E submissions show a ~9.6% difference in peak versus nominal energy use for rotating media storage. Similar results for the SSD storage show a difference of ~1.7%. Taking these results for peak versus idle periods, shows the difference for rotating media being 28.5% and for SSD, ~2.8%.

So, the upside for SSD is drive them as hard as you want and it will cost you only a little bit more energy. In contrast, the downside is leave them idle and it will cost almost as much as if you were driving them at peak IO rates.

Rotating media storage seems to have a much more responsive power profile. Drive them hard and it will consume more power, leave them idle and it consumes less power.

Data center view of storage power

Now these differences might not seem significant but given the amount of storage in most shops they could represent significant cost differentials. Although SSD storage consumes less power, it’s energy use profile is significantly flatter than rotating media and will always consume that level of power (when powered on). On the other hand, rotating media consumes more power on average but it’s power profile is more slanted than SSDs and at peak workload consumes much more power than when idle.

Usualy, it’s unwise to generalize from two results. However, everything I know says that these differences in their respective power profiles should persist across other storage subsystem results. As more results are submitted it should be easy to verify whether I am right.

Why virtualize now?

HP servers at School of Electrical Engineering, University of Belgrade
HP servers by lilit
I suppose it’s obvious to most analyst why server virtualization is such a hot topic these days. Most IT shops purchase servers today that are way overpowered and can easily execute multiple applications. Today’s overpowered servers are wasted running single applications and would easily run multiple applications if only an operating system could run them together without interference.

Enter virtualization, with virtualization hypervisors can run multiple applications concurrently and sometimes simultaneously on the same hardware server without compromising application execution integrity. Multiple virtual machine applications execute on a single server under a hypervisor that isolates the applications from one another. Thus, they all execute together on the same hardware without impacting each other.

But why doesn’t the O/S do this?

Most computer purists would say why not just run the multiple applications under the same operating system. But operating systems that run servers nowadays weren’t designed to run multiple applications together and as such, also weren’t designed to isolate them properly.

Virtualization hypervisors have had a clean slate to execute and isolate multiple application. Thus, virtualization is taking over the data center floor. As new servers come in, old servers are retired and the applications that used to run on them are consolidated on fewer and fewer physical servers.

Why now?

Current hardware trends dictate that each new generation of server has more processing power and oftentimes, more processing elements than previous generations. Today’s applications are getting more sophisticated but even with added sophistication, they do not come close to taking advantage of all the processing power now available. Hence, virtualization wins.

What seems to be happening nowadays is that while data centers started out consolidating tier 3 applications through virtualization, now they are starting to consolidate tier 2 applications and tier 1 apps are not far down this path. But, tier 2 and 1 applications require more dedicated services, more processing power, more deterministic execution times and thus, require more sophisticated virtualization hypervisors.

As such, VMware and others are responding by providing more hypervisor sophistication, e.g., more ways to dedicate and split up processing, networking and storage available to the physical server for virtual machine or application dedicated use. Thus preparing themselves for a point in the not to distant future when tier 1 applications run with all the comforts of a dedicated server environment but actually execute with other VMs in a single physical server.

VMware vSphere

We can see the start of this trend with the latest offering from VMware, vSphere. This product now supports more processing hardware, more networking options and stronger storage support. vSphere also can dedicate more processing elements to virtual machines. Such new features make it easier to support tier 2 today and tier 1 applications sometime in future.

ESRP results 1K and under mailboxes – chart of the month

Top 10 ESRP database transfers/sec
Top 10 ESRP database transfers/sec

As described more fully in last months SCI’s newsletter, to the left is a chart depicting Exchange Solution Reporting Program (ESRP) results for up to 1000 mailboxes in the database read and write per second category. This top 10 chart is dominated by HP’s new MSA 2000fc G2 product.

Microsoft will tell you that ESRP is not to be used to compare one storage vendor against another but more as a proof of concept to show how some storage can support a given email workload. The nice thing about ESRP, from my perspective, is that it represents a realistic storage workload rather than the more synthetic workloads offered by the other benchmarks.

What does over 3000 Exchange database operations per second mean to the normal IT shop or email user. It should mean more emails per hour can be sent/received with less hardware. It should mean a higher capacity to service email clients. It should mean a happier IT staff.

But does it mean happier end-users?

I would show my other chart from this latest dispatch that has read latency on it but that would be two charts. Anyways, what the top 10 Read Latency chart would show is that EMC CLARiiON dominates with the overall lowest latency and has the top 9 positions with various versions of CLARiiON and replication alternatives being reported in ESRP results. The 9-CLARiiON subsystems had a latency at around 8-11 msecs. The one CLARiiON on the chart above (CX3-20, #7 in the top 10) had a read latency around 9 msec. and write latency at 5 msec. In contrast, the HP MSA had a read latency of 16 msecs with a write latency of 5 msec. – very interesting.

What this says is that database transfers per second are now more like throughput measures and even though a single database operation (latency) may be almost ~2X longer (9 vs. 16 msecs), it can still perform more database transfer operations per second due to concurrency. Almost makes sense.

Are vendors different?

This probably says something more about the focus of the two storage vendor engineering groups – EMC CLARiiON on getting data to you the fastest and HP MSA on getting the most data through the system.  It might also speak to what the vendor’se ESRP teams were trying to show as well. In any case, EMC’s CLARiiON and HP’s MSA have very different performance profiles.

Which vendor’s storage product makes best sense for your Exchange servers – that’s a more significant question?

The full report will be up on my website later this week but if you want to get this information earlier and receive your own copy of our newsletter – just subscribe by emailing us.

What's holding back the cloud?

Cloud whisps by turtlemom4bacon
Cloud whisps by turtlemom4bacon

Steve Duplessie’s recent post on how the lack of scarcity will be a gamechanger got me thinking. Free is good but the simplicity of the user/administrative interface is worth paying for. And it’s that simplicity that pays off for me.

Ease of use

I agree wholeheartedly with Steve about what and where people should spend their time today. Tweetdeck, the Mac, and the iPhone are three key examples that make my business life easier (most of the time).

  • TweetDeck allows me to filter who I am following all while giving me access to any and all of them.
  • The Mac leaves me much more time to do what needs to be done and allows me to spend less time on non-essential stuff.
  • The iPhone has 1000’s of app’s which make my idle time that much more productive.

Nobody would say any of these things are easy to create and for most of them (Tweetdeck is free at the moment) I pay a premium for these products. All these products have significant complexity to offer the simple user and administrative interface they supply.

The iPhone is probably closest to the cloud from my perspective. But it performs poorly (compared to broadband) and service (ATT?) is spotty.  Now these are nuisances in a cell phone which can be lived with.  If this were my only work platform they would be deadly.

Now the cloud may be easy to use because it removes the administrative burden but that’s only one facet of using it. I assume using most cloud services are as easy as signing up on the web and then recoding applications to use the cloud provider’s designated API. This doesn’t sound easy to me. (Full disclosure I am not a current cloud user and thus, cannot talk about it’s ease of use).

Storm clouds

However, today the cloud is not there for other reasons – availability concerns, security concerns, performance issues, etc. All these are inhibitors today and need to be resolved before the cloud can reach the mainstream or maybe be my platform of choice. Also, I have talked before on some other issues with the cloud.

Aside from those inhibitors, the other main problems with the cloud are lack of applications I need to do business today.  Google Apps and MS Office over the net are interesting but not sufficient.  Not sure what is sufficient and that would depend on your line of business but server and desktop platforms had the same problem when they started out. However servers and desktops have evolved over time from killer apps to providing needed application support. The cloud will no doubt follow, over time.

In the end, the cloud needs to both grow up and evolve to host my business model and I would presume many others as well. Personally I don’t care if my data&apps are hosted on the cloud or hosted on office machines. What matters to me are security, reliability, availability, and useability. When the cloud can support me in the same way that the Mac can, then who hosts my applications will be a purely economic decision.

The cloud and net are just not there yet.

STEC’s MLC enterprise SSD

So many choices by Robert S. Donovan
So Many Choices by Robert S. Donovan

I haven’t seen much of a specification on STEC’s new enterprise MLC SSD but it should be interesting.  So far everything I have seen seems to indicate that it’s a pure MLC drive with no SLC  NAND.  This is difficult for me to believe but could easily be cleared up by STEC or their specifications.  Most likely it’s a hybrid SLC-MLC drive similar, at least from the NAND technology perspective, to FusionIO’s SSD drive.

MLC write endurance issue

My difficulty with a pure MLC enterprise drive is the write endurance factor.  MLC NAND can only endure around 10,000 erase/program passes before it starts losing data.  With a hybrid SLC-MLC design one could have the heavy write data go to SLC NAND which has a 100,000 erase/program pass lifecycle and have the less heavy write data go to MLC.  Sort of like a storage subsystem “fast write” which writes to cache first and then destages to disk but in this case the destage may never happen if the data is written often enough.

The only flaw in this argument is that as the SSD drives get bigger (STEC’s drive is available supporting up to 800GB) this becomes less of an issue. Because with more raw storage the fact that a small portion of the data is very actively written gets swamped by the fact that there is plenty of storage to hold this data.  As such, when one NAND cell gets close to its lifetime another, younger cell can be used.  This process is called wear leveling. STEC’s current SLC Zeus drive already has sophisticated wear leveling to deal with this sort of problem with SLC SSDs and doing this for MLCs just means having larger tables to work with.

I guess at some point, with multi-TB per drives, the fact that MLC cannot sustain more than 10,000 erase/write passes becomes moot.  Because there just isn’t that much actively written data out there in an enterprise shop. When you amortize the portion of highly written data as a percentage of a drive, the more drive capacity, the smaller the active data percentages become. As such, as SSD drive capacities gets larger this becomes less of an issue.  I figure with 800GB drives, active data proportion might still be high enough to cause a problem but it might not be an issue at all.

Of course with MLC it’s also cheaper to over provision NAND storage to also help with write endurance. For an 800GB MLC SSD, you could easily add another 160GB (20% over provisioning) fairly cheaply. As such, over provisioning will also allow you to sustain an overall drive write endurance that is much higher than the individual NAND write endurance.

Another solution to the write endurance problem is to increase the power of ECC to handle write failures. This would probably take some additional engineering and may or may not be in the latest STEC MLC drive but it would make sense.

MLC performance

The other issue about MLC NAND is that it has slower read and erase/program cycle times.  Now these are still order’s of magnitude faster than standard disk but slower than SLC NAND.  For enterprise applications SLC SSDs are blistering fast and are often performance limited by the subsystem they are attached to. So, the fact that MLC SSDs are somewhat slower than SLC SSDs may not even be percieved by enterprise shops.

MLC performance is slower because it takes longer to read a cell with multiple bits in it than it takes with just one. MLC, in one technology I am aware of, encodes 2-bits in the voltage that is programmed in or read out from a cell, e.g., VoltageA = “00”, VoltageB=”01″, VoltageC=”10″, and VoltageD=”11″. This gets more complex with 3 or more bits per cell but the logic holds.  With multiple voltages, determining which voltage level is present is more complex for MLC and hence, takes longer to perform.

In the end I would expect STEC’s latest drive to be some sort of SLC-MLC hybrid but I could be wrong. It’s certainly possible that STEC have gone with just an MLC drive and beefed up the capacity, over provisioning, ECC, and wear leveling algorithms to handle its lack of write endurance

MLC takes over the world

But the major issue with using MLC in SSDs is that MLC technology is driving the NAND market. All those items in the photo above are most probably using MLC NAND, if not today then certainly tomorrow. As such, the consumer market will be driving MLC NAND manufacturing volumes way above anything the SLC market requires. Such volumes will ultimately make it unaffordable to manufacture/use any other type of NAND, namely SLC in most applications, including SSDs.

So sooner or later all SSDs will be using only MLC NAND technology. I guess the sooner we all learn to live with that the better for all of us.