VMware disaster recovery

Thunderstorms over Alexandria, VA by mehul.antani (cc) (from Flickr)
Thunderstorms over Alexandria, VA by mehul.antani (cc) (from Flickr)

I did an article awhile ago for TechTarget on Virtual (machine) Disaster Recovery and discussed what was then the latest version of VMware Site Recovery Manager (SRM) v1.0 and some of it’s capabilities.

Well its been a couple of years since that came out and I thought it would be an appropriate time to discuss some updates to that product and other facilities that bear on virtual machine disaster recovery of today.

SRM to the rescue

Recall that VMware’s SRM is essentially a run book automation tool for system failover.  Using SRM, an administrator defines the physical and logical mapping between a primary site configuration of (protected site in SRM parlance) virtual machines, networking, and data stores and a secondary site (recovery site to SRM) configuration.

Once this mapping is complete, the administrator then creates recovery scripts (recovery plans to SRM) which take the recovery site in a step-by-step fashion from an “inactive” to an “active” state.  With the recovery scripts in hand, data replication can then be activated and monitoring (using storage replication adaptors, SRAs to SRM) can begin.  Once all that was ready and operating, SRM can provide one button failover to the recovery site.

SRM v4.1 supports the following:

  • NFS data stores can now be protected as well as iSCSI and FC LUN data stores.  Recall that a VMFS  (essentially a virtual machine device or drive letter) or a VM data store can be hosted on LUNs or as NFS files.  NFS data stores have recently become more popular with the proliferation of virtual machines under vSphere 4.1.
  • Raw device mode (RDM) LUNs can now be protected. Recall that RDM is another way to access devices directly for performance sensitive VMs eliminating the need to use a data store and  hyper-visor IO overhead.
  • Shared recovery sites are now supported. As such, one recovery site can now support multiple protected sites.  In this way a single secondary site can support failover from multiple primary sites.
  • Role based access security is now supported for recovery scripts and other SRM administration activities. In this way fine grained security roles can be defined that allow protection over unauthorized use of SRM capabilities.
  • Recovery site alerting is now supported. SRM now fully monitors recovery site activity and can report on and alert operations staff when problems occur which may impact failover to the recovery site.
  • SRM test and actual failover can now be initiated and monitored directly from vCenter serve. This provides the vCenter administrator significant control over SRM activities.
  • SRM automated testing can now use storage snapshots.  One advantage of SRM is the ability to automate DR testing which can be done onsite using local equipment. Snapshots eliminates the need for storage replication in local DR tests.

There were many other minor enhancements to SRM since v1.0 but these seem the major ones to me.

The only things lacking seem to be some form of automated failback and three way failover.  I’ll talk about 3-way failover later.

But without automated failback, the site administrator must reconfigure the two sites and reverse the designation of protected and recovery sites, re-mirror the data in the opposite direction and recreate recovery scripts to automate bringing the primary site back up.

However, failback is likely not to be as time sensitive as failover and could very well be a scheduled activity, taking place over a much longer time period. This can, of course all be handled automatically by SRM or be done in a more manual fashion.

Other DR capabilities

At last year’s EMCWorld VPLEX was announced which provided for a federation of data centers or as I called it at the time Data-at-a-Distance (DaaD).  DaaD together with VMware’s Vmotion could provide a level of  disaster avoidance (see my post on VPLEX surfaces at EMCWorld) previously unattainable.

No doubt cluster services from Microsoft Cluster Server (MSCS), Symantec Veritas Cluster Services (VCS)  and others have also been updated.  In some (mainframe) cluster services, N-way or cascaded failover is starting to be supported.  For example, a 3 way DR scenario has a primary site synchronously replicated to a secondary site which is asynchronously replicated to a third site.  If the region where the primary and secondary site is impacted by a disaster, the tertiary site can be brought online. Such capabilities are not yet available for virtual machine DR but it’s only a matter of time.

—–

Disaster recovery technologies are not standing still and VMware SRM is no exception. I am sure a couple of years from now SRM will be even more capable and other storage vendors will provide DaaD capabilities to rival VPLEX.   What the cluster services folks will be doing by that time I can’t even imagine.

Comments?

 

Caching DaaD for federated data centers

Internet Splat Map by jurvetson (cc) (from flickr)
Internet Splat Map by jurvetson (cc) (from flickr)

Today, I attended a webinar where Pat Gelsinger, President of Information Infrastructure at EMC discussed their concept for a new product based on the Yotta Yotta technology they acquired a few years back.  Yotta Yotta’s product was a distributed, coherent caching appliance that had FC front end ports, an Infiniband appliance internal network and both FC and WAN backend links.

What one did with Yotta Yotta nodes was place them in front of your block storage, connect them together via infiniband locally and via a WAN technology (of your choice, then) and then you could access any data behind the appliances from any attached location.  They also provided very quick transferring of bulk data between remote nodes. So, their technology allowed for very rapid data transmission over standard WAN interfaces/distances and provided a distributed cache across those very same distances to the data behind the appliances.

I like caching appliances as much as anyone but they had become prominent only in the late 70’s and early 80’s mostly because caching technology was hard to do with the storage subsystems of the day, but they went away a long time ago.  Nowadays, you can barely purchase a lone disk drive without a cache in them.  So what’s different.

Introducing DaaD

Today we have SSDs and much cheaper processing power.  I wrote about new caching appliances like DataRam‘s XcelaSAN  in a Cache appliances rise from the dead post I did after last years SNW.  But EMC’s going after a slightly broader domain – the world.  The caching appliance that EMC is discussing is really intended to support distributed data access, or as I like to call it,  Data-at-a-Distance (DaaD).

How can this work?  Data is stored on subsystems at various locations around the world.  A DaaD appliance is inserted in front of each of these and connected over the WAN. Some or all of that data is now re-configured (at block or more likely LUN level) to be accessible at distance from each DaaD data center.  As each data center reads and writes data from/to their remote brethern, some portion of that data is cached locally in the DaaD appliance and the rest is only available by going to the remote site (with considerably higher latency).

This works moderately well for well behaved, read intensive workloads where 80% of the IO is to 20% of the data (most of which is cached locally).  But block writes present a particularly nasty problem as any data write has to be propagated to all cache copies before acknowledged.

It’s possible write propagation could be done via invalidating the data in cache (so any subsequent read would need to re-access the data from the original host).  Nevertheless, to even know which DaaD nodes have a cached copy of a particular block, one needs to maintain a dictionary of all globally identifiable blocks held in any DaaD cache node at every moment in time.  Any such table would change often, will necessarily need to be updated very carefully, deadlock free and atomically with non-failable transactions – therein lies one of the technological hurdles.  Doing this quickly without impacting performance is another hurdle.

So simple enough, EMC takes Yotta Yotta’s technology, updates it for todays processors, networking, and storage, and releases it as a data center federation enabler. So, what can one do with a federated data center, well that’s another question and it involves Vmotion, and must be a subject for a future post …