VMware disaster recovery

Thunderstorms over Alexandria, VA by mehul.antani (cc) (from Flickr)
Thunderstorms over Alexandria, VA by mehul.antani (cc) (from Flickr)

I did an article awhile ago for TechTarget on Virtual (machine) Disaster Recovery and discussed what was then the latest version of VMware Site Recovery Manager (SRM) v1.0 and some of it’s capabilities.

Well its been a couple of years since that came out and I thought it would be an appropriate time to discuss some updates to that product and other facilities that bear on virtual machine disaster recovery of today.

SRM to the rescue

Recall that VMware’s SRM is essentially a run book automation tool for system failover.  Using SRM, an administrator defines the physical and logical mapping between a primary site configuration of (protected site in SRM parlance) virtual machines, networking, and data stores and a secondary site (recovery site to SRM) configuration.

Once this mapping is complete, the administrator then creates recovery scripts (recovery plans to SRM) which take the recovery site in a step-by-step fashion from an “inactive” to an “active” state.  With the recovery scripts in hand, data replication can then be activated and monitoring (using storage replication adaptors, SRAs to SRM) can begin.  Once all that was ready and operating, SRM can provide one button failover to the recovery site.

SRM v4.1 supports the following:

  • NFS data stores can now be protected as well as iSCSI and FC LUN data stores.  Recall that a VMFS  (essentially a virtual machine device or drive letter) or a VM data store can be hosted on LUNs or as NFS files.  NFS data stores have recently become more popular with the proliferation of virtual machines under vSphere 4.1.
  • Raw device mode (RDM) LUNs can now be protected. Recall that RDM is another way to access devices directly for performance sensitive VMs eliminating the need to use a data store and  hyper-visor IO overhead.
  • Shared recovery sites are now supported. As such, one recovery site can now support multiple protected sites.  In this way a single secondary site can support failover from multiple primary sites.
  • Role based access security is now supported for recovery scripts and other SRM administration activities. In this way fine grained security roles can be defined that allow protection over unauthorized use of SRM capabilities.
  • Recovery site alerting is now supported. SRM now fully monitors recovery site activity and can report on and alert operations staff when problems occur which may impact failover to the recovery site.
  • SRM test and actual failover can now be initiated and monitored directly from vCenter serve. This provides the vCenter administrator significant control over SRM activities.
  • SRM automated testing can now use storage snapshots.  One advantage of SRM is the ability to automate DR testing which can be done onsite using local equipment. Snapshots eliminates the need for storage replication in local DR tests.

There were many other minor enhancements to SRM since v1.0 but these seem the major ones to me.

The only things lacking seem to be some form of automated failback and three way failover.  I’ll talk about 3-way failover later.

But without automated failback, the site administrator must reconfigure the two sites and reverse the designation of protected and recovery sites, re-mirror the data in the opposite direction and recreate recovery scripts to automate bringing the primary site back up.

However, failback is likely not to be as time sensitive as failover and could very well be a scheduled activity, taking place over a much longer time period. This can, of course all be handled automatically by SRM or be done in a more manual fashion.

Other DR capabilities

At last year’s EMCWorld VPLEX was announced which provided for a federation of data centers or as I called it at the time Data-at-a-Distance (DaaD).  DaaD together with VMware’s Vmotion could provide a level of  disaster avoidance (see my post on VPLEX surfaces at EMCWorld) previously unattainable.

No doubt cluster services from Microsoft Cluster Server (MSCS), Symantec Veritas Cluster Services (VCS)  and others have also been updated.  In some (mainframe) cluster services, N-way or cascaded failover is starting to be supported.  For example, a 3 way DR scenario has a primary site synchronously replicated to a secondary site which is asynchronously replicated to a third site.  If the region where the primary and secondary site is impacted by a disaster, the tertiary site can be brought online. Such capabilities are not yet available for virtual machine DR but it’s only a matter of time.

—–

Disaster recovery technologies are not standing still and VMware SRM is no exception. I am sure a couple of years from now SRM will be even more capable and other storage vendors will provide DaaD capabilities to rival VPLEX.   What the cluster services folks will be doing by that time I can’t even imagine.

Comments?