Deskchecking BC/DR plans

Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)
Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)

Quite a lot of twitter traffic/tweetchat this Wednesday on DR/BC all documented on #sanchat sponsored by Compellent. In that discussion I mentioned a presentation I did a couple of years ago for StorageDecisions/Chicago on Successful Disaster Recovery Testing where I discussed some of the techniques companies use to provide disaster recovery and how they validated these activities.

For those shops with the luxury of having an owned or contracted for “hot-site” or “warm-site”, DR testing should be an ongoing and periodic activity. In that presentation I suggested testing DR plans at least 1/year but more often if feasible. In this case a test is a “simulated disaster declaration” where operations is temporarily moved to an alternate site.  I know of one European organization which tested their DR plans every week but they owned the hot-site and their normal operations were split across the two sites.

For organizations that have “cold-sites” or no sites, the choices for DR testing are much more limited. In these situations, I recommended a way to deskcheck or walkthru a BC/DR plan, which didn’t involve any hardware testing. This is like a code or design inspection but applied to a BC/DR plans.

How to perform a BC/DR plan deskcheck/walkthru

In a BC/DR plan deskcheck there are a few roles, namely a leader, a BC/DR plan owner, a recorder,  and participants.  The BC/DR deskcheck process looks something like:

  1. Before the deskcheck, the leader identifies walkthru team members from operations, servers, storage, networking, voice, web, applications, etc.; circulates the current BC/DR plan to all team members; and establishes the meeting date-times.
  2. The leader decides which failure scenario will be used to test the DR/BC plan.  This can be driven by the highest probability or use some form of equivalence testing. (In equivalence testing one collapses the potential failure scenarios into a select set which have similar impacts.)
  3. In the pre-deskcheck meeting,  the leader discusses the roles of the team members and identifies the failure scenario to be tested.  IT staff and other participants are to determine the correctness of the DR/BC plan “from their perspective”.  Every team member is supposed to read the BC/DR plan before the deskcheck/walkthru meeting to identify problems with it ahead of time.
  4. At the deskcheck/walkthru meeting, The leader starts the session by describing the failure scenario and states what, if any  data center, telecom, transport facilities are available, the state of the alternate site, and current whereabouts of IT staff, establishing the preconditions for the BC/DR simulation.  Team members should concur with this analysis or come to consensus on the scenario’s impact on facilities, telecom, transport and staffing.
  5. Next, the owner of the plan, describes the first or next step in detail identifying all actions taken and impact on the alternate site. Participants then determines if the step performs the actions as stated or not.  Also,
    1. Participants discuss the duration for step to complete to place everything on the same time track. For instance at
      1. T0: it’s 7pm on a Wednesday, a fire-flood-building collapse occurs, knocks out the main data center, all online portals are down, all application users are offline, …, luckily operations personnel are evacuated and their injuries are slight.
      2. T1: Head of operations is contacted and declares a disaster; activates the disaster site; calls up the DR team to get to on a conference call ASAP, …
      3. T2: Head of operations, requests backups be sent to the alternate site; personnel are contacted and told to travel to the DR site; Contracts for servers, storage and other facilities at DR site are activate; …
    2. The recorder pays particular attention to any problems brought up during the discussion, ties them to the plan step, identifies originator of the issue, and discusses its impact.  Don’t try to solve the problems,  just record  them and its impact .
    3. The Leader or their designee maintains an official plan timeline in real time. This timeline can be kept on a whiteboard or an (excel/visio chart) display for all to see.  Timeline documentation can be kept as a formal record of the walkthru along with the problem list, and the BC/DR plan.
    4. This step is iterated for every step in the BC/DR plan until the plan is completed.
  6. At the end, the recorder lists all the problems encountered and provides a copy to the plan owner.
  7. The team decides if another deskcheck rewiew is warranted on this failure scenario (depends on the number and severity of the problems identified).
  8. When the owner of the plan has resolved all the issues, he or she reissues the plan to everyone that was at the meeting.
  9. If another deskcheck is warranted, the leader issues another meeting call.

This can take anywhere from half a day to a couple of days. BUT deskchecking your BC/DR plan can be significantly less costly than any actual test.  Nevertheless, a deskcheck cannot replace an actual BC/DR plan simulation test on real hardware/software.

Some other hints from code and design inspections

  • For code or design inspections, a checklist of high probability errors is used to identify and familiarize everyone with these errors.  Checklists can focus participant review to look for most probable errors. The leader can discuss these most likely errors at the pre-deskcheck meeting.
  • Also, problems are given severities, like major or minor problems.  For example,  a BC/DR plan “minor” problem might be an inadequate duration estimate for an activity.  A “major” problem might be a mission critical app not coming up after a disaster.

So that’s what a BC/DR plan deskcheck would look like. If you did a BC/DR plan once a quarter you are doing probably better than most.  And if on top of that, you did a yearly full scale DR simulation on real hardware you would be considered well prepared in my view.  What do you think?