Correlated risk

Aerial view of damage to Wakuya, Japan following earthquake. by Official U.S. Navy... (cc) (from Flickr)
Aerial view of damage to Wakuya, Japan following earthquake. by Official U.S. Navy... (cc) (from Flickr)

What’s the chance that

  • an earthquake  at sea could knock out primary power and generate a tsunami which would also knock out backup generators for nuclear power plant emergency cooling equipment (1 in 40 yrs),
  • an overextended speculative market segment would collapse and cause widespread ruin that would take down both equity and bond markets and force 100s of financial institutions to go under (1 in 77 yrs),
  • a hurricane occurs that destroys flood barriers which then flood your home, office and the place you store your backups (?)

All these represent correlated risks that prior to the actual event, were deemed very improbable.  But high improbability, doesn’t mean it will never happen.

Correlated risk defined

A correlated risk is the risk of any subsequent disaster or event occurring after a primary event or catastrophe has occurred. In the case of natural disasters, any event that is generated as a consequence or because of an originating event occurrance is a correlated event and as such, has a correlated risk.

I once worked for a major company, that kept their disaster recovery backups in a basement, underground in the same campus as their headquarters.  This seemed risky, as any event which took out the campus could potentially damage this basement as well and all the associated backup tapes.

How to understand your correlated risk

It seems to me to be pretty straightforward to understand correlated risk within the framework of a business continuity or disaster recovery plan (BC or DR plan).  One lists in one column all possible primary accidents, calamities, disasters, etc., man made or natural, in another column other possible accidents, calamities, disasters, etc. that are generated because of the prime event.

One then recurses on this process to generate all possible correlated events associated with the primary or previous correlated event until you exhaust all possible chains of catastrophes associated with the primary disaster. Then in a third column, list the potential scope (distance or area impacted) and outcomes (what damage could be expected) of all those activities in the first two columns.  In a fourth column, one lists the best guess probability of the events and/or the correlated event(s) occurring.

In the end, you should have an exhaustive list of things you should be preparing for.  Now one ranks the events in probability order and tackles them from highest to lowest probability.  There is some cutoff point that everyone reaches depending on their risk tolerance, at some point dealing with the multiple disasters that could potentially occur becomes too costly to deal with. But it all depends on risk tolerance.  For instance, a nuclear plant probably needs a much higher risk tolerance than your average corporate environment.

With that in place you have a start on a BC and/or DR plan.  Now all you need to determine is your risk tolerance level and how to handle primary and correlated risks that fall within that level.

A correlated risk analysis

Take Silverton Consulting as an example .  I take daily incremental backups stored on a local hard disk, take weekly “partial backups” (critical business files only) to removable media but also stored locally in the office, and take monthly full backups stored in a safety deposit box located in a vault in the basement of a bank within five miles of the office.

If I just look at natural events:

  • My first and most likely natural event is building fire – in this case the scope of the event would be limited to the building, which would take out both the local hard disk incrementals and weekly partial backups but the safety deposit box of monthly fulls would still be accessible.
  • A possible correlated event as well as another primary event could be wild fire – in this case, potentially both the office and the bank could be consumed and all backups would be lost.  The fact that the bank is 5 miles away, has it’s own fire suppression system, and has my backups located in their basement, just reduces the probability of a wild fire impacting both locations but doesn’t eliminate it.
  • Another possible correlated event to any wild fire would be loss of power, transport, and communication services – the fact that the bank is only 5 miles away, indicates that if the primary office loses these services, it’s highly probably that the bank would lose them as well.  Access to the bank vault backups, under these circumstances would be delayed at best, until at least such services could be restored.  Had I been using a cloud provider backup service (which I am considering), I couldn’t access my data until communication services were restored or until I had moved far enough away to regain access to these services.  Wth the roads/other transport being out this would take some time.
  • Next most likely natural event is flood. Our location is within a 100 year flood plane, so a serious flood is possible that would take out the office once every 100 yrs.  I would like to say that our bank is outside our flood plane, but I just don’t know yet.  But I promise to find out.
  • A correlated event to a flood is a loss of power, transport and communication services. The scope and consequences of this catastrophe are similar to that discussed above.
  • Next most likely natural event is tornado, …
  • Next most likely natural event is earthquake, …
  • Next most likely natural event is volcano eruption,

… and the list goes on.  Of course these are just natural disasters, one would need to consider man-made catastrophes as well.

In any event, all these have a distinct, non-zero probability.  One can come up with some calculation of the probability of such primary and correlated events through research and/or other means.

For instance, I get a fortnightly email from Colorado University’s Natural Hazards Center which occasionally provides some insight into these probabilities. Potentially, your corporatations insurance companies can also provide some guidance into these probabilities as well.

What is risk tolerance?

But at some point, only the company can determine it’s risk tolerance.  I believe risk tolerance to be some combination of money one is willing to invest and your ability to invest it in mitigating risks.   For example, let’s say my company makes $10M a year in revenues.  Given the importance of IT to my corporation’s activities a reasonable risk tolerance in $ terms might be somewhere between 0.1% to 1.0% of revenues or $10K to $100K.   I must say I am probably spending more than that percentage of SCI revenues in my current DR activities, such as they are, but I include weekly and monthly backups with these costs (most would not include these activities in pure DR spending).

—-

So as the disaster in Japan continues, let us pray that it works out well in the end for all parties.  But also let’s use this time to re-examine our risk tolerance and disaster recovery plans with respect to correlated risks.  Hopefully, we will all do better next time.

Comments?

Deskchecking BC/DR plans

Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)
Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)

Quite a lot of twitter traffic/tweetchat this Wednesday on DR/BC all documented on #sanchat sponsored by Compellent. In that discussion I mentioned a presentation I did a couple of years ago for StorageDecisions/Chicago on Successful Disaster Recovery Testing where I discussed some of the techniques companies use to provide disaster recovery and how they validated these activities.

For those shops with the luxury of having an owned or contracted for “hot-site” or “warm-site”, DR testing should be an ongoing and periodic activity. In that presentation I suggested testing DR plans at least 1/year but more often if feasible. In this case a test is a “simulated disaster declaration” where operations is temporarily moved to an alternate site.  I know of one European organization which tested their DR plans every week but they owned the hot-site and their normal operations were split across the two sites.

For organizations that have “cold-sites” or no sites, the choices for DR testing are much more limited. In these situations, I recommended a way to deskcheck or walkthru a BC/DR plan, which didn’t involve any hardware testing. This is like a code or design inspection but applied to a BC/DR plans.

How to perform a BC/DR plan deskcheck/walkthru

In a BC/DR plan deskcheck there are a few roles, namely a leader, a BC/DR plan owner, a recorder,  and participants.  The BC/DR deskcheck process looks something like:

  1. Before the deskcheck, the leader identifies walkthru team members from operations, servers, storage, networking, voice, web, applications, etc.; circulates the current BC/DR plan to all team members; and establishes the meeting date-times.
  2. The leader decides which failure scenario will be used to test the DR/BC plan.  This can be driven by the highest probability or use some form of equivalence testing. (In equivalence testing one collapses the potential failure scenarios into a select set which have similar impacts.)
  3. In the pre-deskcheck meeting,  the leader discusses the roles of the team members and identifies the failure scenario to be tested.  IT staff and other participants are to determine the correctness of the DR/BC plan “from their perspective”.  Every team member is supposed to read the BC/DR plan before the deskcheck/walkthru meeting to identify problems with it ahead of time.
  4. At the deskcheck/walkthru meeting, The leader starts the session by describing the failure scenario and states what, if any  data center, telecom, transport facilities are available, the state of the alternate site, and current whereabouts of IT staff, establishing the preconditions for the BC/DR simulation.  Team members should concur with this analysis or come to consensus on the scenario’s impact on facilities, telecom, transport and staffing.
  5. Next, the owner of the plan, describes the first or next step in detail identifying all actions taken and impact on the alternate site. Participants then determines if the step performs the actions as stated or not.  Also,
    1. Participants discuss the duration for step to complete to place everything on the same time track. For instance at
      1. T0: it’s 7pm on a Wednesday, a fire-flood-building collapse occurs, knocks out the main data center, all online portals are down, all application users are offline, …, luckily operations personnel are evacuated and their injuries are slight.
      2. T1: Head of operations is contacted and declares a disaster; activates the disaster site; calls up the DR team to get to on a conference call ASAP, …
      3. T2: Head of operations, requests backups be sent to the alternate site; personnel are contacted and told to travel to the DR site; Contracts for servers, storage and other facilities at DR site are activate; …
    2. The recorder pays particular attention to any problems brought up during the discussion, ties them to the plan step, identifies originator of the issue, and discusses its impact.  Don’t try to solve the problems,  just record  them and its impact .
    3. The Leader or their designee maintains an official plan timeline in real time. This timeline can be kept on a whiteboard or an (excel/visio chart) display for all to see.  Timeline documentation can be kept as a formal record of the walkthru along with the problem list, and the BC/DR plan.
    4. This step is iterated for every step in the BC/DR plan until the plan is completed.
  6. At the end, the recorder lists all the problems encountered and provides a copy to the plan owner.
  7. The team decides if another deskcheck rewiew is warranted on this failure scenario (depends on the number and severity of the problems identified).
  8. When the owner of the plan has resolved all the issues, he or she reissues the plan to everyone that was at the meeting.
  9. If another deskcheck is warranted, the leader issues another meeting call.

This can take anywhere from half a day to a couple of days. BUT deskchecking your BC/DR plan can be significantly less costly than any actual test.  Nevertheless, a deskcheck cannot replace an actual BC/DR plan simulation test on real hardware/software.

Some other hints from code and design inspections

  • For code or design inspections, a checklist of high probability errors is used to identify and familiarize everyone with these errors.  Checklists can focus participant review to look for most probable errors. The leader can discuss these most likely errors at the pre-deskcheck meeting.
  • Also, problems are given severities, like major or minor problems.  For example,  a BC/DR plan “minor” problem might be an inadequate duration estimate for an activity.  A “major” problem might be a mission critical app not coming up after a disaster.

So that’s what a BC/DR plan deskcheck would look like. If you did a BC/DR plan once a quarter you are doing probably better than most.  And if on top of that, you did a yearly full scale DR simulation on real hardware you would be considered well prepared in my view.  What do you think?