Improving disaster recovery with a single word

Nick Cavalancia

I remember as a kid I used to watch an animated short called The Adventures of Letterman, where our hero, Letterman, would come to save the day by taking the single letter from his shirt and replacing a letter in a word (for example from night to light). In the case of these stories, a single letter made all the difference.

When it comes to disaster recovery, it’s a single word.

The most critical part of Disaster Recovery shouldn’t be the recovery; it should be the testing of the recovery. After all, if you’ve properly tested your recovery, the actual process should go smoothly, right? Not necessarily. Recovery testing is often an isolated exercise that only takes into account the server, application, or data set to be recovered, and ignores any related systems that may impact your recovery.

Practice-makes-perfectAnd when it comes to meeting a specific recovery time objective (RTO), you can’t afford to have something go wrong during an actual recovery.

So, how can you ensure you have the recovery process detailed down to the last checkbox?  Simple - be the Letterman of DR, and replace the word testing when it comes to DR, with the word practice.

Practice Makes Perfect

By shifting this one word and actually practicing your recovery, the tone becomes less about just making sure the recovery process works, and more about ensuring you know everything about the recovery process – what will work the first time, what services or servers need to be restarted, which data subsets will require additional restores, and how the recovery both impacts other systems as well as how those other systems impact the recovery.

So what’s involved with practicing?

  • More than just a restore – If you stop your testing just after your restore, keep going. What steps are needed next? Restarting of services? A server?  Follow the process through until you have a functional server.
  • More than just one application – You should consider building a lab environment that mimics production as much as is possible, and test any interdependencies between the recovered application or server and other services on the network.
  • More than just one time – Practice implies it’s going to happen a lot more frequently. And while that’s going to take some significant time commitment, depending on the criticality of the system in question, it may be well justified.
  • More than just a documented process – The recovery process needs to also plan for contingencies – contingencies that will show themselves as you practice recovery over time as the configuration of your environment changes. As you practice recovery and a problem arises, it’s one more documented possibility that will assist in making the actual recovery that much more smooth.

By practicing, when it comes time to put practice into execution, your actual recovery is far less likely to fail, your documentation becomes stronger, and you’ll have the best chance of meeting your RTO.

Thoughts and ideas are always welcome! Feel free to leave a comment below!