Preventing 3 common Disaster Recovery scenarios

The success of your disaster recovery strategy is often judged based on whether it is actually implemented or not. Although no prevention method is 100% fool proof, risk avoidance and taking proactive measures for preparedness are essential elements of the disaster recovery process.

On the flip side, despite all the measures we take to avoid a disaster, we must assume that a disaster will happen. Having this mind-set will help shape our decisions when it comes to planning for IT disaster recovery scenarios.

Here we look at prevention and recovery methods for three common such scenarios.

Scenario 1: The OS becomes corrupt

This scenario considers a situation where the operating system of one of your servers becomes corrupt, but the underlying data is still there. This might be caused by a failed Windows update, malware, or a non-graceful shutdown.

How do I prevent this?

Always test Windows updates or new software prior to deployment. This should ideally be done on a test system containing the same image as the original server. Anti-Virus applications should also be kept up-to-date on every node in your network.

Be sure to restrict access to the server room to prevent malicious or accidental modifications to the physical aspects of the servers. Have a UPS in place to cater for power outages and handle graceful shutdowns.

How do I recover from this?

Use a standby image. The idea here is to restore from a snapshot of your system prior to when the OS became corrupt.
Perform a bare metal restore. Once the OS is back up and running, use a backup application that supports delta recovery to quickly bring the underlying data back online.

Scenario 2: The hard drive dies

This scenario considers a situation where the hard drive of one of your servers suddenly fails. This might be caused by a broken RAID set, overheating, or a mechanical fault.

How do I prevent this?

The occasional hardware fault is accepted as being part and parcel of the manufacturing of modern IT equipment. However, there are things we can do to limit the impact of a failed drive. These include having the correct RAID configuration in place and/or a replication mechanism with auto-failover. Keeping IT equipment at a cool temperature (e.g. using air conditioning) is also vital. If it operates in a sub-optimal physical environment, the risk of overheating increases.

How do I recover from this?

Use a backup solution that has Continuous Recovery and preconfigure a virtual machine (VM) to be on standby. When the production machine goes down all you need to do is press play on the preconfigured VM.
Perform a bare-metal restore from local storage to new hardware.
Recover from the cloud using your MSP’s infrastructure, Azure, or AWS.
Restore from a mountable VHD. If you previously created a standby image of your computer, you can quickly open it in your virtual environment using Hyper-V or VirtualBox.

Scenario 3: The roof caves in

This scenario considers a situation where physical damage occurs to the building or room where your data resides. This is usually caused by a natural (environmental) disaster; hurricanes, tornadoes, floods, or excess snowfall are just some examples.

How do I prevent this?

This is probably the toughest scenario to prevent. ‘Acts of God’ (as defined by insurance companies) are unpredictable and can happen without any forewarning. The only sure-fire way to lessen the impact of a physical disaster is to have a replicated copy of the data and IT environment in a completely separate geographical location (e.g. in the cloud, another regional office, at your MSP).

How do I recover from this?

Recover from the cloud using your MSP’s infrastructure, Azure, or AWS.
Use a backup solution that has Continuous Recovery and preconfigure a virtual machine (VM) in a remote location to be on standby. When the production machine(s) goes down all you need to do is press play on the preconfigured VM.

Note: To keep your business going, your Business Continuity Plan (BCP) should include your company’s strategy for how people can work from remote locations (e.g. their home).

Conclusion

I’ve always believed in the old adage of “prevention is better than cure”. This holds true for disaster recovery scenarios as well. Being proactive is a better approach than being reactive. However, when we are dealing with unpredictable situations beyond our control, we also need to ensure we have the people, processes and methods in place to react as quickly as possible and help bring things back to normal.

With this in mind, it would be wise to have a system in place that allows for speedy and reliable recovery of data. If you do this and run routine restore tests, you will be best placed to minimize the impact of any IT disaster scenario that you are faced with.

By Andrew Tabona

Blog

18th April, 2024

Patch Tuesday April 2024: Lots of Fixes for Secure Boot and Remote Code Execution Vulnerabilities

April’s Microsoft Patch Tuesday brings a bumper crop of fixes that will be keeping patching teams busy this month! Lewis Pope takes a look under the hood.

Event

May 2 2024, 13:00 - 14:30 EDT (19:00 - 20:30 CEST)