Disaster Recovery: Designing for Resilience

Disaster Recovery: Designing for Resilience

Disaster Recovery (DR) is the capability to restore systems to operation after a catastrophic failure — data centre outage, ransomware attack, accidental deletion, or major infrastructure failure. Good DR requires planning, investment in redundancy, and regular testing — not just backup files.

Disaster Scenarios to Plan For

  • Cloud region failure (rare but possible — AWS, Azure, and GCP have all experienced regional outages)
  • Ransomware encrypting all accessible data including online backups
  • Accidental deletion of critical database or configuration
  • Cloud account compromise or suspension
  • Critical third-party service failure

DR Architecture Approaches

  • Backup and restore: Regular backups stored offsite. Simplest and cheapest. RTO is hours to days. Appropriate for non-critical systems.
  • Pilot light: Core infrastructure pre-provisioned in a secondary region in a minimal state. Data replication running continuously. RTO in hours — scale up the pilot light when needed.
  • Warm standby: Secondary environment running at reduced scale, constantly synchronised. RTO in minutes.
  • Multi-site active-active: Full-scale deployment in multiple regions simultaneously. RTO near-zero. Highest cost.

DR Testing

A DR plan that has never been tested is not a reliable DR plan. We conduct regular DR tests: simulating failure scenarios and measuring actual RTO and RPO against targets. Testing reveals gaps in documentation, automation failures, and data replication issues before they matter.

Did you find this article useful?