Disaster Recovery: Designing for Resilience
Disaster Recovery (DR) is the capability to restore systems to operation after a catastrophic failure — data centre outage, ransomware attack, accidental deletion, or major infrastructure failure. Good DR requires planning, investment in redundancy, and regular testing — not just backup files.
Disaster Scenarios to Plan For
- Cloud region failure (rare but possible — AWS, Azure, and GCP have all experienced regional outages)
- Ransomware encrypting all accessible data including online backups
- Accidental deletion of critical database or configuration
- Cloud account compromise or suspension
- Critical third-party service failure
DR Architecture Approaches
- Backup and restore: Regular backups stored offsite. Simplest and cheapest. RTO is hours to days. Appropriate for non-critical systems.
- Pilot light: Core infrastructure pre-provisioned in a secondary region in a minimal state. Data replication running continuously. RTO in hours — scale up the pilot light when needed.
- Warm standby: Secondary environment running at reduced scale, constantly synchronised. RTO in minutes.
- Multi-site active-active: Full-scale deployment in multiple regions simultaneously. RTO near-zero. Highest cost.
DR Testing
A DR plan that has never been tested is not a reliable DR plan. We conduct regular DR tests: simulating failure scenarios and measuring actual RTO and RPO against targets. Testing reveals gaps in documentation, automation failures, and data replication issues before they matter.