Day 4: AWS Solutions Architect Professional Prep – Disaster recovery
Today’s lesson is all about how to create a digital safety net. A technical failure, a fire, or even a simple human error- any of these could cause a digital system to stop working, with dire consequences for a business. Every minute an e-commerce site is down during peak shopping season means lost revenue and unhappy customers. Losing even a few seconds of data could cost a financial trading platform millions.
What digital insurance policy, or plan can a business design to get back up and running as quickly as possible while losing the minimum amount of data? In my lessons, I examined the following fundamental questions about Disaster Recovery (DR):
1. What are the two measurements that dictate our entire recovery strategy?
2. What is the fastest, most resilient recovery plan, and when is it necessary?
3. If we need to save money, what is the simplest, low-cost recovery plan?
4. How do we find a balance between speed and cost?
1. What are the two measurements that dictate our entire recovery strategy?
There are two golden rules of recovery, or 2 key questions to ask when things go wrong:
How quickly can we get the service back up and running?, and
How much data are we willing to lose?
The two questions relate to the 2 fundamental metrics of DR: Recovery Time Objective, or RTO, and Recovery Point Objective, or RPO
- Recovery Time Objective, or RTO: If your car breaks down, the RTO is how fast the tow truck gets there and fixes it. For a major shopping website during a sale, the RTO might be 5 minutes—you fix it almost instantly, or you lose money. By contrast, for a system that only archives old files, 24 hours might be fine.
- Recovery Point Objective (RPO): RPO measures the gap between the incident and your last successful backup. If you are typing a paper and the power goes out, the RPO is the last time you saved. If your RPO is 20 minutes, you might lose 20 minutes of work. A financial trading system needs “near-zero” RPO, meaning they can’t lose any transactions ; real-time replication is required. For analytics data, you might be able to tolerate an RPO of 4 hours
2. What is the fastest, most resilient recovery plan, and when is it necessary?
There are four fundamental DR strategies. Just like in choosing a car, you ask yourself: Do you need a Ferrari (fastest, most expensive) or will a dependable commuter car (slowest, cheapest) suffice?
The fastest plan is the Multi-Region Hot Site Strategy. You might call this the The Identical Twin Factory in which you have two full factories running at the same time. This requires running a complete, full-scale, mirrored copy of your entire application in a second region, potentially in an Active-Active setup where both regions handle traffic simultaneously. Data is copied between the two sites in real-time using services like Aurora Global Database or DynamoDB Global Tables.
This is the gold standard for resiliency and the highest cost. Although the investment is the highest, the returns are near-instantaneous recovery.
RTO/RPO: This high-cost strategy provides an RTO of seconds to minutes (via automated failover) and a Near-zero RPO.
This strategy is best for mission critical systems, such as financial trading platforms or high-traffic e-commerce during peak seasons.
3. If we need to save money, what is the simplest, low-cost recovery plan?
Backup & Restore Strategy is the simplest and lowest-cost approach. It can be called the Archive Approach, or storing copies in a safe deposit box.
It involves relying entirely on automated backups of data (using services like AWS Backup or S3 Cross-Region Replication) and storing them cheaply in services like S3 or Glacier Deep Archive. When a disaster occurs, new infrastructure must be built first, and then the data is restored. If the primary region fails, you build new infrastructure from scratch and load the data back in
RTO/RPO: The cost is lowest, but the RTO is the longest, ranging from 8 to 24+ hours. The RPO is also large, typically 1 to 24 hours, depending on how often backups occur.
This strategy is best for non-critical systems, such as monthly financial reports or development environments.
4. How do we find a balance between speed and cost?
To achieve better speed without paying the highest cost, we can choose between two intermediate strategies
Pilot Light: This can be called the minimal engine approach and involves keeping the engine block assembled. This strategy involves keeping the “bare minimum” running in the backup region—the core components like database replicas and critical data replication. It is a Low-Medium cost strategy. During a disaster, you “scale up” the application servers and load balancers around that ready core. This offers a reasonable RTO of 30 minutes to 2 hours and an RPO of 5–15 minutes. This is best for important data that needs reasonable coverage.
Warm Standby: This can be called the reduced kitchen approach and involves having a smaller, but ready-to-use, second location. With this strategy, you maintain a fully functional, scaled-down replica of your production environment (perhaps 25–50% capacity) in the DR region. This is a Medium-High cost strategy where the DR region is always running at a reduced capacity . The application is functional and ready to take traffic immediately, only requiring a scale-up to handle the full load. This achieves a fast RTO of 5–30 minutes and a strong RPO of 1–5 minutes. This is best for critical e-commerce platforms that cannot endure extended downtime.
The key Services that help with DR are:
S3 (Storage): Great for backing up files, especially using Cross-Region Replication (CRR) to automatically copy files to another region.
RDS/Aurora (Databases): You use cross-region read replicas or the super-fast Aurora Global Database to keep a live copy of your data in the backup region.
Route 53 (“The Traffic Cop”): This is what directs users. It uses Health Checks (checking if your application is alive) and automatically reroutes traffic to the healthy region during a failure
To sum it all up, business criticality is the key driver of Disaster Recovery strategy. In one extreme, for mission critical applications requiring RTO less than 5 minutes and RPO less than 1 minute, the recommended strategy is Hot Site. In the other, For non-critical systems that can tolerate an RTO up to 24 hours and an RPO up to 4 hours, Backup & Restore is the low-cost solution. Simply put, don’t build a Ferrari when a Toyota will do but don’t drive a bicycle on the highway either.
See you tomorrow!
