Back to Portfolio
Eweka — Blueprint Series 10 min read

Multi-Cloud Disaster Recovery — Automated AWS + Azure Failover with Route 53 Health Checks

AWS S3 CloudFront Azure Blob Storage Route 53 Terraform ACM DNS Failover
Multi-Cloud DR Architecture Diagram

Full architecture: Route 53 health-check-based failover between AWS CloudFront + S3 (primary) and Azure Blob Storage (standby), provisioned from a single Terraform configuration

A working multi-cloud disaster recovery system that automatically fails over from AWS to Azure when the primary goes down — no manual intervention, no custom scripts, no third-party tools. Just Route 53 health checks, a failover routing policy, and a single Terraform configuration that deploys to both clouds simultaneously.

Stack

AWS S3CloudFrontACM Route 53Azure Blob StorageTerraform NamecheapDNS Failover

The Problem

Most static site hosting setups have a single point of failure: one cloud provider. If that provider has an outage — even a regional one — your site goes down and there is nothing you can do except wait. The standard DR advice is "use multiple regions," but that still leaves you dependent on a single cloud. A true DR system needs a second cloud entirely.

I wanted to build a system where AWS going down does not mean the site goes down.

What This System Does

The site runs on two clouds simultaneously. AWS S3 + CloudFront is the primary. Azure Blob Storage is the standby. Route 53 monitors both with health checks every 30 seconds. When the primary fails, Route 53 automatically starts returning the Azure endpoint. When AWS recovers, traffic returns automatically. No manual intervention at any step.

1
Terraform deploys to both clouds

A single terraform apply provisions S3, CloudFront, Azure Blob Storage, and all DNS records simultaneously from one codebase.

2
Route 53 monitors both endpoints

Two HTTPS health checks ping the primary (CloudFront) and secondary (Azure Blob) every 30 seconds with a failure threshold of 3.

3
Failover routing handles the switch

A primary A-record alias points to CloudFront. A secondary CNAME points to Azure. When the health check fails, Route 53 stops returning the primary record automatically.

4
DNS propagation completes the handoff

With a 300-second TTL, the effective RTO is ~90 seconds detection time plus up to 5 minutes for cached resolvers to expire.

5
AWS recovery is also automatic

When the S3 bucket policy is restored and the health check passes again, Route 53 returns traffic to the primary CloudFront distribution without any manual action.

Building the System — 4 Phases

Phase 1 — AWS Primary: S3 + CloudFront

The primary hosting layer uses S3 for static website hosting with CloudFront as the CDN in front of it. The S3 bucket is configured with static website hosting enabled, public read access via a bucket policy, and lifecycle rules to prevent accidental deletion. CloudFront handles TLS termination using an ACM certificate for the custom domain, with HTTP-to-HTTPS redirection configured on the listener.

All website files — HTML, CSS, JavaScript, and image assets — are uploaded to S3 using Terraform's aws_s3_object resource with a for_each loop over the assets directory. Adding new files to the website only requires dropping them into the folder and running terraform apply.

Phase 2 — Azure Standby: Blob Storage Static Website

The standby site runs on Azure Blob Storage with the static website feature enabled. A resource group, storage account (StorageV2, LRS replication), and the $web container are provisioned via Terraform's azurerm provider. The same website files are uploaded to Azure using azurerm_storage_blob with a content-type lookup map that assigns the correct MIME type based on file extension.

The key design decision: both providers serve identical content from the same local source directory. One terraform apply deploys to both AWS and Azure simultaneously, eliminating any sync drift between primary and standby.

Phase 3 — DNS Failover with Route 53

Route 53 is the core of the DR mechanism. Two HTTPS health checks monitor the primary (CloudFront) and secondary (Azure Blob) endpoints, pinging every 30 seconds with a failure threshold of 3. The failover routing policy uses an A-record alias for the primary (pointing to the CloudFront distribution) and a CNAME for the secondary (pointing to the Azure Blob static website URL).

When the primary health check fails, Route 53 stops returning the CloudFront record and starts returning the Azure endpoint. DNS TTL is set to 300 seconds, so the effective RTO is the detection time (~90s) plus DNS propagation time (~5 minutes in the worst case for cached resolvers).

Phase 4 — Domain, TLS, and Verification

A custom domain was purchased from Namecheap (~$1) and the nameservers were pointed to Route 53's hosted zone. ACM provides the TLS certificate for CloudFront, validated via DNS. The entire DNS chain — domain registration → Route 53 hosted zone → health checks → failover records → CloudFront/Azure endpoints — was verified using DNS propagation checkers and manual failover testing.

Testing Failover

To validate the DR mechanism, I simulated an AWS outage by restricting the S3 bucket's public access policy, which causes CloudFront to return errors. Within ~90 seconds, Route 53's health check marked the primary as unhealthy and began resolving the domain to the Azure endpoint. The site remained functional throughout — users were served from Azure Blob Storage without any manual intervention.

Restoring the S3 bucket policy caused the health check to pass again, and traffic automatically returned to the primary CloudFront distribution.

Known Limitations and Production Path

Azure HTTPS with custom domains

Azure Blob Storage's static website feature does not natively support HTTPS with custom domains. In this setup, the Azure failover endpoint serves over HTTP, which produces a browser "Not secure" warning. In production, the fix is to place Azure CDN in front of the Blob Storage endpoint with a managed TLS certificate — the same pattern AWS uses with CloudFront + ACM.

Static content only

This DR pattern works for static assets served from object storage. For dynamic applications with databases and server-side logic, the pattern would need to extend to include database replication (e.g., RDS cross-region read replicas or Cosmos DB multi-region writes) and compute failover (e.g., ECS/AKS with weighted routing).

The Hardest Problems I Solved

Terraform multi-provider state management

Running AWS and Azure providers in a single Terraform configuration requires careful credential isolation. I used separate .tfvars files for each provider and added *.tfvars to .gitignore to prevent credential leakage. The alternative — separate state files per provider — would have broken the single-apply deployment model that keeps both sides in sync.

Content-type mapping for Azure Blob uploads

Azure Blob Storage defaults all uploads to application/octet-stream if you don't set the content type explicitly. This breaks CSS and JS loading in the browser. Fixed with a lookup() map in Terraform that assigns the correct MIME type based on file extension for every uploaded asset.

DNS propagation timing during failover

Route 53 detects the outage in ~90 seconds, but DNS resolvers around the world cache records based on TTL. With a 300-second TTL, some users continued hitting the failed primary for up to 5 minutes after failover. In production, lowering the TTL to 60 seconds would reduce this window but increase DNS query costs — a trade-off worth documenting.

Key Takeaways

Multi-cloud IaC works from a single codebase. A single terraform apply provisions and deploys to both AWS and Azure simultaneously. The key is using separate provider blocks with isolated credentials and a shared source directory for the website files.
Route 53 health checks are the DR engine. The entire failover mechanism is just two health checks and a failover routing policy. No custom scripts, no Lambda functions, no third-party tools. The simplicity is the strength — fewer moving parts means fewer things that can break during an actual outage.
RTO is detection time plus DNS TTL. You can tune detection speed (check interval × failure threshold) and propagation speed (TTL value) independently. Lower TTL means faster failover but more DNS queries. For a static site with near-zero operating cost, a 5-minute RTO is acceptable.
Treat security groups and bucket policies as contracts. Define access boundaries before launching any resource. Most "can't reach the site" issues come down to either port rules or name resolution — not the application itself.
Document what you'd do differently in production. Acknowledging the Azure HTTPS gap and explaining the CDN fix shows you understand the difference between a working demo and a production-grade system.