The Resilience Requirement
Single-region architectures fail when the region fails. AWS us-east-1 outages have taken down major services. Multi-region is not optional for systems with SLAs below 99.9%. The question is not whether to go multi-region, but how.
The SLA drives the architecture. 99.9% SLA allows 8.7 hours of downtime per year — achievable in a single region with good operational practices. 99.99% allows 52 minutes per year — requires multi-region active-passive. 99.999% allows 5 minutes per year — requires multi-region active-active. Each additional nine increases cost and complexity exponentially.
Active-Active: The Gold Standard
Traffic routes to all regions simultaneously. If one fails, traffic shifts to the others automatically. Users experience no downtime. The trade-off: data consistency. Active-active requires conflict resolution for writes that happen in multiple regions simultaneously.
The traffic routing mechanism is critical. DNS-based routing (Route 53, Cloudflare) directs users to the nearest healthy region. Health checks monitor each region and remove failed regions from the pool. Global load balancers provide Anycast routing that automatically routes around failures.
The operational complexity is significant. Active-active requires: real-time data replication, conflict resolution logic, traffic routing with health checks, and monitoring across all regions. We have seen teams underestimate this complexity by an order of magnitude. The architecture that looks simple on a whiteboard requires months of engineering to implement correctly.
Active-Passive: The Pragmatic Choice
One region handles all traffic; the other stands by. When the primary fails, traffic fails over to the secondary. Simpler data consistency, lower cost, but measurable downtime during failover. The RTO depends on how quickly you can redirect traffic and warm up caches.
The failover mechanism is the critical component. DNS failover updates DNS records to point to the secondary region. Health check-based failover is faster: failed health checks trigger automatic failover within minutes.
Data replication for active-passive is simpler than active-active. Asynchronous replication streams writes from primary to secondary with a delay of seconds to minutes. The RPO is the replication lag: typically 1-5 seconds for synchronous replication, 1-5 minutes for asynchronous. For most applications, asynchronous replication with 5-minute RPO is sufficient.
Data Replication Strategies
Synchronous replication guarantees zero RPO but adds latency. Asynchronous replication has near-zero latency impact but non-zero RPO. The choice depends on data criticality: financial transactions need synchronous; user preferences can tolerate asynchronous.
Synchronous replication writes data to both regions before acknowledging the write to the client. The write latency is the round-trip time between regions: 50-100ms for nearby regions, 100-200ms for distant regions. This latency is unacceptable for user-facing writes but acceptable for background processes. We use synchronous replication only for financial transactions and critical compliance data.
Our Recommendation
Start with active-passive. It provides 90% of the resilience benefit at 50% of the complexity. Move to active-active only when your SLA demands it and your data architecture supports it.
Untested failover procedures fail when needed. We recommend quarterly failover drills: simulate a regional failure, verify traffic shifts, verify data consistency, and verify application functionality.
The cost of multi-region is significant: double the infrastructure, double the data transfer, and operational overhead for managing multiple regions. We typically see 80-120% cost increase for active-passive, and 150-200% for active-active. The business case must justify this cost: either through SLA requirements, regulatory requirements, or competitive differentiation.