Multi-Region Architecture for Resilience

The Resilience Requirement

Single-region architectures fail when the region fails. AWS us-east-1 outages have taken down major services. Multi-region is not optional for systems with SLAs below 99.9%. The question is not whether to go multi-region, but how.

Cost of Failure: A financial services client calculated that a 4-hour outage would cost £2 million in lost transactions, £500,000 in regulatory fines, and immeasurable reputational damage. Multi-region architecture is insurance against this cost.

The SLA drives the architecture. 99.9% SLA allows 8.7 hours of downtime per year — achievable in a single region with good operational practices. 99.99% allows 52 minutes per year — requires multi-region active-passive. 99.999% allows 5 minutes per year — requires multi-region active-active. Each additional nine increases cost and complexity exponentially.

Active-Active: The Gold Standard

Traffic routes to all regions simultaneously. If one fails, traffic shifts to the others automatically. Users experience no downtime. The trade-off: data consistency. Active-active requires conflict resolution for writes that happen in multiple regions simultaneously.

The traffic routing mechanism is critical. DNS-based routing (Route 53, Cloudflare) directs users to the nearest healthy region. Health checks monitor each region and remove failed regions from the pool. Global load balancers provide Anycast routing that automatically routes around failures.

Data Consistency Is the Hard Problem: When two users in different regions update the same record simultaneously, which update wins? Last-write-wins is simplest but loses data. CRDTs are mathematically sound but application-specific. We typically use last-write-wins for non-critical data and CRDTs for critical data.

The operational complexity is significant. Active-active requires: real-time data replication, conflict resolution logic, traffic routing with health checks, and monitoring across all regions. We have seen teams underestimate this complexity by an order of magnitude. The architecture that looks simple on a whiteboard requires months of engineering to implement correctly.

Active-Passive: The Pragmatic Choice

One region handles all traffic; the other stands by. When the primary fails, traffic fails over to the secondary. Simpler data consistency, lower cost, but measurable downtime during failover. The RTO depends on how quickly you can redirect traffic and warm up caches.

The failover mechanism is the critical component. DNS failover updates DNS records to point to the secondary region. Health check-based failover is faster: failed health checks trigger automatic failover within minutes.

Cache Warm-Up Is the Hidden Delay: When traffic shifts to the secondary region, caches are cold. Database queries that were cached in the primary now hit the database directly, causing load spikes. Without cache warm-up, failover can take 10-30 minutes; with it, failover takes 2-5 minutes.

Data replication for active-passive is simpler than active-active. Asynchronous replication streams writes from primary to secondary with a delay of seconds to minutes. The RPO is the replication lag: typically 1-5 seconds for synchronous replication, 1-5 minutes for asynchronous. For most applications, asynchronous replication with 5-minute RPO is sufficient.

Data Replication Strategies

Synchronous replication guarantees zero RPO but adds latency. Asynchronous replication has near-zero latency impact but non-zero RPO. The choice depends on data criticality: financial transactions need synchronous; user preferences can tolerate asynchronous.

Synchronous replication writes data to both regions before acknowledging the write to the client. The write latency is the round-trip time between regions: 50-100ms for nearby regions, 100-200ms for distant regions. This latency is unacceptable for user-facing writes but acceptable for background processes. We use synchronous replication only for financial transactions and critical compliance data.

Cross-Region Read Replicas: The primary region handles writes; secondary regions handle reads. Read replicas are asynchronous, so reads may see slightly stale data. This is acceptable for 80% of read-heavy workloads: product catalogues, user profiles, analytics.

Our Recommendation

Start with active-passive. It provides 90% of the resilience benefit at 50% of the complexity. Move to active-active only when your SLA demands it and your data architecture supports it.

Untested failover procedures fail when needed. We recommend quarterly failover drills: simulate a regional failure, verify traffic shifts, verify data consistency, and verify application functionality.

The cost of multi-region is significant: double the infrastructure, double the data transfer, and operational overhead for managing multiple regions. We typically see 80-120% cost increase for active-passive, and 150-200% for active-active. The business case must justify this cost: either through SLA requirements, regulatory requirements, or competitive differentiation.

The Resilience Requirement

Active-Active: The Gold Standard

Active-Passive: The Pragmatic Choice

Data Replication Strategies

Our Recommendation

Building for resilience?