Cloud Migration Without the Meltdown

Why Migrations Fail

Most cloud migrations fail in predictable ways: underestimating data transfer costs, ignoring legacy dependencies, treating lift-and-shift as a strategy, and lacking rollback plans. The successful ones treat migration as a product delivery, not an infrastructure project.

The most expensive migration is the one you have to do twice.

The most common failure pattern we see is the "big bang" migration. An organisation decides to move everything at once, underestimates the complexity, hits unexpected issues, and ends up with a partially migrated system that is less reliable than the original. We have seen companies spend 18 months and £2 million on a migration that could have been completed in 6 months with a phased approach.

Anti-Pattern — Lift-and-Shift as Strategy: Moving VMs to the cloud without re-architecting simply moves operational burden. You pay for inefficiency you could have eliminated. A VM costing £200/month on-premises might cost £800/month in the cloud because it runs 24/7 on a fixed instance. The cloud rewards dynamic scaling; lift-and-shift ignores it.

The third major failure is neglecting data gravity. Large datasets are expensive and slow to move. We have seen migrations stall for months because data transfer bandwidth was insufficient. Data gravity pulls compute toward it: moving a cluster without its data creates latency, cost, and complexity.

The Six-Phase Framework

1. Discovery and Assessment

Map every application, dependency, data flow, and integration. Classify workloads: retire, retain, rehost, replatform, refactor, or rebuild. Most organisations discover 30% more dependencies than expected. This phase typically takes 4-6 weeks for a medium-sized estate.

The 6R framework provides a structured approach: Retire (shut down unused systems), Retain (keep on-premises), Rehost (lift-and-shift), Replatform (move with minor optimisation), Refactor (re-architect for cloud-native), and Rebuild (start from scratch).

Key Finding: We typically find that 20% of applications can be retired, 10% should be rebuilt, and the remainder split between Rehost, Replatform, and Refactor. The real value comes from Retire and Rebuild — not Rehost.

Dependency mapping is the most critical activity. We use automated tools (AWS Application Discovery Service, Azure Migrate, or CloudQuery) to scan the estate and build dependency graphs. Then we validate with application teams, who often know about "shadow dependencies" that the tools miss.

2. Landing Zone Design

Build the cloud foundation before migrating workloads. Network topology, identity management, security baselines, and cost controls must be in place. The landing zone is your production environment — treat it with the same rigour.

The landing zone comprises several layers: network (VPCs, subnets, routing, DNS), identity (SSO, RBAC, service accounts), security (firewalls, encryption, logging, compliance), and operations (monitoring, alerting, backup, DR).

Common Mistake: We have seen migrations fail because the landing zone was an afterthought. Teams ended up with inconsistent security policies, overlapping IP ranges, and untraceable costs.

Cost controls are particularly important. Cloud bills can spiral without governance. We implement budget alerts at 50%, 80%, and 100% of allocated spend, with automatic escalation. Resource tagging policies attribute costs to teams, projects, and environments.

3. Pilot Migration

Select 2-3 non-critical applications that represent your estate's diversity. Migrate them completely, including monitoring, backup, and DR. Document every issue. The pilot reveals the real complexity that assessments miss.

The pilot is not a proof of concept; it is a rehearsal. It should use the same tools, processes, and runbooks that the main migration will use.

We recommend selecting pilots that cover different complexity profiles: one simple lift-and-shift (low risk), one replatforming effort (medium risk), and one refactoring project (high risk). The simple pilot builds confidence; the hard pilot reveals true complexity.

4. Wave Planning

Group applications by dependency, risk, and business criticality. Never migrate a system before its dependencies. We typically plan 4-6 week waves with 2-week buffers between them. Rushing waves together causes cascade failures.

The wave structure should reflect the dependency graph. Start with foundational services (databases, message queues, shared libraries), then move to business services, then user-facing applications. Never migrate a service before its database.

Wave Readiness System: Green = all prerequisites met, ready to migrate. Amber = some prerequisites pending, scheduled for next wave. Red = significant blockers, needs remediation. Each wave should have at most 20% amber items.

5. Execution

Each migration needs: a runbook, a rollback procedure, a communication plan, and a war room. Runbooks should be tested in a staging environment that mirrors production exactly. Untested runbooks fail at 3 AM.

The runbook should be a step-by-step guide that a competent engineer can follow without domain expertise. Every step should have an owner, a duration estimate, and a verification method.

Critical: Communication during migration is essential. Stakeholders need to know what is happening, when it will complete, and whether there are issues. Use a standard cadence: start notice, hourly updates, completion notice, post-migration summary.

6. Optimisation

Post-migration, 40-60% of cloud spend is typically waste. Rightsize instances, evaluate reserved capacity, and implement auto-scaling. Optimisation should start 30 days after migration, not six months later.

The optimisation phase has three stages. First, rightsizing: analyse CloudWatch, Azure Monitor, or Google Cloud Operations metrics to identify over-provisioned instances. An instance averaging 15% CPU utilisation is over-provisioned; downsize it or consolidate workloads. Second, reserved capacity: for predictable workloads, commit to 1- or 3-year reserved instances for 40-60% savings. Third, auto-scaling: implement dynamic scaling for variable workloads, with scale-up triggers at 70% CPU and scale-down triggers at 30%.

We typically schedule optimisation reviews at 30, 60, and 90 days post-migration. The 30-day review focuses on obvious waste: oversized instances, unused storage, and unattached resources. The 60-day review analyses usage patterns and implements reserved capacity. The 90-day review evaluates auto-scaling policies and refines them based on observed behaviour.

The Retain Decision

Not everything should move. Some systems are too risky, too tightly coupled to physical hardware, or too close to end-of-life to justify migration cost. Be explicit about what stays and why.

The retain decision is not a failure. It is a strategic choice.

Case Study — Mainframe API Layer: A financial services client had a core banking system on a mainframe. Migration estimate: £5 million and 24 months. Instead, they retained the mainframe and built a modern API layer around it. Delivered in 8 months for £1.2 million. The mainframe still runs, but it is now a backend system, not a constraint.

Rollback Strategy

Every migration must have a rollback plan that can execute in under 30 minutes. This means maintaining data synchronisation between old and new environments during the migration window, and having DNS cutover tested and ready.

The rollback plan is not a theoretical document. It is a tested procedure that runs in under 30 minutes from the first alert to full service restoration.

DNS Cutover: Use low TTL values (60 seconds) during migration windows to enable rapid switching. The cutover should be a single command: a script that updates the DNS record and verifies propagation. Test both cutover and cutback in staging before the migration.

Our Recommendation

Start with assessment, not migration. Spend time understanding dependencies. Pilot aggressively. Plan waves conservatively. And always have a tested rollback path.

Do not migrate alone. Cloud providers offer migration programs, funding, and expertise. Partners like us provide frameworks, tooling, and experience. The cost of professional help is a fraction of the cost of a failed migration.

The organisations that succeed treat cloud migration as a strategic transformation, not a technical relocation. They invest in discovery, build robust landing zones, run disciplined pilots, and optimise continuously after migration.