Cloud Cost Audit Template

The 40% Rule

Most organisations waste 40-60% of their cloud spend. Not through negligence, but through lack of visibility. Resources are provisioned for peak load, left running out of hours, and forgotten after projects end. A systematic audit finds this waste and eliminates it.

An instance running at 15% CPU utilisation looks busy but is actually idle. A storage volume attached to a terminated instance is invisible in the console but still charged. A development environment left running over the weekend costs £200 per month.

The root cause is organisational, not technical. Engineers provision resources for their immediate need and do not revisit them. Project teams disband, but their resources live on. Budget owners see aggregate spend but not individual resource utilisation. No one is responsible for resource lifecycle management. FinOps fixes this by making cost visible and assigning ownership.

Phase 1: Visibility

Before you optimise, you must see. Tag everything: environment, team, project, cost centre. Use cost allocation tags consistently. Set up billing alerts at 50%, 80%, and 100% of budget. Without visibility, optimisation is guesswork.

Tagging is the foundation of cost visibility. Every resource must have: Environment (production, staging, development), Team (engineering, data science, operations), Project (the specific initiative), and Cost Centre (the budget owner). We enforce tagging through policy-as-code: Terraform policies, AWS Organizations SCPs, or Azure Policy. Resources without required tags are automatically flagged, and after a grace period, terminated.

Billing Alert Ladder: 50% = review needed. 80% = action required. 100% = emergency - stop non-essential resources, escalate to leadership, freeze new provisioning.

Cost dashboards provide ongoing visibility. We use native tools (AWS Cost Explorer, Azure Cost Management, GCP Cost Console) for high-level views, and third-party tools (CloudHealth, Datadog, Spot.io) for detailed analysis. The key metrics: total spend by service, by team, and by project; spend trend over time; and cost per transaction or user (unit economics).

Phase 2: Right-Sizing

Analyse actual CPU and memory utilisation over 30 days. Rightsize instances to match observed peaks, not theoretical maximums. We typically find 30% of instances are oversized by two or more instance classes. Downsize them.

Right-sizing requires data. We collect CPU, memory, disk, and network metrics from CloudWatch, Azure Monitor, or Google Cloud Monitoring. The analysis covers 30 days to capture weekly patterns, monthly cycles, and any anomalies. We look at p95 and p99 utilisation, not average. Average utilisation of 30% might hide peaks of 90% that require larger instances.

Average Is Misleading: Average utilisation of 30% might hide peaks of 90% that require larger instances. Always use p95 and p99 utilisation for right-sizing decisions.

The right-sizing process: identify instances with sustained low utilisation (under 20% CPU or memory for 30 days), identify instances with bursty patterns that could use burstable instance types, identify instances that could be replaced by serverless functions or container services, and identify instances that are idle (zero utilisation for 7+ days). Each category has a different remediation: downsize, switch instance type, replatform, or terminate.

Savings Example: A typical medium-sized environment (50 instances) has 15 instances that can be downsized, saving £2,000-£5,000 per month. The analysis takes 2-3 days; the savings are permanent.

Phase 3: Scheduling

Development and test environments do not need to run 24/7. Implement automated start/stop schedules based on working hours. For global teams, stagger schedules by region. We typically see 20-30% savings from scheduling alone.

The simplest scheduling pattern: start development environments at 8 AM and stop them at 7 PM, Monday to Friday. This saves 70% of development environment costs (12 hours running vs 24 hours). Test environments can follow the same pattern, or they can run during test execution windows only.

Advanced Scheduling: Dynamic scheduling based on CI/CD pipeline activity - start when a build triggers, stop 30 minutes after the last activity. Predictive scheduling learns historical patterns automatically. Savings increase from 20% to 30-40%.

The exception: environments used for overnight batch jobs, global teams, or 24/7 testing. These should not be scheduled off but should be right-sized to their actual load. We review scheduling policies monthly to catch exceptions.

Phase 4: Storage Optimisation

Storage is often the largest hidden cost. Move infrequently accessed data to cheaper tiers. Delete orphaned snapshots and unused volumes. Implement lifecycle policies that transition data to archive storage automatically.

An EBS volume costs £0.10 per GB per month. A 1TB volume costs £1,200 per year. Most organisations have hundreds of volumes, many unused or oversized.

The storage optimisation process: identify unattached volumes and delete them after 7 days of non-attachment (with snapshot backup), identify idle volumes and downsize them, identify infrequently accessed data and transition to cheaper tiers (S3 Infrequent Access, Glacier, or Azure Cool Blob), and implement lifecycle policies that automatically transition data based on age and access patterns.

Snapshot Creep: A daily snapshot policy with 30-day retention creates 3,000 snapshots across 100 volumes. At £0.05 per GB-month, a 500GB snapshot costs £25 per month. 3,000 snapshots cost £75,000 per year. Implement tiered retention: daily for 7 days, weekly for 4 weeks, monthly for 12 months, then delete.

Phase 5: Reserved Capacity

For predictable baseline workloads, reserved instances or savings plans reduce costs by 30-60%. The key is committing only to predictable usage. Overcommitting wastes money; undercommitting leaves savings on the table.

Reserved instances require a 1- or 3-year commitment. The savings are substantial: 30-40% for 1-year, 50-60% for 3-year, all upfront. The risk is committing to usage that changes. We recommend starting with 1-year reservations for the most stable workloads (production databases, core services), then expanding as usage patterns become clearer.

Savings Plans vs Reserved Instances: Savings plans are more flexible — they apply to a family of instance types and commit to a dollar amount per hour. We prefer savings plans for variable environments and reserved instances for stable, predictable usage.

The analysis process: identify instances that run 24/7 with stable utilisation, calculate the baseline usage (minimum observed over 30 days), reserve that baseline, and leave the variable portion on-demand. This provides the best of both: savings on predictable usage, flexibility on variable usage.

Our Recommendation

Run this audit quarterly. Cloud drift is real — resources accumulate, teams change, and usage patterns shift. A quarterly 2-day audit consistently finds 15-25% savings in mature environments.

When engineers see cost as a feature, not an afterthought, waste reduces organically. When teams are accountable for their cloud spend, they optimise proactively.

Start with visibility. You cannot optimise what you cannot see. Then right-size, schedule, optimise storage, and reserve capacity. Each phase builds on the previous, creating compounding savings. The organisations that succeed treat cost optimisation as a continuous practice, not a one-time project.