Kubernetes at Scale: Lessons from 50+ Clusters

The Scale Problem

Kubernetes works brilliantly at small scale. At large scale, it becomes a distributed systems problem of its own. We have operated 50+ production clusters across financial services, healthcare, and technology clients. Here is what we have learned.

Kubernetes is not a platform; it is a platform for building platforms. At scale, you need to build operational tooling, monitoring, and automation on top of Kubernetes.

The problems at scale are not the ones you encounter in tutorials. Resource contention in the control plane, etcd performance degradation, network policy complexity, and storage latency spikes become the limiting factors. We have seen clusters with 500 nodes where pod scheduling takes 30 seconds because the scheduler is overwhelmed.

Cluster Sizing

We recommend 100-200 nodes per cluster as the sweet spot. Larger clusters have control plane bottlenecks, slower etcd performance, and wider blast radius. Smaller clusters multiply operational overhead. The right size depends on workload homogeneity - similar workloads co-locate better than diverse ones.

etcd Is the Hidden Bottleneck: etcd stores all cluster state and is sensitive to disk I/O latency. We use dedicated SSDs with fsync latency under 2ms. Network-attached storage for etcd is a recipe for disaster - we have seen clusters fail because etcd latency spiked to 50ms during network congestion.

The blast radius consideration is equally important. A cluster-wide failure affects hundreds of nodes. We limit cluster size to contain blast radius: if a cluster fails, we want it to affect at most 20% of total capacity. This means 5 clusters minimum for most production estates, with workload spread across them.

Multi-Tenancy Patterns

Three approaches to multi-tenancy: namespace isolation (soft), virtual clusters (medium), and physical cluster separation (hard). We use namespace isolation for dev/test, virtual clusters for team boundaries, and physical separation for regulatory requirements or blast radius containment.

Namespace isolation uses Kubernetes RBAC, NetworkPolicies, and ResourceQuotas to separate tenants within a cluster. It is lightweight but porous. A privileged container escape or kernel vulnerability can compromise the entire cluster. We use namespace isolation only for non-production environments where blast radius is acceptable.

vCluster for Team Boundaries: Virtual clusters provide 90% of the isolation of physical clusters at 20% of the cost. Each tenant gets their own API server, scheduler, and CRDs, but shares underlying worker nodes. We use vCluster for team boundaries in production where shared infrastructure is acceptable.

Physical cluster separation provides the strongest isolation. Each tenant gets dedicated control plane and worker nodes. This is required for regulatory compliance (PCI-DSS, HIPAA) and for critical workloads where blast radius must be minimised. The cost is significant: each cluster needs dedicated control plane nodes, monitoring, and operational overhead. We use physical separation sparingly, only where required.

Observability at Scale

Standard Prometheus + Grafana does not scale to 50 clusters. We use Thanos for global query aggregation and Grafana Mimir for long-term metrics storage. Logs ship to a central Loki cluster via Fluent Bit. The key insight: collect everything, but query selectively.

Thanos vs Mimir: Thanos provides global query aggregation across all Prometheus instances with object storage for long-term retention. Grafana Mimir is an alternative with better multi-tenancy support and simpler operational model. We prefer Mimir for new deployments.

Logging at scale requires a different approach. Fluent Bit runs as a DaemonSet on each node, collecting container logs and shipping them to a central Loki cluster. Loki indexes only the log metadata (labels), not the full log content, which makes it efficient for high-volume log ingestion.

At 50 clusters with 10,000 pods, the volume of metrics and logs is overwhelming. We build dashboards for the 20 metrics that matter: API server latency, etcd latency, pod scheduling time, node resource utilisation, and error rates. Everything else is available on demand but not visible by default.

GitOps Workflows

ArgoCD manages all cluster state. Applications, configurations, and policies live in Git. Changes are pull requests, not kubectl commands. This provides audit trails, rollback capability, and code review for infrastructure changes. The deployment pipeline is: PR → Review → Merge → ArgoCD Sync → Validation.

The GitOps pattern is: the desired state of the cluster is defined in Git, and an automated agent (ArgoCD, Flux) continuously reconciles the actual state with the desired state. If someone modifies the cluster directly with kubectl, the agent reverts the change. This makes the cluster immutable: the only way to change it is through Git.

Direct kubectl Changes: If someone modifies the cluster directly with kubectl, ArgoCD reverts the change. The cluster is immutable — the only way to change it is through Git. This prevents configuration drift and provides complete audit trails.

The validation stage is critical. After ArgoCD syncs an application, we run automated health checks: pod readiness, service endpoints, ingress accessibility, and business metrics. If any check fails, ArgoCD rolls back to the previous version. This provides automated rollback without human intervention.

Our Recommendation

Start with namespace isolation and a single monitoring stack. Add virtual clusters when teams need more autonomy. Use physical separation only for compliance or critical blast radius containment. Invest in GitOps early — it pays dividends at every scale.

Start small. One cluster, one team, one application. Establish the patterns, then scale. The patterns that work at 1 cluster work at 50; the patterns that fail at 1 cluster fail catastrophically at 50.

Kubernetes at scale is not about the technology; it is about the operational discipline. Cluster sizing, multi-tenancy, observability, and GitOps are not features to enable; they are practices to establish. The organisations that succeed invest in operational excellence from the start, not as an afterthought.