Most companies running Kubernetes are overpaying. Not by a little - by 30-60% in many cases. The reason is almost never that Kubernetes is inherently expensive. It is that the default configurations prioritize availability over efficiency, and no one went back and tuned them.

Here are the tactics that actually move the needle.

The Root Cause: Overprovisioning at Every Layer

Kubernetes costs compound because overprovisioning happens at multiple levels simultaneously:

  1. Nodes are larger than required by workloads
  2. Pod resource requests are set too high (pods ask for more than they use)
  3. Replicas are too many for actual traffic levels
  4. Storage is allocated and forgotten
  5. Load balancers are created and orphaned

Each layer of overprovisioning multiplies with the others. A pod that requests 2 CPUs but uses 0.3, running at 5 replicas when 2 would suffice, on a node that is 30% utilized - you are getting roughly 10% of what you are paying for.

Set Accurate Resource Requests and Limits

Resource requests tell the scheduler how much CPU and memory to reserve for a pod. If your requests are wrong, your utilization numbers are meaningless.

The workflow to fix this:

# Get actual usage data
kubectl top pods --all-namespaces --sort-by=cpu

# For a specific deployment over time, use Prometheus query:
# avg(rate(container_cpu_usage_seconds_total{container="my-app"}[5m]))

The general rule: set requests to the 75th percentile of actual usage, set limits to the 95th percentile. This allows headroom for spikes while accurately reflecting typical usage for scheduling purposes.

VPA (Vertical Pod Autoscaler) in recommendation mode can automate this analysis:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # Recommendation only, no automatic updates

After a few days, kubectl describe vpa my-app-vpa shows recommended request values based on actual usage.

Autoscaling That Actually Works

HPA (Horizontal Pod Autoscaler) scales your pods. Cluster Autoscaler scales your nodes. Both need to be configured correctly to work together.

The common HPA mistake is scaling on CPU only. CPU is a lagging indicator for many workloads. A web server that queues requests does not show high CPU until it is already in trouble. Consider scaling on custom metrics - queue depth, request latency, active connections - depending on your workload.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

For Cluster Autoscaler, set --scale-down-utilization-threshold to 0.5 (50% is the default) and --scale-down-delay-after-add to something reasonable for your startup time. The default settings are conservative and leave underutilized nodes running longer than necessary.

Spot and Preemptible Instances

This is the single highest-leverage cost reduction available. Spot instances (AWS) and preemptible VMs (GCP) run at 60-80% discount compared to on-demand pricing.

The tradeoff is that cloud providers can reclaim them with 2 minutes notice (AWS) or 30 seconds (GCP). This sounds scary. In practice, workloads designed for this run reliably because:

  • Kubernetes automatically reschedules evicted pods
  • Node pools typically have multiple spot nodes, so simultaneous eviction of all nodes is unlikely
  • Graceful shutdown handling with PodDisruptionBudgets ensures clean workload migration

The correct architecture: run stateless workloads on spot nodes, run stateful workloads (databases, persistent queues) on on-demand nodes. Use node selectors or taints to enforce this:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: kubernetes.io/node-type
          operator: In
          values: ["spot"]

Namespace-Level Resource Quotas

Without quotas, any team can create any number of pods with any resource requests. This is how clusters become expensive.

ResourceQuotas set hard limits per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    services.loadbalancers: "2"

LimitRanges set defaults and constraints on individual containers:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:
      cpu: "100m"
      memory: "64Mi"
    type: Container

This ensures every container has resource requests even if the developer forgot to set them.

Cleaning Up Orphaned Resources

Old deployments, orphaned PVCs, forgotten LoadBalancer services - these accumulate silently and cost money.

Resource Monthly cost if forgotten How to find
LoadBalancer service $15-30 kubectl get svc -A --field-selector spec.type=LoadBalancer
PersistentVolumeClaim $5-50+ `kubectl get pvc -A
Idle node pools $100-500+ Check cloud console utilization
Unused container images Minimal storage Registry cleanup policies

Automate cleanup with scheduled jobs or tools like kube-janitor that delete resources based on annotations after a TTL.

The Cost Visibility Problem

You cannot optimize what you cannot see. Most teams lack per-namespace or per-team cost visibility.

Kubecost and OpenCost (the open-source version) allocate cluster costs to namespaces, deployments, and labels. This is the foundation for accountability:

  • Which team is spending the most?
  • Which workloads are most inefficient?
  • What would it cost to increase replicas of a specific service?

Without this visibility, optimization is guesswork.

Realistic Savings Expectations

Tactic Typical Savings
Right-sizing resource requests 20-35%
Spot instances for stateless workloads 30-50% of compute
Proper autoscaling 10-25%
Cleaning orphaned resources 5-15%
Combined 40-60% of original spend

The combined effect is significant. A $50,000/month cluster bill can realistically reach $25,000-30,000 with disciplined implementation of these tactics.

Bottom Line

Kubernetes clusters overspend by default because the safe defaults prioritize availability over efficiency. The fixes - accurate resource requests, autoscaling tuned to real metrics, spot instances for stateless workloads, namespace quotas, and cost visibility tooling - are not complicated. They require measurement, iteration, and organizational commitment to treating cloud costs as an engineering metric. The savings from doing this correctly are large enough to justify significant engineering investment.