The Prometheus Metrics That Actually Predict Outages

The difference between teams that get paged at 3 AM and teams that fix things before users notice is usually not better hardware or more experienced engineers. It is instrumentation that catches problems when they are small.

Most Prometheus setups monitor “is it broken now?” The useful question is “is it trending toward broken?” Here are the metrics and alert patterns that answer that.

The Four Golden Signals (and Their Prediction Variants)

Google SRE’s four golden signals are a starting point, not an ending point. The prediction variants are what actually give you advance warning.

1. Error Rate with Trend Detection

Standard alert: error rate above threshold. Predictive alert: error rate increasing faster than baseline.

# Standard - fires when broken
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) /
        rate(http_requests_total[5m]) > 0.01

# Predictive - fires when trending up 30 minutes before threshold
- alert: ErrorRateIncreasing
  expr: |
    rate(http_requests_total{status=~"5.."}[5m]) /
    rate(http_requests_total[5m])
    > 3 * (
      rate(http_requests_total{status=~"5.."}[30m]) /
      rate(http_requests_total[30m])
    )
    AND
    rate(http_requests_total{status=~"5.."}[5m]) /
    rate(http_requests_total[5m]) > 0.001
  annotations:
    summary: "Error rate is 3x higher than 30-minute baseline"

This fires when your error rate spikes relative to your recent baseline, even if it has not crossed an absolute threshold. A service going from 0.01% errors to 0.05% errors is a significant change that absolute thresholds often miss.

2. Saturation - Memory Pressure Before OOM

Waiting until a service OOM-kills is too late. Memory pressure shows up in metrics before the crash:

# JVM heap usage trending toward GC pressure
- alert: JvmHeapSaturation
  expr: |
    jvm_memory_used_bytes{area="heap"} /
    jvm_memory_max_bytes{area="heap"} > 0.80
  for: 10m
  annotations:
    summary: "JVM heap above 80% for 10 minutes - GC pressure likely"

# Rate of heap growth (will it OOM?)
- alert: HeapGrowthRate
  expr: |
    predict_linear(
      jvm_memory_used_bytes{area="heap"}[30m],
      3600  # Project 1 hour forward
    ) > jvm_memory_max_bytes{area="heap"}
  annotations:
    summary: "At current growth rate, heap will be exhausted in under 1 hour"

predict_linear is Prometheus’s linear regression function. It projects the trend of a metric forward in time. A “will you run out of memory in N seconds?” alert is far more actionable than “memory is high.”

3. Connection Pool Exhaustion

Database connection pools are among the most common sources of outages that are predictable. The pool runs out, queries queue, latency spikes, circuit breakers open, service unavailable.

# Pool approaching exhaustion
- alert: DbConnectionPoolSaturation
  expr: |
    db_connection_pool_active /
    db_connection_pool_max > 0.80
  for: 5m

# Connections queuing - imminent pool exhaustion
- alert: DbConnectionPoolWaiting
  expr: db_connection_pool_pending > 0
  for: 2m
  annotations:
    summary: "Queries waiting for database connections - pool saturated"

Any non-zero pending connections is an emergency. That means every incoming database query is waiting for a connection. Latency is already degraded. You have minutes before cascading failure.

4. Latency Distribution Shifts

Average latency hides problems. P99 latency shifts are early warning signals.

# P99 above threshold
- alert: HighP99Latency
  expr: |
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    > 2.0

# P99 diverging from P50 - indicates a slow tail
- alert: LatencyDistributionDegrading
  expr: |
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    /
    histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
    > 10
  annotations:
    summary: "P99 latency is 10x P50 - slow outlier requests detected"

A P99/P50 ratio above 10 means some requests are taking 10x longer than the median. This usually signals a specific code path, a database query, or an external dependency with intermittent slowness.

Disk Exhaustion with Time Projection

Running out of disk space crashes services silently in the worst cases. Prometheus gives you enough history to predict it:

- alert: DiskSpaceRunningOut
  expr: |
    predict_linear(
      node_filesystem_avail_bytes[6h],
      4 * 3600  # Project 4 hours forward
    ) < 0
  annotations:
    summary: "Disk will be full in under 4 hours at current write rate"
    runbook: "https://wiki.example.com/runbooks/disk-exhaustion"

The 6-hour window smooths out write bursts. The 4-hour projection gives you time to act.

The Alert That Catches Everything Else: Burn Rate

Alerting on individual metrics is whack-a-mole. Multi-window burn rate alerts catch problems based on how fast you are consuming your error budget:

# Fast burn - fires quickly for severe problems
- alert: SLOBurnRateFast
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[1h]) /
      rate(http_requests_total[1h])
    ) > 14.4 * 0.001  # 14.4x burn rate against 0.1% error budget

# Slow burn - catches gradual degradation
- alert: SLOBurnRateSlow
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[6h]) /
      rate(http_requests_total[6h])
    ) > 6 * 0.001  # 6x burn rate
  annotations:
    summary: "Gradual error rate elevation will exhaust SLO budget"

The burn rate model (from Google’s SRE book) pages you based on how quickly you are consuming your reliability budget over different windows. A 1-hour window catches fast burns. A 6-hour window catches slow degradations that individual metric alerts miss.

Alert Hygiene: Why Your Alerts Are Being Ignored

Predictive alerts only work if your team responds to them. The most common failure mode: alert fatigue from alerts that fire too often for non-actionable reasons.

Rules for good alerts:

Every alert must have a runbook - If you cannot write a runbook for the alert, the alert should not exist
Alerts must be actionable immediately - The recipient must be able to do something right now
Use for: clauses - A metric that spikes for 30 seconds is rarely an alert. Require sustained conditions
Page only P0/P1 conditions - Everything else goes to Slack or a ticket queue

Bottom Line

The gap between reactive and predictive monitoring is mostly about using metrics that trend toward problems rather than metrics that report them after they happen. predict_linear for disk and memory, connection pool pending metrics, P99/P50 ratio divergence, and burn rate alerts catch the most common outage precursors.

The total configuration for all of these is under 100 lines of Prometheus alerting rules. The ROI of that configuration - measured in incidents prevented and engineers not paged at 3 AM - is difficult to overstate.

The Four Golden Signals (and Their Prediction Variants)#

1. Error Rate with Trend Detection#

2. Saturation - Memory Pressure Before OOM#

3. Connection Pool Exhaustion#

4. Latency Distribution Shifts#

Disk Exhaustion with Time Projection#

The Alert That Catches Everything Else: Burn Rate#

Alert Hygiene: Why Your Alerts Are Being Ignored#

Bottom Line#

Comments