1. Requirements & Scope (5 min)

Functional Requirements

  1. Ingest metrics from thousands of services (CPU, memory, request latency, error rate, custom business metrics)
  2. Store metrics with high granularity (per-second) for recent data, lower granularity (per-minute, per-hour) for historical data
  3. Query metrics: time-range aggregations (avg, sum, p50, p95, p99, max, min)
  4. Real-time dashboards with auto-refresh (< 30 second data freshness)
  5. Alerting: trigger alerts when metrics cross thresholds (e.g., p99 latency > 500ms for 5 minutes)

Non-Functional Requirements

  • Availability: 99.9% for ingestion (losing metrics is acceptable during brief outages — we don’t want to lose all data, but a few seconds of data loss during failover is tolerable). 99.99% for querying (dashboards must be up during incidents — that’s when they’re needed most).
  • Latency: Ingestion → queryable in < 30 seconds. Dashboard queries: simple queries < 500ms, complex aggregations < 5 seconds.
  • Scale: 10M metrics data points/sec ingestion. 1 year retention at full granularity, 3 years at reduced granularity.
  • Durability: Metrics data should survive individual node failures. Total loss acceptable only for catastrophic events.

2. Estimation (3 min)

Ingestion

  • 10M data points/sec
  • Each data point: metric_name (100B) + tags (200B) + timestamp (8B) + value (8B) = ~316 bytes → round to 300 bytes
  • 10M × 300 bytes = 3GB/sec ingestion throughput
  • Per day: ~260TB

Storage

  • Raw data at 1-second granularity: 260TB/day
  • With compression (time-series data compresses well): ~10:1 → 26TB/day
  • 1 year retention: ~9.5PB compressed
  • Rollup data (1-min, 1-hour granularity): ~1% of raw → additional ~100TB/year

Query patterns

  • Dashboard queries: typically 1-6 hour time ranges, 5-10 metrics
  • Alert evaluation: thousands of alert rules evaluated every 60 seconds
  • Ad-hoc queries: arbitrary time ranges, complex aggregations

3. API Design (3 min)

Ingestion API

POST /api/v1/metrics
  Body: [
    {
      "metric": "http.request.latency",
      "tags": { "service": "api-gateway", "endpoint": "/v1/users", "method": "GET", "region": "us-east-1" },
      "timestamp": 1708632000,
      "value": 45.2
    },
    {
      "metric": "http.request.count",
      "tags": { "service": "api-gateway", "status": "200" },
      "timestamp": 1708632000,
      "value": 1
    }
  ]
  Response 202 Accepted

// StatsD/Prometheus-compatible push
UDP /statsd (fire-and-forget, lossy but fast)
  Payload: "http.request.latency:45.2|ms|#service:api-gateway,endpoint:/v1/users"

Query API

POST /api/v1/query
  Body: {
    "metric": "http.request.latency",
    "aggregation": "p99",
    "tags": { "service": "api-gateway" },
    "from": "2026-02-22T12:00:00Z",
    "to": "2026-02-22T18:00:00Z",
    "interval": "1m"       // rollup interval
  }
  Response 200: {
    "series": [
      { "timestamp": 1708603200, "value": 42.3 },
      { "timestamp": 1708603260, "value": 45.1 },
      ...
    ]
  }

// Multi-metric query (for dashboards)
POST /api/v1/query/batch
  Body: {
    "queries": [
      { "metric": "cpu.usage", "aggregation": "avg", "tags": {...}, ... },
      { "metric": "memory.usage", "aggregation": "max", "tags": {...}, ... }
    ]
  }

Alert API

POST /api/v1/alerts
  Body: {
    "name": "High API Latency",
    "metric": "http.request.latency",
    "aggregation": "p99",
    "tags": { "service": "api-gateway" },
    "condition": "above",
    "threshold": 500,
    "duration": "5m",
    "channels": ["slack:#oncall", "pagerduty:team-infra"]
  }

4. Data Model (3 min)

Time-Series Storage (custom or TSDB like ClickHouse/TimescaleDB)

Conceptual schema:
  metric_name    | string
  tag_set        | map<string, string>  (sorted, for consistent hashing)
  timestamp      | int64 (Unix epoch seconds)
  value          | float64

Physical storage (column-oriented):
  Series ID = hash(metric_name + sorted_tags)

  Series metadata table:
    series_id    (PK) | bigint
    metric_name        | string
    tags               | map<string, string>

  Data points table (partitioned by time):
    series_id          | bigint
    timestamp          | int64
    value              | float64
    Partition key: (series_id, time_bucket)
    Clustering: timestamp ASC

Key design choice: Column-oriented storage. Time-series queries almost always read a specific metric over a time range (columnar scans). Row-oriented storage would waste I/O reading irrelevant columns.

Rollup Tables

  rollup_1m:  (series_id, minute_timestamp) → (avg, min, max, sum, count, p50, p95, p99)
  rollup_1h:  (series_id, hour_timestamp) → (avg, min, max, sum, count, p50, p95, p99)
  rollup_1d:  (series_id, day_timestamp) → (avg, min, max, sum, count, p50, p95, p99)

5. High-Level Design (12 min)

Ingestion Path

Application → Metrics Agent (local, per-host)
  → Pre-aggregate locally (10-second flush interval)
  → Batch send to Ingestion Service

Ingestion Service (fleet of stateless servers):
  → Validate and normalize metrics
  → Assign series_id = hash(metric_name + sorted_tags)
  → Write to Kafka (partitioned by series_id)

Kafka → Storage Writers (consumer group):
  → Batch writes to Time-Series DB (TSDB)
  → Write-ahead log for durability
  → In-memory buffer → flush to disk when full or after 10 seconds

Query Path

Dashboard → Query Service
  → Parse query, determine time range
  → Route to appropriate storage tier:
    → Last 6 hours: hot storage (in-memory + recent disk)
    → Last 24 hours - 7 days: warm storage (SSD)
    → Older: cold storage (HDD / object storage)
  → For aggregation queries spanning multiple series:
    → Scatter to relevant TSDB nodes (each holds a shard)
    → Each node performs local aggregation
    → Gather results, merge, return

Alert Evaluation

Alert Manager:
  → Every 60 seconds, for each active alert rule:
    → Execute the query (via Query Service)
    → Compare result to threshold
    → If threshold exceeded for >= duration:
      → Fire alert to configured channels (Slack, PagerDuty, email)
    → Track alert state: OK → PENDING → FIRING → RESOLVED

Components

  1. Metrics Agent: Lightweight daemon on every host. Pre-aggregates locally.
  2. Ingestion Service: Receives metrics from agents, writes to Kafka.
  3. Kafka: Durable buffer between ingestion and storage. Handles burst traffic.
  4. Storage Writers: Consume from Kafka, write to TSDB.
  5. TSDB Cluster: Distributed time-series database (custom or ClickHouse/VictoriaMetrics).
  6. Query Service: Handles metric queries, scatter-gather across TSDB nodes.
  7. Rollup Service: Periodically aggregates raw data into lower-granularity rollups.
  8. Alert Manager: Evaluates alert rules, manages alert lifecycle, sends notifications.
  9. Object Storage (S3): Cold storage for historical data.

6. Deep Dives (15 min)

Deep Dive 1: Time-Series Storage Engine Design

Write-optimized LSM-tree approach:

Time-series data is append-only (no updates, no deletes until TTL). This is perfect for LSM-tree storage:

  1. Write path:

    • Incoming data points go to an in-memory buffer (MemTable), sorted by (series_id, timestamp)
    • When buffer reaches threshold (e.g., 64MB): flush to disk as an immutable SSTable
    • Background compaction merges SSTables
  2. Compression:

    • Delta-of-delta encoding for timestamps: sequential timestamps differ by a constant (e.g., 1 second). Store the delta of deltas → mostly 0, compresses to 1-2 bits per timestamp.
    • XOR encoding for values: consecutive metric values are often similar. XOR of consecutive values has many leading/trailing zeros → compress to 1-2 bytes per value.
    • Overall: 16 bytes per data point (8B timestamp + 8B value) → compressed to ~2 bytes. 8:1 compression ratio.
    • This is the Gorilla compression scheme (Facebook’s time-series DB paper).
  3. Partitioning:

    • Partition by (series_id_range, time_block)
    • Time blocks: 2-hour blocks for recent data, 24-hour blocks for older data
    • This ensures queries for a specific series + time range hit a minimal number of partitions

Cardinality challenge:

  • “Cardinality” = number of unique series (unique combinations of metric_name × tag values)
  • If tags include user_id (millions of values), cardinality explodes → millions of series
  • Solution: enforce tag value limits. high-cardinality tags should be metric values, not tags.
  • Example: BAD: tags: { user_id: "123" } → millions of series. GOOD: report as a histogram/distribution without per-user series.

Deep Dive 2: Rollup and Retention

Why rollups?

  • Querying 1 year of per-second data for one metric: 31.5M data points. Even compressed, that’s slow.
  • Rollup to 1-minute granularity: 525K data points. 60x fewer.
  • Rollup to 1-hour granularity: 8,760 data points. 3,600x fewer.

Rollup process:

Rollup Service (runs continuously):
  → Read raw data points for a completed time window (e.g., past minute)
  → For each series, compute: avg, min, max, sum, count
  → For percentiles (p50, p95, p99): use t-digest or DDSketch (mergeable sketches)
  → Write rollup results to rollup tables

Percentile aggregation challenge:

  • You can’t compute exact p99 from pre-aggregated data (averaging p99s is mathematically wrong)
  • Solution: t-digest (or DDSketch) — a compact, mergeable data structure that approximates percentiles
  • Each 1-minute rollup stores a serialized t-digest (~200 bytes)
  • To query p99 over 1 hour: deserialize 60 t-digests, merge them, query p99 from the merged t-digest
  • Accuracy: within 1% for p99, which is sufficient for monitoring

Retention policy:

Raw data (1s granularity): 7 days  → hot storage (SSD)
1-minute rollups: 90 days          → warm storage (SSD)
1-hour rollups: 1 year             → warm storage (HDD)
1-day rollups: 3 years             → cold storage (S3)

Downsampling pipeline: Background job runs continuously, processing the oldest un-rolled-up data. After rollup is confirmed, raw data outside retention window is deleted.

Deep Dive 3: Alerting System

Requirements:

  • Evaluate thousands of alert rules every 60 seconds
  • No missed alerts (high recall)
  • Minimal false alerts (high precision)
  • Alert deduplication and grouping

Architecture:

Alert Manager:
  → Load all active alert rules (cached, refreshed every 5 min)
  → Every 60 seconds:
    → Partition rules across Alert Evaluator workers
    → Each worker evaluates its rules:
      → Execute query (using Query Service)
      → Apply condition (above/below threshold)
      → Check duration requirement (threshold exceeded for >= N minutes)
      → Update alert state machine:
        OK → (threshold exceeded) → PENDING
        PENDING → (duration met) → FIRING → send notification
        FIRING → (threshold no longer exceeded) → RESOLVED → send resolution

Alert state machine:

OK → PENDING: threshold breached (start counting duration)
PENDING → FIRING: threshold continuously breached for >= duration
PENDING → OK: threshold recovered before duration met (false alarm)
FIRING → RESOLVED: threshold recovered while alert was active
RESOLVED → OK: after cooldown period (prevents flapping)

Notification channels:

  • Slack, PagerDuty, email, webhook
  • Deduplication: don’t re-fire the same alert if it’s already FIRING
  • Grouping: if 50 hosts have the same alert firing, group into one notification
  • Escalation: if not acknowledged within 15 minutes, escalate to next on-call

High availability:

  • Multiple Alert Manager replicas
  • Leader election (via Redis or ZooKeeper) — only leader evaluates rules
  • If leader dies, another replica becomes leader within seconds
  • Alternatively: all replicas evaluate all rules, but deduplicate notifications using a distributed lock

7. Extensions (2 min)

  • Distributed tracing integration: Correlate metrics with traces. When an alert fires for high latency, automatically link to example traces from that time window.
  • Anomaly detection: Instead of static thresholds, use ML-based anomaly detection. Train on historical patterns, alert on deviations. Handles seasonal patterns (weekend vs weekday) automatically.
  • Custom dashboards: Drag-and-drop dashboard builder (like Grafana). Store dashboard definitions as JSON. Support variables (e.g., dropdown to select service/region).
  • SLO tracking: Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives). Track error budget burn rate. Alert when burn rate exceeds threshold.
  • Log correlation: Link metrics to logs. When CPU spikes, show corresponding log lines from that host/service.
  • Multi-tenancy: Support multiple teams/orgs with isolated data and separate quotas. Tenant-aware routing and query isolation.