1. Requirements & Scope (5 min)
Functional Requirements
- Ingest metrics from thousands of services (CPU, memory, request latency, error rate, custom business metrics)
- Store metrics with high granularity (per-second) for recent data, lower granularity (per-minute, per-hour) for historical data
- Query metrics: time-range aggregations (avg, sum, p50, p95, p99, max, min)
- Real-time dashboards with auto-refresh (< 30 second data freshness)
- Alerting: trigger alerts when metrics cross thresholds (e.g., p99 latency > 500ms for 5 minutes)
Non-Functional Requirements
- Availability: 99.9% for ingestion (losing metrics is acceptable during brief outages — we don’t want to lose all data, but a few seconds of data loss during failover is tolerable). 99.99% for querying (dashboards must be up during incidents — that’s when they’re needed most).
- Latency: Ingestion → queryable in < 30 seconds. Dashboard queries: simple queries < 500ms, complex aggregations < 5 seconds.
- Scale: 10M metrics data points/sec ingestion. 1 year retention at full granularity, 3 years at reduced granularity.
- Durability: Metrics data should survive individual node failures. Total loss acceptable only for catastrophic events.
2. Estimation (3 min)
Ingestion
- 10M data points/sec
- Each data point: metric_name (100B) + tags (200B) + timestamp (8B) + value (8B) = ~316 bytes → round to 300 bytes
- 10M × 300 bytes = 3GB/sec ingestion throughput
- Per day: ~260TB
Storage
- Raw data at 1-second granularity: 260TB/day
- With compression (time-series data compresses well): ~10:1 → 26TB/day
- 1 year retention: ~9.5PB compressed
- Rollup data (1-min, 1-hour granularity): ~1% of raw → additional ~100TB/year
Query patterns
- Dashboard queries: typically 1-6 hour time ranges, 5-10 metrics
- Alert evaluation: thousands of alert rules evaluated every 60 seconds
- Ad-hoc queries: arbitrary time ranges, complex aggregations
3. API Design (3 min)
Ingestion API
POST /api/v1/metrics
Body: [
{
"metric": "http.request.latency",
"tags": { "service": "api-gateway", "endpoint": "/v1/users", "method": "GET", "region": "us-east-1" },
"timestamp": 1708632000,
"value": 45.2
},
{
"metric": "http.request.count",
"tags": { "service": "api-gateway", "status": "200" },
"timestamp": 1708632000,
"value": 1
}
]
Response 202 Accepted
// StatsD/Prometheus-compatible push
UDP /statsd (fire-and-forget, lossy but fast)
Payload: "http.request.latency:45.2|ms|#service:api-gateway,endpoint:/v1/users"
Query API
POST /api/v1/query
Body: {
"metric": "http.request.latency",
"aggregation": "p99",
"tags": { "service": "api-gateway" },
"from": "2026-02-22T12:00:00Z",
"to": "2026-02-22T18:00:00Z",
"interval": "1m" // rollup interval
}
Response 200: {
"series": [
{ "timestamp": 1708603200, "value": 42.3 },
{ "timestamp": 1708603260, "value": 45.1 },
...
]
}
// Multi-metric query (for dashboards)
POST /api/v1/query/batch
Body: {
"queries": [
{ "metric": "cpu.usage", "aggregation": "avg", "tags": {...}, ... },
{ "metric": "memory.usage", "aggregation": "max", "tags": {...}, ... }
]
}
Alert API
POST /api/v1/alerts
Body: {
"name": "High API Latency",
"metric": "http.request.latency",
"aggregation": "p99",
"tags": { "service": "api-gateway" },
"condition": "above",
"threshold": 500,
"duration": "5m",
"channels": ["slack:#oncall", "pagerduty:team-infra"]
}
4. Data Model (3 min)
Time-Series Storage (custom or TSDB like ClickHouse/TimescaleDB)
Conceptual schema:
metric_name | string
tag_set | map<string, string> (sorted, for consistent hashing)
timestamp | int64 (Unix epoch seconds)
value | float64
Physical storage (column-oriented):
Series ID = hash(metric_name + sorted_tags)
Series metadata table:
series_id (PK) | bigint
metric_name | string
tags | map<string, string>
Data points table (partitioned by time):
series_id | bigint
timestamp | int64
value | float64
Partition key: (series_id, time_bucket)
Clustering: timestamp ASC
Key design choice: Column-oriented storage. Time-series queries almost always read a specific metric over a time range (columnar scans). Row-oriented storage would waste I/O reading irrelevant columns.
Rollup Tables
rollup_1m: (series_id, minute_timestamp) → (avg, min, max, sum, count, p50, p95, p99)
rollup_1h: (series_id, hour_timestamp) → (avg, min, max, sum, count, p50, p95, p99)
rollup_1d: (series_id, day_timestamp) → (avg, min, max, sum, count, p50, p95, p99)
5. High-Level Design (12 min)
Ingestion Path
Application → Metrics Agent (local, per-host)
→ Pre-aggregate locally (10-second flush interval)
→ Batch send to Ingestion Service
Ingestion Service (fleet of stateless servers):
→ Validate and normalize metrics
→ Assign series_id = hash(metric_name + sorted_tags)
→ Write to Kafka (partitioned by series_id)
Kafka → Storage Writers (consumer group):
→ Batch writes to Time-Series DB (TSDB)
→ Write-ahead log for durability
→ In-memory buffer → flush to disk when full or after 10 seconds
Query Path
Dashboard → Query Service
→ Parse query, determine time range
→ Route to appropriate storage tier:
→ Last 6 hours: hot storage (in-memory + recent disk)
→ Last 24 hours - 7 days: warm storage (SSD)
→ Older: cold storage (HDD / object storage)
→ For aggregation queries spanning multiple series:
→ Scatter to relevant TSDB nodes (each holds a shard)
→ Each node performs local aggregation
→ Gather results, merge, return
Alert Evaluation
Alert Manager:
→ Every 60 seconds, for each active alert rule:
→ Execute the query (via Query Service)
→ Compare result to threshold
→ If threshold exceeded for >= duration:
→ Fire alert to configured channels (Slack, PagerDuty, email)
→ Track alert state: OK → PENDING → FIRING → RESOLVED
Components
- Metrics Agent: Lightweight daemon on every host. Pre-aggregates locally.
- Ingestion Service: Receives metrics from agents, writes to Kafka.
- Kafka: Durable buffer between ingestion and storage. Handles burst traffic.
- Storage Writers: Consume from Kafka, write to TSDB.
- TSDB Cluster: Distributed time-series database (custom or ClickHouse/VictoriaMetrics).
- Query Service: Handles metric queries, scatter-gather across TSDB nodes.
- Rollup Service: Periodically aggregates raw data into lower-granularity rollups.
- Alert Manager: Evaluates alert rules, manages alert lifecycle, sends notifications.
- Object Storage (S3): Cold storage for historical data.
6. Deep Dives (15 min)
Deep Dive 1: Time-Series Storage Engine Design
Write-optimized LSM-tree approach:
Time-series data is append-only (no updates, no deletes until TTL). This is perfect for LSM-tree storage:
-
Write path:
- Incoming data points go to an in-memory buffer (MemTable), sorted by (series_id, timestamp)
- When buffer reaches threshold (e.g., 64MB): flush to disk as an immutable SSTable
- Background compaction merges SSTables
-
Compression:
- Delta-of-delta encoding for timestamps: sequential timestamps differ by a constant (e.g., 1 second). Store the delta of deltas → mostly 0, compresses to 1-2 bits per timestamp.
- XOR encoding for values: consecutive metric values are often similar. XOR of consecutive values has many leading/trailing zeros → compress to 1-2 bytes per value.
- Overall: 16 bytes per data point (8B timestamp + 8B value) → compressed to ~2 bytes. 8:1 compression ratio.
- This is the Gorilla compression scheme (Facebook’s time-series DB paper).
-
Partitioning:
- Partition by (series_id_range, time_block)
- Time blocks: 2-hour blocks for recent data, 24-hour blocks for older data
- This ensures queries for a specific series + time range hit a minimal number of partitions
Cardinality challenge:
- “Cardinality” = number of unique series (unique combinations of metric_name × tag values)
- If tags include user_id (millions of values), cardinality explodes → millions of series
- Solution: enforce tag value limits. high-cardinality tags should be metric values, not tags.
- Example: BAD:
tags: { user_id: "123" }→ millions of series. GOOD: report as a histogram/distribution without per-user series.
Deep Dive 2: Rollup and Retention
Why rollups?
- Querying 1 year of per-second data for one metric: 31.5M data points. Even compressed, that’s slow.
- Rollup to 1-minute granularity: 525K data points. 60x fewer.
- Rollup to 1-hour granularity: 8,760 data points. 3,600x fewer.
Rollup process:
Rollup Service (runs continuously):
→ Read raw data points for a completed time window (e.g., past minute)
→ For each series, compute: avg, min, max, sum, count
→ For percentiles (p50, p95, p99): use t-digest or DDSketch (mergeable sketches)
→ Write rollup results to rollup tables
Percentile aggregation challenge:
- You can’t compute exact p99 from pre-aggregated data (averaging p99s is mathematically wrong)
- Solution: t-digest (or DDSketch) — a compact, mergeable data structure that approximates percentiles
- Each 1-minute rollup stores a serialized t-digest (~200 bytes)
- To query p99 over 1 hour: deserialize 60 t-digests, merge them, query p99 from the merged t-digest
- Accuracy: within 1% for p99, which is sufficient for monitoring
Retention policy:
Raw data (1s granularity): 7 days → hot storage (SSD)
1-minute rollups: 90 days → warm storage (SSD)
1-hour rollups: 1 year → warm storage (HDD)
1-day rollups: 3 years → cold storage (S3)
Downsampling pipeline: Background job runs continuously, processing the oldest un-rolled-up data. After rollup is confirmed, raw data outside retention window is deleted.
Deep Dive 3: Alerting System
Requirements:
- Evaluate thousands of alert rules every 60 seconds
- No missed alerts (high recall)
- Minimal false alerts (high precision)
- Alert deduplication and grouping
Architecture:
Alert Manager:
→ Load all active alert rules (cached, refreshed every 5 min)
→ Every 60 seconds:
→ Partition rules across Alert Evaluator workers
→ Each worker evaluates its rules:
→ Execute query (using Query Service)
→ Apply condition (above/below threshold)
→ Check duration requirement (threshold exceeded for >= N minutes)
→ Update alert state machine:
OK → (threshold exceeded) → PENDING
PENDING → (duration met) → FIRING → send notification
FIRING → (threshold no longer exceeded) → RESOLVED → send resolution
Alert state machine:
OK → PENDING: threshold breached (start counting duration)
PENDING → FIRING: threshold continuously breached for >= duration
PENDING → OK: threshold recovered before duration met (false alarm)
FIRING → RESOLVED: threshold recovered while alert was active
RESOLVED → OK: after cooldown period (prevents flapping)
Notification channels:
- Slack, PagerDuty, email, webhook
- Deduplication: don’t re-fire the same alert if it’s already FIRING
- Grouping: if 50 hosts have the same alert firing, group into one notification
- Escalation: if not acknowledged within 15 minutes, escalate to next on-call
High availability:
- Multiple Alert Manager replicas
- Leader election (via Redis or ZooKeeper) — only leader evaluates rules
- If leader dies, another replica becomes leader within seconds
- Alternatively: all replicas evaluate all rules, but deduplicate notifications using a distributed lock
7. Extensions (2 min)
- Distributed tracing integration: Correlate metrics with traces. When an alert fires for high latency, automatically link to example traces from that time window.
- Anomaly detection: Instead of static thresholds, use ML-based anomaly detection. Train on historical patterns, alert on deviations. Handles seasonal patterns (weekend vs weekday) automatically.
- Custom dashboards: Drag-and-drop dashboard builder (like Grafana). Store dashboard definitions as JSON. Support variables (e.g., dropdown to select service/region).
- SLO tracking: Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives). Track error budget burn rate. Alert when burn rate exceeds threshold.
- Log correlation: Link metrics to logs. When CPU spikes, show corresponding log lines from that host/service.
- Multi-tenancy: Support multiple teams/orgs with isolated data and separate quotas. Tenant-aware routing and query isolation.