1. Requirements & Scope (5 min)
Functional Requirements
- Collect spans from every service in the request path and assemble them into end-to-end traces with parent-child relationships
- Propagate trace context (trace ID, span ID, sampling decision) across service boundaries via HTTP headers, gRPC metadata, and message queues
- Support configurable sampling strategies (head-based probabilistic, tail-based on error/latency, always-on for debug)
- Store traces and provide query capabilities: search by trace ID, service name, operation, duration, tags, and time range
- Generate service dependency graphs and latency breakdowns from trace data
Non-Functional Requirements
- Availability: 99.9% for collection pipeline (trace loss during outage is acceptable but not data corruption). Query system: 99.95%.
- Latency: Zero observable overhead on the critical path. Span reporting must be async and non-blocking (< 0.1ms per span in the application process).
- Consistency: Traces can be eventually consistent (30-second delay from span emission to queryability is fine).
- Scale: 1,000 microservices, 500K requests/sec, average 10 spans per trace = 5M spans/sec ingestion.
- Durability: Retain full traces for 7 days, sampled traces for 30 days, aggregated service metrics for 1 year.
2. Estimation (3 min)
Traffic
- Spans ingested: 5M spans/sec (500K traces/sec × 10 spans/trace average)
- With 10% head-based sampling: 500K spans/sec stored (reduces storage 10x while preserving statistical accuracy)
- Tail-based sampling (errors/slow): adds ~50K spans/sec (all error traces and p99 latency traces kept at 100%)
- Total stored: ~550K spans/sec
Storage
- Per span: span_id (16B) + trace_id (16B) + parent_id (16B) + service (32B) + operation (64B) + start_time (8B) + duration (8B) + tags (128B) + logs (256B) = ~544 bytes average
- Raw spans (7 days): 550K/sec × 86,400 × 7 × 544B = 23TB
- With columnar compression (5x): ~4.6TB
- Sampled traces (30 days, 1% sample): ~2TB compressed
- Service metrics (1 year): aggregated data, ~100GB
Key Insight
This is a high-throughput write pipeline with relatively infrequent reads. The core challenges are: (1) ingesting 5M spans/sec with near-zero application overhead, (2) tail-based sampling which requires holding spans in memory until the trace completes, and (3) efficiently querying traces by multiple dimensions.
3. API Design (3 min)
Span Reporting (SDK → Agent/Collector)
POST /v1/spans (batch)
Body (protobuf or Thrift for efficiency):
{
"spans": [
{
"trace_id": "abc123def456...",
"span_id": "span789...",
"parent_span_id": "span456...", // null for root span
"operation_name": "GET /api/users",
"service_name": "user-service",
"start_time_us": 1708632060000000, // microseconds
"duration_us": 15200,
"tags": {
"http.method": "GET",
"http.status_code": "200",
"db.type": "postgresql",
"error": false
},
"logs": [
{ "timestamp_us": 1708632060005000, "fields": { "event": "cache_miss" } }
]
},
// ... batch of 50-100 spans
]
}
Response 202: { "accepted": true }
Trace Context Propagation (HTTP Headers)
// W3C Trace Context standard
traceparent: 00-abc123def456...-span789...-01
tracestate: vendor=value
// B3 format (Zipkin)
X-B3-TraceId: abc123def456...
X-B3-SpanId: span789...
X-B3-ParentSpanId: span456...
X-B3-Sampled: 1
Query API
GET /v1/traces/{trace_id}
Response: { "trace_id": "abc123...", "spans": [...], "service_count": 5, "depth": 4,
"duration_us": 234000, "root_service": "api-gateway" }
GET /v1/traces?service=user-service&operation=GET+/api/users
&min_duration=1000ms&max_duration=5000ms&tag=error:true
&start=1708600000&end=1708632000&limit=20
Response: { "traces": [...], "total": 142 }
GET /v1/services/dependencies
Response: { "nodes": ["api-gateway", "user-service", "db"],
"edges": [{ "from": "api-gateway", "to": "user-service", "call_count": 50000, "avg_latency_ms": 15 }] }
Key Decisions
- Batch span reporting (SDK buffers spans locally, flushes every 1 second or 100 spans) to minimize network overhead
- Protobuf encoding (3-5x smaller than JSON for span data)
- 202 Accepted (fire-and-forget) — SDK never blocks on span delivery
- W3C Trace Context as primary propagation format for interoperability
4. Data Model (3 min)
Spans (ClickHouse — columnar, optimized for time-range + filter queries)
Table: spans
trace_id | FixedString(32) -- partition key
span_id | FixedString(16)
parent_span_id | FixedString(16) -- '' for root span
service_name | LowCardinality(String)
operation_name | LowCardinality(String)
start_time | DateTime64(6) -- microsecond precision
duration_us | UInt64
status_code | UInt16
is_error | UInt8
tags | Map(String, String)
logs | Array(Tuple(DateTime64(6), Map(String, String)))
-- Indices
ORDER BY (service_name, start_time, trace_id)
PARTITION BY toDate(start_time)
-- Secondary index on trace_id for single-trace lookup
Service Dependency Graph (materialized from spans)
Table: service_edges (materialized view, updated hourly)
source_service | LowCardinality(String)
dest_service | LowCardinality(String)
operation | LowCardinality(String)
hour | DateTime
call_count | UInt64
error_count | UInt64
p50_latency_us | UInt64
p95_latency_us | UInt64
p99_latency_us | UInt64
Trace Index (Elasticsearch — for flexible tag-based search)
Document per trace (not per span):
{
"trace_id": "abc123...",
"root_service": "api-gateway",
"root_operation": "POST /api/orders",
"services": ["api-gateway", "order-service", "payment-service", "db"],
"duration_us": 234000,
"span_count": 12,
"has_error": true,
"start_time": "2024-02-22T18:01:00Z",
"tags": { "user_id": "u123", "http.status": "500", "error.type": "TimeoutException" }
}
Why ClickHouse + Elasticsearch?
- ClickHouse for span storage: columnar compression (10-15x), excellent time-range query performance, handles 550K inserts/sec easily
- Elasticsearch for trace discovery: flexible full-text and tag-based search across trace summaries
- Trace lookup by ID: query ClickHouse directly (index on trace_id)
- Trace search by tags/service: query Elasticsearch for matching trace IDs, then fetch spans from ClickHouse
5. High-Level Design (12 min)
Architecture
Application Services (1000 microservices)
[Tracing SDK] → buffer spans in memory (ring buffer, 1000 spans)
→ flush every 1s or when buffer full
→ Local Agent (sidecar, one per host)
→ batches spans from all services on the host
→ applies head-based sampling decision
→ forwards to Collector via gRPC
Collector Fleet (stateless, 100 instances):
→ Receives span batches from Agents
→ Validates and normalizes spans
→ Publishes to Kafka (spans-raw topic, 256 partitions, keyed by trace_id)
Kafka (spans-raw, 256 partitions):
→ Consumer Group 1: Span Writer → ClickHouse (batch inserts)
→ Consumer Group 2: Trace Indexer → build trace summary → Elasticsearch
→ Consumer Group 3: Tail-Based Sampler → holds spans in memory, decides on trace completion
→ Consumer Group 4: Service Graph Builder → compute edges → materialized view
Query Service:
→ API for trace lookup, search, service graph
→ Reads from ClickHouse (spans) + Elasticsearch (trace index)
→ Serves UI (Jaeger UI or Grafana Tempo)
Tail-Based Sampling Service:
→ Holds partial traces in memory (keyed by trace_id)
→ Waits for trace completion (no new spans for 30s)
→ Decision: keep if (has_error OR duration > p99 threshold OR debug flag)
→ If keep: write all spans to Kafka (spans-sampled topic)
→ If drop: discard
Trace Context Propagation Flow
Client → API Gateway (creates root span, generates trace_id)
Header: traceparent: 00-{trace_id}-{span_id_A}-01
API Gateway → User Service (propagates trace context)
Header: traceparent: 00-{trace_id}-{span_id_B}-01
(span_id_B is child of span_id_A)
User Service → PostgreSQL (creates DB span)
(no HTTP header, SDK auto-instruments DB client)
Parent: span_id_B
User Service → Cache Service (propagates via HTTP)
Header: traceparent: 00-{trace_id}-{span_id_C}-01
Each service:
1. Extract trace context from incoming request
2. Create local span (child of incoming span)
3. Inject trace context into outgoing requests
4. Report span to local Agent (async, non-blocking)
Components
- Tracing SDK: Language-specific library (Java, Go, Python, Node.js). Auto-instruments HTTP clients, DB drivers, message queue producers/consumers. Async span reporting via ring buffer.
- Local Agent: Sidecar daemon per host. Receives spans from all services on the host. Applies sampling. Batches and compresses before forwarding to Collector.
- Collector: Stateless gRPC servers. Validates spans, enriches with service metadata, publishes to Kafka. Auto-scales based on Kafka consumer lag.
- Kafka: Durable buffer. Keyed by trace_id so all spans of a trace go to the same partition (important for tail-based sampling).
- Span Writer: Batch-inserts spans into ClickHouse. Uses async inserts with 10,000 rows per batch for throughput.
- Trace Indexer: Builds one Elasticsearch document per trace (aggregated from all spans). Updated as new spans arrive.
- Tail-Based Sampler: Stateful service that holds partial traces in memory. Memory budget: 10GB per instance (holds ~5 seconds of trace data).
- Query Service: Translates user queries into ClickHouse SQL + Elasticsearch queries. Merges results for the UI.
6. Deep Dives (15 min)
Deep Dive 1: Sampling Strategies — Head-Based vs. Tail-Based
Head-based sampling (simple, lossy):
At the root span (entry point), decide: sample this trace? (e.g., 10% probability)
→ Decision propagated to ALL downstream services via trace context header
→ sampled=1: all services report spans for this trace
→ sampled=0: no service reports spans (zero overhead)
Pros:
- Simple to implement (decision at the edge)
- Predictable load (exactly 10% of traces stored)
- No additional infrastructure needed
Cons:
- Might miss the ONE interesting trace (the error, the slow request)
- 10% sampling means 90% of errors go unrecorded
- Statistical accuracy requires high sample rate for rare events
Tail-based sampling (complex, complete for interesting traces):
ALL spans are reported (sampled=1 for everything)
→ Tail Sampler collects all spans per trace
→ After trace is "complete" (no new spans for 30s):
Decision:
- If any span has error=true → KEEP (100% of error traces)
- If root span duration > p99 threshold → KEEP (all slow traces)
- If trace has debug flag → KEEP
- Otherwise → apply probabilistic sampling (keep 5%)
Implementation challenge:
- Must buffer ALL spans in memory until trace completes
- 5M spans/sec × 30s buffer window × 544 bytes = 81.6GB memory needed
- Solution: partition by trace_id across 20 sampler instances
→ 4GB per instance, feasible
Hybrid approach (recommended):
1. Head-based: 10% probabilistic sampling (always active)
2. Tail-based: additionally keep 100% of:
- Error traces (any span with error=true)
- Slow traces (duration > p99 for that service+operation)
- Debug traces (explicit trace_id flagged for debugging)
3. Guaranteed: always keep traces for specific operations
(e.g., payments, user signups — business-critical flows)
Result: ~15% of traces stored, but 100% of interesting ones
Sampling configuration:
sampling:
default_strategy: probabilistic
default_rate: 0.10 # 10% head-based
tail_based:
enabled: true
buffer_duration: 30s
policies:
- type: error
sample_rate: 1.0 # keep all errors
- type: latency
threshold_ms: 5000
sample_rate: 1.0 # keep all > 5s traces
- type: string_attribute
key: "debug"
values: ["true"]
sample_rate: 1.0
per_service_overrides:
payment-service:
default_rate: 1.0 # 100% sampling for payments
Deep Dive 2: Trace Assembly and Storage Optimization
The problem: A single trace’s spans arrive from different services at different times. The root span may arrive after child spans. We need to assemble a complete trace for querying.
Span storage in ClickHouse:
-- Optimized table engine for time-series span data
CREATE TABLE spans (
trace_id FixedString(32),
span_id FixedString(16),
parent_span_id FixedString(16),
service_name LowCardinality(String),
operation_name LowCardinality(String),
start_time DateTime64(6),
duration_us UInt64,
status_code UInt16,
is_error UInt8,
tags Map(String, String),
logs String -- JSON encoded
) ENGINE = MergeTree()
PARTITION BY toDate(start_time)
ORDER BY (service_name, start_time, trace_id)
TTL start_time + INTERVAL 7 DAY DELETE
SETTINGS index_granularity = 8192;
-- Secondary index for trace lookup
ALTER TABLE spans ADD INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 4;
Query patterns and optimization:
- Single trace lookup (by trace_id):
SELECT * FROM spans WHERE trace_id = 'abc123...'
-- Uses bloom filter index on trace_id
-- Returns all spans for trace assembly
-- Client-side: reconstruct tree from parent_span_id relationships
- Search by service + time range:
SELECT trace_id, min(start_time), max(start_time + duration_us),
groupArray(service_name), countIf(is_error = 1)
FROM spans
WHERE service_name = 'payment-service'
AND start_time BETWEEN '2024-02-22 17:00:00' AND '2024-02-22 18:00:00'
GROUP BY trace_id
HAVING countIf(is_error = 1) > 0
ORDER BY min(start_time) DESC
LIMIT 20
- Latency analysis (p50/p95/p99 by operation):
SELECT operation_name,
quantile(0.5)(duration_us) as p50,
quantile(0.95)(duration_us) as p95,
quantile(0.99)(duration_us) as p99,
count() as call_count
FROM spans
WHERE service_name = 'user-service'
AND start_time >= now() - INTERVAL 1 HOUR
GROUP BY operation_name
ORDER BY p99 DESC
Trace assembly on read:
- Spans are stored flat (no pre-assembled traces)
- Query service fetches all spans for a trace_id, builds the tree in memory
- Root span: parent_span_id is empty
- Children: parent_span_id points to parent
- Orphan spans (parent arrived late or was dropped): shown as detached subtrees in the UI
Compression results:
- ClickHouse LZ4 compression on span data: 8-12x compression ratio
LowCardinalityfor service_name and operation_name: dictionary encoding, 10-50x for these columns- Net result: 544 bytes/span raw → ~60 bytes/span stored
Deep Dive 3: Service Dependency Graph and Latency Analysis
The problem: With 1,000 microservices, understanding which service calls which (and how fast) is critical for debugging and capacity planning.
Building the dependency graph:
For each span with a parent:
edge = (parent.service_name → child.service_name, child.operation_name)
Record: call_count, error_count, latency histogram
Materialized view in ClickHouse (updated every 5 minutes):
CREATE MATERIALIZED VIEW service_edges
ENGINE = SummingMergeTree()
ORDER BY (source_service, dest_service, operation, time_bucket)
AS SELECT
-- Join span with its parent to get the edge
parent.service_name AS source_service,
child.service_name AS dest_service,
child.operation_name AS operation,
toStartOfFiveMinutes(child.start_time) AS time_bucket,
count() AS call_count,
countIf(child.is_error = 1) AS error_count,
quantileState(0.50)(child.duration_us) AS p50_state,
quantileState(0.95)(child.duration_us) AS p95_state,
quantileState(0.99)(child.duration_us) AS p99_state
FROM spans AS child
INNER JOIN spans AS parent
ON child.parent_span_id = parent.span_id
AND child.trace_id = parent.trace_id
WHERE child.parent_span_id != ''
GROUP BY source_service, dest_service, operation, time_bucket
Critical path analysis:
Given a slow trace, identify the critical path:
1. Build span tree from trace data
2. For each span, compute "self time" = duration - sum(child durations)
3. Critical path = longest path from root to leaf (the chain of spans
that determines total trace duration)
4. Highlight spans on critical path in red in the UI
5. For each critical path span, show self-time vs. child-time breakdown
Example:
API Gateway (total: 500ms)
→ User Service (total: 200ms, self: 30ms)
→ PostgreSQL (total: 170ms, self: 170ms) ← BOTTLENECK
→ Cache Service (total: 5ms, self: 5ms)
Critical path: API Gateway → User Service → PostgreSQL
Root cause: slow database query (170ms)
Anomaly detection on service graph:
- Compare current 5-minute latency percentiles against rolling 24-hour baseline
- If p99 latency for service edge (A → B) exceeds 2x the baseline → flag as anomalous
- Propagate anomaly upstream: if B is slow, A and everything upstream is likely affected
- Auto-generate alert: “payment-service → database p99 latency increased 3x (from 50ms to 150ms) starting 18:05 UTC”
7. Extensions (2 min)
- Continuous profiling integration: Attach CPU/memory profiles to slow spans automatically. When a span exceeds p99 latency, trigger a 10-second CPU profile on that service instance and link it to the trace.
- Trace-based testing: Record production traces, replay them in staging to verify new deployments handle the same request patterns correctly. Detect regressions before they hit production.
- Cost attribution: Use trace data to attribute infrastructure costs to teams/services. If team A’s service makes 60% of the calls to the shared database, charge 60% of the database cost to team A.
- Real-time streaming analytics: Instead of batch-building the dependency graph, use Flink to compute live service health scores from the span stream. Power a real-time “service weather map” dashboard.
- Distributed context beyond tracing: Propagate feature flags, experiment IDs, and tenant IDs in the trace context. Enable per-request routing, A/B test correlation, and multi-tenant debugging from the same infrastructure.