1. Requirements & Scope (5 min)

Functional Requirements

  1. Collect spans from every service in the request path and assemble them into end-to-end traces with parent-child relationships
  2. Propagate trace context (trace ID, span ID, sampling decision) across service boundaries via HTTP headers, gRPC metadata, and message queues
  3. Support configurable sampling strategies (head-based probabilistic, tail-based on error/latency, always-on for debug)
  4. Store traces and provide query capabilities: search by trace ID, service name, operation, duration, tags, and time range
  5. Generate service dependency graphs and latency breakdowns from trace data

Non-Functional Requirements

  • Availability: 99.9% for collection pipeline (trace loss during outage is acceptable but not data corruption). Query system: 99.95%.
  • Latency: Zero observable overhead on the critical path. Span reporting must be async and non-blocking (< 0.1ms per span in the application process).
  • Consistency: Traces can be eventually consistent (30-second delay from span emission to queryability is fine).
  • Scale: 1,000 microservices, 500K requests/sec, average 10 spans per trace = 5M spans/sec ingestion.
  • Durability: Retain full traces for 7 days, sampled traces for 30 days, aggregated service metrics for 1 year.

2. Estimation (3 min)

Traffic

  • Spans ingested: 5M spans/sec (500K traces/sec × 10 spans/trace average)
  • With 10% head-based sampling: 500K spans/sec stored (reduces storage 10x while preserving statistical accuracy)
  • Tail-based sampling (errors/slow): adds ~50K spans/sec (all error traces and p99 latency traces kept at 100%)
  • Total stored: ~550K spans/sec

Storage

  • Per span: span_id (16B) + trace_id (16B) + parent_id (16B) + service (32B) + operation (64B) + start_time (8B) + duration (8B) + tags (128B) + logs (256B) = ~544 bytes average
  • Raw spans (7 days): 550K/sec × 86,400 × 7 × 544B = 23TB
  • With columnar compression (5x): ~4.6TB
  • Sampled traces (30 days, 1% sample): ~2TB compressed
  • Service metrics (1 year): aggregated data, ~100GB

Key Insight

This is a high-throughput write pipeline with relatively infrequent reads. The core challenges are: (1) ingesting 5M spans/sec with near-zero application overhead, (2) tail-based sampling which requires holding spans in memory until the trace completes, and (3) efficiently querying traces by multiple dimensions.


3. API Design (3 min)

Span Reporting (SDK → Agent/Collector)

POST /v1/spans (batch)
  Body (protobuf or Thrift for efficiency):
  {
    "spans": [
      {
        "trace_id": "abc123def456...",
        "span_id": "span789...",
        "parent_span_id": "span456...",     // null for root span
        "operation_name": "GET /api/users",
        "service_name": "user-service",
        "start_time_us": 1708632060000000,  // microseconds
        "duration_us": 15200,
        "tags": {
          "http.method": "GET",
          "http.status_code": "200",
          "db.type": "postgresql",
          "error": false
        },
        "logs": [
          { "timestamp_us": 1708632060005000, "fields": { "event": "cache_miss" } }
        ]
      },
      // ... batch of 50-100 spans
    ]
  }
  Response 202: { "accepted": true }

Trace Context Propagation (HTTP Headers)

// W3C Trace Context standard
traceparent: 00-abc123def456...-span789...-01
tracestate: vendor=value

// B3 format (Zipkin)
X-B3-TraceId: abc123def456...
X-B3-SpanId: span789...
X-B3-ParentSpanId: span456...
X-B3-Sampled: 1

Query API

GET /v1/traces/{trace_id}
  Response: { "trace_id": "abc123...", "spans": [...], "service_count": 5, "depth": 4,
              "duration_us": 234000, "root_service": "api-gateway" }

GET /v1/traces?service=user-service&operation=GET+/api/users
    &min_duration=1000ms&max_duration=5000ms&tag=error:true
    &start=1708600000&end=1708632000&limit=20
  Response: { "traces": [...], "total": 142 }

GET /v1/services/dependencies
  Response: { "nodes": ["api-gateway", "user-service", "db"],
              "edges": [{ "from": "api-gateway", "to": "user-service", "call_count": 50000, "avg_latency_ms": 15 }] }

Key Decisions

  • Batch span reporting (SDK buffers spans locally, flushes every 1 second or 100 spans) to minimize network overhead
  • Protobuf encoding (3-5x smaller than JSON for span data)
  • 202 Accepted (fire-and-forget) — SDK never blocks on span delivery
  • W3C Trace Context as primary propagation format for interoperability

4. Data Model (3 min)

Spans (ClickHouse — columnar, optimized for time-range + filter queries)

Table: spans
  trace_id         | FixedString(32)   -- partition key
  span_id          | FixedString(16)
  parent_span_id   | FixedString(16)   -- '' for root span
  service_name     | LowCardinality(String)
  operation_name   | LowCardinality(String)
  start_time       | DateTime64(6)     -- microsecond precision
  duration_us      | UInt64
  status_code      | UInt16
  is_error         | UInt8
  tags             | Map(String, String)
  logs             | Array(Tuple(DateTime64(6), Map(String, String)))

  -- Indices
  ORDER BY (service_name, start_time, trace_id)
  PARTITION BY toDate(start_time)
  -- Secondary index on trace_id for single-trace lookup

Service Dependency Graph (materialized from spans)

Table: service_edges (materialized view, updated hourly)
  source_service   | LowCardinality(String)
  dest_service     | LowCardinality(String)
  operation        | LowCardinality(String)
  hour             | DateTime
  call_count       | UInt64
  error_count      | UInt64
  p50_latency_us   | UInt64
  p95_latency_us   | UInt64
  p99_latency_us   | UInt64
Document per trace (not per span):
{
  "trace_id": "abc123...",
  "root_service": "api-gateway",
  "root_operation": "POST /api/orders",
  "services": ["api-gateway", "order-service", "payment-service", "db"],
  "duration_us": 234000,
  "span_count": 12,
  "has_error": true,
  "start_time": "2024-02-22T18:01:00Z",
  "tags": { "user_id": "u123", "http.status": "500", "error.type": "TimeoutException" }
}

Why ClickHouse + Elasticsearch?

  • ClickHouse for span storage: columnar compression (10-15x), excellent time-range query performance, handles 550K inserts/sec easily
  • Elasticsearch for trace discovery: flexible full-text and tag-based search across trace summaries
  • Trace lookup by ID: query ClickHouse directly (index on trace_id)
  • Trace search by tags/service: query Elasticsearch for matching trace IDs, then fetch spans from ClickHouse

5. High-Level Design (12 min)

Architecture

Application Services (1000 microservices)
  [Tracing SDK] → buffer spans in memory (ring buffer, 1000 spans)
    → flush every 1s or when buffer full
      → Local Agent (sidecar, one per host)
        → batches spans from all services on the host
        → applies head-based sampling decision
        → forwards to Collector via gRPC

Collector Fleet (stateless, 100 instances):
  → Receives span batches from Agents
  → Validates and normalizes spans
  → Publishes to Kafka (spans-raw topic, 256 partitions, keyed by trace_id)

Kafka (spans-raw, 256 partitions):
  → Consumer Group 1: Span Writer → ClickHouse (batch inserts)
  → Consumer Group 2: Trace Indexer → build trace summary → Elasticsearch
  → Consumer Group 3: Tail-Based Sampler → holds spans in memory, decides on trace completion
  → Consumer Group 4: Service Graph Builder → compute edges → materialized view

Query Service:
  → API for trace lookup, search, service graph
  → Reads from ClickHouse (spans) + Elasticsearch (trace index)
  → Serves UI (Jaeger UI or Grafana Tempo)

Tail-Based Sampling Service:
  → Holds partial traces in memory (keyed by trace_id)
  → Waits for trace completion (no new spans for 30s)
  → Decision: keep if (has_error OR duration > p99 threshold OR debug flag)
  → If keep: write all spans to Kafka (spans-sampled topic)
  → If drop: discard

Trace Context Propagation Flow

Client → API Gateway (creates root span, generates trace_id)
  Header: traceparent: 00-{trace_id}-{span_id_A}-01

API Gateway → User Service (propagates trace context)
  Header: traceparent: 00-{trace_id}-{span_id_B}-01
  (span_id_B is child of span_id_A)

User Service → PostgreSQL (creates DB span)
  (no HTTP header, SDK auto-instruments DB client)
  Parent: span_id_B

User Service → Cache Service (propagates via HTTP)
  Header: traceparent: 00-{trace_id}-{span_id_C}-01

Each service:
  1. Extract trace context from incoming request
  2. Create local span (child of incoming span)
  3. Inject trace context into outgoing requests
  4. Report span to local Agent (async, non-blocking)

Components

  1. Tracing SDK: Language-specific library (Java, Go, Python, Node.js). Auto-instruments HTTP clients, DB drivers, message queue producers/consumers. Async span reporting via ring buffer.
  2. Local Agent: Sidecar daemon per host. Receives spans from all services on the host. Applies sampling. Batches and compresses before forwarding to Collector.
  3. Collector: Stateless gRPC servers. Validates spans, enriches with service metadata, publishes to Kafka. Auto-scales based on Kafka consumer lag.
  4. Kafka: Durable buffer. Keyed by trace_id so all spans of a trace go to the same partition (important for tail-based sampling).
  5. Span Writer: Batch-inserts spans into ClickHouse. Uses async inserts with 10,000 rows per batch for throughput.
  6. Trace Indexer: Builds one Elasticsearch document per trace (aggregated from all spans). Updated as new spans arrive.
  7. Tail-Based Sampler: Stateful service that holds partial traces in memory. Memory budget: 10GB per instance (holds ~5 seconds of trace data).
  8. Query Service: Translates user queries into ClickHouse SQL + Elasticsearch queries. Merges results for the UI.

6. Deep Dives (15 min)

Deep Dive 1: Sampling Strategies — Head-Based vs. Tail-Based

Head-based sampling (simple, lossy):

At the root span (entry point), decide: sample this trace? (e.g., 10% probability)
  → Decision propagated to ALL downstream services via trace context header
  → sampled=1: all services report spans for this trace
  → sampled=0: no service reports spans (zero overhead)

Pros:
  - Simple to implement (decision at the edge)
  - Predictable load (exactly 10% of traces stored)
  - No additional infrastructure needed

Cons:
  - Might miss the ONE interesting trace (the error, the slow request)
  - 10% sampling means 90% of errors go unrecorded
  - Statistical accuracy requires high sample rate for rare events

Tail-based sampling (complex, complete for interesting traces):

ALL spans are reported (sampled=1 for everything)
  → Tail Sampler collects all spans per trace
  → After trace is "complete" (no new spans for 30s):
    Decision:
      - If any span has error=true → KEEP (100% of error traces)
      - If root span duration > p99 threshold → KEEP (all slow traces)
      - If trace has debug flag → KEEP
      - Otherwise → apply probabilistic sampling (keep 5%)

Implementation challenge:
  - Must buffer ALL spans in memory until trace completes
  - 5M spans/sec × 30s buffer window × 544 bytes = 81.6GB memory needed
  - Solution: partition by trace_id across 20 sampler instances
    → 4GB per instance, feasible

Hybrid approach (recommended):

1. Head-based: 10% probabilistic sampling (always active)
2. Tail-based: additionally keep 100% of:
   - Error traces (any span with error=true)
   - Slow traces (duration > p99 for that service+operation)
   - Debug traces (explicit trace_id flagged for debugging)
3. Guaranteed: always keep traces for specific operations
   (e.g., payments, user signups — business-critical flows)

Result: ~15% of traces stored, but 100% of interesting ones

Sampling configuration:

sampling:
  default_strategy: probabilistic
  default_rate: 0.10  # 10% head-based

  tail_based:
    enabled: true
    buffer_duration: 30s
    policies:
      - type: error
        sample_rate: 1.0  # keep all errors
      - type: latency
        threshold_ms: 5000
        sample_rate: 1.0  # keep all > 5s traces
      - type: string_attribute
        key: "debug"
        values: ["true"]
        sample_rate: 1.0

  per_service_overrides:
    payment-service:
      default_rate: 1.0  # 100% sampling for payments

Deep Dive 2: Trace Assembly and Storage Optimization

The problem: A single trace’s spans arrive from different services at different times. The root span may arrive after child spans. We need to assemble a complete trace for querying.

Span storage in ClickHouse:

-- Optimized table engine for time-series span data
CREATE TABLE spans (
    trace_id      FixedString(32),
    span_id       FixedString(16),
    parent_span_id FixedString(16),
    service_name  LowCardinality(String),
    operation_name LowCardinality(String),
    start_time    DateTime64(6),
    duration_us   UInt64,
    status_code   UInt16,
    is_error      UInt8,
    tags          Map(String, String),
    logs          String  -- JSON encoded
) ENGINE = MergeTree()
PARTITION BY toDate(start_time)
ORDER BY (service_name, start_time, trace_id)
TTL start_time + INTERVAL 7 DAY DELETE
SETTINGS index_granularity = 8192;

-- Secondary index for trace lookup
ALTER TABLE spans ADD INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 4;

Query patterns and optimization:

  1. Single trace lookup (by trace_id):
SELECT * FROM spans WHERE trace_id = 'abc123...'
-- Uses bloom filter index on trace_id
-- Returns all spans for trace assembly
-- Client-side: reconstruct tree from parent_span_id relationships
  1. Search by service + time range:
SELECT trace_id, min(start_time), max(start_time + duration_us),
       groupArray(service_name), countIf(is_error = 1)
FROM spans
WHERE service_name = 'payment-service'
  AND start_time BETWEEN '2024-02-22 17:00:00' AND '2024-02-22 18:00:00'
GROUP BY trace_id
HAVING countIf(is_error = 1) > 0
ORDER BY min(start_time) DESC
LIMIT 20
  1. Latency analysis (p50/p95/p99 by operation):
SELECT operation_name,
       quantile(0.5)(duration_us) as p50,
       quantile(0.95)(duration_us) as p95,
       quantile(0.99)(duration_us) as p99,
       count() as call_count
FROM spans
WHERE service_name = 'user-service'
  AND start_time >= now() - INTERVAL 1 HOUR
GROUP BY operation_name
ORDER BY p99 DESC

Trace assembly on read:

  • Spans are stored flat (no pre-assembled traces)
  • Query service fetches all spans for a trace_id, builds the tree in memory
  • Root span: parent_span_id is empty
  • Children: parent_span_id points to parent
  • Orphan spans (parent arrived late or was dropped): shown as detached subtrees in the UI

Compression results:

  • ClickHouse LZ4 compression on span data: 8-12x compression ratio
  • LowCardinality for service_name and operation_name: dictionary encoding, 10-50x for these columns
  • Net result: 544 bytes/span raw → ~60 bytes/span stored

Deep Dive 3: Service Dependency Graph and Latency Analysis

The problem: With 1,000 microservices, understanding which service calls which (and how fast) is critical for debugging and capacity planning.

Building the dependency graph:

For each span with a parent:
  edge = (parent.service_name → child.service_name, child.operation_name)
  Record: call_count, error_count, latency histogram

Materialized view in ClickHouse (updated every 5 minutes):
CREATE MATERIALIZED VIEW service_edges
ENGINE = SummingMergeTree()
ORDER BY (source_service, dest_service, operation, time_bucket)
AS SELECT
    -- Join span with its parent to get the edge
    parent.service_name AS source_service,
    child.service_name AS dest_service,
    child.operation_name AS operation,
    toStartOfFiveMinutes(child.start_time) AS time_bucket,
    count() AS call_count,
    countIf(child.is_error = 1) AS error_count,
    quantileState(0.50)(child.duration_us) AS p50_state,
    quantileState(0.95)(child.duration_us) AS p95_state,
    quantileState(0.99)(child.duration_us) AS p99_state
FROM spans AS child
INNER JOIN spans AS parent
    ON child.parent_span_id = parent.span_id
    AND child.trace_id = parent.trace_id
WHERE child.parent_span_id != ''
GROUP BY source_service, dest_service, operation, time_bucket

Critical path analysis:

Given a slow trace, identify the critical path:
1. Build span tree from trace data
2. For each span, compute "self time" = duration - sum(child durations)
3. Critical path = longest path from root to leaf (the chain of spans
   that determines total trace duration)
4. Highlight spans on critical path in red in the UI
5. For each critical path span, show self-time vs. child-time breakdown

Example:
  API Gateway (total: 500ms)
    → User Service (total: 200ms, self: 30ms)
      → PostgreSQL (total: 170ms, self: 170ms)  ← BOTTLENECK
    → Cache Service (total: 5ms, self: 5ms)
  Critical path: API Gateway → User Service → PostgreSQL
  Root cause: slow database query (170ms)

Anomaly detection on service graph:

  • Compare current 5-minute latency percentiles against rolling 24-hour baseline
  • If p99 latency for service edge (A → B) exceeds 2x the baseline → flag as anomalous
  • Propagate anomaly upstream: if B is slow, A and everything upstream is likely affected
  • Auto-generate alert: “payment-service → database p99 latency increased 3x (from 50ms to 150ms) starting 18:05 UTC”

7. Extensions (2 min)

  • Continuous profiling integration: Attach CPU/memory profiles to slow spans automatically. When a span exceeds p99 latency, trigger a 10-second CPU profile on that service instance and link it to the trace.
  • Trace-based testing: Record production traces, replay them in staging to verify new deployments handle the same request patterns correctly. Detect regressions before they hit production.
  • Cost attribution: Use trace data to attribute infrastructure costs to teams/services. If team A’s service makes 60% of the calls to the shared database, charge 60% of the database cost to team A.
  • Real-time streaming analytics: Instead of batch-building the dependency graph, use Flink to compute live service health scores from the span stream. Power a real-time “service weather map” dashboard.
  • Distributed context beyond tracing: Propagate feature flags, experiment IDs, and tenant IDs in the trace context. Enable per-request routing, A/B test correlation, and multi-tenant debugging from the same infrastructure.