Design a Distributed Logging System (ELK/Splunk)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
- Key Decisions
4. Data Model (3 min)
5. High-Level Design (12 min)
- System Architecture
- Component Responsibilities
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Log Ingestion — Collect structured and unstructured logs from 50K+ servers, containers, and serverless functions. Support push-based (agents forward logs) and pull-based (system scrapes endpoints) collection models.
Full-Text Search & Field-Based Filtering — Users can search logs by arbitrary keywords (full-text), filter by structured fields (service name, log level, host, correlation ID), and scope by time range. Queries return results within seconds even over terabytes of data.
Live Tail (Real-Time Streaming) — Provide a tail -f equivalent where engineers can subscribe to a live stream of logs from a specific service, host, or filter expression. Latency from log emission to live tail display should be under 5 seconds.
Alerting & Pattern Detection — Define alert rules on log patterns (e.g., “more than 100 ERROR logs from payment-service in 5 minutes”). Support anomaly detection on log volume and error rates. Route alerts to PagerDuty, Slack, email.
Retention & Lifecycle Management — Configure per-tenant or per-service retention policies (e.g., 7 days hot, 30 days warm, 1 year cold, 7 years frozen for compliance). Automatic tiered storage migration and deletion.

Non-Functional Requirements

Availability: 99.99% uptime for ingestion (logs must never be dropped). Search can tolerate brief degradation (99.9%).
Latency: Ingestion-to-searchable latency < 10 seconds (p99). Search queries return in < 5 seconds for recent data (last 24h), < 30 seconds for historical data.
Consistency: Eventual consistency is acceptable. A log written at time T may not appear in search results for up to 10 seconds. No strict ordering guarantees across services.
Scale: 50,000 servers, 2 million log lines/second sustained ingestion, 10M lines/sec burst. 1 PB of searchable hot/warm data. 10 PB in cold/frozen archival.
Durability: Zero log loss once acknowledged by the ingestion pipeline. Replicated storage with 99.999999999% (11 nines) durability for archived data.

2. Estimation (3 min)

Write Path (Ingestion)

Metric	Value
Servers	50,000
Avg log lines per server per second	40
Sustained ingestion rate	2M lines/sec
Burst ingestion rate (10x spikes)	10M lines/sec
Avg log line size (JSON structured)	500 bytes
Sustained throughput	1 GB/sec = 3.6 TB/hour = 86 TB/day
Compression ratio (zstd)	~10:1
Compressed daily storage	~8.6 TB/day

Read Path (Search)

Metric	Value
Concurrent search users	~500
Search QPS	~200 queries/sec
Live tail subscriptions	~1,000 concurrent streams
Avg query scans	1–10 GB of index data per query

Storage Tiers

Tier	Retention	Raw Data	Compressed
Hot (SSD, fully indexed)	3 days	258 TB	~26 TB
Warm (HDD, fully indexed)	30 days	2.58 PB	~258 TB
Cold (object storage, partial index)	1 year	~31 PB	~3.1 PB
Frozen (object storage, metadata only)	7 years	archived	~20 PB

Kafka Sizing

2M messages/sec × 500 bytes = 1 GB/sec.
With replication factor 3, Kafka needs ~3 GB/sec disk write throughput.
24-hour retention in Kafka (buffer for downstream failures): 86 TB × 3 replicas = 258 TB Kafka storage.
~50 Kafka brokers (each handling ~60 MB/sec write throughput).

3. API Design (3 min)

# --- Ingestion APIs ---

# Batch log ingestion (used by agents)
POST /v1/logs/ingest
Headers: X-Tenant-ID, X-API-Key, Content-Encoding: zstd
Body: {
  "logs": [
    {
      "timestamp": "2025-07-15T10:23:45.123Z",
      "level": "ERROR",
      "service": "payment-service",
      "host": "prod-web-042",
      "trace_id": "abc123def456",
      "span_id": "span-789",
      "message": "Failed to process payment",
      "metadata": { "user_id": "u-123", "order_id": "ord-456", "error_code": "TIMEOUT" }
    }
  ]
}
Response: 202 Accepted { "ingested": 150, "failed": 0 }

# --- Search APIs ---

# Full-text and field-based search
POST /v1/logs/search
Body: {
  "query": "Failed to process payment",
  "filters": {
    "service": "payment-service",
    "level": ["ERROR", "WARN"],
    "time_range": { "from": "2025-07-15T10:00:00Z", "to": "2025-07-15T11:00:00Z" }
  },
  "sort": "timestamp:desc",
  "limit": 100,
  "cursor": "eyJsYXN0X3RzIjoxNjg..."
}
Response: { "hits": [...], "total": 4521, "next_cursor": "..." }

# Live tail — WebSocket
WS /v1/logs/tail?service=payment-service&level=ERROR
→ Server pushes matching log lines in real time

# Aggregation query (for dashboards)
POST /v1/logs/aggregate
Body: {
  "filters": { "service": "payment-service", "time_range": { "last": "1h" } },
  "group_by": ["level"],
  "interval": "1m",
  "metric": "count"
}

# --- Alert APIs ---

POST /v1/alerts/rules
Body: {
  "name": "Payment errors spike",
  "query": "level:ERROR AND service:payment-service",
  "condition": { "threshold": 100, "window": "5m", "operator": ">" },
  "actions": [{ "type": "pagerduty", "severity": "critical" }]
}

Key Decisions

Batch ingestion over single-line writes — amortizes network overhead, enables compression. Agents batch locally (every 1–5 seconds or 1000 lines, whichever comes first).
Cursor-based pagination instead of offset-based — handles the append-heavy, time-sorted nature of log data without expensive deep-page queries.
WebSocket for live tail — HTTP long-polling would waste connections. WebSocket allows server-push with low latency. Each subscription is a filtered Kafka consumer under the hood.

4. Data Model (3 min)

Log Document (Elasticsearch / ClickHouse)

Field	Type	Indexed	Notes
`id`	UUID	Primary key	Generated at ingestion, used for dedup
`timestamp`	DateTime64(ms)	Yes (sort key)	Nanosecond precision stored, ms for indexing
`tenant_id`	String	Yes (partition key)	Tenant isolation
`service`	String (keyword)	Yes	Exact match, not analyzed
`host`	String (keyword)	Yes	Server hostname
`level`	Enum (TRACE/DEBUG/INFO/WARN/ERROR/FATAL)	Yes	Stored as uint8 internally
`trace_id`	String (keyword)	Yes	For distributed tracing correlation
`span_id`	String (keyword)	Yes	Links to specific trace span
`message`	Text (analyzed)	Yes (full-text)	Inverted index for search
`metadata`	JSON / Map(String, String)	Selective	Dynamic fields, selectively indexed
`_raw`	String	No	Original log line, stored but not indexed

Index Partitioning Strategy

Index naming: logs-{tenant_id}-{YYYY.MM.DD}-{shard_number}

Example: logs-payment-team-2025.07.15-003

Daily indices → easy to delete/archive entire days
Tenant prefix → physical isolation for noisy neighbors
Shard number → distribute within a day (target 30-50 GB per shard)

Why Elasticsearch + ClickHouse Hybrid?

Engine	Use Case	Strength
Elasticsearch	Full-text search, live tail, ad-hoc queries	Inverted index excels at arbitrary text search. Near real-time indexing (refresh every 1s).
ClickHouse	Aggregations, dashboards, analytics, alerting	Columnar storage gives 10-100x faster GROUP BY, COUNT, and time-series aggregations. Excellent compression.
S3 / GCS	Cold & frozen storage	11 nines durability, ~$0.023/GB/month vs $0.10/GB for SSD. Parquet format for occasional queries.

Both engines are populated from the same Kafka topics. Elasticsearch handles interactive search. ClickHouse handles dashboard queries and alert rule evaluation. Cold data is written to object storage in compressed Parquet with only metadata indexed in a lightweight catalog (e.g., Hive Metastore or Iceberg).

5. High-Level Design (12 min)

System Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                              LOG PRODUCERS                                   │
│  [App Servers] [Containers] [Lambda/Serverless] [Network Devices] [CDN]     │
│       │              │              │                  │             │        │
│       ▼              ▼              ▼                  ▼             ▼        │
│  ┌─────────┐   ┌─────────┐   ┌──────────┐     ┌─────────┐   ┌─────────┐   │
│  │Filebeat │   │Fluentd  │   │  SDK /   │     │ Syslog  │   │Filebeat │   │
│  │ Agent   │   │ Agent   │   │ Direct   │     │ Forwarder│   │ Agent   │   │
│  └────┬────┘   └────┬────┘   └────┬─────┘     └────┬────┘   └────┬────┘   │
│       │              │             │                 │             │         │
└───────┼──────────────┼─────────────┼─────────────────┼─────────────┼────────┘
        │              │             │                 │             │
        ▼              ▼             ▼                 ▼             ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        INGESTION GATEWAY (L7 Load Balancer)                  │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────┐              │
│  │  Gateway Fleet (stateless, auto-scaling)                   │              │
│  │  - Authentication (API key validation)                     │              │
│  │  - Tenant rate limiting (token bucket per tenant)          │              │
│  │  - Schema validation & enrichment (add receive_ts, geo)    │              │
│  │  - Decompression (zstd) → re-serialization (Protobuf)     │              │
│  │  - Backpressure signaling (HTTP 429 / TCP flow control)    │              │
│  └────────────────────┬───────────────────────────────────────┘              │
│                       │                                                      │
└───────────────────────┼──────────────────────────────────────────────────────┘
                        │
                        ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                            KAFKA CLUSTER                                     │
│                                                                              │
│  Topic: logs-raw (partitioned by hash(tenant_id + service))                 │
│  ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐                  │
│  │ P0   │ P1   │ P2   │ P3   │ ...  │ P247 │ P248 │ P249 │  (250 partitions)│
│  └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘                  │
│  Replication factor: 3  |  Retention: 24h  |  Compression: zstd             │
│                                                                              │
│  Topic: logs-enriched (after processing pipeline)                           │
│  Topic: logs-alerts (matched alert patterns, routed to alerting)            │
└────────┬────────────────────┬──────────────────────┬─────────────────────────┘
         │                    │                      │
         ▼                    ▼                      ▼
┌─────────────────┐  ┌─────────────────┐   ┌─────────────────────┐
│ STREAM PROCESSOR │  │ STREAM PROCESSOR│   │ STREAM PROCESSOR    │
│ (ES Indexer)     │  │ (CH Loader)     │   │ (Alert Evaluator)   │
│                  │  │                 │   │                     │
│ - Parse & enrich │  │ - Batch inserts │   │ - Sliding window    │
│ - Bulk index to  │  │   (64K rows per │   │   aggregations      │
│   Elasticsearch  │  │    batch)       │   │ - Pattern matching  │
│ - Manage index   │  │ - Write to CH   │   │ - Anomaly detection │
│   rotation       │  │   MergeTree     │   │ - Publish to alerts │
│ - Handle retries │  │   tables        │   │   topic             │
└────────┬────────┘  └────────┬────────┘   └──────────┬──────────┘
         │                    │                       │
         ▼                    ▼                       ▼
┌─────────────────┐  ┌─────────────────┐   ┌─────────────────────┐
│  ELASTICSEARCH   │  │  CLICKHOUSE     │   │  ALERT MANAGER      │
│  CLUSTER         │  │  CLUSTER        │   │                     │
│                  │  │                 │   │ - Dedup alerts      │
│ Hot: 30 nodes    │  │ 20 nodes        │   │ - Route: PagerDuty, │
│   (NVMe SSD)    │  │ (SSD + HDD)     │   │   Slack, email      │
│ Warm: 50 nodes   │  │                 │   │ - Silence/snooze    │
│   (HDD)         │  │ ReplicatedMerge │   │ - Escalation chains │
│                  │  │ Tree engine     │   │                     │
│ Daily index      │  │ TTL-based       │   └─────────────────────┘
│ rotation + ILM   │  │ retention       │
└────────┬────────┘  └─────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        TIERED STORAGE & LIFECYCLE                            │
│                                                                              │
│  HOT (0-3 days)  →  WARM (3-30 days)  →  COLD (30d-1yr)  →  FROZEN (1-7yr)│
│  NVMe SSD            HDD                  S3/GCS (Parquet)   S3 Glacier     │
│  Full index          Full index           Partial index       Metadata only  │
│  3 replicas          1 replica            Erasure coded       Erasure coded  │
│                                                                              │
│  ILM Policy: auto-migrate based on index age                                │
│  Compliance: immutable writes, audit trail, encryption at rest              │
└──────────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Log Collection Agents (Filebeat / Fluentd) — Deployed as sidecars or DaemonSets. Read from files, stdout, or syslog. Buffer locally (disk-backed queue) to survive brief network outages. Compress and batch before sending to gateway. Support structured (JSON) and unstructured (plaintext with grok parsing) formats.
Ingestion Gateway — Stateless HTTP/gRPC endpoints behind an L7 load balancer. Handles auth, rate limiting, validation, and enrichment. Crucially, applies backpressure: if Kafka producers see high latency, the gateway returns HTTP 429 to agents, which back off and buffer locally. This prevents cascade failures.
Kafka — The central durability buffer. Decouples ingestion from processing. The 24-hour retention means if Elasticsearch is down for maintenance, no logs are lost — consumers replay from offset. Partitioning by hash(tenant_id + service) ensures logs from the same service land on the same partition, enabling ordered processing within a service.
Stream Processors — Three independent consumer groups:
- ES Indexer: Consumes from logs-raw, enriches (GeoIP, hostname resolution), and bulk-indexes into Elasticsearch. Uses batches of 5,000 documents or 5-second windows.
- ClickHouse Loader: Consumes the same topic, writes batches of 64K rows into ClickHouse MergeTree tables for analytical queries.
- Alert Evaluator: Runs sliding window aggregations (tumbling 1-minute windows with Flink/Kafka Streams). Matches alert rules and publishes to logs-alerts topic.
Elasticsearch Cluster — Powers full-text search, live tail, and ad-hoc exploration. Daily indices with ILM (Index Lifecycle Management) policies. Hot nodes use NVMe SSDs for fast indexing and search. Warm nodes use HDDs for cost efficiency on older data.
ClickHouse Cluster — Powers dashboards, aggregation queries, and alert rule evaluation. The columnar MergeTree engine compresses log data 15-20x and runs analytical queries 10-100x faster than Elasticsearch for GROUP BY and COUNT operations.
Alert Manager — Consumes from logs-alerts, deduplicates (same alert within a suppression window), and routes to notification channels. Supports escalation chains and silence windows.
Tiered Storage Manager — Background service that migrates data between tiers based on ILM policies. Hot → Warm is an Elasticsearch node migration. Warm → Cold exports to Parquet in object storage. Cold → Frozen moves to archival storage with only metadata retained for discovery.

6. Deep Dives (15 min)

Deep Dive 1: Handling 2M+ Lines/Sec Ingestion with Backpressure

The hardest engineering challenge is reliable ingestion at 2M lines/sec sustained with 10M lines/sec bursts, without losing a single log and without cascading failures.

The Backpressure Chain:

Agent ←── Gateway ←── Kafka ←── Stream Processor ←── Elasticsearch

Backpressure flows RIGHT to LEFT. Each component signals the upstream
when it cannot keep up.

Layer 1: Agent-side buffering. Filebeat and Fluentd maintain a disk-backed queue (configurable, typically 1-5 GB). When the gateway returns 429 or the connection fails, agents stop reading from source files and buffer on disk. The file offset (registry in Filebeat) is only advanced after acknowledgment. This is the last line of defense — as long as disk space holds, no logs are lost.

Layer 2: Gateway rate limiting. Each tenant gets a token bucket rate limiter (e.g., payment-team: 100K lines/sec). This prevents a single runaway service from saturating the pipeline. The gateway maintains a Kafka producer pool. If Kafka producer buffers exceed 80% capacity, the gateway starts rejecting with 429. Key design: the gateway is stateless — rate limit state is stored in a shared Redis cluster, with local caching (leaky bucket local + periodic sync) to avoid Redis becoming a bottleneck.

Layer 3: Kafka as the shock absorber. Kafka is sized to handle 10x burst (10M lines/sec) for up to 30 minutes. With 250 partitions and 50 brokers, each broker handles ~200K messages/sec, well within Kafka’s capability (~800K messages/sec per broker on commodity hardware). The 24-hour retention means even a multi-hour downstream outage loses no data.

Layer 4: Consumer-side adaptive batching. The ES Indexer consumer group dynamically adjusts batch size based on Elasticsearch indexing latency. When ES responds slowly (p99 > 500ms), batch size increases from 5K to 50K documents and flush interval extends from 5s to 30s. This trades latency for throughput, preventing consumer lag from growing unboundedly. If lag exceeds a threshold (e.g., 10 minutes), an autoscaler adds more consumer instances.

Burst handling math:

Burst: 10M lines/sec for 10 minutes = 6 billion logs = 3 TB raw data.
Kafka absorbs all of it (3 TB is well within the 258 TB buffered storage).
ES Indexer processes at ~3M lines/sec (with 30 consumer instances).
Drain time for the burst backlog: 6B / 3M = 2,000 seconds ≈ 33 minutes after burst ends.
During drain, search latency for very recent logs degrades from 10s to ~40 minutes. Acceptable per our SLA.

Failure mode: Elasticsearch cluster down for 2 hours.

Kafka retains all data (2M/sec × 7,200 sec = 14.4B logs = ~7.2 TB).
Well within 24-hour retention capacity.
After ES recovery, consumers replay from stored offset.
Full catchup time: 14.4B / 3M = 4,800 sec ≈ 80 minutes.

Deep Dive 2: Storage Engine & Index Partitioning for Petabyte-Scale Search

At petabyte scale, naive Elasticsearch deployment collapses. This deep dive covers how to make search work across massive data volumes.

Time-Based Index Partitioning:

logs-{tenant}-{date}-{shard}

Example index lifecycle:
Day 1: logs-payment-2025.07.15-{000..019}  → 20 primary shards, HOT tier
Day 4: ILM moves to WARM tier, force-merge to 5 segments per shard
Day 31: Snapshot to S3, delete from ES, register in cold catalog
Day 366: Move S3 objects to Glacier, retain metadata only (FROZEN)

Shard sizing strategy:

Target: 30–50 GB per shard (Elasticsearch sweet spot for search performance).
A tenant producing 10 TB/day needs ~200-300 shards per day.
A small tenant producing 10 GB/day needs 1 shard per day.
Dynamic shard allocation: the ingestion gateway tracks per-tenant volume. When creating tomorrow’s index template, it calculates the optimal shard count based on yesterday’s volume, with 20% headroom.

The problem with too many shards:

50K shards across the cluster is manageable. 500K is not — each shard consumes ~10 MB heap on the master node.
Solution: small tenants share indices using a _routing field set to tenant_id. The routing ensures all of a tenant’s docs land on the same shard within a shared daily index. Larger tenants (> 1 TB/day) get dedicated indices.

Inverted Index Optimization:

Log messages are tokenized using a custom analyzer: lowercase, ASCII folding, split on whitespace and common delimiters, no stemming (logs aren’t natural language).
Keyword fields (service, host, level, trace_id) use keyword type — exact match, doc values for aggregations.
The message field uses text type with a position-aware inverted index for phrase queries (“connection refused from 10.0.1.42”).
metadata fields are dynamically mapped but with a strict limit (max 100 unique field names per index) to prevent mapping explosion.

Tiered Storage Query Routing:

User query: "error payment" last 7 days
                    │
                    ▼
            ┌──────────────┐
            │ Query Router  │
            └──────┬───────┘
                   │
    ┌──────────────┼──────────────┐
    ▼              ▼              ▼
 HOT (0-3d)    WARM (3-7d)    (skip COLD)
 20 shards     20 shards
 NVMe SSD      HDD
 ~200ms        ~800ms
                   │
                   ▼
          Merge & rank results
          Return top 100 by timestamp

For cold tier queries (rare, typically compliance or incident forensics), the system uses a searchable snapshots approach: the query router identifies relevant Parquet files in S3 using a metadata catalog (time range + tenant + service → file list), downloads the necessary column chunks on-demand to a local cache, and searches them. Cold queries are capped at 60-second timeout and limited to 5 concurrent cold queries to prevent resource starvation.

Compression and cost optimization:

Hot tier: LZ4 compression (fast decompression for real-time search). ~3:1 ratio.
Warm tier: zstd compression (better ratio, acceptable decompression speed). ~8:1 ratio.
Cold tier: zstd level 19 in Parquet files. ~15:1 ratio.
Debug log sampling: For high-volume services emitting > 100K DEBUG lines/sec, the gateway applies probabilistic sampling (keep 10% of DEBUG, 100% of INFO and above). The sampling rate is stored as metadata so dashboards can extrapolate counts.
Deduplication: The ingestion pipeline hashes (timestamp, host, message) and deduplicates within a 5-second window using a Bloom filter. Catches duplicate delivery from agent retries. Reduces storage by ~3-5% in practice.

Deep Dive 3: Multi-Tenancy, Access Control, and Noisy Neighbor Isolation

In a shared logging platform serving hundreds of teams, multi-tenancy is critical for both security (Team A must not see Team B’s logs) and reliability (Team A’s log flood must not degrade Team B’s search experience).

Tenant Isolation Model:

┌───────────────────────────────────────────────────┐
│              Ingestion Layer                       │
│  Per-tenant rate limits (token bucket in Redis)    │
│  payment-team: 200K lines/sec                     │
│  search-team:   50K lines/sec                     │
│  ml-platform:  500K lines/sec                     │
│                                                   │
│  Overage handling:                                │
│  - Soft limit: log warning, continue ingesting    │
│  - Hard limit (2x quota): drop DEBUG/TRACE only   │
│  - Emergency (5x): drop all except ERROR/FATAL    │
└───────────────────────────────────────────────────┘
                      │
                      ▼
┌───────────────────────────────────────────────────┐
│              Storage Layer                         │
│                                                   │
│  Large tenants: dedicated indices                 │
│    logs-payment-2025.07.15-{shard}                │
│                                                   │
│  Small tenants: shared indices with routing       │
│    logs-shared-small-2025.07.15 (routing=tenant)  │
│                                                   │
│  Physical isolation option (enterprise):          │
│    Dedicated ES node pools per tenant             │
└───────────────────────────────────────────────────┘
                      │
                      ▼
┌───────────────────────────────────────────────────┐
│              Query Layer                           │
│                                                   │
│  Every query automatically injected with:         │
│    filter: { tenant_id: <caller's tenant> }       │
│                                                   │
│  Query resource limits per tenant:                │
│  - Max concurrent queries: 20                     │
│  - Max query time range: 30 days (configurable)   │
│  - Max result set: 10,000 hits                    │
│  - Query timeout: 30 seconds                      │
│                                                   │
│  Query priority queues:                           │
│  - P0 (on-call incident): bypass limits           │
│  - P1 (normal interactive): standard limits       │
│  - P2 (batch/export): throttled, off-peak only    │
└───────────────────────────────────────────────────┘

Access Control:

Authentication: mTLS for agent-to-gateway. API keys or OAuth2 tokens for human users.
Authorization: RBAC at tenant level. Roles: viewer (search own team’s logs), admin (manage retention policies, alert rules), superadmin (cross-tenant search for SRE team).
Field-level redaction: PII fields (user email, IP address) can be marked as restricted. Only users with pii-reader role see the actual values; others see [REDACTED]. Implemented as a post-processing filter in the query layer, not in storage (preserving the ability to un-redact for authorized users).

Noisy Neighbor Prevention:

The core mechanism is the per-tenant ingestion rate limit described above.
On the query side, each tenant’s queries run in an isolated thread pool with bounded concurrency. A tenant running expensive queries cannot starve other tenants’ search capacity.
Elasticsearch uses search request priority: queries from tenants in an active incident (P0) get 3x the thread pool allocation. This is triggered by an integration with the incident management system.
Chargeback: each tenant’s ingestion volume and storage usage is metered. Monthly reports show cost attribution, incentivizing teams to reduce unnecessary logging (especially DEBUG in production).

7. Extensions (2 min)

Distributed Tracing Correlation — Enrich log documents with trace_id and span_id. Build a trace-to-logs bridge: clicking a trace span in Jaeger/Tempo opens the corresponding log lines. Requires a secondary index on trace_id with fast point lookups, achievable via ClickHouse’s ReplicatedMergeTree with an ORDER BY (trace_id, timestamp) table dedicated to trace correlation.
Log-Based Metrics (Logs-to-Metrics Pipeline) — Extract numeric metrics from log lines in the stream processor (e.g., “request completed in 342ms” → request_duration_ms histogram). Publish to Prometheus/Mimir. This reduces the need to search raw logs for performance analysis and is 100x cheaper than querying Elasticsearch for aggregation dashboards.
ML-Powered Anomaly Detection — Train a model on historical log patterns per service (log volume by level per 5-minute window). Detect anomalies: sudden drop in INFO logs (service crashed?), spike in WARN (degradation?), new ERROR message never seen before (new bug?). Use the ClickHouse aggregation engine to compute features, feed into an online ML model (e.g., Isolation Forest), and publish anomalies to the alert pipeline.
Cross-Region Replication & Active-Active — For global deployments, run independent logging clusters per region. A cross-region replicator asynchronously copies logs from each region to a central “global search” cluster (latency tolerance: 5 minutes). Users can query local (fast) or global (slower but complete). Use Kafka MirrorMaker 2 for cross-region topic replication.
Compliance & Immutability (WORM Storage) — For regulated industries (finance, healthcare), write logs to WORM (Write Once Read Many) storage. S3 Object Lock in Governance mode prevents deletion even by admins during the retention period. Maintain a cryptographic hash chain (Merkle tree) over daily log batches so tampering is detectable. Export compliance reports showing chain-of-custody for audit logs.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Write Path (Ingestion)#

Read Path (Search)#

Storage Tiers#

Kafka Sizing#

3. API Design (3 min)#

Key Decisions#

4. Data Model (3 min)#

Log Document (Elasticsearch / ClickHouse)#

Index Partitioning Strategy#

Why Elasticsearch + ClickHouse Hybrid?#

5. High-Level Design (12 min)#

System Architecture#

Component Responsibilities#

6. Deep Dives (15 min)#

Deep Dive 1: Handling 2M+ Lines/Sec Ingestion with Backpressure#

Deep Dive 2: Storage Engine & Index Partitioning for Petabyte-Scale Search#

Deep Dive 3: Multi-Tenancy, Access Control, and Noisy Neighbor Isolation#

7. Extensions (2 min)#