1. Requirements & Scope (5 min)
Functional Requirements
- Log Ingestion — Collect structured and unstructured logs from 50K+ servers, containers, and serverless functions. Support push-based (agents forward logs) and pull-based (system scrapes endpoints) collection models.
- Full-Text Search & Field-Based Filtering — Users can search logs by arbitrary keywords (full-text), filter by structured fields (service name, log level, host, correlation ID), and scope by time range. Queries return results within seconds even over terabytes of data.
- Live Tail (Real-Time Streaming) — Provide a
tail -fequivalent where engineers can subscribe to a live stream of logs from a specific service, host, or filter expression. Latency from log emission to live tail display should be under 5 seconds. - Alerting & Pattern Detection — Define alert rules on log patterns (e.g., “more than 100 ERROR logs from payment-service in 5 minutes”). Support anomaly detection on log volume and error rates. Route alerts to PagerDuty, Slack, email.
- Retention & Lifecycle Management — Configure per-tenant or per-service retention policies (e.g., 7 days hot, 30 days warm, 1 year cold, 7 years frozen for compliance). Automatic tiered storage migration and deletion.
Non-Functional Requirements
- Availability: 99.99% uptime for ingestion (logs must never be dropped). Search can tolerate brief degradation (99.9%).
- Latency: Ingestion-to-searchable latency < 10 seconds (p99). Search queries return in < 5 seconds for recent data (last 24h), < 30 seconds for historical data.
- Consistency: Eventual consistency is acceptable. A log written at time T may not appear in search results for up to 10 seconds. No strict ordering guarantees across services.
- Scale: 50,000 servers, 2 million log lines/second sustained ingestion, 10M lines/sec burst. 1 PB of searchable hot/warm data. 10 PB in cold/frozen archival.
- Durability: Zero log loss once acknowledged by the ingestion pipeline. Replicated storage with 99.999999999% (11 nines) durability for archived data.
2. Estimation (3 min)
Write Path (Ingestion)
| Metric | Value |
|---|---|
| Servers | 50,000 |
| Avg log lines per server per second | 40 |
| Sustained ingestion rate | 2M lines/sec |
| Burst ingestion rate (10x spikes) | 10M lines/sec |
| Avg log line size (JSON structured) | 500 bytes |
| Sustained throughput | 1 GB/sec = 3.6 TB/hour = 86 TB/day |
| Compression ratio (zstd) | ~10:1 |
| Compressed daily storage | ~8.6 TB/day |
Read Path (Search)
| Metric | Value |
|---|---|
| Concurrent search users | ~500 |
| Search QPS | ~200 queries/sec |
| Live tail subscriptions | ~1,000 concurrent streams |
| Avg query scans | 1–10 GB of index data per query |
Storage Tiers
| Tier | Retention | Raw Data | Compressed |
|---|---|---|---|
| Hot (SSD, fully indexed) | 3 days | 258 TB | ~26 TB |
| Warm (HDD, fully indexed) | 30 days | 2.58 PB | ~258 TB |
| Cold (object storage, partial index) | 1 year | ~31 PB | ~3.1 PB |
| Frozen (object storage, metadata only) | 7 years | archived | ~20 PB |
Kafka Sizing
- 2M messages/sec × 500 bytes = 1 GB/sec.
- With replication factor 3, Kafka needs ~3 GB/sec disk write throughput.
- 24-hour retention in Kafka (buffer for downstream failures): 86 TB × 3 replicas = 258 TB Kafka storage.
- ~50 Kafka brokers (each handling ~60 MB/sec write throughput).
3. API Design (3 min)
# --- Ingestion APIs ---
# Batch log ingestion (used by agents)
POST /v1/logs/ingest
Headers: X-Tenant-ID, X-API-Key, Content-Encoding: zstd
Body: {
"logs": [
{
"timestamp": "2025-07-15T10:23:45.123Z",
"level": "ERROR",
"service": "payment-service",
"host": "prod-web-042",
"trace_id": "abc123def456",
"span_id": "span-789",
"message": "Failed to process payment",
"metadata": { "user_id": "u-123", "order_id": "ord-456", "error_code": "TIMEOUT" }
}
]
}
Response: 202 Accepted { "ingested": 150, "failed": 0 }
# --- Search APIs ---
# Full-text and field-based search
POST /v1/logs/search
Body: {
"query": "Failed to process payment",
"filters": {
"service": "payment-service",
"level": ["ERROR", "WARN"],
"time_range": { "from": "2025-07-15T10:00:00Z", "to": "2025-07-15T11:00:00Z" }
},
"sort": "timestamp:desc",
"limit": 100,
"cursor": "eyJsYXN0X3RzIjoxNjg..."
}
Response: { "hits": [...], "total": 4521, "next_cursor": "..." }
# Live tail — WebSocket
WS /v1/logs/tail?service=payment-service&level=ERROR
→ Server pushes matching log lines in real time
# Aggregation query (for dashboards)
POST /v1/logs/aggregate
Body: {
"filters": { "service": "payment-service", "time_range": { "last": "1h" } },
"group_by": ["level"],
"interval": "1m",
"metric": "count"
}
# --- Alert APIs ---
POST /v1/alerts/rules
Body: {
"name": "Payment errors spike",
"query": "level:ERROR AND service:payment-service",
"condition": { "threshold": 100, "window": "5m", "operator": ">" },
"actions": [{ "type": "pagerduty", "severity": "critical" }]
}
Key Decisions
- Batch ingestion over single-line writes — amortizes network overhead, enables compression. Agents batch locally (every 1–5 seconds or 1000 lines, whichever comes first).
- Cursor-based pagination instead of offset-based — handles the append-heavy, time-sorted nature of log data without expensive deep-page queries.
- WebSocket for live tail — HTTP long-polling would waste connections. WebSocket allows server-push with low latency. Each subscription is a filtered Kafka consumer under the hood.
4. Data Model (3 min)
Log Document (Elasticsearch / ClickHouse)
| Field | Type | Indexed | Notes |
|---|---|---|---|
id |
UUID | Primary key | Generated at ingestion, used for dedup |
timestamp |
DateTime64(ms) | Yes (sort key) | Nanosecond precision stored, ms for indexing |
tenant_id |
String | Yes (partition key) | Tenant isolation |
service |
String (keyword) | Yes | Exact match, not analyzed |
host |
String (keyword) | Yes | Server hostname |
level |
Enum (TRACE/DEBUG/INFO/WARN/ERROR/FATAL) | Yes | Stored as uint8 internally |
trace_id |
String (keyword) | Yes | For distributed tracing correlation |
span_id |
String (keyword) | Yes | Links to specific trace span |
message |
Text (analyzed) | Yes (full-text) | Inverted index for search |
metadata |
JSON / Map(String, String) | Selective | Dynamic fields, selectively indexed |
_raw |
String | No | Original log line, stored but not indexed |
Index Partitioning Strategy
Index naming: logs-{tenant_id}-{YYYY.MM.DD}-{shard_number}
Example: logs-payment-team-2025.07.15-003
Daily indices → easy to delete/archive entire days
Tenant prefix → physical isolation for noisy neighbors
Shard number → distribute within a day (target 30-50 GB per shard)
Why Elasticsearch + ClickHouse Hybrid?
| Engine | Use Case | Strength |
|---|---|---|
| Elasticsearch | Full-text search, live tail, ad-hoc queries | Inverted index excels at arbitrary text search. Near real-time indexing (refresh every 1s). |
| ClickHouse | Aggregations, dashboards, analytics, alerting | Columnar storage gives 10-100x faster GROUP BY, COUNT, and time-series aggregations. Excellent compression. |
| S3 / GCS | Cold & frozen storage | 11 nines durability, ~$0.023/GB/month vs $0.10/GB for SSD. Parquet format for occasional queries. |
Both engines are populated from the same Kafka topics. Elasticsearch handles interactive search. ClickHouse handles dashboard queries and alert rule evaluation. Cold data is written to object storage in compressed Parquet with only metadata indexed in a lightweight catalog (e.g., Hive Metastore or Iceberg).
5. High-Level Design (12 min)
System Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ LOG PRODUCERS │
│ [App Servers] [Containers] [Lambda/Serverless] [Network Devices] [CDN] │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌─────────┐ │
│ │Filebeat │ │Fluentd │ │ SDK / │ │ Syslog │ │Filebeat │ │
│ │ Agent │ │ Agent │ │ Direct │ │ Forwarder│ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │ │
└───────┼──────────────┼─────────────┼─────────────────┼─────────────┼────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ INGESTION GATEWAY (L7 Load Balancer) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Gateway Fleet (stateless, auto-scaling) │ │
│ │ - Authentication (API key validation) │ │
│ │ - Tenant rate limiting (token bucket per tenant) │ │
│ │ - Schema validation & enrichment (add receive_ts, geo) │ │
│ │ - Decompression (zstd) → re-serialization (Protobuf) │ │
│ │ - Backpressure signaling (HTTP 429 / TCP flow control) │ │
│ └────────────────────┬───────────────────────────────────────┘ │
│ │ │
└───────────────────────┼──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ KAFKA CLUSTER │
│ │
│ Topic: logs-raw (partitioned by hash(tenant_id + service)) │
│ ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐ │
│ │ P0 │ P1 │ P2 │ P3 │ ... │ P247 │ P248 │ P249 │ (250 partitions)│
│ └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ │
│ Replication factor: 3 | Retention: 24h | Compression: zstd │
│ │
│ Topic: logs-enriched (after processing pipeline) │
│ Topic: logs-alerts (matched alert patterns, routed to alerting) │
└────────┬────────────────────┬──────────────────────┬─────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ STREAM PROCESSOR │ │ STREAM PROCESSOR│ │ STREAM PROCESSOR │
│ (ES Indexer) │ │ (CH Loader) │ │ (Alert Evaluator) │
│ │ │ │ │ │
│ - Parse & enrich │ │ - Batch inserts │ │ - Sliding window │
│ - Bulk index to │ │ (64K rows per │ │ aggregations │
│ Elasticsearch │ │ batch) │ │ - Pattern matching │
│ - Manage index │ │ - Write to CH │ │ - Anomaly detection │
│ rotation │ │ MergeTree │ │ - Publish to alerts │
│ - Handle retries │ │ tables │ │ topic │
└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ ELASTICSEARCH │ │ CLICKHOUSE │ │ ALERT MANAGER │
│ CLUSTER │ │ CLUSTER │ │ │
│ │ │ │ │ - Dedup alerts │
│ Hot: 30 nodes │ │ 20 nodes │ │ - Route: PagerDuty, │
│ (NVMe SSD) │ │ (SSD + HDD) │ │ Slack, email │
│ Warm: 50 nodes │ │ │ │ - Silence/snooze │
│ (HDD) │ │ ReplicatedMerge │ │ - Escalation chains │
│ │ │ Tree engine │ │ │
│ Daily index │ │ TTL-based │ └─────────────────────┘
│ rotation + ILM │ │ retention │
└────────┬────────┘ └─────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ TIERED STORAGE & LIFECYCLE │
│ │
│ HOT (0-3 days) → WARM (3-30 days) → COLD (30d-1yr) → FROZEN (1-7yr)│
│ NVMe SSD HDD S3/GCS (Parquet) S3 Glacier │
│ Full index Full index Partial index Metadata only │
│ 3 replicas 1 replica Erasure coded Erasure coded │
│ │
│ ILM Policy: auto-migrate based on index age │
│ Compliance: immutable writes, audit trail, encryption at rest │
└──────────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
-
Log Collection Agents (Filebeat / Fluentd) — Deployed as sidecars or DaemonSets. Read from files, stdout, or syslog. Buffer locally (disk-backed queue) to survive brief network outages. Compress and batch before sending to gateway. Support structured (JSON) and unstructured (plaintext with grok parsing) formats.
-
Ingestion Gateway — Stateless HTTP/gRPC endpoints behind an L7 load balancer. Handles auth, rate limiting, validation, and enrichment. Crucially, applies backpressure: if Kafka producers see high latency, the gateway returns HTTP 429 to agents, which back off and buffer locally. This prevents cascade failures.
-
Kafka — The central durability buffer. Decouples ingestion from processing. The 24-hour retention means if Elasticsearch is down for maintenance, no logs are lost — consumers replay from offset. Partitioning by
hash(tenant_id + service)ensures logs from the same service land on the same partition, enabling ordered processing within a service. -
Stream Processors — Three independent consumer groups:
- ES Indexer: Consumes from
logs-raw, enriches (GeoIP, hostname resolution), and bulk-indexes into Elasticsearch. Uses batches of 5,000 documents or 5-second windows. - ClickHouse Loader: Consumes the same topic, writes batches of 64K rows into ClickHouse MergeTree tables for analytical queries.
- Alert Evaluator: Runs sliding window aggregations (tumbling 1-minute windows with Flink/Kafka Streams). Matches alert rules and publishes to
logs-alertstopic.
- ES Indexer: Consumes from
-
Elasticsearch Cluster — Powers full-text search, live tail, and ad-hoc exploration. Daily indices with ILM (Index Lifecycle Management) policies. Hot nodes use NVMe SSDs for fast indexing and search. Warm nodes use HDDs for cost efficiency on older data.
-
ClickHouse Cluster — Powers dashboards, aggregation queries, and alert rule evaluation. The columnar MergeTree engine compresses log data 15-20x and runs analytical queries 10-100x faster than Elasticsearch for GROUP BY and COUNT operations.
-
Alert Manager — Consumes from
logs-alerts, deduplicates (same alert within a suppression window), and routes to notification channels. Supports escalation chains and silence windows. -
Tiered Storage Manager — Background service that migrates data between tiers based on ILM policies. Hot → Warm is an Elasticsearch node migration. Warm → Cold exports to Parquet in object storage. Cold → Frozen moves to archival storage with only metadata retained for discovery.
6. Deep Dives (15 min)
Deep Dive 1: Handling 2M+ Lines/Sec Ingestion with Backpressure
The hardest engineering challenge is reliable ingestion at 2M lines/sec sustained with 10M lines/sec bursts, without losing a single log and without cascading failures.
The Backpressure Chain:
Agent ←── Gateway ←── Kafka ←── Stream Processor ←── Elasticsearch
Backpressure flows RIGHT to LEFT. Each component signals the upstream
when it cannot keep up.
Layer 1: Agent-side buffering. Filebeat and Fluentd maintain a disk-backed queue (configurable, typically 1-5 GB). When the gateway returns 429 or the connection fails, agents stop reading from source files and buffer on disk. The file offset (registry in Filebeat) is only advanced after acknowledgment. This is the last line of defense — as long as disk space holds, no logs are lost.
Layer 2: Gateway rate limiting. Each tenant gets a token bucket rate limiter (e.g., payment-team: 100K lines/sec). This prevents a single runaway service from saturating the pipeline. The gateway maintains a Kafka producer pool. If Kafka producer buffers exceed 80% capacity, the gateway starts rejecting with 429. Key design: the gateway is stateless — rate limit state is stored in a shared Redis cluster, with local caching (leaky bucket local + periodic sync) to avoid Redis becoming a bottleneck.
Layer 3: Kafka as the shock absorber. Kafka is sized to handle 10x burst (10M lines/sec) for up to 30 minutes. With 250 partitions and 50 brokers, each broker handles ~200K messages/sec, well within Kafka’s capability (~800K messages/sec per broker on commodity hardware). The 24-hour retention means even a multi-hour downstream outage loses no data.
Layer 4: Consumer-side adaptive batching. The ES Indexer consumer group dynamically adjusts batch size based on Elasticsearch indexing latency. When ES responds slowly (p99 > 500ms), batch size increases from 5K to 50K documents and flush interval extends from 5s to 30s. This trades latency for throughput, preventing consumer lag from growing unboundedly. If lag exceeds a threshold (e.g., 10 minutes), an autoscaler adds more consumer instances.
Burst handling math:
- Burst: 10M lines/sec for 10 minutes = 6 billion logs = 3 TB raw data.
- Kafka absorbs all of it (3 TB is well within the 258 TB buffered storage).
- ES Indexer processes at ~3M lines/sec (with 30 consumer instances).
- Drain time for the burst backlog: 6B / 3M = 2,000 seconds ≈ 33 minutes after burst ends.
- During drain, search latency for very recent logs degrades from 10s to ~40 minutes. Acceptable per our SLA.
Failure mode: Elasticsearch cluster down for 2 hours.
- Kafka retains all data (2M/sec × 7,200 sec = 14.4B logs = ~7.2 TB).
- Well within 24-hour retention capacity.
- After ES recovery, consumers replay from stored offset.
- Full catchup time: 14.4B / 3M = 4,800 sec ≈ 80 minutes.
Deep Dive 2: Storage Engine & Index Partitioning for Petabyte-Scale Search
At petabyte scale, naive Elasticsearch deployment collapses. This deep dive covers how to make search work across massive data volumes.
Time-Based Index Partitioning:
logs-{tenant}-{date}-{shard}
Example index lifecycle:
Day 1: logs-payment-2025.07.15-{000..019} → 20 primary shards, HOT tier
Day 4: ILM moves to WARM tier, force-merge to 5 segments per shard
Day 31: Snapshot to S3, delete from ES, register in cold catalog
Day 366: Move S3 objects to Glacier, retain metadata only (FROZEN)
Shard sizing strategy:
- Target: 30–50 GB per shard (Elasticsearch sweet spot for search performance).
- A tenant producing 10 TB/day needs ~200-300 shards per day.
- A small tenant producing 10 GB/day needs 1 shard per day.
- Dynamic shard allocation: the ingestion gateway tracks per-tenant volume. When creating tomorrow’s index template, it calculates the optimal shard count based on yesterday’s volume, with 20% headroom.
The problem with too many shards:
- 50K shards across the cluster is manageable. 500K is not — each shard consumes ~10 MB heap on the master node.
- Solution: small tenants share indices using a
_routingfield set totenant_id. The routing ensures all of a tenant’s docs land on the same shard within a shared daily index. Larger tenants (> 1 TB/day) get dedicated indices.
Inverted Index Optimization:
- Log messages are tokenized using a custom analyzer: lowercase, ASCII folding, split on whitespace and common delimiters, no stemming (logs aren’t natural language).
- Keyword fields (service, host, level, trace_id) use
keywordtype — exact match, doc values for aggregations. - The
messagefield usestexttype with a position-aware inverted index for phrase queries (“connection refused from 10.0.1.42”). metadatafields are dynamically mapped but with a strict limit (max 100 unique field names per index) to prevent mapping explosion.
Tiered Storage Query Routing:
User query: "error payment" last 7 days
│
▼
┌──────────────┐
│ Query Router │
└──────┬───────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
HOT (0-3d) WARM (3-7d) (skip COLD)
20 shards 20 shards
NVMe SSD HDD
~200ms ~800ms
│
▼
Merge & rank results
Return top 100 by timestamp
For cold tier queries (rare, typically compliance or incident forensics), the system uses a searchable snapshots approach: the query router identifies relevant Parquet files in S3 using a metadata catalog (time range + tenant + service → file list), downloads the necessary column chunks on-demand to a local cache, and searches them. Cold queries are capped at 60-second timeout and limited to 5 concurrent cold queries to prevent resource starvation.
Compression and cost optimization:
- Hot tier: LZ4 compression (fast decompression for real-time search). ~3:1 ratio.
- Warm tier: zstd compression (better ratio, acceptable decompression speed). ~8:1 ratio.
- Cold tier: zstd level 19 in Parquet files. ~15:1 ratio.
- Debug log sampling: For high-volume services emitting > 100K DEBUG lines/sec, the gateway applies probabilistic sampling (keep 10% of DEBUG, 100% of INFO and above). The sampling rate is stored as metadata so dashboards can extrapolate counts.
- Deduplication: The ingestion pipeline hashes
(timestamp, host, message)and deduplicates within a 5-second window using a Bloom filter. Catches duplicate delivery from agent retries. Reduces storage by ~3-5% in practice.
Deep Dive 3: Multi-Tenancy, Access Control, and Noisy Neighbor Isolation
In a shared logging platform serving hundreds of teams, multi-tenancy is critical for both security (Team A must not see Team B’s logs) and reliability (Team A’s log flood must not degrade Team B’s search experience).
Tenant Isolation Model:
┌───────────────────────────────────────────────────┐
│ Ingestion Layer │
│ Per-tenant rate limits (token bucket in Redis) │
│ payment-team: 200K lines/sec │
│ search-team: 50K lines/sec │
│ ml-platform: 500K lines/sec │
│ │
│ Overage handling: │
│ - Soft limit: log warning, continue ingesting │
│ - Hard limit (2x quota): drop DEBUG/TRACE only │
│ - Emergency (5x): drop all except ERROR/FATAL │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Storage Layer │
│ │
│ Large tenants: dedicated indices │
│ logs-payment-2025.07.15-{shard} │
│ │
│ Small tenants: shared indices with routing │
│ logs-shared-small-2025.07.15 (routing=tenant) │
│ │
│ Physical isolation option (enterprise): │
│ Dedicated ES node pools per tenant │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Query Layer │
│ │
│ Every query automatically injected with: │
│ filter: { tenant_id: <caller's tenant> } │
│ │
│ Query resource limits per tenant: │
│ - Max concurrent queries: 20 │
│ - Max query time range: 30 days (configurable) │
│ - Max result set: 10,000 hits │
│ - Query timeout: 30 seconds │
│ │
│ Query priority queues: │
│ - P0 (on-call incident): bypass limits │
│ - P1 (normal interactive): standard limits │
│ - P2 (batch/export): throttled, off-peak only │
└───────────────────────────────────────────────────┘
Access Control:
- Authentication: mTLS for agent-to-gateway. API keys or OAuth2 tokens for human users.
- Authorization: RBAC at tenant level. Roles:
viewer(search own team’s logs),admin(manage retention policies, alert rules),superadmin(cross-tenant search for SRE team). - Field-level redaction: PII fields (user email, IP address) can be marked as
restricted. Only users withpii-readerrole see the actual values; others see[REDACTED]. Implemented as a post-processing filter in the query layer, not in storage (preserving the ability to un-redact for authorized users).
Noisy Neighbor Prevention:
- The core mechanism is the per-tenant ingestion rate limit described above.
- On the query side, each tenant’s queries run in an isolated thread pool with bounded concurrency. A tenant running expensive queries cannot starve other tenants’ search capacity.
- Elasticsearch uses search request priority: queries from tenants in an active incident (P0) get 3x the thread pool allocation. This is triggered by an integration with the incident management system.
- Chargeback: each tenant’s ingestion volume and storage usage is metered. Monthly reports show cost attribution, incentivizing teams to reduce unnecessary logging (especially DEBUG in production).
7. Extensions (2 min)
-
Distributed Tracing Correlation — Enrich log documents with trace_id and span_id. Build a trace-to-logs bridge: clicking a trace span in Jaeger/Tempo opens the corresponding log lines. Requires a secondary index on trace_id with fast point lookups, achievable via ClickHouse’s
ReplicatedMergeTreewith anORDER BY (trace_id, timestamp)table dedicated to trace correlation. -
Log-Based Metrics (Logs-to-Metrics Pipeline) — Extract numeric metrics from log lines in the stream processor (e.g., “request completed in 342ms” →
request_duration_mshistogram). Publish to Prometheus/Mimir. This reduces the need to search raw logs for performance analysis and is 100x cheaper than querying Elasticsearch for aggregation dashboards. -
ML-Powered Anomaly Detection — Train a model on historical log patterns per service (log volume by level per 5-minute window). Detect anomalies: sudden drop in INFO logs (service crashed?), spike in WARN (degradation?), new ERROR message never seen before (new bug?). Use the ClickHouse aggregation engine to compute features, feed into an online ML model (e.g., Isolation Forest), and publish anomalies to the alert pipeline.
-
Cross-Region Replication & Active-Active — For global deployments, run independent logging clusters per region. A cross-region replicator asynchronously copies logs from each region to a central “global search” cluster (latency tolerance: 5 minutes). Users can query local (fast) or global (slower but complete). Use Kafka MirrorMaker 2 for cross-region topic replication.
-
Compliance & Immutability (WORM Storage) — For regulated industries (finance, healthcare), write logs to WORM (Write Once Read Many) storage. S3 Object Lock in Governance mode prevents deletion even by admins during the retention period. Maintain a cryptographic hash chain (Merkle tree) over daily log batches so tampering is detectable. Export compliance reports showing chain-of-custody for audit logs.