Design a Distributed Stream Processing System (Kafka)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Producers publish messages to named topics with optional partitioning keys
Consumers subscribe to topics and read messages in order within each partition
Support consumer groups — each message is delivered to exactly one consumer within a group (load balancing), but to all groups (broadcast)
Messages are durably persisted for a configurable retention period (default 7 days)
Support message replay — consumers can seek to any offset and re-consume

Non-Functional Requirements

Availability: 99.99% — the messaging backbone cannot go down without cascading failures across the entire platform
Latency: < 10ms end-to-end for p99 publish-to-consume (single datacenter). Throughput is prioritized over single-message latency.
Consistency: Messages within a partition are strictly ordered. Across partitions, no ordering guarantee. At-least-once delivery by default; exactly-once semantics available with idempotent producers + transactional consumers.
Scale: 1M+ messages/sec ingestion, 10TB+ daily throughput, 1000+ topics, 10,000+ partitions across the cluster
Durability: Zero message loss for acknowledged writes. Replication factor of 3 minimum for production topics.

2. Estimation (3 min)

Traffic

Write throughput: 1M messages/sec
Average message size: 1KB
Write bandwidth: 1M x 1KB = 1 GB/sec ingress
Replication factor 3: 1 GB/sec x 3 = 3 GB/sec internal replication traffic
Read throughput: Assume 5 consumer groups on average → 5 GB/sec egress
Total network: ~9 GB/sec across the cluster

Storage

Daily ingestion: 1 GB/sec x 86,400 = 86.4 TB/day (before replication)
With replication factor 3: 259 TB/day
7-day retention: 1.8 PB raw storage
Each broker: 12 x 4TB SSDs = 48TB per broker → need ~38 brokers for storage alone
In practice: 50-60 brokers (headroom for CPU, network, rebalancing)

Partitions

1000 topics x average 10 partitions = 10,000 partitions
Each partition is an append-only log segment on disk
Partition leader handles all reads/writes → must balance leaders evenly across brokers

3. API Design (3 min)

Producer API

// Publish a message
POST /topics/{topic}/messages
  Body: {
    "key": "user_123",              // optional partition key
    "value": "<base64-encoded payload>",
    "headers": {"trace_id": "abc"}, // optional metadata
    "partition": null                // null = use key hash; int = explicit partition
  }
  Response 200: {
    "topic": "user-events",
    "partition": 7,
    "offset": 48291034,
    "timestamp": 1708632060000
  }

// Batch publish (preferred for throughput)
POST /topics/{topic}/messages/batch
  Body: { "messages": [ ... ] }
  Response 200: { "offsets": [ ... ] }

Consumer API

// Subscribe to topic(s) with a consumer group
POST /consumers/{group_id}/subscribe
  Body: { "topics": ["user-events", "page-views"] }

// Poll for messages
GET /consumers/{group_id}/poll?max_records=500&timeout_ms=1000
  Response 200: {
    "messages": [
      {"topic": "user-events", "partition": 7, "offset": 48291034, "key": "user_123", "value": "...", "timestamp": ...},
      ...
    ]
  }

// Commit offsets (acknowledge processing)
POST /consumers/{group_id}/offsets
  Body: { "offsets": {"user-events": {"7": 48291035}} }

// Seek to specific offset (replay)
POST /consumers/{group_id}/seek
  Body: { "topic": "user-events", "partition": 7, "offset": 48000000 }

Key Decisions

Batch publishing is the primary path — amortizes network round trips
Consumer pull model (not push) — consumers control their own pace, natural backpressure
Offsets are committed by consumers, not auto-acked — enables at-least-once and exactly-once semantics

4. Data Model (3 min)

Message (on-disk log format)

Record:
  offset          | int64       -- monotonically increasing per partition
  timestamp       | int64       -- producer-set or broker-set
  key             | bytes       -- nullable, used for partitioning
  value           | bytes       -- the payload
  headers         | map<str,str>-- metadata
  crc32           | int32       -- integrity check
  attributes      | int8        -- compression codec, timestamp type
  batch_offset    | int32       -- offset within producer batch (for idempotency)
  producer_id     | int64       -- for exactly-once (idempotent producer)
  producer_epoch  | int16       -- for exactly-once (fencing)
  sequence_num    | int32       -- for exactly-once (dedup)

Topic Metadata (ZooKeeper/KRaft)

Topic:
  topic_name      | string (PK)
  num_partitions  | int
  replication_factor | int
  config_overrides   | map       -- retention.ms, segment.bytes, etc.

Partition Assignment:
  topic_name      | string
  partition_id    | int
  leader_broker   | int
  isr             | list<int>   -- in-sync replicas
  replicas        | list<int>   -- all assigned replicas

Consumer Group Offsets

__consumer_offsets (internal compacted topic):
  Key: {group_id, topic, partition}
  Value: {committed_offset, metadata, timestamp}

Why This Storage Model

Append-only log on local disk — sequential writes are the fastest possible I/O pattern. HDDs do 200MB/s sequential, SSDs do 2GB/s+. No random seeks.
No traditional database — the log IS the database. No index overhead, no B-tree maintenance.
Consumer offsets stored as a compacted topic — self-hosted, replicated, no external DB dependency.
KRaft (replacing ZooKeeper) — metadata managed via Raft consensus among controller brokers, eliminating the ZK dependency.

5. High-Level Design (12 min)

Architecture

Producers (thousands)
  │
  ├─ Producer 1 ──┐
  ├─ Producer 2 ──┤    (messages partitioned by key hash)
  └─ Producer N ──┤
                  ▼
         ┌──────────────────────────────────────┐
         │          Kafka Cluster                │
         │                                       │
         │  Broker 1         Broker 2            │
         │  ┌────────────┐   ┌────────────┐     │
         │  │ Topic-A P0 │   │ Topic-A P1 │     │
         │  │  (Leader)  │   │  (Leader)  │     │
         │  │ Topic-A P1 │   │ Topic-A P0 │     │
         │  │  (Replica) │   │  (Replica) │     │
         │  └────────────┘   └────────────┘     │
         │                                       │
         │  Broker 3 (Controller)                │
         │  ┌────────────┐                      │
         │  │ Topic-A P0 │  KRaft Metadata      │
         │  │  (Replica) │  (Raft consensus)    │
         │  │ Topic-A P1 │                      │
         │  │  (Replica) │                      │
         │  └────────────┘                      │
         └──────────────────────────────────────┘
                  │
                  ▼
         Consumer Groups
         ┌─────────────────────┐
         │ Group "analytics"   │
         │  Consumer A → P0    │
         │  Consumer B → P1    │
         └─────────────────────┘
         ┌─────────────────────┐
         │ Group "search-index"│
         │  Consumer C → P0,P1 │
         └─────────────────────┘

Write Path (Producer → Broker)

Producer
  → Serialize message (Avro/Protobuf/JSON)
  → Compute partition: hash(key) % num_partitions (or round-robin if no key)
  → Batch messages destined for same partition (linger.ms = 5ms, batch.size = 16KB)
  → Compress batch (LZ4/Snappy/Zstd)
  → Send to partition leader broker
  → Leader:
    → Append to local log segment (sequential write)
    → Wait for ISR replicas to fetch and acknowledge
    → When acks=all: respond to producer only after all ISR replicas confirm
    → When acks=1: respond after local write (faster, risk of data loss on leader crash)
  → Producer receives offset confirmation

Read Path (Broker → Consumer)

Consumer
  → Poll leader broker for partition
  → Broker:
    → Read from page cache (hot data) or disk (cold data)
    → Use sendfile() / zero-copy: data goes directly from page cache → network socket
      (no kernel→user→kernel copy — saves 2 memory copies and 2 context switches)
    → Respond with batch of messages
  → Consumer deserializes and processes
  → Consumer commits offset (async or sync)

Components

Brokers (50-60): Store partition log segments, serve produce/fetch requests, replicate data. Each broker is a partition leader for some partitions and a follower for others.
Controller (KRaft): 3-5 controller nodes running Raft consensus. Manages cluster metadata: topic creation, partition assignment, leader election, ISR management.
Producers: Client libraries that batch, compress, and route messages to the correct partition leader.
Consumers: Client libraries that poll partition leaders, track offsets, and handle rebalancing within consumer groups.
Schema Registry: External service storing Avro/Protobuf schemas. Producers register schemas, consumers validate. Prevents breaking changes.
Monitoring (JMX/Prometheus): Under-replicated partitions, consumer lag, broker throughput, request latency percentiles.

6. Deep Dives (15 min)

Deep Dive 1: Replication and ISR (In-Sync Replicas)

The problem: How do we ensure durability without sacrificing throughput? Traditional quorum (majority must ack) wastes 1/3 of write capacity.

Kafka’s ISR approach:

Partition P0: Leader = Broker 1, Replicas = [Broker 1, Broker 2, Broker 3]
ISR = [Broker 1, Broker 2, Broker 3]  (initially all in sync)

Write flow:
1. Producer sends message to Leader (Broker 1)
2. Leader appends to local log, records LEO (Log End Offset) = 100
3. Followers (Broker 2, 3) fetch from leader (pull-based replication)
4. Each follower appends to local log, sends fetch response
5. Leader tracks each replica's LEO:
   - Broker 1 LEO: 100
   - Broker 2 LEO: 100
   - Broker 3 LEO: 98 (lagging)
6. High-Water Mark (HWM) = min(LEO of all ISR members) = 98
   - Only messages up to HWM are visible to consumers
   - This prevents consumers from reading data that might be lost on leader failure

ISR shrinking:

If Broker 3 falls behind by > replica.lag.time.max.ms (default 30s):
  ISR shrinks: [Broker 1, Broker 2, Broker 3] → [Broker 1, Broker 2]
  - Writes now only need ack from Broker 1 and Broker 2
  - min.insync.replicas = 2 → writes still succeed
  - If ISR drops below min.insync.replicas → producer gets NotEnoughReplicasException
  - This is the safety valve: we refuse writes rather than risk data loss

Leader failure and election:

Broker 1 (leader) crashes:
  1. Controller detects failure (session timeout)
  2. Controller picks new leader from ISR: Broker 2
  3. Broker 2 becomes leader, truncates any messages beyond the last committed HWM
  4. Consumers and producers discover new leader via metadata refresh
  5. Total failover time: ~5-10 seconds

Unclean leader election (all ISR dead):
  - unclean.leader.election.enable = false (default): partition goes offline, no data loss
  - unclean.leader.election.enable = true: out-of-sync replica becomes leader, some messages lost
  - Trade-off: availability vs durability

Why ISR > Quorum:

Quorum (majority): RF=3 needs 2 acks. RF=5 needs 3 acks. Wastes replicas.
ISR: RF=3, all 3 in ISR → all 3 ack → stronger durability. If one falls behind, ISR shrinks to 2, writes continue with 2 acks. Same durability guarantee, but more adaptive.

Deep Dive 2: Exactly-Once Semantics (EOS)

The problem: At-least-once delivery means duplicates when producers retry on timeout. At-most-once means data loss when producers don’t retry. We need exactly-once for financial transactions, deduplication pipelines, and stream processing.

Three layers to EOS:

Layer 1: Idempotent Producer (per-partition dedup)

Producer config: enable.idempotence = true

How it works:
1. Broker assigns each producer a unique ProducerID (PID)
2. Producer tracks a sequence number per partition, incrementing with each message
3. Broker maintains: Map<(PID, Partition) → last_sequence_number>
4. On receive:
   - If seq == expected: accept, increment expected
   - If seq < expected: duplicate, silently ack (idempotent)
   - If seq > expected: out-of-order, reject (producer bug or PID reuse)

This eliminates duplicates from network retries within a single producer session.
Limitation: PID changes on producer restart → cannot dedup across restarts.

Layer 2: Transactional Producer (cross-partition atomicity)

Producer config: transactional.id = "payment-processor-1"

How it works:
1. Producer registers transactional.id with Transaction Coordinator (a broker)
2. Coordinator assigns PID + epoch (epoch increments on restart → fences zombies)

Transaction flow:
  producer.beginTransaction()
  producer.send(topic="orders", key="order_1", value="...")
  producer.send(topic="inventory", key="item_42", value="...")
  producer.sendOffsetsToTransaction(consumer_offsets)  // for consume-transform-produce
  producer.commitTransaction()

Under the hood:
  a. Producer → Coordinator: begin transaction
  b. Producer → Partition leaders: send messages (marked as "transactional, uncommitted")
  c. Producer → Coordinator: commit transaction
  d. Coordinator → writes COMMIT marker to transaction log
  e. Coordinator → Partition leaders: write COMMIT marker to each partition
  f. Consumers with isolation.level=read_committed skip uncommitted messages

Layer 3: Consumer isolation

Consumer config: isolation.level = read_committed

Behavior:
- Consumer only sees messages up to the Last Stable Offset (LSO)
- LSO = min(HWM, earliest_open_transaction_offset)
- Uncommitted transactional messages are buffered but not delivered
- On COMMIT: messages become visible
- On ABORT: messages are skipped

Deep Dive 3: Performance — Zero-Copy and Sequential I/O

Why Kafka is fast (not despite being on disk, but because of it):

1. Sequential I/O

Random disk write: ~100 IOPS (HDD), ~10,000 IOPS (SSD)
Sequential disk write: ~200 MB/s (HDD), ~2 GB/s (SSD)

Kafka only does sequential appends:
- Producer write → append to end of log segment file
- No random seeks, no in-place updates, no B-tree rebalancing
- OS page cache naturally handles hot data in RAM

2. Zero-Copy (sendfile syscall)

Traditional read + network send:
  1. Disk → Kernel page cache (DMA)
  2. Kernel page cache → User space buffer (CPU copy)
  3. User space buffer → Kernel socket buffer (CPU copy)
  4. Kernel socket buffer → NIC (DMA)
  = 4 copies, 2 context switches

Zero-copy with sendfile():
  1. Disk → Kernel page cache (DMA)
  2. Kernel page cache → NIC (DMA, with scatter-gather)
  = 2 copies, 0 context switches, data never enters user space

For Kafka consuming: data goes directly from page cache to network.
This is why Kafka consumer throughput can saturate the NIC (10 Gbps+).

3. Batching and Compression

Producer batching:
  - linger.ms = 5: wait up to 5ms to accumulate messages into a batch
  - batch.size = 16KB: send when batch reaches 16KB
  - One network round trip per batch, not per message
  - Compression (LZ4/Zstd) applied per batch → better compression ratio than per-message

Result: 1M messages/sec with ~2000 network requests/sec (500 messages per batch)

4. Log Compaction (for changelog topics)

Normal retention: delete segments older than retention.ms
Log compaction: keep only the latest value per key

Use case: KTable / changelog / state store
  Key: user_123, Value: {name: "Alice", city: "NYC"}     offset 10
  Key: user_123, Value: {name: "Alice", city: "London"}   offset 50
  Key: user_456, Value: {name: "Bob", city: "SF"}         offset 30

After compaction:
  Key: user_123, Value: {name: "Alice", city: "London"}   offset 50
  Key: user_456, Value: {name: "Bob", city: "SF"}         offset 30

Tombstone: Key: user_123, Value: null → deletes the key after delete.retention.ms

7. Extensions (2 min)

Kafka Streams / ksqlDB: Stream processing library that enables joins, aggregations, windowed computations, and stateful transformations directly on Kafka topics without a separate processing cluster (Spark/Flink).
Tiered Storage: Move old log segments from local broker disks to object storage (S3/GCS). Enables infinite retention without scaling broker storage. Hot segments stay on local SSDs, cold segments served from S3.
Multi-datacenter replication (MirrorMaker 2): Async replication across regions using source-topic → remote-topic mirroring. Handles offset translation so consumers can failover across DCs.
Rack-aware replica assignment: Ensure replicas for the same partition are placed on different racks/AZs. Survives rack-level failures without data loss.
Quotas and multi-tenancy: Per-client-id produce/consume byte-rate throttling. Prevents one noisy tenant from starving others on a shared cluster. Network and request quotas enforced at the broker level.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Partitions#

3. API Design (3 min)#

Producer API#

Consumer API#

Key Decisions#

4. Data Model (3 min)#

Message (on-disk log format)#

Topic Metadata (ZooKeeper/KRaft)#

Consumer Group Offsets#

Why This Storage Model#

5. High-Level Design (12 min)#

Architecture#

Write Path (Producer → Broker)#

Read Path (Broker → Consumer)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Replication and ISR (In-Sync Replicas)#

Deep Dive 2: Exactly-Once Semantics (EOS)#

Deep Dive 3: Performance — Zero-Copy and Sequential I/O#

7. Extensions (2 min)#