System Design Solutions on Chirag Hasija

Design a Botnet Detection System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Detect and classify botnet traffic in real-time — distinguish between legitimate users, known bots (scrapers, crawlers), and malicious botnets (DDoS, credential stuffing, spam)
Maintain an IP reputation database that scores IPs from 0 (definitively malicious) to 100 (definitively legitimate), updated continuously based on observed behavior
Support multiple detection methods: rate-based, behavioral fingerprinting, ML anomaly detection, honeypot-based, and collaborative intelligence (threat feeds)
Automatically block or challenge (CAPTCHA) detected bots with configurable actions per threat level
Provide a real-time dashboard showing attack patterns, blocked traffic, top threat sources, and detection accuracy metrics

Non-Functional Requirements

Availability: 99.99% — the detection system sits in the critical path. If it goes down, either all traffic is blocked (fail-closed) or all bots get through (fail-open). Neither is acceptable.
Latency: < 5ms decision time per request. Detection is in the hot path of every incoming request. Cannot add perceptible latency.
Consistency: Eventual consistency for IP reputation scores (a few seconds of propagation delay is fine). Real-time blocking decisions must use locally cached scores.
Scale: 1M requests/sec. 10M unique IPs/day. 500K concurrent connections. Detection must scale linearly.
False Positive Rate: < 0.1% — blocking legitimate users is worse than letting some bots through. False negatives (missed bots) are bad but recoverable. False positives lose customers.

2. Estimation (3 min)

Traffic

1M requests/sec → each request needs a bot/not-bot decision
1M feature extraction + scoring operations per second
Assuming 5% of traffic is botnet → 50K malicious requests/sec to detect and block

IP Reputation Database

10M unique IPs observed per day
Each IP record: ~500 bytes (IP, score, features, history, last seen, classification)
Active IPs (last 24h): 10M × 500 bytes = 5 GB — fits in memory (Redis)
Historical IPs (90-day window): 100M × 500 bytes = 50 GB — PostgreSQL with hot/cold tiering

Feature Extraction

Per-request features: ~20 signals extracted in < 1ms
- IP, User-Agent, headers, URL pattern, request timing, TLS fingerprint, geo
Per-IP aggregated features (sliding window): ~50 signals updated in < 2ms
- Request rate, endpoint diversity, timing regularity, error rate, session behavior
Feature vector size: ~200 bytes per request

ML Model

Model: gradient-boosted decision tree (XGBoost/LightGBM)
Inference time: < 1ms per request (100 trees, max depth 8)
Model size: ~10 MB (loaded in memory on every detection node)
Retraining: daily, on previous day’s labeled data

3. API Design (3 min)

Request Evaluation API (internal, called by API Gateway/WAF)

// Evaluate a request for bot probability
POST /v1/evaluate
  Body: {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ...",
    "url": "/api/v1/login",
    "method": "POST",
    "headers": { ... },
    "tls_fingerprint": "ja3_hash_abc",
    "geo": { "country": "RU", "asn": 12345 },
    "session_id": "sess_xyz"
  }
  Response 200: {
    "decision": "block",               // allow | challenge | throttle | block
    "confidence": 0.94,
    "threat_type": "credential_stuffing",
    "ip_reputation": 12,
    "signals": ["high_request_rate", "ja3_known_bot", "geo_mismatch"],
    "latency_ms": 2.3
  }

IP Reputation API

// Query IP reputation
GET /v1/reputation/{ip}
  Response 200: {
    "ip": "203.0.113.42",
    "score": 12,                        // 0-100, lower = more suspicious
    "classification": "likely_bot",
    "first_seen": "2026-02-01T00:00:00Z",
    "last_seen": "2026-02-22T10:00:00Z",
    "total_requests_24h": 45000,
    "blocked_requests_24h": 42000,
    "threat_types": ["credential_stuffing", "scraping"],
    "associated_asn": { "number": 12345, "name": "Shady Hosting Inc." }
  }

// Manually override IP reputation (analyst action)
PUT /v1/reputation/{ip}
  Body: { "score": 100, "reason": "False positive - legitimate partner API", "ttl": 86400 }

// Report malicious IP (collaborative intelligence)
POST /v1/reputation/report
  Body: { "ip": "203.0.113.42", "threat_type": "ddos", "evidence": "..." }

Configuration API

// Set detection policy
PUT /v1/policies/{policy_id}
  Body: {
    "name": "login_endpoint_protection",
    "endpoint_pattern": "/api/*/login",
    "rules": [
      { "type": "rate_limit", "threshold": 10, "window": "1m", "action": "challenge" },
      { "type": "ml_score", "threshold": 0.8, "action": "block" },
      { "type": "geo_block", "countries": ["XX"], "action": "block" }
    ]
  }

4. Data Model (3 min)

IP Reputation (Redis — hot data + PostgreSQL — full history)

// Redis (real-time lookups)
Key: ip:{ip_address}
Type: Hash
Fields:
  score              | int (0-100)
  classification     | string
  request_count_1h   | int
  request_count_24h  | int
  block_count_24h    | int
  last_seen          | timestamp
  features           | bytes (serialized feature vector)
  threat_types       | string (comma-separated)
TTL: 24 hours (re-populate from PostgreSQL on cache miss)

// PostgreSQL (historical)
Table: ip_reputation
  ip_address       (PK) | inet
  score                  | smallint
  classification         | varchar(50)
  first_seen             | timestamp
  last_seen              | timestamp
  total_requests         | bigint
  total_blocks           | bigint
  threat_types           | text[]
  asn_number             | int
  country_code           | char(2)
  updated_at             | timestamp

Request Log (Kafka + ClickHouse for analytics)

// Kafka topic: request_events (retained 7 days)
{
  "timestamp": 1708632000,
  "ip": "203.0.113.42",
  "url": "/api/v1/login",
  "method": "POST",
  "user_agent": "...",
  "tls_fingerprint": "ja3_abc",
  "decision": "block",
  "ml_score": 0.94,
  "signals": ["high_rate", "ja3_bot"],
  "latency_ms": 2.3
}

// ClickHouse (aggregated analytics, 90-day retention)
Table: request_events
  timestamp         | DateTime
  ip                | IPv4
  url               | String
  decision          | Enum('allow','challenge','throttle','block')
  ml_score          | Float32
  threat_type       | String
  -- columnar storage, efficient for time-series aggregations

Threat Intelligence (PostgreSQL)

Table: threat_feeds
  source            | varchar(100)  -- e.g., "spamhaus", "abuseipdb"
  ip_address        | inet
  threat_type       | varchar(50)
  confidence        | float
  reported_at       | timestamp
  expires_at        | timestamp

Table: ja3_fingerprints
  ja3_hash         (PK) | varchar(64)
  classification         | varchar(50)  -- "chrome_browser", "python_requests", "known_bot"
  confidence             | float
  source                 | varchar(100)

5. High-Level Design (12 min)

Request Evaluation Flow

Incoming Request
  → Edge / API Gateway
    → Bot Detection Middleware (< 5ms total):
      → 1. Extract features (< 0.5ms):
           IP, User-Agent, URL, method, headers, TLS fingerprint (JA3)
      → 2. IP reputation lookup (< 0.5ms):
           Redis: HGET ip:{ip} score
           If score < 20 → fast-path BLOCK (skip ML)
           If score > 90 → fast-path ALLOW (skip ML)
      → 3. Rate check (< 0.5ms):
           Redis: INCR ip:{ip}:rate:{minute_bucket}
           If > threshold → flag as rate-exceeded
      → 4. ML scoring (< 1ms):
           Feed feature vector to in-process model
           Score = probability of being a bot (0.0 - 1.0)
      → 5. Decision engine (< 0.5ms):
           Combine: IP reputation + rate check + ML score + policy rules
           Output: allow / challenge / throttle / block
      → 6. Async: log request event to Kafka
    → If allowed: forward to backend
    → If challenged: return CAPTCHA page
    → If blocked: return 403

Background Processing

Feature Aggregation Pipeline:
  Kafka (request_events) → Stream Processor (Flink/Kafka Streams):
    Per IP, compute sliding window aggregates:
      - Request rate (1min, 5min, 1hr windows)
      - Unique endpoints accessed
      - Error rate (4xx, 5xx responses)
      - Request timing entropy (regular intervals → bot)
      - Geographic consistency
    Write aggregated features to Redis: HSET ip:{ip} features {vector}
    Update IP reputation score based on new features

ML Model Training Pipeline (daily):
  ClickHouse (historical data) → Feature extraction → Label assignment:
    Labels from: manual reviews, honeypot catches, known-good traffic
  → Train XGBoost model → Validate on holdout set
  → If accuracy > threshold: deploy to all detection nodes
  → Canary: deploy to 5% of nodes first, monitor false positive rate

Components

Bot Detection Middleware: Runs in-process on every API Gateway node. Performs feature extraction, reputation lookup, rate check, ML inference, and decision. Must be < 5ms total.
Redis Cluster: IP reputation cache, rate counters, feature vectors. Co-located with API Gateway for < 1ms latency.
Feature Aggregation Service (Flink): Consumes request events from Kafka. Computes per-IP behavioral features in sliding windows. Updates Redis with fresh feature vectors.
Reputation Scoring Service: Periodically recalculates IP reputation scores based on aggregated features, threat feed data, and historical behavior. Writes to Redis + PostgreSQL.
ML Training Pipeline: Daily batch training on labeled data. Model validation and canary deployment. Model versioning and A/B testing.
Threat Intelligence Ingester: Pulls from external feeds (Spamhaus, AbuseIPDB, etc.) hourly. Updates threat_feeds table and pre-populates IP reputation for known-bad IPs.
Honeypot Service: Invisible endpoints (hidden form fields, fake API paths) that no legitimate user would access. Any traffic to honeypots is definitively bot traffic → auto-label and block.
Analyst Dashboard: Real-time view of attack patterns, top blocked IPs, false positive reviews, detection accuracy metrics.

Architecture

Internet
  → CDN / Edge (DDoS mitigation at L3/L4 — Cloudflare/AWS Shield)
    → API Gateway with Bot Detection Middleware
      ├── Redis Cluster (reputation, rates, features)
      ├── In-process ML model (XGBoost)
      └── Decision Engine (policy rules)
    → Backend Services (if request allowed)

Async Pipeline:
  Kafka ← request events
    → Flink (feature aggregation) → Redis
    → ClickHouse (analytics storage)

Batch Pipeline:
  ClickHouse → ML Training → Model Store → Deploy to Gateway nodes
  Threat Feeds → Ingester → PostgreSQL + Redis

6. Deep Dives (15 min)

Deep Dive 1: Behavioral Fingerprinting and TLS Analysis

IP addresses alone are insufficient for bot detection. Sophisticated botnets use residential IP proxies, rotating through millions of IPs. Each IP has low request volume (below rate limits), making IP-based detection blind.

Design a Cluster Health Monitoring System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Monitor health of every node in a cluster (CPU, memory, disk, network, process status) via lightweight agents
Detect node failures within seconds using heartbeat mechanism and classify failure type (node crash, network partition, disk failure, OOM)
Define alerting rules (threshold-based and anomaly-based) and route alerts to on-call teams via multiple channels
Provide a real-time dashboard showing cluster topology, node status, and aggregated metrics
Support auto-remediation actions (restart service, drain node, scale up) triggered by specific failure patterns

Non-Functional Requirements

Availability: 99.99% — the monitoring system must be more available than the systems it monitors. A monitoring outage is a blind spot.
Latency: Failure detection within 10 seconds. Dashboard data staleness < 30 seconds. Alert delivery < 60 seconds from event.
Consistency: Heartbeat status must be accurate (no false positives for node death). Metrics can tolerate eventual consistency (30s lag acceptable).
Scale: 100,000 nodes, each emitting ~50 metrics every 10 seconds = 500K metric samples/sec.
Durability: Metrics retained at full resolution for 7 days, downsampled (1-min aggregates) for 90 days, further downsampled (1-hour) for 2 years.

2. Estimation (3 min)

Traffic

Nodes: 100,000
Metrics per node: 50 metrics × 1 sample/10s = 5 samples/sec/node
Total ingestion: 100K × 5 = 500,000 samples/sec
Heartbeats: 100K nodes × 1 heartbeat/5s = 20,000 heartbeats/sec
Dashboard queries: 500 concurrent operators × 1 query/5s = 100 QPS (but each query may scan millions of data points)

Storage

Per sample: metric_name (8 bytes hashed) + value (8 bytes) + timestamp (8 bytes) + node_id (8 bytes) + tags (16 bytes) = ~48 bytes
Raw data (7 days): 500K samples/sec × 86,400 sec/day × 7 days × 48 bytes = 14.5TB
1-min aggregates (90 days): 100K nodes × 50 metrics × 1440 min/day × 90 days × 48 bytes = 31TB
1-hour aggregates (2 years): 100K nodes × 50 metrics × 24 × 730 × 48 bytes = 4.2TB
Total: ~50TB active storage

Key Insight

This is a write-heavy time-series system with a 5000:1 write-to-read ratio. The critical challenge is ingesting 500K samples/sec reliably while simultaneously running failure detection with < 10s latency. Storage must be optimized for time-range queries on specific metrics.

Design a Concurrent Screen Limit System (Netflix)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Enforce a maximum number of concurrent streams per account (e.g., Basic = 1, Standard = 2, Premium = 4)
Track active streams in real-time via heartbeat-based session monitoring — a stream is “active” if a heartbeat was received within the last 60 seconds
When the concurrent limit is reached, deny new stream requests with a clear error message (“Too many screens. Stop watching on another device to continue.”)
Handle edge cases: device crashes (no explicit stop), network interruptions, and simultaneous stream starts from multiple devices (race conditions)
Allow users to see active sessions and forcefully terminate a session from another device

Non-Functional Requirements

Availability: 99.99% — blocking a paying customer from watching is worse than allowing one extra stream temporarily. Fail-open bias.
Latency: Stream start authorization < 100ms. Heartbeat processing < 50ms.
Consistency: Strong consistency for stream count enforcement. Two devices simultaneously starting stream #3 on a 2-stream plan must not both succeed.
Scale: 200M subscribers, 50M concurrent streams at peak, 500M heartbeats/minute (each stream sends a heartbeat every 30 seconds)
Fault tolerance: Single Redis node failure must not cause mass stream denials. Graceful degradation.

2. Estimation (3 min)

Traffic

50M concurrent streams at peak
Each stream sends heartbeat every 30 seconds → 50M / 30 = 1.67M heartbeats/sec
Stream starts: average stream duration = 45 minutes → 50M / 45 = ~18.5K stream starts/sec
Stream stops (explicit): ~70% of sessions end cleanly → ~13K stops/sec
Stream authorization checks (on start + periodic re-check every 5 min): ~185K checks/sec

Storage

Active session record: account_id (8B) + session_id (16B) + device_id (32B) + started_at (8B) + last_heartbeat (8B) + metadata (28B) = ~100 bytes
50M concurrent sessions × 100 bytes = 5 GB — fits entirely in Redis
Account → plan mapping: 200M accounts × 50 bytes = 10 GB — cached in Redis, source of truth in PostgreSQL

Heartbeat Processing

1.67M heartbeats/sec × 100 bytes each = 167 MB/sec of inbound traffic
Each heartbeat updates a single key in Redis: O(1) operation
Redis can handle 500K+ operations/sec per instance → need ~4 Redis shards minimum
With 10 Redis shards: 167K operations/sec each — comfortable headroom

Key Insight

This is a distributed counting problem with strong consistency requirements. The core challenge is ensuring that the stream count for an account never exceeds the limit, even when multiple devices attempt to start streams simultaneously. This is complicated by the stateless nature of video playback — we must use heartbeats to infer liveness rather than relying on explicit stop signals.

Design a Credit Card Processing System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Process credit card authorization requests in real-time — validate card, check funds, place a hold, return approve/decline
Handle settlement/clearing: batch process authorized transactions at end of day, move funds from issuing bank to acquiring bank
Support tokenization — replace sensitive card numbers with non-reversible tokens for PCI DSS compliance
Detect and flag fraudulent transactions in real-time using rules and ML models before authorization
Handle refunds, chargebacks, and multi-currency transactions with proper exchange rate management

Non-Functional Requirements

Availability: 99.999% for the authorization path — every second of downtime costs merchants millions in lost sales
Latency: Authorization response < 300ms end-to-end (merchant → acquirer → card network → issuer → back). Our processing adds < 50ms.
Consistency: Strong consistency for financial operations. Every transaction must be exactly-once. Duplicate charges are unacceptable.
Scale: 100K transactions/sec peak (Black Friday). 5B transactions/day average. $10T annual payment volume.
Security: PCI DSS Level 1 compliance. Card data encrypted at rest (AES-256) and in transit (TLS 1.3). HSM for key management.

2. Estimation (3 min)

Traffic

5B transactions/day average = 58K TPS average
Peak (Black Friday, flash sales): 150K TPS
Each authorization involves: fraud check + balance check + hold placement = 3-5 internal operations per transaction

Storage

Transaction records: 5B/day × 500 bytes = 2.5TB/day, ~900TB/year
Token vault: 2B unique cards × 200 bytes = 400GB (fits in memory with replication)
Fraud model features: 5B/day × 1KB (computed features) = 5TB/day (stored for model training, purged after 90 days)

Financial Math

Average transaction: $50
Annual volume: 5B/day × 365 × $50 = $91T
Interchange fee (~2%): $1.8T/year in fees flowing through the system
A 1-second outage at peak: 150K transactions × $50 = $7.5M in potentially lost transactions

Key Insight

This system has zero tolerance for data loss or duplication. A lost transaction means either the merchant doesn’t get paid or the customer gets charged without receiving goods. A duplicate means double-charging. Every operation must be idempotent and durable.

Design a Distributed Cache (Memcached/Redis)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Get/Set/Delete — clients can store, retrieve, and remove key-value pairs with sub-millisecond latency
TTL support — every key can have an optional time-to-live; expired keys are automatically purged
Eviction policies — when memory is full, evict keys according to a configurable policy (LRU, LFU, random)
Atomic operations — support CAS (compare-and-swap), increment/decrement, and simple Lua-script-style transactions
Cluster management — automatically shard data across nodes, detect failures, and rebalance with minimal data movement

Non-Functional Requirements

Availability: 99.99% — the cache is in the read hot-path; downtime causes a stampede on the backing store
Latency: p50 < 0.5ms, p99 < 2ms for single-key operations from within the same datacenter
Consistency: Eventual consistency across replicas is acceptable. For a single shard, reads-after-writes must be consistent (read-your-writes from the leader)
Scale: 100M+ ops/sec cluster-wide, 500TB+ aggregate memory across thousands of nodes
Durability: Best-effort. Cache is ephemeral by design, but optional AOF/snapshotting for warm restarts

2. Estimation (3 min)

Traffic

100M cache operations/sec (70% reads, 30% writes)
Read QPS: 70M/s, Write QPS: 30M/s
Average key size: 64 bytes, average value size: 1 KB
Average payload per op: ~1 KB → bandwidth: 100M × 1 KB = 100 GB/s cluster-wide

Storage

500B unique keys in steady state (Zipf distribution — 20% of keys serve 80% of traffic)
Average entry: 64 B (key) + 1 KB (value) + 48 B (metadata: TTL, flags, pointers) ≈ 1.1 KB
Total memory: 500B × 1.1 KB = 550 TB
With replication factor 2 (one replica per shard): 1.1 PB raw memory

Node Count

If each node has 128 GB usable memory → 550 TB / 128 GB ≈ 4,300 primary shards + 4,300 replicas ≈ 8,600 nodes
Each node handles: 100M / 4,300 ≈ 23K ops/sec — very comfortable for a single Redis-like process

Key Insight

This is a memory-bound, latency-critical system. The hard problems are: (1) distributing keys evenly across thousands of shards, (2) handling hot keys that violate even distribution, and (3) preventing thundering herds when popular keys expire.

Design a Distributed Job Scheduler

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Schedule jobs to run at a specific time, on a recurring cron schedule, or after a delay (e.g., “run in 30 minutes”)
Execute jobs with at-least-once semantics — every scheduled job must run, even if a worker crashes mid-execution
Support job dependencies (DAGs) — Job B runs only after Job A completes successfully
Provide job priority levels (critical, high, normal, low) with priority-based queue ordering
Support job lifecycle management: create, pause, resume, cancel, retry; expose job status, logs, and execution history

Non-Functional Requirements

Availability: 99.99% — missed or delayed jobs can cost real money (billing runs, report generation, data pipelines)
Latency: Job dispatch within 1 second of scheduled time. Job completion depends on the job itself.
Consistency: Exactly-once scheduling (no duplicate dispatches), at-least-once execution (idempotent jobs handle retries)
Scale: 10M scheduled jobs, 100K job executions per minute at peak, 10K concurrent running jobs
Durability: Job definitions and execution history must survive any single node failure. Zero job loss.

2. Estimation (3 min)

Traffic

10M total scheduled jobs (mix of one-time and recurring)
Recurring jobs: 1M cron jobs, average frequency = every hour → 1M/3600 = ~278 job dispatches/sec from cron alone
One-time jobs: 100K scheduled per hour → ~28 dispatches/sec
Peak: 5× average → ~1500 dispatches/sec
API calls (CRUD + status checks): ~5000 QPS

Storage

Job definition: ~2 KB (name, schedule, payload, config, dependencies)
10M jobs × 2 KB = 20 GB — fits in a single PostgreSQL instance
Execution history: 100K executions/min × 1 KB per record = 100 MB/min = 144 GB/day
Retention: 30 days of execution history = ~4.3 TB (needs archival strategy)

Worker Pool

Average job duration: 30 seconds
100K executions/min ÷ 60 = 1,667 jobs/sec
1,667 jobs/sec × 30 sec average = 50,000 concurrent job slots needed
With 16 slots per worker: ~3,125 worker machines

Key Insight

The scheduler itself is not compute-heavy — the challenge is reliable, timely dispatch at scale with exactly-once semantics. The workers that execute jobs are a separate scaling concern. The hardest problems are: (1) distributing scheduling responsibility without missing or duplicating jobs, (2) handling worker failures mid-execution, and (3) orchestrating DAG dependencies efficiently.

Design a Distributed Key-Value Store

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

put(key, value) — store a key-value pair (upsert semantics)
get(key) — retrieve the value for a given key
delete(key) — remove a key-value pair
Support large value sizes (up to 1MB per value, keys up to 256 bytes)
Automatic data partitioning across nodes with rebalancing on cluster resize

Non-Functional Requirements

Availability: 99.99% — the store must remain writable even during node failures and network partitions (AP system by default, tunable toward CP)
Latency: < 5ms p99 for both reads and writes (single datacenter)
Consistency: Tunable — eventual consistency by default (R=1, W=1), strong consistency optional (R=W=quorum)
Scale: 100TB+ total data, 500K+ operations/sec, hundreds of nodes, linear horizontal scaling
Durability: No data loss for acknowledged writes. Replication factor of 3 across failure domains.

2. Estimation (3 min)

Traffic

Write QPS: 200K ops/sec
Read QPS: 300K ops/sec (1.5:1 read-to-write ratio)
Average value size: 10KB, average key size: 100 bytes
Write bandwidth: 200K x 10KB = 2 GB/sec
Read bandwidth: 300K x 10KB = 3 GB/sec

Storage

100 billion key-value pairs in steady state
Average size per pair: 10KB value + 100 bytes key + 200 bytes metadata = ~10.3KB
Total data: 100B x 10.3KB = 1 PB before replication
With RF=3: 3 PB total
Per node (assuming 100 nodes): 30TB per node → 8 x 4TB SSDs each

Memory

Each node holds ~1B keys
In-memory index (key → file offset): 100 bytes key + 16 bytes offset = 116 bytes per key
Index per node: 1B x 116 bytes = 116 GB — too large for RAM
Solution: Bloom filters + sparse index (see Deep Dive)

3. API Design (3 min)

// Write a key-value pair
PUT /kv/{key}
  Headers: {
    "X-Consistency": "quorum",      // optional: one, quorum, all
    "X-TTL": 86400,                 // optional: TTL in seconds
    "If-Match": "<vector_clock>"    // optional: conditional write (CAS)
  }
  Body: <raw bytes>
  Response 200: {
    "key": "user:123:profile",
    "version": {"node1": 3, "node2": 1},  // vector clock
    "timestamp": 1708632060000
  }

// Read a key
GET /kv/{key}
  Headers: {
    "X-Consistency": "quorum"       // optional: one, quorum, all
  }
  Response 200: {
    "key": "user:123:profile",
    "value": "<base64-encoded>",
    "version": {"node1": 3, "node2": 1},
    "timestamp": 1708632060000
  }
  // If conflicting versions exist (eventual consistency):
  Response 200: {
    "key": "user:123:profile",
    "values": [
      {"value": "...", "version": {"node1": 3}},
      {"value": "...", "version": {"node2": 2}}
    ],
    "conflict": true
  }

// Delete a key
DELETE /kv/{key}
  Response 200: { "deleted": true }
  // Internally: tombstone with TTL (not immediate physical delete)

// Scan keys by prefix
GET /kv?prefix=user:123:&limit=100
  Response 200: { "pairs": [...], "cursor": "..." }

Key Decisions

Client receives vector clock on write, passes it back on subsequent writes for conflict detection
Conflicts surfaced to client (Dynamo-style) rather than hidden behind last-write-wins
Delete uses tombstones — physical deletion happens during compaction

4. Data Model (3 min)

On-Disk Storage (LSM-Tree per node)

SSTable (Sorted String Table):
  ┌─────────────────────────────────┐
  │ Data Block 1 (4KB, sorted KVs) │
  │ Data Block 2                    │
  │ ...                             │
  │ Data Block N                    │
  │ Index Block (key → block offset)│
  │ Bloom Filter Block              │
  │ Footer (offsets to index/filter)│
  └─────────────────────────────────┘

Each KV entry:
  key             | bytes (up to 256B)
  value           | bytes (up to 1MB)
  timestamp       | int64
  vector_clock    | map<node_id, counter>
  tombstone       | boolean
  ttl_expiry      | int64 (0 = no expiry)

In-Memory Structures (per node)

Memtable (Red-Black Tree or Skip List):
  - All recent writes, sorted by key
  - Flushed to SSTable when size exceeds 64MB

Write-Ahead Log (WAL):
  - Sequential append of every write
  - Replayed on crash recovery
  - Truncated after memtable flush

Bloom Filters:
  - One per SSTable, loaded in memory
  - 10 bits per key → ~1% false positive rate
  - Avoids reading SSTables that definitely don't contain the key

Why LSM-Tree (not B-Tree)

Write-optimized: All writes are sequential appends (memtable → flush → SSTable). No random I/O on writes.
Trade-off: Reads may check multiple SSTables (L0, L1, …, Ln). Mitigated by bloom filters, level compaction, and caching.
Compaction: Background process merges SSTables, removes tombstones, reduces read amplification.
Space amplification: Leveled compaction keeps it at ~1.1x; size-tiered compaction can be ~2x but has better write throughput.

5. High-Level Design (12 min)

Architecture

Client SDK (consistent hashing + preference list)
  │
  │  Route to appropriate node(s)
  ▼
┌──────────────────────────────────────────────┐
│              Distributed KV Cluster           │
│                                               │
│  Node A (tokens: 0-90)                       │
│  ┌────────────────────────┐                  │
│  │ Request Handler        │                  │
│  │ ├─ Coordinator logic   │                  │
│  │ ├─ Replication manager │                  │
│  │ └─ Read repair         │                  │
│  │                        │                  │
│  │ Storage Engine          │                  │
│  │ ├─ Memtable (64MB)     │                  │
│  │ ├─ WAL                  │                  │
│  │ ├─ SSTables (L0..Ln)   │                  │
│  │ └─ Bloom filters       │                  │
│  └────────────────────────┘                  │
│                                               │
│  Node B (tokens: 91-180)   ...  Node F       │
│  (similar structure)                          │
│                                               │
│  Cross-cutting:                               │
│  ├─ Gossip Protocol (membership, failure)    │
│  ├─ Anti-Entropy (Merkle trees)              │
│  └─ Hinted Handoff (temporary failed nodes)  │
└──────────────────────────────────────────────┘

Write Path

Client
  → SDK hashes key → identifies coordinator node (owns token range)
  → Sends PUT to coordinator
  → Coordinator:
    1. Determine preference list: [Node A (primary), Node B (replica), Node C (replica)]
    2. Forward write to all 3 nodes in parallel
    3. Each node:
       a. Append to WAL (fsync for durability)
       b. Insert into Memtable
       c. Respond to coordinator
    4. Coordinator waits for W responses (configurable: W=1, W=2 quorum, W=3 all)
    5. Respond to client with vector clock

Read Path

Client
  → SDK hashes key → identifies coordinator
  → Sends GET to coordinator
  → Coordinator:
    1. Send read to R nodes from preference list (R=1, R=2 quorum, R=3 all)
    2. Each node:
       a. Check Memtable → if found, return
       b. Check Bloom filters for each SSTable (L0 first, then L1, L2...)
       c. If bloom filter says "maybe present" → read SSTable index → read data block
       d. Return latest version (by vector clock)
    3. Coordinator:
       a. Compare versions from R responses
       b. If all match → return to client
       c. If conflict → return multiple values (client resolves) OR last-write-wins
       d. Trigger read repair (send latest version to stale nodes)

Components

Nodes (100+): Each node runs the storage engine (LSM-tree), handles coordinator logic, participates in gossip, and manages local replication.
Consistent Hash Ring: 256 virtual nodes per physical node for balanced distribution. Token assignment stored in gossip state.
Gossip Protocol: Every node pings 3 random nodes every second, exchanging membership state (heartbeats, token assignments, failure suspicions).
Anti-Entropy Service: Background process comparing Merkle trees between replica nodes to detect and repair divergence.
Hinted Handoff Store: When a target replica is down, the coordinator stores the write locally as a “hint” and replays it when the node recovers.
Compaction Manager: Background threads running leveled or size-tiered compaction to merge SSTables and remove dead data.

6. Deep Dives (15 min)

Deep Dive 1: Consistent Hashing and Data Partitioning

The problem: How to distribute data across N nodes so that adding/removing a node only moves ~1/N of the data (not a full reshuffle)?

Design a Distributed Logging System (ELK/Splunk)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Log Ingestion — Collect structured and unstructured logs from 50K+ servers, containers, and serverless functions. Support push-based (agents forward logs) and pull-based (system scrapes endpoints) collection models.
Full-Text Search & Field-Based Filtering — Users can search logs by arbitrary keywords (full-text), filter by structured fields (service name, log level, host, correlation ID), and scope by time range. Queries return results within seconds even over terabytes of data.
Live Tail (Real-Time Streaming) — Provide a tail -f equivalent where engineers can subscribe to a live stream of logs from a specific service, host, or filter expression. Latency from log emission to live tail display should be under 5 seconds.
Alerting & Pattern Detection — Define alert rules on log patterns (e.g., “more than 100 ERROR logs from payment-service in 5 minutes”). Support anomaly detection on log volume and error rates. Route alerts to PagerDuty, Slack, email.
Retention & Lifecycle Management — Configure per-tenant or per-service retention policies (e.g., 7 days hot, 30 days warm, 1 year cold, 7 years frozen for compliance). Automatic tiered storage migration and deletion.

Non-Functional Requirements

Availability: 99.99% uptime for ingestion (logs must never be dropped). Search can tolerate brief degradation (99.9%).
Latency: Ingestion-to-searchable latency < 10 seconds (p99). Search queries return in < 5 seconds for recent data (last 24h), < 30 seconds for historical data.
Consistency: Eventual consistency is acceptable. A log written at time T may not appear in search results for up to 10 seconds. No strict ordering guarantees across services.
Scale: 50,000 servers, 2 million log lines/second sustained ingestion, 10M lines/sec burst. 1 PB of searchable hot/warm data. 10 PB in cold/frozen archival.
Durability: Zero log loss once acknowledged by the ingestion pipeline. Replicated storage with 99.999999999% (11 nines) durability for archived data.

2. Estimation (3 min)

Write Path (Ingestion)

Metric	Value
Servers	50,000
Avg log lines per server per second	40
Sustained ingestion rate	2M lines/sec
Burst ingestion rate (10x spikes)	10M lines/sec
Avg log line size (JSON structured)	500 bytes
Sustained throughput	1 GB/sec = 3.6 TB/hour = 86 TB/day
Compression ratio (zstd)	~10:1
Compressed daily storage	~8.6 TB/day

Read Path (Search)

Metric	Value
Concurrent search users	~500
Search QPS	~200 queries/sec
Live tail subscriptions	~1,000 concurrent streams
Avg query scans	1–10 GB of index data per query

Storage Tiers

Tier	Retention	Raw Data	Compressed
Hot (SSD, fully indexed)	3 days	258 TB	~26 TB
Warm (HDD, fully indexed)	30 days	2.58 PB	~258 TB
Cold (object storage, partial index)	1 year	~31 PB	~3.1 PB
Frozen (object storage, metadata only)	7 years	archived	~20 PB

Kafka Sizing

2M messages/sec × 500 bytes = 1 GB/sec.
With replication factor 3, Kafka needs ~3 GB/sec disk write throughput.
24-hour retention in Kafka (buffer for downstream failures): 86 TB × 3 replicas = 258 TB Kafka storage.
~50 Kafka brokers (each handling ~60 MB/sec write throughput).

3. API Design (3 min)

# --- Ingestion APIs ---

# Batch log ingestion (used by agents)
POST /v1/logs/ingest
Headers: X-Tenant-ID, X-API-Key, Content-Encoding: zstd
Body: {
  "logs": [
    {
      "timestamp": "2025-07-15T10:23:45.123Z",
      "level": "ERROR",
      "service": "payment-service",
      "host": "prod-web-042",
      "trace_id": "abc123def456",
      "span_id": "span-789",
      "message": "Failed to process payment",
      "metadata": { "user_id": "u-123", "order_id": "ord-456", "error_code": "TIMEOUT" }
    }
  ]
}
Response: 202 Accepted { "ingested": 150, "failed": 0 }

# --- Search APIs ---

# Full-text and field-based search
POST /v1/logs/search
Body: {
  "query": "Failed to process payment",
  "filters": {
    "service": "payment-service",
    "level": ["ERROR", "WARN"],
    "time_range": { "from": "2025-07-15T10:00:00Z", "to": "2025-07-15T11:00:00Z" }
  },
  "sort": "timestamp:desc",
  "limit": 100,
  "cursor": "eyJsYXN0X3RzIjoxNjg..."
}
Response: { "hits": [...], "total": 4521, "next_cursor": "..." }

# Live tail — WebSocket
WS /v1/logs/tail?service=payment-service&level=ERROR
→ Server pushes matching log lines in real time

# Aggregation query (for dashboards)
POST /v1/logs/aggregate
Body: {
  "filters": { "service": "payment-service", "time_range": { "last": "1h" } },
  "group_by": ["level"],
  "interval": "1m",
  "metric": "count"
}

# --- Alert APIs ---

POST /v1/alerts/rules
Body: {
  "name": "Payment errors spike",
  "query": "level:ERROR AND service:payment-service",
  "condition": { "threshold": 100, "window": "5m", "operator": ">" },
  "actions": [{ "type": "pagerduty", "severity": "critical" }]
}

Key Decisions

Batch ingestion over single-line writes — amortizes network overhead, enables compression. Agents batch locally (every 1–5 seconds or 1000 lines, whichever comes first).
Cursor-based pagination instead of offset-based — handles the append-heavy, time-sorted nature of log data without expensive deep-page queries.
WebSocket for live tail — HTTP long-polling would waste connections. WebSocket allows server-push with low latency. Each subscription is a filtered Kafka consumer under the hood.

4. Data Model (3 min)

Log Document (Elasticsearch / ClickHouse)

Field	Type	Indexed	Notes
`id`	UUID	Primary key	Generated at ingestion, used for dedup
`timestamp`	DateTime64(ms)	Yes (sort key)	Nanosecond precision stored, ms for indexing
`tenant_id`	String	Yes (partition key)	Tenant isolation
`service`	String (keyword)	Yes	Exact match, not analyzed
`host`	String (keyword)	Yes	Server hostname
`level`	Enum (TRACE/DEBUG/INFO/WARN/ERROR/FATAL)	Yes	Stored as uint8 internally
`trace_id`	String (keyword)	Yes	For distributed tracing correlation
`span_id`	String (keyword)	Yes	Links to specific trace span
`message`	Text (analyzed)	Yes (full-text)	Inverted index for search
`metadata`	JSON / Map(String, String)	Selective	Dynamic fields, selectively indexed
`_raw`	String	No	Original log line, stored but not indexed

Index Partitioning Strategy

Index naming: logs-{tenant_id}-{YYYY.MM.DD}-{shard_number}

Example: logs-payment-team-2025.07.15-003

Daily indices → easy to delete/archive entire days
Tenant prefix → physical isolation for noisy neighbors
Shard number → distribute within a day (target 30-50 GB per shard)

Why Elasticsearch + ClickHouse Hybrid?

Engine	Use Case	Strength
Elasticsearch	Full-text search, live tail, ad-hoc queries	Inverted index excels at arbitrary text search. Near real-time indexing (refresh every 1s).
ClickHouse	Aggregations, dashboards, analytics, alerting	Columnar storage gives 10-100x faster GROUP BY, COUNT, and time-series aggregations. Excellent compression.
S3 / GCS	Cold & frozen storage	11 nines durability, ~$0.023/GB/month vs $0.10/GB for SSD. Parquet format for occasional queries.

Both engines are populated from the same Kafka topics. Elasticsearch handles interactive search. ClickHouse handles dashboard queries and alert rule evaluation. Cold data is written to object storage in compressed Parquet with only metadata indexed in a lightweight catalog (e.g., Hive Metastore or Iceberg).

Design a Distributed Message Queue (RabbitMQ)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Producers publish messages to named exchanges, which route messages to queues based on routing rules
Consumers subscribe to queues and receive messages with acknowledgment-based delivery (message stays in queue until acked)
Support multiple exchange types: direct (exact routing key match), topic (pattern matching), fanout (broadcast to all bound queues)
Dead letter queues: messages that fail processing after N retries are moved to a DLQ for inspection
Message persistence: critical messages survive broker restarts (durable queues + persistent messages)

Non-Functional Requirements

Availability: 99.99% — the message queue is infrastructure that other services depend on. Downtime cascades.
Latency: < 5ms for message publish (broker acknowledges to producer). < 1ms for message delivery to a connected consumer.
Ordering: Messages within a single queue are delivered in FIFO order. No ordering guarantees across queues.
Delivery guarantees: At-least-once by default (ack-based). At-most-once available (auto-ack). Exactly-once achievable with idempotent consumers.
Scale: 100K messages/sec ingestion, 10,000 queues, 50,000 consumers, 10M messages in flight at any time

2. Estimation (3 min)

Traffic

Message publish rate: 100K messages/sec
Average message size: 2KB (headers + body)
Ingestion bandwidth: 100K x 2KB = 200 MB/sec
Consumer delivery rate: 100K messages/sec (steady state: publish rate = consume rate)
With replication (RF=2): internal bandwidth = 200 MB/sec additional

Storage

Messages in flight (unconsumed): 10M messages x 2KB = 20 GB
Peak backlog (consumer down for 1 hour): 100K/sec x 3600 x 2KB = 720 GB
- This is the sizing case for disk: must handle 1-hour consumer outages without data loss
Per broker: 6 brokers → 120 GB in-flight + up to 360 GB backlog per broker during incidents
Disk: 500 GB SSD per broker with headroom

Memory

Message metadata index: 10M messages x 100 bytes = 1 GB — fits easily in RAM
Queue metadata: 10,000 queues x 1KB = 10 MB
Consumer connection state: 50,000 consumers x 2KB = 100 MB
Total memory per broker: ~8 GB for metadata + OS page cache for message data

3. API Design (3 min)

Publisher API

// Declare an exchange
PUT /exchanges/{vhost}/{exchange_name}
  Body: {
    "type": "topic",                    // direct, topic, fanout, headers
    "durable": true,                    // survive broker restart
    "auto_delete": false                // delete when no queues bound
  }

// Publish a message
POST /exchanges/{vhost}/{exchange_name}/publish
  Body: {
    "routing_key": "order.created",
    "properties": {
      "delivery_mode": 2,              // 1=transient, 2=persistent
      "content_type": "application/json",
      "message_id": "msg_abc123",      // for deduplication
      "correlation_id": "req_xyz",     // for RPC-style patterns
      "expiration": "60000",           // TTL in milliseconds
      "headers": {"x-retry-count": 0}
    },
    "payload": "{\"order_id\": 12345, \"amount\": 99.99}"
  }
  Response 200: { "routed": true }     // message was routed to at least one queue

Consumer API (AMQP protocol, shown as pseudo-REST)

// Declare a queue
PUT /queues/{vhost}/{queue_name}
  Body: {
    "durable": true,
    "exclusive": false,                // exclusive to this connection
    "auto_delete": false,
    "arguments": {
      "x-dead-letter-exchange": "dlx",
      "x-dead-letter-routing-key": "dead.order",
      "x-message-ttl": 300000,        // 5 minutes
      "x-max-length": 1000000,        // max 1M messages
      "x-max-priority": 10            // enable priority queue
    }
  }

// Bind queue to exchange
POST /bindings/{vhost}/e/{exchange}/q/{queue}
  Body: { "routing_key": "order.*" }   // pattern for topic exchange

// Consume messages (long-lived connection, push-based)
// In practice: AMQP channel with basic.consume
// Simplified:
GET /queues/{vhost}/{queue_name}/get?count=10&ack_mode=manual
  Response 200: {
    "messages": [
      {
        "delivery_tag": 1,             // unique per channel, used for ack/nack
        "exchange": "orders",
        "routing_key": "order.created",
        "properties": {...},
        "payload": "{\"order_id\": 12345, ...}",
        "redelivered": false
      },
      ...
    ]
  }

// Acknowledge message (consumed successfully)
POST /queues/{vhost}/{queue_name}/ack
  Body: { "delivery_tag": 1, "multiple": false }

// Negative acknowledge (processing failed, requeue or dead-letter)
POST /queues/{vhost}/{queue_name}/nack
  Body: { "delivery_tag": 1, "requeue": false }  // false → route to DLQ

Key Decisions

Push-based delivery (broker pushes to consumers) — lower latency than polling, consumer prefetch controls flow
Manual acknowledgment by default — messages are not removed until consumer confirms processing success
DLQ routing on nack — failed messages automatically move to dead letter queue for debugging

4. Data Model (3 min)

Message (on-disk and in-flight)

Message:
  message_id       | uuid          -- publisher-assigned or broker-generated
  exchange         | string
  routing_key      | string
  body             | bytes         -- up to 128MB (configurable)
  properties:
    delivery_mode  | int (1 or 2)  -- transient vs persistent
    content_type   | string
    correlation_id | string
    reply_to       | string        -- for RPC pattern
    expiration     | string        -- TTL
    timestamp      | int64
    priority       | int (0-255)
    headers        | map<str,any>
  metadata:
    publish_seq    | int64         -- publisher confirm sequence number
    queue_position | int64         -- position in queue (for ordering)
    delivery_count | int           -- number of delivery attempts
    first_death_exchange | string  -- for DLQ: original exchange
    first_death_queue    | string  -- for DLQ: original queue
    first_death_reason   | string  -- rejected, expired, maxlen

Queue State (per queue, in memory + WAL)

Queue:
  name             | string (PK)
  vhost            | string
  durable          | boolean
  state:
    messages_ready      | int     -- messages waiting for delivery
    messages_unacked    | int     -- delivered but not yet acknowledged
    messages_total      | int
    consumers           | int     -- number of active consumers
    head_position       | int64   -- next message to deliver
    tail_position       | int64   -- where next publish lands

Message Index (in-memory):
  TreeMap<queue_position, MessageRef>
  MessageRef: {store_offset, size, expiry, priority}

  For priority queues: use a priority heap instead of a simple FIFO

Persistent Message Store (per broker, disk)

Message Store (append-only segments):
  ┌───────────────────────────────┐
  │ Segment 1 (16MB)             │
  │  [msg1][msg2][msg3]...       │
  │ Segment 2 (16MB)             │
  │  [msg4][msg5]...             │
  └───────────────────────────────┘

  On publish (persistent msg):
    1. Append message body to current segment
    2. fsync (or batch fsync every 100ms for throughput)
    3. Add to queue's in-memory index

  On acknowledge:
    1. Mark message as acked in index
    2. Segment compaction: when all messages in a segment are acked, delete segment

Bindings (in metadata store):
  Exchange → Queue mappings:
    ("orders", "topic") → [
      ("order-processing-queue", "order.created"),
      ("analytics-queue", "order.*"),
      ("audit-queue", "#")       // # matches everything in topic exchange
    ]

Why These Choices

Append-only segment files — high write throughput, sequential I/O. Similar to Kafka’s log segments.
In-memory index + on-disk messages — index is small (100 bytes per message), messages can be large (KBs). Hot messages served from page cache, cold messages from disk.
Per-queue ordering — each queue is an independent FIFO (or priority queue). No cross-queue coordination needed.

5. High-Level Design (12 min)

Architecture

Producers                              Consumers
  │                                       ▲
  │ AMQP / HTTP                           │ AMQP (push)
  ▼                                       │
┌──────────────────────────────────────────────────────┐
│                   Load Balancer                        │
│              (TCP passthrough, sticky)                  │
└────────────────────────┬─────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Broker 1    │ │   Broker 2    │ │   Broker 3    │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Exchange  │ │ │ │ Exchange  │ │ │ │ Exchange  │ │
│ │ Router    │ │ │ │ Router    │ │ │ │ Router    │ │
│ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │
│      │       │ │      │       │ │      │       │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Queue     │ │ │ │ Queue     │ │ │ │ Queue     │ │
│ │ Manager   │ │ │ │ Manager   │ │ │ │ Manager   │ │
│ │           │ │ │ │           │ │ │ │           │ │
│ │ Q1 (lead) │ │ │ │ Q1 (mirr) │ │ │ │ Q4 (lead) │ │
│ │ Q2 (lead) │ │ │ │ Q3 (lead) │ │ │ │ Q5 (lead) │ │
│ │ Q3 (mirr) │ │ │ │ Q4 (mirr) │ │ │ │ Q2 (mirr) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Message   │ │ │ │ Message   │ │ │ │ Message   │ │
│ │ Store     │ │ │ │ Store     │ │ │ │ Store     │ │
│ │ (Disk)    │ │ │ │ (Disk)    │ │ │ │ (Disk)    │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
        │                │                │
        └────────────────┼────────────────┘
                         │
              ┌──────────▼──────────┐
              │ Cluster Coordinator  │
              │ (Raft / mnesia)      │
              │ - Queue leadership   │
              │ - Membership         │
              │ - Exchange/binding   │
              │   metadata           │
              └─────────────────────┘

Publish Flow

Producer → Broker (any broker via load balancer)
  │
  ▼
1. Receive AMQP basic.publish frame
2. Exchange Router:
   a. Look up exchange by name
   b. Get all bindings for this exchange
   c. Match routing_key against binding patterns:
      - Direct: exact match (routing_key == binding_key)
      - Topic: pattern match ("order.created" matches "order.*" and "#")
      - Fanout: all bound queues (ignore routing key)
   d. Result: list of queues to route to

3. For each target queue:
   a. If queue leader is on THIS broker:
      → Append to local message store (disk write if persistent)
      → Add to queue's in-memory index
      → Replicate to mirror brokers (sync or async based on config)
   b. If queue leader is on ANOTHER broker:
      → Forward message to that broker (internal cluster protocol)

4. Publisher confirm:
   → If publisher confirms enabled: send basic.ack to producer with sequence number
   → Only after message is persisted + replicated to mirrors (if ha-mode=all)

Consume Flow

Consumer → Broker (connects to any broker, but deliveries come from queue leader)
  │
  ▼
1. Consumer subscribes: basic.consume(queue="order-processing")
2. Broker locates queue leader:
   a. If leader is on THIS broker: deliver directly
   b. If leader is on another broker: proxy the channel to leader broker

3. Queue leader delivers messages:
   a. Check prefetch limit (QoS): consumer has capacity for more messages?
   b. If yes: pop next message from queue head
   c. Mark message as "unacked" (in flight to consumer)
   d. Send basic.deliver frame to consumer
   e. Start ack timeout timer (30 seconds default)

4. Consumer processes message:
   a. Success → basic.ack(delivery_tag)
      → Broker removes message from queue and disk
   b. Failure → basic.nack(delivery_tag, requeue=false)
      → Broker routes to dead letter exchange (if configured)
      → Or basic.nack(delivery_tag, requeue=true) → put back at head of queue

5. Timeout (consumer doesn't ack within 30s):
   → Message requeued (redelivered flag set to true)
   → Delivered to same or different consumer

Components

Brokers (3-6): Each broker handles connections, routing, queue management, and message storage. Queues are distributed across brokers — each queue has a “leader” broker.
Exchange Router: In-memory routing table. Evaluates routing rules per exchange type. Direct exchange: O(1) hash lookup. Topic exchange: O(B) where B = number of bindings (trie-based optimization available). Fanout: O(Q) where Q = bound queues.
Queue Manager: Manages queue state: ready messages, unacked messages, consumers, prefetch counters. One leader per queue, with mirrors on other brokers for HA.
Message Store: Per-broker append-only segment files. Persistent messages fsync’d to disk. Segment compaction removes fully-acknowledged messages.
Cluster Coordinator: Manages cluster membership, queue leader election, exchange/binding metadata. Uses Raft consensus (RabbitMQ 3.x used Mnesia/Erlang distribution).
Management UI/API: HTTP API for monitoring: queue depths, message rates, consumer counts, connection status. Grafana dashboards for operations.

6. Deep Dives (15 min)

Deep Dive 1: Delivery Guarantees — At-Least-Once, At-Most-Once, Exactly-Once

At-most-once delivery:

Design a Distributed Metrics Logging and Aggregation System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Ingest metrics from thousands of services (CPU, memory, request latency, error rate, custom business metrics)
Store metrics with high granularity (per-second) for recent data, lower granularity (per-minute, per-hour) for historical data
Query metrics: time-range aggregations (avg, sum, p50, p95, p99, max, min)
Real-time dashboards with auto-refresh (< 30 second data freshness)
Alerting: trigger alerts when metrics cross thresholds (e.g., p99 latency > 500ms for 5 minutes)

Non-Functional Requirements

Availability: 99.9% for ingestion (losing metrics is acceptable during brief outages — we don’t want to lose all data, but a few seconds of data loss during failover is tolerable). 99.99% for querying (dashboards must be up during incidents — that’s when they’re needed most).
Latency: Ingestion → queryable in < 30 seconds. Dashboard queries: simple queries < 500ms, complex aggregations < 5 seconds.
Scale: 10M metrics data points/sec ingestion. 1 year retention at full granularity, 3 years at reduced granularity.
Durability: Metrics data should survive individual node failures. Total loss acceptable only for catastrophic events.

2. Estimation (3 min)

Ingestion

10M data points/sec
Each data point: metric_name (100B) + tags (200B) + timestamp (8B) + value (8B) = ~316 bytes → round to 300 bytes
10M × 300 bytes = 3GB/sec ingestion throughput
Per day: ~260TB

Storage

Raw data at 1-second granularity: 260TB/day
With compression (time-series data compresses well): ~10:1 → 26TB/day
1 year retention: ~9.5PB compressed
Rollup data (1-min, 1-hour granularity): ~1% of raw → additional ~100TB/year

Query patterns

Dashboard queries: typically 1-6 hour time ranges, 5-10 metrics
Alert evaluation: thousands of alert rules evaluated every 60 seconds
Ad-hoc queries: arbitrary time ranges, complex aggregations

3. API Design (3 min)

Ingestion API

POST /api/v1/metrics
  Body: [
    {
      "metric": "http.request.latency",
      "tags": { "service": "api-gateway", "endpoint": "/v1/users", "method": "GET", "region": "us-east-1" },
      "timestamp": 1708632000,
      "value": 45.2
    },
    {
      "metric": "http.request.count",
      "tags": { "service": "api-gateway", "status": "200" },
      "timestamp": 1708632000,
      "value": 1
    }
  ]
  Response 202 Accepted

// StatsD/Prometheus-compatible push
UDP /statsd (fire-and-forget, lossy but fast)
  Payload: "http.request.latency:45.2|ms|#service:api-gateway,endpoint:/v1/users"

Query API

POST /api/v1/query
  Body: {
    "metric": "http.request.latency",
    "aggregation": "p99",
    "tags": { "service": "api-gateway" },
    "from": "2026-02-22T12:00:00Z",
    "to": "2026-02-22T18:00:00Z",
    "interval": "1m"       // rollup interval
  }
  Response 200: {
    "series": [
      { "timestamp": 1708603200, "value": 42.3 },
      { "timestamp": 1708603260, "value": 45.1 },
      ...
    ]
  }

// Multi-metric query (for dashboards)
POST /api/v1/query/batch
  Body: {
    "queries": [
      { "metric": "cpu.usage", "aggregation": "avg", "tags": {...}, ... },
      { "metric": "memory.usage", "aggregation": "max", "tags": {...}, ... }
    ]
  }

Alert API

POST /api/v1/alerts
  Body: {
    "name": "High API Latency",
    "metric": "http.request.latency",
    "aggregation": "p99",
    "tags": { "service": "api-gateway" },
    "condition": "above",
    "threshold": 500,
    "duration": "5m",
    "channels": ["slack:#oncall", "pagerduty:team-infra"]
  }

4. Data Model (3 min)

Time-Series Storage (custom or TSDB like ClickHouse/TimescaleDB)

Conceptual schema:
  metric_name    | string
  tag_set        | map<string, string>  (sorted, for consistent hashing)
  timestamp      | int64 (Unix epoch seconds)
  value          | float64

Physical storage (column-oriented):
  Series ID = hash(metric_name + sorted_tags)

  Series metadata table:
    series_id    (PK) | bigint
    metric_name        | string
    tags               | map<string, string>

  Data points table (partitioned by time):
    series_id          | bigint
    timestamp          | int64
    value              | float64
    Partition key: (series_id, time_bucket)
    Clustering: timestamp ASC

Key design choice: Column-oriented storage. Time-series queries almost always read a specific metric over a time range (columnar scans). Row-oriented storage would waste I/O reading irrelevant columns.

Design a Distributed Stream Processing System (Kafka)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Producers publish messages to named topics with optional partitioning keys
Consumers subscribe to topics and read messages in order within each partition
Support consumer groups — each message is delivered to exactly one consumer within a group (load balancing), but to all groups (broadcast)
Messages are durably persisted for a configurable retention period (default 7 days)
Support message replay — consumers can seek to any offset and re-consume

Non-Functional Requirements

Availability: 99.99% — the messaging backbone cannot go down without cascading failures across the entire platform
Latency: < 10ms end-to-end for p99 publish-to-consume (single datacenter). Throughput is prioritized over single-message latency.
Consistency: Messages within a partition are strictly ordered. Across partitions, no ordering guarantee. At-least-once delivery by default; exactly-once semantics available with idempotent producers + transactional consumers.
Scale: 1M+ messages/sec ingestion, 10TB+ daily throughput, 1000+ topics, 10,000+ partitions across the cluster
Durability: Zero message loss for acknowledged writes. Replication factor of 3 minimum for production topics.

2. Estimation (3 min)

Traffic

Write throughput: 1M messages/sec
Average message size: 1KB
Write bandwidth: 1M x 1KB = 1 GB/sec ingress
Replication factor 3: 1 GB/sec x 3 = 3 GB/sec internal replication traffic
Read throughput: Assume 5 consumer groups on average → 5 GB/sec egress
Total network: ~9 GB/sec across the cluster

Storage

Daily ingestion: 1 GB/sec x 86,400 = 86.4 TB/day (before replication)
With replication factor 3: 259 TB/day
7-day retention: 1.8 PB raw storage
Each broker: 12 x 4TB SSDs = 48TB per broker → need ~38 brokers for storage alone
In practice: 50-60 brokers (headroom for CPU, network, rebalancing)

Partitions

1000 topics x average 10 partitions = 10,000 partitions
Each partition is an append-only log segment on disk
Partition leader handles all reads/writes → must balance leaders evenly across brokers

3. API Design (3 min)

Producer API

// Publish a message
POST /topics/{topic}/messages
  Body: {
    "key": "user_123",              // optional partition key
    "value": "<base64-encoded payload>",
    "headers": {"trace_id": "abc"}, // optional metadata
    "partition": null                // null = use key hash; int = explicit partition
  }
  Response 200: {
    "topic": "user-events",
    "partition": 7,
    "offset": 48291034,
    "timestamp": 1708632060000
  }

// Batch publish (preferred for throughput)
POST /topics/{topic}/messages/batch
  Body: { "messages": [ ... ] }
  Response 200: { "offsets": [ ... ] }

Consumer API

// Subscribe to topic(s) with a consumer group
POST /consumers/{group_id}/subscribe
  Body: { "topics": ["user-events", "page-views"] }

// Poll for messages
GET /consumers/{group_id}/poll?max_records=500&timeout_ms=1000
  Response 200: {
    "messages": [
      {"topic": "user-events", "partition": 7, "offset": 48291034, "key": "user_123", "value": "...", "timestamp": ...},
      ...
    ]
  }

// Commit offsets (acknowledge processing)
POST /consumers/{group_id}/offsets
  Body: { "offsets": {"user-events": {"7": 48291035}} }

// Seek to specific offset (replay)
POST /consumers/{group_id}/seek
  Body: { "topic": "user-events", "partition": 7, "offset": 48000000 }

Key Decisions

Batch publishing is the primary path — amortizes network round trips
Consumer pull model (not push) — consumers control their own pace, natural backpressure
Offsets are committed by consumers, not auto-acked — enables at-least-once and exactly-once semantics

4. Data Model (3 min)

Message (on-disk log format)

Record:
  offset          | int64       -- monotonically increasing per partition
  timestamp       | int64       -- producer-set or broker-set
  key             | bytes       -- nullable, used for partitioning
  value           | bytes       -- the payload
  headers         | map<str,str>-- metadata
  crc32           | int32       -- integrity check
  attributes      | int8        -- compression codec, timestamp type
  batch_offset    | int32       -- offset within producer batch (for idempotency)
  producer_id     | int64       -- for exactly-once (idempotent producer)
  producer_epoch  | int16       -- for exactly-once (fencing)
  sequence_num    | int32       -- for exactly-once (dedup)

Topic Metadata (ZooKeeper/KRaft)

Topic:
  topic_name      | string (PK)
  num_partitions  | int
  replication_factor | int
  config_overrides   | map       -- retention.ms, segment.bytes, etc.

Partition Assignment:
  topic_name      | string
  partition_id    | int
  leader_broker   | int
  isr             | list<int>   -- in-sync replicas
  replicas        | list<int>   -- all assigned replicas

Consumer Group Offsets

__consumer_offsets (internal compacted topic):
  Key: {group_id, topic, partition}
  Value: {committed_offset, metadata, timestamp}

Why This Storage Model

Append-only log on local disk — sequential writes are the fastest possible I/O pattern. HDDs do 200MB/s sequential, SSDs do 2GB/s+. No random seeks.
No traditional database — the log IS the database. No index overhead, no B-tree maintenance.
Consumer offsets stored as a compacted topic — self-hosted, replicated, no external DB dependency.
KRaft (replacing ZooKeeper) — metadata managed via Raft consensus among controller brokers, eliminating the ZK dependency.

5. High-Level Design (12 min)

Architecture

Producers (thousands)
  │
  ├─ Producer 1 ──┐
  ├─ Producer 2 ──┤    (messages partitioned by key hash)
  └─ Producer N ──┤
                  ▼
         ┌──────────────────────────────────────┐
         │          Kafka Cluster                │
         │                                       │
         │  Broker 1         Broker 2            │
         │  ┌────────────┐   ┌────────────┐     │
         │  │ Topic-A P0 │   │ Topic-A P1 │     │
         │  │  (Leader)  │   │  (Leader)  │     │
         │  │ Topic-A P1 │   │ Topic-A P0 │     │
         │  │  (Replica) │   │  (Replica) │     │
         │  └────────────┘   └────────────┘     │
         │                                       │
         │  Broker 3 (Controller)                │
         │  ┌────────────┐                      │
         │  │ Topic-A P0 │  KRaft Metadata      │
         │  │  (Replica) │  (Raft consensus)    │
         │  │ Topic-A P1 │                      │
         │  │  (Replica) │                      │
         │  └────────────┘                      │
         └──────────────────────────────────────┘
                  │
                  ▼
         Consumer Groups
         ┌─────────────────────┐
         │ Group "analytics"   │
         │  Consumer A → P0    │
         │  Consumer B → P1    │
         └─────────────────────┘
         ┌─────────────────────┐
         │ Group "search-index"│
         │  Consumer C → P0,P1 │
         └─────────────────────┘

Write Path (Producer → Broker)

Producer
  → Serialize message (Avro/Protobuf/JSON)
  → Compute partition: hash(key) % num_partitions (or round-robin if no key)
  → Batch messages destined for same partition (linger.ms = 5ms, batch.size = 16KB)
  → Compress batch (LZ4/Snappy/Zstd)
  → Send to partition leader broker
  → Leader:
    → Append to local log segment (sequential write)
    → Wait for ISR replicas to fetch and acknowledge
    → When acks=all: respond to producer only after all ISR replicas confirm
    → When acks=1: respond after local write (faster, risk of data loss on leader crash)
  → Producer receives offset confirmation

Read Path (Broker → Consumer)

Consumer
  → Poll leader broker for partition
  → Broker:
    → Read from page cache (hot data) or disk (cold data)
    → Use sendfile() / zero-copy: data goes directly from page cache → network socket
      (no kernel→user→kernel copy — saves 2 memory copies and 2 context switches)
    → Respond with batch of messages
  → Consumer deserializes and processes
  → Consumer commits offset (async or sync)

Components

Brokers (50-60): Store partition log segments, serve produce/fetch requests, replicate data. Each broker is a partition leader for some partitions and a follower for others.
Controller (KRaft): 3-5 controller nodes running Raft consensus. Manages cluster metadata: topic creation, partition assignment, leader election, ISR management.
Producers: Client libraries that batch, compress, and route messages to the correct partition leader.
Consumers: Client libraries that poll partition leaders, track offsets, and handle rebalancing within consumer groups.
Schema Registry: External service storing Avro/Protobuf schemas. Producers register schemas, consumers validate. Prevents breaking changes.
Monitoring (JMX/Prometheus): Under-replicated partitions, consumer lag, broker throughput, request latency percentiles.

6. Deep Dives (15 min)

Deep Dive 1: Replication and ISR (In-Sync Replicas)

The problem: How do we ensure durability without sacrificing throughput? Traditional quorum (majority must ack) wastes 1/3 of write capacity.

Design a Distributed System Control Plane

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Service Discovery: Services register themselves on startup and discover other services by name. Registry is always up-to-date with running instances.
Configuration Management: Centrally manage and distribute configuration to all services. Support versioning, rollback, and environment-specific overrides.
Health Checking: Continuously monitor service health. Automatically remove unhealthy instances from the service registry. Support liveness and readiness probes.
Rolling Deployments: Orchestrate zero-downtime deployments by gradually replacing old instances with new ones, with automatic rollback on failure.
Load Balancing Policy: Define and enforce load balancing policies (round-robin, least-connections, weighted) and circuit breaking rules at the control plane level.

Non-Functional Requirements

Availability: 99.999% — if the control plane goes down, services can’t discover each other, configs can’t update, and deployments halt. Data plane must continue to function independently during control plane outages.
Latency: Service discovery lookups < 5ms. Config pushes reach all nodes within 30 seconds. Health check detection < 10 seconds.
Consistency: Service registry must be strongly consistent (a deregistered service must never receive traffic). Config updates must be atomically applied (no partial config states).
Scale: 50K service instances across 500 services. 10 regions. 1M service discovery lookups/sec. 100K config reads/sec.
Partition Tolerance: Control plane must handle network partitions gracefully. The data plane (actual service-to-service traffic) must continue even if the control plane is unreachable.

2. Estimation (3 min)

Service Registry

500 services × 100 instances each = 50K registered instances
Each registration: ~1KB (service name, host, port, metadata, health status, version)
Total registry size: 50K × 1KB = 50MB — trivially fits in memory on every node
Registration/deregistration events: ~10K/hour (deploys, scaling, failures)

Service Discovery

50K instances, each resolving other services ~20 times/sec = 1M lookups/sec
With local caching (refresh every 5-10 seconds): actual control plane queries = 50K instances / 5 sec = 10K queries/sec — very manageable

Health Checking

50K instances, health checked every 5 seconds = 10K health checks/sec
Each health check: ~200 bytes response
Network: 10K × 200 bytes = 2MB/sec — trivial

Configuration

500 services × 5 config keys each = 2,500 config entries
Total config data: 2,500 × 10KB average = 25MB
Config change events: ~50/day (most configs rarely change)
Config reads: 50K instances poll every 30 seconds = 1,700 reads/sec

Key Insight

The control plane is a low-throughput, high-availability system. Data volumes are small (< 100MB total state). The hard problem is availability, consistency, and graceful degradation — not scale.

Design a Distributed Tracing System (Jaeger/Zipkin)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Collect spans from every service in the request path and assemble them into end-to-end traces with parent-child relationships
Propagate trace context (trace ID, span ID, sampling decision) across service boundaries via HTTP headers, gRPC metadata, and message queues
Support configurable sampling strategies (head-based probabilistic, tail-based on error/latency, always-on for debug)
Store traces and provide query capabilities: search by trace ID, service name, operation, duration, tags, and time range
Generate service dependency graphs and latency breakdowns from trace data

Non-Functional Requirements

Availability: 99.9% for collection pipeline (trace loss during outage is acceptable but not data corruption). Query system: 99.95%.
Latency: Zero observable overhead on the critical path. Span reporting must be async and non-blocking (< 0.1ms per span in the application process).
Consistency: Traces can be eventually consistent (30-second delay from span emission to queryability is fine).
Scale: 1,000 microservices, 500K requests/sec, average 10 spans per trace = 5M spans/sec ingestion.
Durability: Retain full traces for 7 days, sampled traces for 30 days, aggregated service metrics for 1 year.

2. Estimation (3 min)

Traffic

Spans ingested: 5M spans/sec (500K traces/sec × 10 spans/trace average)
With 10% head-based sampling: 500K spans/sec stored (reduces storage 10x while preserving statistical accuracy)
Tail-based sampling (errors/slow): adds ~50K spans/sec (all error traces and p99 latency traces kept at 100%)
Total stored: ~550K spans/sec

Storage

Per span: span_id (16B) + trace_id (16B) + parent_id (16B) + service (32B) + operation (64B) + start_time (8B) + duration (8B) + tags (128B) + logs (256B) = ~544 bytes average
Raw spans (7 days): 550K/sec × 86,400 × 7 × 544B = 23TB
With columnar compression (5x): ~4.6TB
Sampled traces (30 days, 1% sample): ~2TB compressed
Service metrics (1 year): aggregated data, ~100GB

Key Insight

This is a high-throughput write pipeline with relatively infrequent reads. The core challenges are: (1) ingesting 5M spans/sec with near-zero application overhead, (2) tail-based sampling which requires holding spans in memory until the trace completes, and (3) efficiently querying traces by multiple dimensions.

Design a Document Management System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Upload, download, and organize files in a hierarchical folder structure with support for large files (up to 10GB) via chunked/resumable uploads
Version control: maintain full version history of documents, allow reverting to any previous version, and show diffs for text-based files
Fine-grained access control: per-document and per-folder permissions (owner, editor, viewer), shareable links with expiry, and organization-wide policies
Full-text search across document contents, metadata, and tags. Support filters by file type, date, owner, and folder.
Real-time collaborative editing for text documents (Google Docs-style) with conflict resolution and presence indicators

Non-Functional Requirements

Availability: 99.99% for reads (viewing/downloading). 99.9% for writes (uploading/editing). Acceptable to queue uploads during degraded states.
Latency: File listing < 100ms. File download start (first byte) < 200ms. Search results < 500ms. Collaborative edit sync < 100ms (real-time feel).
Durability: 99.999999999% (11 nines) — zero data loss. Documents are business-critical. Use replication + backups.
Scale: 500M documents totaling 50PB of storage. 10M users. 1M uploads/day. 10M downloads/day. 100K concurrent collaborative editing sessions.
Compliance: Audit trail for all document access. Support for retention policies and legal holds. GDPR right to deletion.

2. Estimation (3 min)

Storage

500M documents, average 100MB = 50PB total storage
With 5 versions average per document: 50PB × 3 (dedup reduces version overhead to ~3x) = 150PB raw storage → with S3’s internal redundancy: managed
Daily uploads: 1M files × 100MB average = 100TB/day
Search index: 500M documents × 5KB extracted text average = 2.5TB Elasticsearch index

Traffic

Uploads: 1M/day = 12 uploads/sec average, peak 100/sec
Downloads: 10M/day = 115 downloads/sec average, peak 1K/sec
Search: 5M queries/day = 58 queries/sec
Collaborative editing: 100K concurrent sessions, each generating ~10 operations/sec = 1M ops/sec for the collaboration engine

Bandwidth

Uploads: 100TB/day = 9.3 Gbps sustained
Downloads: 10M × 100MB = 1PB/day = 93 Gbps sustained → CDN handles this
Total: ~100 Gbps — significant but manageable with CDN offload

Key Insight

Storage cost dominates. At S3 Standard pricing ($0.023/GB/month), 50PB = $1.15M/month. Tiering cold documents to S3 Glacier ($0.004/GB/month) reduces this to ~$250K/month. Storage tiering is a critical cost optimization, not just a nice-to-have.

Design a Food Ordering System (DoorDash/UberEats)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Customers can browse nearby restaurants, view menus with real-time item availability, and place orders
Orders flow through a state machine: placed → confirmed by restaurant → preparing → ready for pickup → picked up → delivered
Assign delivery drivers to orders based on proximity, current load, and estimated restaurant prep time
Real-time tracking of driver location from restaurant to customer doorstep
Estimate and display accurate ETAs at each stage (restaurant prep, driver pickup, delivery)

Non-Functional Requirements

Availability: 99.99% during meal-time peaks (11 AM - 1 PM, 5 PM - 9 PM). A 10-minute outage during dinner rush can cost millions.
Latency: Menu browsing < 100ms, order placement < 500ms, location updates < 1 second.
Consistency: Order state must be strongly consistent (no duplicate orders, no lost payments). Menu prices can be eventually consistent (seconds-stale acceptable).
Scale: 500K restaurants, 30M orders/month (~12 orders/sec average, peak 100/sec), 200K concurrent drivers.
Reliability: Payment capture must be exactly-once. Driver assignment must avoid double-booking.

2. Estimation (3 min)

Traffic

Menu browsing: 5M DAU × 5 searches × 3 restaurant views = 75M page loads/day = ~870 QPS (peak 3x = 2,600 QPS)
Orders: 1M orders/day = ~12/sec average, peak (dinner rush) = 100/sec
Driver location updates: 200K drivers × 1 update/4s = 50K updates/sec
Order status updates: 1M orders × 6 state transitions = 6M events/day = 70/sec

Storage

Restaurants + menus: 500K restaurants × 50 items × 2KB = 50GB
Orders: 30M/month × 3KB = 90GB/month, ~1TB/year
Driver locations: Real-time only (Redis), ~200K × 64 bytes = 12.8MB in memory
Order tracking history: 30M/month × 100 location points × 32 bytes = 96GB/month

Key Insight

This is a three-sided marketplace (customers, restaurants, drivers) with a complex orchestration problem. The hardest challenge is ETA accuracy — it depends on restaurant prep time (variable, 5-45 minutes), driver travel time (traffic-dependent), and coordinating pickup timing so the driver arrives when food is ready (not 15 minutes early waiting, not 10 minutes late with cold food).

Design a Hotel Booking System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Search for hotels by location, dates, guests, and filters (price range, star rating, amenities) with availability-aware results
View hotel details, room types, photos, reviews, and real-time pricing for selected dates
Book a room with a hold-then-confirm workflow: hold inventory for 10 minutes while user enters payment, then confirm or release
Prevent double-booking: two users cannot book the same room for overlapping dates, even under concurrent requests
Support booking lifecycle: create, confirm, modify (change dates), cancel (with cancellation policy enforcement), and refund

Non-Functional Requirements

Availability: 99.99% for search, 99.999% for booking (losing a confirmed booking is catastrophic)
Latency: Search results < 500ms. Booking confirmation < 2 seconds (includes payment).
Consistency: Strong consistency for inventory. A room shown as available must actually be bookable. Overbooking must be prevented at the database level.
Scale: 500K hotels, 50M rooms globally, 100M searches/day, 1M bookings/day
Durability: Booking records and payment transactions must survive any failure. Zero data loss.

2. Estimation (3 min)

Traffic

Search: 100M/day = ~1,150 searches/sec average, 5× peak = ~5,750/sec
Hotel detail page views: 3× searches = ~3,450/sec average
Booking attempts: 1M/day = ~12 bookings/sec average, peak = ~60/sec
Booking holds created: 3× bookings (many abandoned) = ~36 holds/sec
Price lookups: combined with search and detail = ~10K/sec

Storage

Hotels: 500K × 5 KB = 2.5 GB
Room types: 500K hotels × 5 room types × 1 KB = 2.5 GB
Room inventory (per room per night): 50M rooms × 365 nights × 50 bytes = 913 GB (~1 TB)
Bookings: 1M/day × 365 days × 1 KB = 365 GB/year
Search index (Elasticsearch): hotel metadata + denormalized availability = ~50 GB

Pricing

Price varies by date, room type, demand, and channel
Rate table: 500K hotels × 5 room types × 365 days = ~900M rate entries
Each entry: ~30 bytes → 27 GB of rate data
Cached in Redis for fast lookup

Key Insight

This is an inventory management + search problem. The core challenge is maintaining a correct, real-time inventory count (rooms available per night) under concurrent booking pressure while also providing fast, filter-rich search results. The booking flow is a distributed transaction spanning inventory, payment, and confirmation.

Design a Large File Download System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can download large files (100MB to 50GB) reliably over unstable network connections with resume support
Files are split into chunks; clients can download chunks in parallel and resume from the last completed chunk after interruption
Integrity verification at both chunk level (per-chunk checksum) and file level (whole-file hash) to detect corruption
Distribute files globally via CDN edge nodes to minimize download latency and maximize throughput
Support bandwidth throttling per user/tier and fair-share scheduling when multiple users download simultaneously

Non-Functional Requirements

Availability: 99.95% — users should almost always be able to start or resume a download. Brief outages are tolerable if resume works.
Latency: First byte within 200ms. Download speed should saturate the user’s available bandwidth (no artificial bottleneck from our side, except throttling).
Consistency: File metadata (checksums, chunk manifest) must be strongly consistent. A user must never download a partially-updated file.
Scale: 10M registered users, 500K concurrent downloads, 100PB total stored files, 5PB egress/month.
Durability: 99.999999999% (11 nines) for stored files. No data loss ever.

2. Estimation (3 min)

Traffic

Concurrent downloads: 500K
Average file size: 2GB
Average download speed: 50 Mbps per user
Total bandwidth: 500K × 50 Mbps = 25 Tbps peak egress (this is CDN-scale, not single-origin)
Chunk requests: 2GB file / 8MB chunk = 250 chunks per download. 500K downloads × 250 = 125M chunk requests (spread over download duration)
Chunk request rate: ~200K chunk requests/sec at peak

Storage

Total files: 100PB (stored in object storage like S3)
Chunk metadata: 50M files × 250 chunks × 64 bytes = 800GB (fits in a database)
File manifests: 50M files × 2KB = 100GB

Cost Insight

Egress cost is dominant: 5PB/month × $0.05/GB (CDN) = $250K/month
CDN cache hit ratio is critical: 80% hit rate → origin serves only 1PB/month
Popular files (top 1%) account for 80% of downloads (cache-friendly)
Long-tail files need origin serving — optimize with regional origin replicas

Key Insight

This is an egress-heavy, reliability-focused system. The core challenges are: (1) reliable chunk-based downloads with resume over unreliable networks, (2) efficient CDN distribution to minimize origin egress, and (3) integrity guarantees so users never get a corrupted file.

Design a Large-Scale Data Migration System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Migrate data from a legacy datastore (e.g., MySQL) to a new datastore (e.g., PostgreSQL, DynamoDB, or a new schema in the same engine) with zero downtime
Maintain full data consistency between source and destination throughout the migration — no data loss, no corruption
Provide a validation framework that continuously compares source and destination data and reports discrepancies
Support rollback at any stage — if the new system has issues, revert to the old system without data loss
Handle schema transformations during migration (column renames, type changes, data enrichment, denormalization)

Non-Functional Requirements

Availability: Zero downtime. The system must remain fully operational throughout the entire migration, which may take days to weeks.
Latency: No user-facing latency increase during migration. Reads and writes continue at normal speed.
Consistency: Strong consistency. After migration completes, the new datastore must have 100.000% of the data. Not 99.99% — 100%.
Scale: 50 TB of data, 500M rows across 200 tables. 10K writes/sec, 100K reads/sec. Migration must complete within 2 weeks.
Durability: The migration process itself must be resumable. If the migration pipeline crashes, it resumes from where it left off, not from scratch.

2. Estimation (3 min)

Backfill Phase

50 TB of data to migrate
Network throughput (source → destination): 1 Gbps sustained = 125 MB/sec
Time to backfill: 50 TB / 125 MB/sec = 400,000 sec ≈ 4.6 days at full throughput
With throttling to avoid impacting production (50% utilization): ~9 days

Change Data Capture (CDC) During Backfill

10K writes/sec to source DB during backfill
CDC event size: ~500 bytes average
CDC throughput: 10K × 500 bytes = 5 MB/sec (easily handled by Kafka)
Events generated during 9-day backfill: 10K/sec × 86400 × 9 = 7.8 billion events
At 500 bytes each: 3.9 TB of CDC data (Kafka with 14-day retention can handle this)

Validation Phase

Compare 500M rows between source and destination
Comparison rate: 50K rows/sec (read from both, hash, compare)
Time: 500M / 50K = 10,000 sec ≈ 2.8 hours per full validation pass
Run 3 passes for confidence: ~8 hours

Total Timeline

Backfill: 9 days
CDC catch-up: 1-2 hours (processing backlog of changes accumulated during backfill)
Validation: 1 day (3 passes + fixing discrepancies)
Shadow reads: 2-3 days (comparison in production)
Cutover: minutes
Total: ~2 weeks

3. API Design (3 min)

This is an internal infrastructure system, not a user-facing API. But it needs operational APIs for engineers to manage the migration.

Design a Live Comments System (YouTube Live/Twitch Chat)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can send text messages to a live stream’s chat in real-time (< 200ms delivery to other viewers)
All viewers of a stream see messages in a consistent chronological order
Support rate limiting per user (e.g., 1 message every 2 seconds) and configurable slow mode (streamer sets interval)
Moderation tools: delete messages, ban users, assign moderators, auto-filter spam/profanity
Persist chat history so users joining late can load recent messages (last 200 messages on join)

Non-Functional Requirements

Availability: 99.95% — chat can briefly degrade (delay messages) but should not go fully offline during a stream
Latency: < 200ms from send to display for 99th percentile viewers. For popular streams (100K+ viewers), < 500ms is acceptable.
Consistency: Causal ordering per stream is required. All viewers must see messages in the same order. Messages from the same user must appear in order.
Scale: 100K concurrent streams. Top streams have 500K concurrent viewers. 50K messages/sec globally across all streams. Top streams receive 5,000 messages/sec.
Durability: Chat history persisted for at least 30 days for moderation review and replay.

2. Estimation (3 min)

Traffic

100K concurrent live streams, 50M total concurrent viewers
50,000 messages/sec globally (write)
Fan-out: each message is delivered to all viewers of that stream
Average stream: 500 viewers. Top stream: 500K viewers.
Fan-out volume: 50,000 messages/sec × average 500 recipients = 25M message deliveries/sec

Storage

Average message: 200 bytes (user_id, stream_id, text, timestamp, metadata)
50,000 messages/sec × 200 bytes = 10 MB/sec = 864 GB/day
30-day retention = ~26 TB (compressible to ~5 TB with gzip)

Connection State

50M concurrent WebSocket connections
Each connection: ~10 KB memory overhead (buffers, metadata)
Total: 50M × 10 KB = 500 GB of connection memory
At 500K connections per server = 100 WebSocket servers

Bandwidth

Average message delivered: 200 bytes × 25M deliveries/sec = 5 GB/sec outbound bandwidth
Top stream (500K viewers × 5,000 msg/sec): 200 bytes × 500K × 5,000 would be 500 GB/sec — impossible without batching
Solution: Batch messages. Send 50 messages every 200ms = 10 KB per batch per viewer. 500K × 10 KB × 5/sec = 25 GB/sec. Still high — need edge-level fan-out.

3. API Design (3 min)

REST Endpoints

// Send a chat message
POST /v1/streams/{stream_id}/messages
  Headers: Authorization: Bearer <token>
  Body: {
    "text": "Great play!",
    "reply_to": "msg_abc"              // optional, for threaded replies
  }
  Response 201: {
    "message_id": "msg_xyz",
    "timestamp": 1708632000123
  }

// Load recent messages (on join)
GET /v1/streams/{stream_id}/messages?limit=200&before=<cursor>
  Response 200: {
    "messages": [...],
    "cursor": "msg_abc"
  }

// Delete a message (moderator)
DELETE /v1/streams/{stream_id}/messages/{message_id}
  Headers: Authorization: Bearer <moderator_token>

// Ban a user from chat
POST /v1/streams/{stream_id}/bans
  Body: { "user_id": "user_123", "duration": 600 }  // 10 min timeout

// Set slow mode
PUT /v1/streams/{stream_id}/settings
  Body: { "slow_mode_seconds": 5 }

WebSocket Protocol

// Client connects
WS /v1/streams/{stream_id}/chat?token=<auth_token>

// Server → Client: new messages (batched)
{
  "type": "messages",
  "data": [
    { "id": "msg_1", "user": "alice", "text": "Hello!", "ts": 1708632000123 },
    { "id": "msg_2", "user": "bob", "text": "GG!", "ts": 1708632000456 }
  ]
}

// Server → Client: message deleted
{ "type": "delete", "message_id": "msg_1" }

// Server → Client: user banned
{ "type": "ban", "user_id": "user_123", "duration": 600 }

// Client → Server: heartbeat (every 30s)
{ "type": "ping" }

4. Data Model (3 min)

Messages (Cassandra — partitioned by stream_id, clustered by timestamp)

Table: messages
  stream_id      (PK)  | varchar
  message_id     (CK)  | varchar       -- time-based UUID (sortable)
  user_id              | varchar
  text                 | varchar(500)
  reply_to             | varchar        -- nullable
  is_deleted           | boolean
  created_at           | timestamp

Stream Settings (Redis Hash)

Key: stream:{stream_id}:settings
Fields:
  slow_mode_seconds   | int (0 = disabled)
  subscribers_only    | boolean
  emote_only          | boolean

Bans (Redis Set with TTL + PostgreSQL for permanent bans)

Key: stream:{stream_id}:bans
Type: Sorted Set
Member: user_id
Score: ban_expiry_timestamp

Rate Limit State (Redis)

Key: ratelimit:{stream_id}:{user_id}
Value: last_message_timestamp
TTL: slow_mode_seconds

Why Cassandra for Messages?

Write-optimized (append-only, log-structured)
Partitioned by stream_id — all messages for a stream are co-located
Clustering by message_id (time-based UUID) gives natural chronological ordering
Handles 50K writes/sec easily with horizontal scaling
TTL support for automatic 30-day cleanup

Why Redis for Real-Time State?

Sub-millisecond lookups for rate limiting, ban checks, and settings
Pub/Sub for cross-server message fan-out
Sorted sets for efficient ban expiry checks

5. High-Level Design (12 min)

Message Send Flow

Client (sender)
  → WebSocket Connection Server (WS Server)
    → 1. Rate limit check (Redis: last message timestamp)
       If too soon → reject with "slow down" error
    → 2. Ban check (Redis: is user_id in ban set?)
       If banned → reject
    → 3. Spam/profanity filter (in-process rules + ML service)
       If spam → silently drop or shadow-ban
    → 4. Assign message_id (time-based UUID) + server timestamp
    → 5. Publish to Message Bus (Kafka topic: chat.{stream_id})
    → 6. Return ACK to sender

Message Bus (Kafka)
  → Chat Dispatcher Service:
    → 1. Persist message to Cassandra
    → 2. Determine which WS Servers hold viewers of this stream
       (lookup: Redis set stream:{stream_id}:servers)
    → 3. Publish to each WS Server via Redis Pub/Sub channel
       Channel: ws_server:{server_id}:messages

WS Server (receivers)
  → 1. Receive message from Redis Pub/Sub
  → 2. Buffer message (batch: up to 50 messages or 200ms, whichever first)
  → 3. Send batched messages to all local WebSocket connections for that stream

Components

WebSocket Connection Servers: Maintain persistent connections with clients. Stateful — each server knows which streams its connected clients are watching. Horizontally scaled (100+ servers). Sticky routing by stream_id hash to concentrate viewers.
Message Bus (Kafka): Durable message queue. One topic per popular stream, shared topics for smaller streams. Provides ordering guarantee within a partition (one stream = one partition).
Chat Dispatcher Service: Consumes from Kafka, persists to Cassandra, fans out to WS Servers. Stateless workers, scaled by partition count.
Redis Cluster: Rate limiting, ban lists, stream settings, pub/sub for WS server fan-out, stream→server mapping.
Cassandra Cluster: Persistent chat history. Read path for “load recent messages” on viewer join.
Spam/Moderation Service: Real-time text classification. Regex rules (profanity word list) + ML model (spam detection). Called synchronously on message send (< 10ms budget).
Stream Registry: Tracks which WS servers have viewers for each stream. Updated when connections are established/dropped.

Connection Management

Viewer joins stream:
  1. Client establishes WebSocket to assigned WS Server (via load balancer)
  2. WS Server registers: SADD stream:{stream_id}:servers {server_id}
  3. WS Server loads last 200 messages from Cassandra for initial hydration
  4. Client starts receiving live messages via WebSocket

Viewer leaves / disconnects:
  1. WS Server detects close/timeout
  2. Decrement local viewer count for stream
  3. If no more local viewers for stream:
     SREM stream:{stream_id}:servers {server_id}
     Unsubscribe from Redis Pub/Sub channel

6. Deep Dives (15 min)

Deep Dive 1: Fan-Out at Scale — Handling 500K Viewers

The hardest problem: a top stream has 500K concurrent viewers receiving 5,000 messages/sec. Naive fan-out (send each message individually to 500K connections) is 2.5 billion message deliveries per second. Impossible.

Design a Notification System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Send notifications across multiple channels: push notifications (iOS/Android), SMS, email, and in-app
Support user notification preferences — users can opt in/out per channel, per notification type (e.g., marketing vs transactional)
Template-based notifications with variable substitution (e.g., “Hi {{name}}, your order {{order_id}} has shipped”)
Rate limiting per user per channel — no user receives more than N notifications per hour (prevent notification fatigue)
Track delivery status for each notification (sent, delivered, opened, clicked, bounced, failed) with retry on failure

Non-Functional Requirements

Availability: 99.99% — transactional notifications (password resets, 2FA codes) are critical path
Latency: Transactional notifications delivered within 5 seconds of trigger. Marketing/batch notifications within 30 minutes.
Consistency: At-least-once delivery. No notification should be silently dropped. Deduplication prevents sending the same notification twice.
Scale: 1B notifications/day (across all channels), 50K concurrent sends/sec at peak, 500M registered users
Durability: Every notification request and its delivery status persisted. Full audit trail.

2. Estimation (3 min)

Traffic

1B notifications/day = ~11,500 notifications/sec average
Peak (morning/evening, campaign blasts): 5× = ~58,000 notifications/sec
Channel breakdown: Push 50% (500M), Email 30% (300M), In-app 15% (150M), SMS 5% (50M)
API ingestion (trigger events): ~100K events/sec (many produce multiple notifications)

Storage

Notification record: ~500 bytes (recipient, channel, template, payload, status, timestamps)
1B/day × 500 bytes = 500 GB/day, 30-day retention = 15 TB
User preferences: 500M users × 200 bytes = 100 GB (fits in a single DB)
Templates: ~10K templates × 5 KB = 50 MB (negligible)

Third-Party Throughput

Push (APNs/FCM): Both support ~100K sends/sec with connection pooling
Email (SES/SendGrid): SES supports 50K emails/sec at scale
SMS (Twilio/SNS): ~1000 SMS/sec per account (need multiple accounts or regional providers for 50M/day)
SMS is the bottleneck — 50M SMS/day ÷ 86400 = 579/sec average, 2900/sec peak. Need multiple provider accounts.

Key Insight

This is a high-throughput, multi-channel delivery pipeline where the core challenges are: (1) routing to the right channel at the right time, (2) respecting user preferences and rate limits, (3) handling third-party provider failures gracefully, and (4) ensuring no notification is lost or duplicated.

Design a Parts Compatibility System (PCPartPicker)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users select PC components (CPU, motherboard, RAM, GPU, storage, PSU, case) and the system validates compatibility across all parts in real-time
Compatibility rules engine checks socket matching (CPU ↔ motherboard), form factor (motherboard ↔ case), RAM type (DDR4 vs DDR5), power requirements (total wattage ↔ PSU), and physical clearances (GPU length, cooler height)
Provide a power budget calculator that sums component TDPs and recommends minimum PSU wattage with appropriate headroom
Detect and warn about performance bottlenecks (e.g., pairing a high-end GPU with a low-end CPU)
Aggregate pricing from multiple retailers with real-time price tracking, price history, and price drop alerts

Non-Functional Requirements

Availability: 99.9% — the site is informational/commerce, not life-critical. Brief outages are tolerable but costly (lost affiliate revenue).
Latency: Compatibility checks must complete in < 200ms as users add each component. Price data should be < 1 hour stale.
Consistency: Compatibility rules must be strictly correct — a false “compatible” signal that leads to a bad purchase is unacceptable. False “incompatible” (overly conservative) is tolerable.
Scale: 500K product SKUs across 15 component categories. 10M monthly active users. 50K concurrent build sessions. 100M compatibility checks/day.
Durability: User builds must be saved reliably. Price history must be retained for 2+ years for trend analysis.

2. Estimation (3 min)

Product Catalog

500K SKUs across 15 categories
Average product record: 2 KB (specs, descriptions, images, links)
Total catalog: 500K × 2 KB = 1 GB — easily fits in memory
Compatibility attributes per product: ~500 bytes of structured spec data
Compatibility data: 500K × 500 bytes = 250 MB

Compatibility Checks

100M checks/day = 1,157 checks/sec average, ~5,000/sec peak
Each check: evaluate ~10-15 rules per component pair
Average build has 7 components → adding 1 component checks against 6 others → ~60-90 rule evaluations
At < 200ms budget and ~90 rules: < 2ms per rule

Price Data

500K products × 10 retailers average = 5M price points
Price scrape frequency: every 1 hour per retailer
5M scrapes/hour = 1,400 scrapes/sec
Price history: 5M products × 365 days × 24 data points/day × 10 bytes = ~440 GB/year

User Builds

10M monthly users, ~30% create builds = 3M builds/month
Average build: 500 bytes (7 component IDs + metadata)
3M × 500 bytes = 1.5 GB/month — trivial storage

3. API Design (3 min)

REST Endpoints

// Create or update a build
POST /v1/builds
  Headers: Authorization: Bearer <token>  // optional for anonymous builds
  Body: {
    "name": "My Gaming PC 2026",
    "components": {
      "cpu": "sku_amd_9800x3d",
      "motherboard": "sku_asus_x870e",
      "ram": "sku_corsair_ddr5_6000_32gb",
      "gpu": "sku_nvidia_5080",
      "storage": "sku_samsung_990_pro_2tb",
      "psu": "sku_corsair_rm850x",
      "case": "sku_fractal_north"
    }
  }
  Response 200: {
    "build_id": "build_abc123",
    "compatibility": {
      "status": "compatible",           // compatible | warning | incompatible
      "issues": [],
      "warnings": [
        {
          "type": "bottleneck",
          "severity": "low",
          "message": "CPU may slightly bottleneck GPU at 1080p. Consider 1440p+ gaming."
        }
      ],
      "power_budget": {
        "estimated_tdp": 450,
        "psu_wattage": 850,
        "headroom_pct": 47,
        "rating": "excellent"
      }
    },
    "total_price": 189499,              // $1,894.99 in cents
    "price_by_component": { ... }
  }

// Check compatibility for adding a single component
POST /v1/builds/{build_id}/check
  Body: { "category": "gpu", "sku": "sku_nvidia_5080" }
  Response 200: {
    "compatible": true,
    "issues": [],
    "warnings": [...],
    "updated_power_budget": { ... }
  }

// Search products with filters
GET /v1/products?category=gpu&socket=AM5&min_vram=12&max_price=80000&sort=price_asc
  Response 200: {
    "products": [
      {
        "sku": "sku_nvidia_5080",
        "name": "NVIDIA GeForce RTX 5080",
        "specs": { "vram_gb": 16, "tdp_watts": 360, "length_mm": 304, ... },
        "compatible_with_build": true,  // pre-filtered based on current build
        "prices": [
          { "retailer": "Amazon", "price": 99999, "url": "...", "in_stock": true },
          { "retailer": "Newegg", "price": 98999, "url": "...", "in_stock": true }
        ]
      }
    ]
  }

// Get price history for a product
GET /v1/products/{sku}/price-history?period=90d
  Response 200: {
    "sku": "sku_nvidia_5080",
    "history": [
      { "date": "2026-02-22", "min_price": 98999, "avg_price": 100499 },
      ...
    ]
  }

4. Data Model (3 min)

Products (PostgreSQL)

Table: products
  sku              (PK) | varchar(50)
  category               | enum('cpu','motherboard','ram','gpu','storage','psu','case','cooler','monitor',...)
  name                   | varchar(200)
  manufacturer           | varchar(100)
  specs                  | jsonb         -- category-specific specifications
  compatibility_attrs    | jsonb         -- extracted compatibility attributes
  image_url              | varchar(500)
  release_date           | date
  is_active              | boolean
  updated_at             | timestamp

-- Example specs for a CPU:
-- { "socket": "AM5", "cores": 8, "threads": 16, "base_clock_ghz": 4.7,
--   "boost_clock_ghz": 5.2, "tdp_watts": 120, "ram_type": "DDR5",
--   "max_ram_speed_mhz": 5200, "pcie_version": "5.0", "pcie_lanes": 24,
--   "integrated_graphics": false, "cooler_included": false }

-- Example compatibility_attrs for a CPU:
-- { "socket": "AM5", "ram_type": "DDR5", "tdp_watts": 120, "pcie_version": "5.0" }

Compatibility Rules (PostgreSQL + in-memory cache)

Table: compatibility_rules
  rule_id          (PK) | int
  name                   | varchar(100)
  category_a             | varchar(20)   -- e.g., "cpu"
  category_b             | varchar(20)   -- e.g., "motherboard"
  rule_type              | enum('must_match', 'must_fit', 'must_not_exceed', 'warning')
  attribute_a            | varchar(50)   -- e.g., "socket"
  attribute_b            | varchar(50)   -- e.g., "socket"
  operator               | enum('equals', 'gte', 'lte', 'contains', 'fits_in', 'custom')
  custom_logic           | text          -- for complex rules (e.g., RAM slot count + DIMM count)
  severity               | enum('error', 'warning', 'info')
  message_template       | text          -- "CPU socket {a} does not match motherboard socket {b}"
  is_active              | boolean

User Builds (PostgreSQL)

Table: builds
  build_id         (PK) | varchar(20)
  user_id          (FK) | varchar(20)  -- nullable for anonymous
  name                   | varchar(100)
  components             | jsonb        -- { "cpu": "sku_xxx", "gpu": "sku_yyy", ... }
  total_price            | int
  compatibility_status   | enum('compatible','warning','incompatible')
  is_public              | boolean
  created_at             | timestamp
  updated_at             | timestamp

Prices (TimescaleDB / PostgreSQL with partitioning)

Table: prices
  sku              (FK) | varchar(50)
  retailer               | varchar(50)
  price                  | int           -- in cents
  in_stock               | boolean
  url                    | varchar(500)
  scraped_at             | timestamp
  -- Partitioned by month on scraped_at
  -- Index on (sku, retailer, scraped_at DESC)

Why PostgreSQL?

Product catalog and compatibility rules need relational integrity
JSONB for flexible specs (each category has different attributes)
Compatibility rules are read-heavy, easily cached — PostgreSQL is fine as the source of truth
Price data benefits from TimescaleDB extension for time-series queries (price history charts)

5. High-Level Design (12 min)

Compatibility Check Flow

User adds a GPU to their build:
  → Frontend sends: POST /v1/builds/{id}/check { "category": "gpu", "sku": "sku_5080" }
    → Build Service:
      1. Load current build components from cache/DB
      2. Load new component's compatibility_attrs from product cache
      3. Call Compatibility Engine:
         For each existing component in the build:
           → Find applicable rules (gpu ↔ motherboard, gpu ↔ psu, gpu ↔ case, ...)
           → Evaluate each rule:
              Rule: gpu.pcie_version <= motherboard.pcie_version → OK
              Rule: gpu.length_mm <= case.max_gpu_length_mm → OK
              Rule: gpu.tdp_watts + ... <= psu.wattage × 0.8 → OK
              Rule: gpu.power_connectors available on PSU → OK
           → Collect all issues and warnings
      4. Calculate power budget:
         Sum TDP of all components → estimated_draw
         Headroom = (psu_wattage - estimated_draw) / psu_wattage × 100
      5. Check bottleneck (heuristic):
         CPU tier vs GPU tier (based on benchmark scores)
         If mismatch > threshold → warning
      6. Return result

Product Search with Compatibility Filtering

User is on GPU selection page (already has CPU, motherboard, case, PSU chosen):
  → GET /v1/products?category=gpu&compatible_with=build_abc123

  → Product Service:
    1. Load all GPUs from product cache (filtered by user's other criteria)
    2. For each GPU candidate:
       → Run compatibility check against existing build components
       → Mark as compatible/warning/incompatible
    3. Return only compatible + warning GPUs (hide incompatible by default)
    4. Sort by user preference (price, performance, popularity)

  Optimization: Pre-compute compatibility for common component pairs.
  For 1,000 CPUs × 500 motherboards = 500K pairs → precompute and cache.

Components

Build Service: Manages user builds. Orchestrates compatibility checks. Stateless, horizontally scaled.
Compatibility Engine: Core rule evaluation logic. Loaded in-memory with all rules and product compatibility attributes. < 200ms per full build check.
Product Service: Manages the product catalog. Search, filter, sort. Elasticsearch for full-text search + faceted filtering.
Price Aggregation Service: Scrapes retailer prices. Stores price history. Detects price drops and sends alerts.
Product Cache (Redis): All 500K products + compatibility attributes cached. 250 MB — fits easily. Refreshed on catalog updates.
Rules Cache (in-memory): All compatibility rules loaded into each service instance’s memory. Refreshed every 5 minutes from PostgreSQL.

Architecture Diagram

Users (browser)
  → CDN (static assets, product images)
  → API Gateway → Load Balancer
    → Build Service
      → Compatibility Engine (in-process)
        → Product Cache (Redis)
        → Rules Cache (in-memory)
    → Product Service
      → Elasticsearch (search + faceted filters)
      → PostgreSQL (catalog source of truth)
    → Price Service
      → Price Scrapers (distributed workers)
      → TimescaleDB (price history)
      → Redis (current prices)

Background:
  Price Scrapers → Retailer websites/APIs → TimescaleDB + Redis
  Catalog Updater → Manufacturer feeds → PostgreSQL → Redis cache invalidation

6. Deep Dives (15 min)

Deep Dive 1: The Compatibility Rules Engine

Rule categories and examples:

Design a Payment Payout System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Question Statement

Design a payment payout system for a marketplace (like Amazon, Uber, Airbnb) that pays out sellers/drivers/hosts.

Constraints:

Payment gateway takes ~1 minute to process a payment

Fixed fee per transfer charged by the payment gateway

Requirements (in order of importance):

Audit log of all payments made out to sellers

Sellers paid out in their preferred form of payment, as soon as possible

Sellers can check the status of their payments, or see steps required to fix problems

Keep payment gateway fees to a minimum

No payments should be dropped

No duplicate payments should be made out

Functional Requirements

#	Requirement	Details
FR1	Payment Processing	Process seller payouts in their preferred method (bank transfer, PayPal, UPI, etc.)
FR2	Batching	Batch multiple small payments into fewer gateway calls to minimize fees
FR3	Idempotency	No duplicate payments — exactly-once execution guarantee
FR4	Audit Log	Immutable, append-only log of every payment state transition
FR5	Status Tracking	Real-time, queryable status for every payment with actionable error messages
FR6	Retry & Recovery	Failed payments are retried with backoff; stuck payments are recovered automatically
FR7	Settlement Tracking	Track whether money actually landed in seller’s bank (gateway success ≠ bank settlement)
FR8	Reconciliation	Daily comparison of gateway records vs internal records to catch mismatches

Non-Functional Requirements

#	Requirement	Target
NFR1	Durability	Zero data loss. RPO = 0. Every payment survives any single system failure
NFR2	Consistency	Strong consistency for payment state. No scenario where payment is charged but not recorded
NFR3	Availability	99.99% for payment ingestion and status API
NFR4	Ingestion Latency	< 200ms to accept and acknowledge a payment request
NFR5	Status Lookup Latency	< 100ms
NFR6	Execution Latency	Minutes are acceptable (batching + gateway processing time)
NFR7	Idempotency	Every payment has a globally unique idempotency key at every level

2. Estimation (3 min)

Active sellers:           1M
Payouts per day:          10M transactions
Ingestion rate:           10M / 86400 ≈ 115 payments/sec
Batch ratio (10:1):       ~12 gateway calls/sec
Audit log growth:         10M records/day → ~3.6B/year

3. API Design (3 min)

POST /payments                      → Initiate a payout
  Request:
    {
      idempotency_key: "order_12345_payout_v1",   → client generates this
      seller_id: "s456",
      amount: 250.00,
      currency: "USD",
      payment_method: "bank_transfer"
    }
  Response: 202 Accepted
    {
      payment_id: "p789",
      status: "PENDING"
    }

GET /payments/{payment_id}          → Check payment status
  Response:
    {
      payment_id: "p789",
      status: "ACCEPTED",
      amount: 250.00,
      batch_id: "b101",
      failure_reason: null,
      action_required: null,
      estimated_settlement: "2026-02-26T14:00:00Z"
    }

GET /sellers/{seller_id}/payments   → Seller payment history
  Query params: ?status=FAILED&from=2026-01-01&to=2026-02-25

4. Payment State Machine

PENDING ───▶ BATCHED ───▶ SUBMITTED ───▶ ACCEPTED ───▶ SETTLED ✓
                                          │
                                          ├───▶ REVERSED (bank rejected after acceptance)
                                          │
                                          └───▶ RETURNED (settled but bounced back)

At any failed stage:
  FAILED ───▶ (re-enters as PENDING after fix or retry)

Status	Meaning	Seller-Facing Message
PENDING	Payment received, waiting to be batched	“Payment received, queued for processing”
BATCHED	Grouped with other payments for fee optimization	“Payment queued for processing”
SUBMITTED	Batch sent to payment gateway	“Payment sent to payment processor”
ACCEPTED	Gateway accepted, in transit to bank	“Payment in transit to your bank”
SETTLED	Money confirmed in seller’s account	“Payment deposited in your account”
REVERSED	Bank rejected the transfer	“Payment failed: [reason]. [action required]”
RETURNED	Settled but bounced back	“Payment returned by your bank. Please contact your bank.”
FAILED	Exhausted retries or unrecoverable error	“Payment failed: [reason]. [action required]”

Key insight: Gateway returning “success” only means ACCEPTED, not SETTLED. True settlement confirmation comes hours/days later via webhooks or polling.

Design a Peer-to-Peer File Sharing System (BitTorrent)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can share files by creating a torrent (metadata file) and seeding (uploading) the actual content to peers
Users can download files by obtaining a torrent file or magnet link, discovering peers who have the file, and downloading pieces from multiple peers simultaneously
Support peer discovery through both centralized trackers and decentralized DHT (Distributed Hash Table)
Ensure file integrity — every downloaded piece is verified against a cryptographic hash before being accepted
Implement an incentive mechanism (tit-for-tat) — peers who upload more get faster downloads; freeloaders (leechers) get throttled

Non-Functional Requirements

Availability: The system must work even when the tracker is offline (DHT provides decentralized fallback). No single point of failure for content distribution.
Latency: Peer discovery < 5 seconds. First bytes of download begin within 10 seconds of starting. Sustained throughput scales with the number of available peers.
Consistency: Eventual consistency for peer lists (stale peers are tolerable — the client will simply fail to connect and try others). Strong consistency for file integrity (every piece must match its hash).
Scale: Support files from 1 MB to 100 GB. Swarms with 1 million simultaneous peers. Global DHT with 100 million nodes.
Durability: Files remain available as long as at least one seeder exists. The more popular a file, the more available it becomes (self-scaling).

2. Estimation (3 min)

File Distribution

A 4 GB file, split into 256 KB pieces = 16,384 pieces
Each piece has a 20-byte SHA-1 hash for verification
Torrent metadata (info dictionary): 16,384 × 20 bytes = 320 KB of hashes + metadata ≈ 350 KB total

Tracker Load

1 million active torrents
Average swarm size: 200 peers
Each peer announces to the tracker every 30 minutes
200M peers / 1800 seconds = 111K announce requests/sec to the tracker
Each announce: ~200 bytes request, ~2 KB response (list of 50 peers)
Bandwidth: 111K × 2 KB = 222 MB/sec tracker outbound

DHT Scale

100 million DHT nodes globally
Routing table per node: 160 k-buckets × 8 entries = 1,280 peers ≈ 50 KB
DHT lookup: O(log N) hops. log2(100M) ≈ 27 hops, but k-bucket routing reduces this to ~4-6 hops in practice
Each hop: ~50ms (UDP round trip) → 200-300ms total DHT lookup time

Download Throughput

A well-seeded torrent with 100 peers, each contributing 1 MB/sec
Theoretical max: 100 MB/sec (limited by receiver’s bandwidth, not peer count)
Practical: 20-50 MB/sec (connection overhead, slow peers, churn)
A 4 GB file at 30 MB/sec ≈ 2.2 minutes to download

3. API Design (3 min)

BitTorrent is a peer-to-peer protocol, not a traditional client-server API. But there are three distinct interfaces.

Design a Performance Metrics Collection System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Collect performance metrics (CPU, memory, disk, network, custom application metrics) from thousands of servers and services
Store time-series data with configurable retention (high-resolution for recent data, downsampled for historical)
Support flexible querying: aggregate metrics across dimensions (host, service, region), with functions (avg, p99, sum, rate)
Real-time alerting: trigger alerts when metrics cross thresholds (e.g., CPU > 90% for 5 minutes)
Dashboard visualization: support building dashboards with graphs, heatmaps, and tables that auto-refresh

Non-Functional Requirements

Availability: 99.95% — monitoring must be more reliable than what it monitors. A monitoring outage during a production incident is catastrophic.
Latency: Metric ingestion: < 5 seconds end-to-end (from source to queryable). Dashboard queries: < 500ms for recent data (last 1 hour), < 2 seconds for historical data (last 30 days).
Scale: 10,000 hosts, each emitting 500 metrics at 10-second intervals = 500K data points/sec ingestion. 10 million active time series. 10TB of metric data in storage.
Durability: Metric data must survive single-node failures. 30-day raw retention, 1-year downsampled retention.
Cardinality: Handle high-cardinality labels gracefully (up to 1M unique label combinations per metric name without degrading query performance).

2. Estimation (3 min)

Ingestion

10,000 hosts x 500 metrics x 1 sample/10 sec = 500K samples/sec
Each sample: metric_name (hashed to 8 bytes) + labels (8 bytes hash) + timestamp (8 bytes) + value (8 bytes) = ~32 bytes
Ingestion bandwidth: 500K x 32 bytes = 16 MB/sec — modest, not a bottleneck

Storage

Samples per day: 500K/sec x 86,400 = 43.2 billion samples/day
Raw storage per day: 43.2B x 16 bytes (timestamp + value, compressed) = 691 GB/day
- With compression (gorilla encoding): ~1.37 bytes per sample → 59 GB/day
30-day raw retention: 59 x 30 = 1.8 TB
1-year downsampled (5-min resolution): 59 GB/day / 30 (10s → 5min) x 365 = 717 GB
Total: ~2.5 TB — surprisingly small thanks to time-series compression

Queries

Active dashboards: 1,000 dashboards, each with 10 panels, refreshing every 30 seconds
Query QPS: 1,000 x 10 / 30 = ~333 queries/sec
Each query scans: ~1 hour of data for 100 time series = 100 x 360 samples = 36K samples per query
Query throughput: 333 x 36K = 12M samples scanned/sec — well within SSD IOPS capacity

3. API Design (3 min)

Ingestion API

// Push metrics (agent → collector)
POST /api/v1/write
  Content-Type: application/x-protobuf    // or application/json
  Body: {
    "timeseries": [
      {
        "labels": {"__name__": "cpu_usage", "host": "web-01", "region": "us-east", "service": "api"},
        "samples": [
          {"timestamp": 1708632060, "value": 73.5},
          {"timestamp": 1708632070, "value": 75.2}
        ]
      },
      {
        "labels": {"__name__": "http_request_duration_seconds", "method": "GET", "path": "/api/users", "status": "200"},
        "samples": [
          {"timestamp": 1708632060, "value": 0.042}
        ]
      }
    ]
  }
  Response 200: { "samples_written": 3 }

Query API

// PromQL-compatible query
GET /api/v1/query_range
  Params:
    query = avg(rate(http_request_duration_seconds{service="api"}[5m])) by (path)
    start = 1708628460    // 1 hour ago
    end = 1708632060      // now
    step = 15             // 15-second resolution

  Response 200: {
    "result_type": "matrix",
    "result": [
      {
        "labels": {"path": "/api/users"},
        "values": [[1708628460, "0.035"], [1708628475, "0.038"], ...]
      },
      {
        "labels": {"path": "/api/orders"},
        "values": [[1708628460, "0.122"], [1708628475, "0.118"], ...]
      }
    ]
  }

Alerting API

// Create alert rule
POST /api/v1/alerts
  Body: {
    "name": "HighCPU",
    "expr": "avg(cpu_usage{service='api'}) by (host) > 90",
    "for": "5m",                    // must be true for 5 minutes
    "severity": "critical",
    "annotations": {
      "summary": "Host {{ $labels.host }} CPU is {{ $value }}%",
      "runbook": "https://wiki/runbooks/high-cpu"
    },
    "notify": ["pagerduty-oncall", "slack-infra"]
  }

Key Decisions

Pull model (Prometheus-style) vs push model: we use push — agents push to collectors. Better for ephemeral/auto-scaled instances that may disappear before being scraped.
PromQL-compatible query language — industry standard, rich aggregation functions
Protobuf for ingestion wire format — 50% smaller than JSON, faster serialization

4. Data Model (3 min)

Time Series Identification

A time series is uniquely identified by its label set:
  {__name__="http_requests_total", method="GET", path="/api/users", status="200", host="web-01"}

Internal representation:
  series_id  = hash(sorted_labels) → uint64
  This is the primary key for all operations

Storage Format (custom time-series optimized)

On-disk structure (per series, per time block):
  ┌──────────────────────────────────┐
  │ Block (2-hour time window)       │
  │ ┌──────────────────────────────┐ │
  │ │ Series 1: [timestamps][values]│ │
  │ │ Series 2: [timestamps][values]│ │
  │ │ ...                           │ │
  │ └──────────────────────────────┘ │
  │ Index: series_id → offset        │
  │ Label index: label → series_ids  │
  │ Tombstones (deletions)           │
  └──────────────────────────────────┘

Compression (Gorilla encoding):
  Timestamps: delta-of-delta encoding
    - Most deltas are 0 (fixed 10s interval) → 1 bit per timestamp
    - Occasional jitter → few extra bits
    - Average: 1-2 bits per timestamp

  Values: XOR encoding
    - Consecutive values of the same metric are similar
    - XOR of consecutive values has many leading/trailing zeros
    - Average: 1-2 bytes per value

  Result: ~1.37 bytes per sample (vs 16 bytes uncompressed) → 12x compression

Label Index (inverted index)

Label to series mapping (for query filtering):
  "service=api"   → [series_1, series_2, series_5, series_99, ...]
  "host=web-01"   → [series_1, series_3, series_7, ...]
  "method=GET"    → [series_1, series_2, series_3, ...]

Query: http_requests{service="api", method="GET"}
  → Intersect posting lists: [series_1, series_2, series_5, ...] ∩ [series_1, series_2, series_3, ...]
  → Result: [series_1, series_2]

Stored as: sorted arrays with bitmap intersection (Roaring bitmaps for cardinality > 10K)

Downsampled Data

Table: metrics_5min (5-minute aggregates)
  series_id       | uint64
  timestamp       | int64 (5-min aligned)
  min             | float64
  max             | float64
  sum             | float64
  count           | uint32

Computed by background job: raw data → 5-min aggregates → 1-hour aggregates → 1-day aggregates
Each level retains min/max/sum/count so any aggregation can be reconstructed.

Why This Design

Custom time-series format (not SQL) — relational databases are 100x worse at time-series workloads due to row overhead, index maintenance, and lack of columnar compression.
Gorilla compression — developed by Facebook for their monitoring system. 12x compression means 10TB logical data fits in ~800GB on disk.
Inverted label index — enables fast multi-label query filtering without scanning all series. Critical for high-cardinality environments.

5. High-Level Design (12 min)

Architecture

Data Sources (10,000 hosts)
  │
  │  Each host runs a metrics agent (collectd/telegraf/custom)
  │  Agent collects: CPU, memory, disk, network, app-specific metrics
  │  Agent pushes every 10 seconds
  │
  ▼
┌──────────────────────────────────────────────────┐
│              Ingestion Layer                        │
│                                                     │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  │
│  │ Collector 1 │  │ Collector 2 │  │ Collector N │  │
│  │ (stateless) │  │             │  │             │  │
│  └──────┬─────┘  └──────┬─────┘  └──────┬─────┘  │
│         │               │               │         │
│         └───────────────┼───────────────┘         │
│                         │                          │
│                         ▼                          │
│              ┌─────────────────┐                   │
│              │  Kafka Cluster   │                   │
│              │  (metrics topic, │                   │
│              │   32 partitions) │                   │
│              └────────┬────────┘                   │
└───────────────────────┼────────────────────────────┘
                        │
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Storage Node │ │ Storage Node │ │ Storage Node │
│ (Ingester)   │ │ (Ingester)   │ │ (Ingester)   │
│              │ │              │ │              │
│ In-memory    │ │ In-memory    │ │ In-memory    │
│ head block   │ │ head block   │ │ head block   │
│ (last 2 hrs) │ │ (last 2 hrs) │ │ (last 2 hrs) │
│      │       │ │      │       │ │      │       │
│      ▼       │ │      ▼       │ │      ▼       │
│ Persistent   │ │ Persistent   │ │ Persistent   │
│ blocks (SSD) │ │ blocks (SSD) │ │ blocks (SSD) │
└──────────────┘ └──────────────┘ └──────────────┘
         │              │              │
         └──────────────┼──────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │  Query Layer     │
              │  (Queriers,      │
              │   fan-out to     │
              │   storage nodes) │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              ▼                 ▼
     ┌──────────────┐  ┌──────────────┐
     │ Alert Manager │  │ Dashboard    │
     │ (evaluates    │  │ (Grafana,    │
     │  rules every  │  │  custom UI)  │
     │  15 seconds)  │  │              │
     └──────────────┘  └──────────────┘

Ingestion Flow

Agent (on host)
  → Collect metrics every 10 seconds
  → Batch into protobuf payload (~500 metrics = ~16KB)
  → Push to any Collector instance (round-robin via load balancer)

Collector (stateless)
  → Validate payload (label format, value range, cardinality check)
  → Route to correct Kafka partition based on hash(series_id) % 32
  → Respond 200 to agent

Kafka → Storage Node (Ingester)
  → Consume from assigned partitions
  → Append samples to in-memory "head block" (current 2-hour window)
  → WAL for crash recovery
  → Every 2 hours: flush head block to disk as compressed block
  → Replication: 2 ingesters consume same partition (RF=2)

Query Flow

Dashboard query: avg(cpu_usage{service="api"}) by (host) [last 1 hour]

Querier:
  1. Parse PromQL expression
  2. Label matching: which series match {service="api"}?
     → Query label index on storage nodes → get list of series_ids
  3. Time range: last 1 hour → need head block (in-memory) + possibly 1 on-disk block
  4. Fan-out: send sub-queries to storage nodes that hold these series
  5. Each storage node:
     → Decompress relevant samples
     → Apply rate/avg functions locally
     → Return partial result
  6. Querier merges partial results
  7. Return to dashboard

Components

Metrics Agents (10,000): Lightweight daemons on each host. Collect system metrics (CPU, memory, disk, network) and application-specific metrics (HTTP latency, queue depth). Push to collectors. Buffered locally for 1 hour if collectors are unreachable.
Collectors (10, stateless): Receive pushed metrics, validate, route to Kafka. No state — purely a fan-in and routing layer. Horizontal scaling trivial.
Kafka: Durability buffer between collectors and storage. Handles burst traffic, provides replay capability. 32 partitions, 3x replication, 24-hour retention.
Storage Nodes / Ingesters (6-10): Each owns a subset of time series (by hash). In-memory head block for recent data (fast queries), compressed on-disk blocks for historical data. 2TB SSD per node.
Queriers (stateless): Parse queries, fan-out to storage nodes, merge results. Auto-scaled based on dashboard load.
Alert Manager: Evaluates alert rules every 15 seconds by running PromQL queries. Deduplication, grouping, silencing, routing to notification channels (PagerDuty, Slack, email).
Downsampler (background job): Computes 5-min, 1-hour, and 1-day rollups. Runs continuously, processing blocks as they are flushed to disk.

6. Deep Dives (15 min)

Deep Dive 1: Time-Series Compression (Gorilla Encoding)

The problem: At 500K samples/sec, naive storage (16 bytes per sample: 8 bytes timestamp + 8 bytes float64 value) would require 691 GB/day. We need at least 10x compression to keep costs manageable.

Design a Photo Sharing Service (Flickr/Google Photos)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can upload photos (up to 50MB each) with the system generating multiple thumbnail sizes and extracting EXIF metadata automatically
Organize photos into albums, add tags, and manage sharing permissions (private, shared with specific users, public)
Search photos by metadata (date, location, camera), user-applied tags, and ML-generated labels (objects, faces, scenes)
Browse photo feeds (own photos, shared albums, explore/discover) with infinite scroll and fast thumbnail loading
Share photos and albums via links with configurable permissions (view-only, download allowed, password-protected)

Non-Functional Requirements

Availability: 99.99% for viewing photos (reads). 99.9% for uploads (briefly queuing uploads is acceptable).
Latency: Thumbnail loading < 100ms. Full-resolution photo < 500ms. Upload acknowledgment < 2 seconds (processing happens async).
Consistency: Photo metadata must be strongly consistent (if you upload and refresh, your photo must appear). Search index can lag by 30 seconds.
Scale: 100M users, 5M daily active, 50M photo uploads/day, 500M photos viewed/day, 10PB total stored photos.
Durability: 99.999999999% (11 nines) for original photos. Losing a user’s photo is unacceptable.

2. Estimation (3 min)

Traffic

Uploads: 50M photos/day = ~580 uploads/sec (peak 2x = 1,160/sec)
Photo views (thumbnails): 500M/day = ~5,800/sec (peak 3x = 17,400/sec)
Full-resolution views: 50M/day = ~580/sec
Search queries: 5M DAU × 2 searches/day = 10M/day = ~115 QPS
Upload bandwidth: 580/sec × 5MB avg = 2.9 GB/sec inbound

Storage

Photos (originals): 50M/day × 5MB avg × 365 days = 91PB/year (after dedup and growth: ~50PB/year)
Thumbnails (4 sizes per photo): 50M/day × 4 × 50KB avg = 10TB/day = 3.6PB/year
Metadata: 50M/day × 2KB = 100GB/day = 36TB/year
Total stored: ~10PB currently (growing 50PB/year)

Cost Insight

Storage dominates cost: 10PB in S3 = ~$230K/month
CDN egress for thumbnails: 500M views × 50KB = 25TB/day = ~$37K/month
ML processing (image labeling): 50M photos × $0.001/photo = $50K/month

Key Insight

This is a write-heavy, storage-intensive system. The upload pipeline (ingest, process, store, index) is the most complex component. The read path is CDN-dominated (thumbnails served from edge). The most interesting engineering challenge is the image processing pipeline and ML-powered search.

Design a Price Alert System (Google Flights)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can set price alerts for a specific route (origin, destination, date range, cabin class) with a target price or “notify on any drop”
System continuously monitors flight prices by polling airline APIs / aggregators and detects meaningful price changes
Send notifications (email, push) when a tracked route’s price drops below the user’s threshold or changes significantly
Show price history and trends for a route (price graph over last 30-90 days)
Support alert management: list, edit, pause, delete alerts; set expiration (auto-delete after travel date)

Non-Functional Requirements

Availability: 99.9% for alert creation/management; price monitoring can tolerate brief outages (users won’t notice a 5-minute gap in polling)
Latency: Alert creation < 200ms. Notification delivery within 15 minutes of a price change (not real-time — flights don’t change by the second).
Consistency: A user should never miss a significant price drop. False positives (alerting on a stale price) are worse than a 15-minute delay.
Scale: 100M active alerts across 50M users. 500K unique routes being monitored. 10M price checks/hour.
Cost Efficiency: Airline API calls are expensive (rate-limited, sometimes paid). Minimize redundant polling.

2. Estimation (3 min)

Alerts

50M users, average 2 active alerts each = 100M active alerts
Alert creation/deletion: ~1M/day (relatively low write volume)

Price Monitoring

500K unique routes being monitored
Each route checked every 30 minutes = 500K × 48 checks/day = 24M price checks/day ≈ 280 checks/sec
Popular routes (JFK→LAX) shared by 100K+ alerts; long-tail routes shared by < 10 alerts
Key optimization: check routes, not individual alerts. 500K route checks serve 100M alerts.

Notifications

Assume 5% of checks find a meaningful price change = 1.2M price changes/day
Each change triggers alerts for all subscribed users. Average 200 users per route change = 240M notifications/day ≈ 2,800/sec
Peak: 3-5x average during fare sales = ~10K notifications/sec

Storage

Alert records: 100M × 500 bytes = 50GB
Price history: 500K routes × 365 days × 48 data points × 50 bytes = 4.4TB/year
Route metadata/cache: 500K × 2KB = 1GB

3. API Design (3 min)

Alert Management

POST /alerts
  Body: {
    "origin": "JFK",
    "destination": "LAX",
    "departure_date_start": "2026-06-01",
    "departure_date_end": "2026-06-07",
    "return_date_start": "2026-06-08",
    "return_date_end": "2026-06-14",
    "cabin_class": "economy",
    "max_price": 250,                    // null = notify on any significant drop
    "notification_channels": ["email", "push"],
    "passengers": 1
  }
  Response 201: { "alert_id": "alt_abc123", "current_price": 312, ... }

GET /alerts?user_id=u_123&status=active
  Response: [list of alert objects with current prices]

DELETE /alerts/{alert_id}

PATCH /alerts/{alert_id}
  Body: { "max_price": 200, "paused": false }

Price Data

GET /routes/{origin}/{destination}/prices
  Query: departure_start, departure_end, cabin_class, lookback_days=30
  Response: {
    "current_price": 312,
    "price_history": [
      {"date": "2026-02-20", "min_price": 318, "median_price": 345},
      {"date": "2026-02-21", "min_price": 312, "median_price": 340},
      ...
    ],
    "trend": "declining",
    "typical_range": {"low": 280, "high": 420}
  }

Key Decisions

Alerts are defined on route+date_range, not specific flights. The system finds the cheapest option matching the criteria.
Date ranges (not exact dates) because most travelers have flexibility — this dramatically increases alert match rates.

4. Data Model (3 min)

Alerts (PostgreSQL)

Table: alerts
  alert_id        (PK) | uuid
  user_id         (FK) | uuid (indexed)
  route_key             | varchar(50) (indexed) -- "JFK-LAX-2026-06-economy"
  origin                | char(3)
  destination           | char(3)
  departure_start       | date
  departure_end         | date
  return_start          | date (nullable)
  return_end            | date (nullable)
  cabin_class           | enum('economy', 'premium_economy', 'business', 'first')
  max_price             | decimal (nullable)
  passengers            | int
  status                | enum('active', 'paused', 'expired', 'triggered')
  notification_channels | jsonb
  created_at            | timestamp
  expires_at            | timestamp -- auto-set to departure_start

Monitored Routes (PostgreSQL + Redis cache)

Table: monitored_routes
  route_key       (PK) | varchar(50)
  origin                | char(3)
  destination           | char(3)
  date_range            | daterange
  cabin_class           | varchar(20)
  alert_count           | int -- denormalized, number of active alerts
  last_checked_at       | timestamp
  last_price            | decimal
  check_interval_min    | int -- 15 for popular, 60 for long-tail
  next_check_at         | timestamp (indexed)

Price History (TimescaleDB or ClickHouse)

Table: price_history
  route_key             | varchar(50)
  checked_at            | timestamp
  min_price             | decimal
  median_price          | decimal
  source                | varchar(50) -- airline, aggregator name
  flight_options        | jsonb -- top 3-5 cheapest options with details

Why These Choices

PostgreSQL for alerts and routes: strong consistency, complex queries (find all alerts for a route), transactional status updates
TimescaleDB for price history: time-series optimized, automatic partitioning by time, efficient range queries for price graphs
Redis for route check scheduling: sorted set with next_check_at as score — O(log N) to get the next routes to check

5. High-Level Design (12 min)

Price Monitoring Pipeline

Scheduler (Redis Sorted Set: next_check_at)
  → every second: ZRANGEBYSCORE routes 0 {now} LIMIT 100
  → dispatch route checks to worker pool

Price Check Workers (horizontally scaled)
  → For each route:
    1. Call airline APIs / aggregator APIs (Google Flights, Skyscanner, etc.)
    2. Parse response → extract min price for the route criteria
    3. Compare with last_price in DB
    4. If significant change detected:
       → Write new price to price_history
       → Update last_price in monitored_routes
       → Publish PriceChangeEvent to Kafka
    5. Update next_check_at in scheduler

Kafka: PriceChangeEvent
  → Alert Matcher Service
    → Query: SELECT * FROM alerts WHERE route_key = ? AND status = 'active'
    → For each matching alert:
       → If new_price <= max_price (or significant drop for "any drop" alerts):
         → Enqueue notification job

Notification Service
  → Read from notification queue
  → Deduplicate (don't send same user same alert twice in 1 hour)
  → Render email/push template with price details
  → Send via email (SES) / push (FCM/APNs)
  → Update alert status if appropriate (mark as triggered)

Components

Route Scheduler: Redis sorted set of routes ordered by next_check_at. A cron-like process pops due routes and dispatches them.
Price Check Workers: Stateless workers that call external APIs. Auto-scale based on queue depth. Handle rate limits, retries, API key rotation.
Route Consolidator: Background job that merges overlapping date ranges across alerts into a minimal set of monitored routes. Runs on alert creation/deletion.
Alert Matcher: On price change, finds all alerts affected. Uses indexed route_key lookup.
Notification Service: Manages delivery across channels. Handles batching (don’t send 10 alerts at once — batch into a digest), deduplication, and delivery tracking.
Price History Service: Serves price graphs and trend analysis. Queries TimescaleDB with pre-computed daily aggregates.
API Gateway: Handles alert CRUD, price queries, user authentication.

Route Consolidation Example

Alert 1: JFK→LAX, Jun 1-7, economy
Alert 2: JFK→LAX, Jun 3-10, economy
Alert 3: JFK→LAX, Jun 1-7, business

Monitored routes:
  Route A: JFK→LAX, Jun 1-10, economy (covers alerts 1 + 2)
  Route B: JFK→LAX, Jun 1-7, business (covers alert 3)

This reduces 100M alerts to ~500K monitored routes — a 200x reduction in API calls.

Design a Real-Time Live Likes/Reactions System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can send reactions (like, love, wow, haha, etc.) on a live stream or post, and all viewers see animated reactions in real-time
Display an aggregated reaction count that updates live (e.g., “12.3K likes”)
Each user can react multiple times (unlike a static like button — this is a live engagement feature, like Facebook Live or Instagram Live hearts)
Reaction animations appear as floating icons on the viewer’s screen, reflecting the volume and type of reactions across all viewers
Provide reaction rate metrics (reactions per second) for streamer dashboards and analytics

Non-Functional Requirements

Availability: 99.9% — reactions are engagement features, not mission-critical. Brief delays are acceptable; total loss of reactions degrades the experience but does not break the product.
Latency: Reactions should appear on other viewers’ screens within 1-2 seconds of being sent.
Consistency: Eventual consistency is fine. Counts can be approximate. Showing “12.3K” when the true count is 12,347 is perfectly acceptable.
Scale: 100K concurrent live streams. Top streams: 500K concurrent viewers, up to 50,000 reactions/sec per stream. Global: 5M reactions/sec across all streams.
Durability: Aggregate counts must be durable (total likes on a stream). Individual reaction events do not need permanent storage.

2. Estimation (3 min)

Write Traffic

5M reactions/sec globally
Each reaction event: ~80 bytes (stream_id, user_id, reaction_type, timestamp)
Inbound data rate: 5M × 80 bytes = 400 MB/sec

Fan-Out

Each reaction is not individually broadcast — instead, reactions are aggregated into batches
Every 500ms, each stream produces a batch summary: { “like”: 142, “love”: 37, “wow”: 12 }
Batch per stream: ~200 bytes
100K streams × 200 bytes × 2 batches/sec = 40 MB/sec fan-out from aggregation layer
Each batch is pushed to all viewers of that stream
Top stream: 500K viewers × 200 bytes × 2/sec = 200 MB/sec — manageable with edge fan-out

Storage

Aggregate counts per stream: 100K streams × 6 reaction types × 8 bytes = ~5 MB in Redis
Reaction event log (for analytics, retained 7 days): 5M/sec × 80 bytes × 86400 sec × 7 days = ~240 TB/week
- Store sampled (1% sample = 2.4 TB/week) in a data lake for trend analysis, not the full firehose

Count Storage

Final aggregate counts per stream (permanent): negligible — one row per stream in the DB

3. API Design (3 min)

REST Endpoints

// Send a reaction
POST /v1/streams/{stream_id}/reactions
  Headers: Authorization: Bearer <token>
  Body: {
    "type": "like"                     // like, love, wow, haha, sad, angry
  }
  Response 202: { "status": "accepted" }
  // 202 Accepted — fire-and-forget, no guarantee of individual delivery

// Get current reaction counts
GET /v1/streams/{stream_id}/reactions/counts
  Response 200: {
    "stream_id": "stream_abc",
    "counts": {
      "like": 1234567,
      "love": 234567,
      "wow": 45678,
      "haha": 12345,
      "sad": 1234,
      "angry": 567
    },
    "rate": {
      "total_per_second": 4521,
      "by_type": { "like": 2100, "love": 890, ... }
    }
  }

WebSocket Protocol

// Client → Server: send a reaction (alternative to REST, lower overhead)
{ "type": "reaction", "reaction": "like" }

// Server → Client: batched reaction update (every 500ms)
{
  "type": "reaction_batch",
  "window_ms": 500,
  "counts": {
    "like": 142,
    "love": 37,
    "wow": 12,
    "haha": 8,
    "sad": 2,
    "angry": 1
  },
  "total": {
    "like": 1234567,
    "love": 234568,
    ...
  }
}

Key Decisions

Fire-and-forget writes: Reactions return 202, not 201. We do not guarantee every reaction is counted. Losing 0.1% of reactions under extreme load is acceptable.
Batched delivery: Individual reactions are never pushed to clients. Instead, 500ms window summaries are pushed. This reduces fan-out by 1000x for high-volume streams.
Approximate counts in totals: The total count uses eventual consistency. The per-window count (used for animations) is more important for UX.

4. Data Model (3 min)

In-Flight Reaction Aggregation (Redis)

// Per-window reaction counts (current 500ms window)
Key: stream:{stream_id}:reactions:current
Type: Hash
Fields: like → 142, love → 37, wow → 12, ...
TTL: 5 seconds (auto-cleanup)

// Running total counts
Key: stream:{stream_id}:reactions:total
Type: Hash
Fields: like → 1234567, love → 234567, ...

Permanent Counts (PostgreSQL — updated periodically)

Table: stream_reaction_counts
  stream_id      (PK) | varchar(20)
  like_count          | bigint
  love_count          | bigint
  wow_count           | bigint
  haha_count          | bigint
  sad_count           | bigint
  angry_count         | bigint
  updated_at          | timestamp

Rate Limit State (Redis)

// Per-user reaction rate limit (max 10 reactions per second per user)
Key: reaction_rl:{stream_id}:{user_id}
Type: String (counter)
TTL: 1 second

Why This Split?

Redis for real-time: All hot-path data lives in Redis. Hash HINCRBY is O(1) and handles millions of increments per second. The aggregation window resets every 500ms.
PostgreSQL for permanence: The final count after a stream ends is checkpointed to PostgreSQL. This happens once per stream, not millions of times.
No per-event storage: We do NOT store individual reaction events in a database. 5M events/sec is too expensive to persist individually. Instead, we aggregate in Redis and periodically flush summaries.

5. High-Level Design (12 min)

Reaction Write Path

Client taps "like" button
  → WebSocket Connection Server
    → 1. Rate limit check:
         INCR reaction_rl:{stream_id}:{user_id}
         If > 10 → silently drop (don't error, just ignore excess)
    → 2. Forward to Reaction Aggregator:
         HINCRBY stream:{stream_id}:reactions:current like 1
         HINCRBY stream:{stream_id}:reactions:total like 1
    → 3. Done. No ACK needed to client.

Reaction Broadcast Path

Reaction Broadcaster Service (per-stream timer):
  Every 500ms per active stream:
    → 1. HGETALL stream:{stream_id}:reactions:current
         Result: { "like": 142, "love": 37, "wow": 12 }
    → 2. DEL stream:{stream_id}:reactions:current (reset window)
         (Use GETDEL pattern or Lua script for atomicity)
    → 3. If total reactions in window > 0:
         Publish batch to fan-out layer:
         PUBLISH stream_reactions:{stream_id} {batch_json}
    → 4. If total reactions = 0 → skip (no broadcast, save bandwidth)

Fan-Out to Viewers:
  WebSocket Servers subscribe to Redis Pub/Sub: stream_reactions:{stream_id}
    → On batch received:
      → Push batch to all local WebSocket connections for that stream

Components

WebSocket Connection Servers (100+ instances): Handle client connections. Receive reactions via WebSocket. Forward to Redis for aggregation. Push batched updates to clients.
Redis Cluster: Reaction aggregation (HINCRBY). Rate limiting. Pub/Sub for batch distribution. Stores running totals.
Reaction Broadcaster Service: Timer-based service, one logical timer per active stream. Reads current window, resets counter, publishes batch. Horizontally scaled — each instance owns a shard of streams.
Count Checkpoint Service: Periodically (every 60 seconds) writes Redis totals to PostgreSQL. On stream end, performs a final checkpoint.
Analytics Sampler: Taps into the reaction stream at 1% sample rate. Writes to Kafka for downstream analytics (trend detection, engagement scoring).

Architecture Diagram

Clients (viewers)
  ↕ WebSocket
WebSocket Servers (100+)
  → HINCRBY → Redis Cluster
               ├── Current window counters
               ├── Running totals
               └── Pub/Sub channels

Reaction Broadcaster (sharded by stream)
  → Every 500ms: read + reset current window
  → PUBLISH batch to Redis Pub/Sub
  → WebSocket Servers receive and push to clients

Count Checkpoint Service
  → Every 60s: Redis totals → PostgreSQL

Analytics Sampler → Kafka → Data Warehouse

Animation Rendering (Client-Side)

On receiving batch { "like": 142, "love": 37, "wow": 12 }:
  1. Total reactions in window: 191
  2. Scale animation intensity:
     < 10 reactions → sparse floating icons
     10-100 → moderate stream
     100-1000 → dense stream with size variation
     > 1000 → "burst" mode with explosion effect
  3. For each reaction type, spawn proportional number of animated icons:
     like: 142/191 = 74% → 74% of animation slots are hearts
     love: 37/191 = 19% → 19% are love icons
  4. Randomize: position (x-axis), float speed, size, opacity
  5. Animate using requestAnimationFrame, recycle DOM nodes from a pool
  6. Cap at 60 visible animations simultaneously (performance)

6. Deep Dives (15 min)

Deep Dive 1: High Write Throughput — Handling 50K Reactions/Sec Per Stream

A top stream receives 50,000 reactions per second. Naively, each reaction is a separate Redis HINCRBY command. At 50K/sec for one stream, this is manageable for Redis (it handles 100K+ ops/sec). But across 100K streams with 5M total reactions/sec, we need to be smarter.

Design a Real-Time Page Viewer Count System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Display a real-time count of users currently viewing a specific page (e.g., “47 people viewing this page”)
Count updates within 5 seconds of a viewer joining or leaving
Track unique viewers — refreshing the page or opening multiple tabs from the same user should count as 1 viewer
Support any page on the platform (product pages, articles, dashboards) — millions of distinct page IDs
Provide an API to query current viewer count for any page (for analytics dashboards)

Non-Functional Requirements

Availability: 99.9% — viewer counts are informational, not business-critical. Showing a stale count briefly is acceptable.
Latency: Count updates pushed to viewers within 5 seconds. API queries return in < 50ms.
Consistency: Approximate counts are fine. Off by 2-3 viewers is acceptable. Off by 50% is not.
Scale: 10M concurrent users across 5M distinct pages. Average page: 2 viewers. Hot pages (trending product, breaking news): 500K+ viewers.
Durability: Viewer counts are ephemeral — no need to persist historical real-time counts. However, log peak counts for analytics.

2. Estimation (3 min)

Connections

10M concurrent WebSocket connections
Each connection: ~10 KB memory → 100 GB total connection memory
At 200K connections per server → 50 WebSocket servers

Heartbeat Traffic

Each client sends a heartbeat every 30 seconds
10M clients / 30 sec = 333K heartbeats/sec
Each heartbeat: ~100 bytes → 33 MB/sec inbound

Count Update Traffic

When a viewer joins/leaves, update the count for that page
Churn rate: ~5% of viewers change pages per minute → 500K join/leave events per minute → 8,300 events/sec
Each event triggers a count update pushed to all viewers of that page
Average page: 2 viewers, hot pages: thousands
Estimated push volume: ~50K count update messages/sec

Storage

Active page viewer sets: 5M pages × ~100 bytes per entry × avg 2 viewers = 1 GB in Redis
Hot pages with 500K viewers: a single sorted set with 500K members = ~50 MB. Manageable.

3. API Design (3 min)

REST Endpoints

// Get current viewer count for a page
GET /v1/pages/{page_id}/viewers/count
  Response 200: {
    "page_id": "product_12345",
    "viewer_count": 47,
    "updated_at": "2026-02-22T10:00:05Z"
  }

// Get viewer counts for multiple pages (batch)
POST /v1/pages/viewers/count
  Body: { "page_ids": ["product_12345", "article_678", ...] }
  Response 200: {
    "counts": {
      "product_12345": 47,
      "article_678": 1203
    }
  }

WebSocket Protocol

// Client connects and subscribes to a page
WS /v1/pages/{page_id}/viewers

// Client → Server: heartbeat (every 30 seconds)
{ "type": "heartbeat" }

// Server → Client: viewer count update
{ "type": "viewer_count", "count": 48 }

// Client navigates to a different page
// → Close old WebSocket, open new one (or send subscribe/unsubscribe messages)
{ "type": "subscribe", "page_id": "new_page_456" }
{ "type": "unsubscribe", "page_id": "product_12345" }

Key Decisions

Use WebSocket for real-time push rather than polling (polling 10M clients every 5 seconds = 2M req/sec just for counts)
Multiplexed WebSocket: single connection per client, subscribe/unsubscribe to pages as they navigate. Avoids reconnection overhead.
Heartbeat is mandatory. If no heartbeat received in 90 seconds (3 missed), the server considers the viewer gone.

4. Data Model (3 min)

Active Viewer Set (Redis)

// Set of active viewers per page
Key: page:{page_id}:viewers
Type: Sorted Set
Member: viewer_id (user_id or session_id for anonymous users)
Score: last_heartbeat_timestamp

// Current count (cached, updated on join/leave)
Key: page:{page_id}:count
Type: String (integer)

Connection Registry (Redis)

// Which pages a connection is viewing
Key: conn:{connection_id}
Type: Hash
Fields:
  viewer_id      | varchar
  page_id        | varchar
  server_id      | varchar
  connected_at   | timestamp
  last_heartbeat | timestamp
TTL: 120 seconds (auto-cleanup if server crashes)

Unique Viewer Tracking (Redis HyperLogLog)

// Approximate unique viewers (for analytics, not real-time count)
Key: page:{page_id}:unique_viewers:{date}
Type: HyperLogLog
Operation: PFADD on each new viewer

Why Redis?

All data is ephemeral (viewer presence is transient)
Sub-millisecond reads/writes for count lookups and heartbeat updates
Sorted sets enable efficient expiry scanning (remove viewers with old heartbeats)
Pub/Sub for cross-server count change notifications
HyperLogLog for memory-efficient unique counting (12 KB per counter regardless of cardinality)

5. High-Level Design (12 min)

Viewer Join Flow

Client opens page
  → WebSocket Connection Server
    → 1. Authenticate (extract user_id or generate session_id)
    → 2. Deduplicate: Check if this viewer_id already exists in page's viewer set
         ZSCORE page:{page_id}:viewers {viewer_id}
         If exists → update heartbeat timestamp, do NOT increment count
         If new → ZADD page:{page_id}:viewers {now} {viewer_id}
                   INCR page:{page_id}:count
    → 3. Register connection: HSET conn:{connection_id} ...
    → 4. Publish count change: PUBLISH page_count:{page_id} {new_count}
    → 5. Return current count to client

Heartbeat Flow

Client sends heartbeat every 30 seconds
  → WS Server receives heartbeat
    → ZADD page:{page_id}:viewers {now} {viewer_id}  (update score = timestamp)
    → EXPIRE conn:{connection_id} 120                  (refresh TTL)

Viewer Leave Flow

Case 1: Client navigates away (graceful close)
  → Client sends unsubscribe or closes WebSocket
  → WS Server:
    → ZREM page:{page_id}:viewers {viewer_id}
    → DECR page:{page_id}:count
    → PUBLISH page_count:{page_id} {new_count}
    → DEL conn:{connection_id}

Case 2: Client crashes / loses network (ungraceful)
  → No more heartbeats received
  → Heartbeat Reaper (background job):
    → Runs every 30 seconds
    → ZRANGEBYSCORE page:{page_id}:viewers -inf (now - 90)
    → Remove stale entries, decrement count
    → Publish updated count

Count Broadcast Flow

Redis Pub/Sub channel: page_count:{page_id}
  → When count changes, PUBLISH to this channel
  → Each WS Server subscribes to channels for pages with local viewers
  → On receiving count update:
    → Push { "type": "viewer_count", "count": N } to all local WebSocket
      connections viewing that page

Components

WebSocket Connection Servers (50 servers): Maintain persistent connections. Handle heartbeats, join/leave events. Subscribe to Redis Pub/Sub for count updates.
Redis Cluster (3 shards, replicated): Stores viewer sets, counts, connection registry. Provides Pub/Sub for cross-server notifications.
Heartbeat Reaper Service: Background job scanning for stale viewers. Runs on a schedule (every 30 seconds). Scans sorted sets for expired entries.
Count Query API: Stateless HTTP service for non-real-time queries. Reads directly from Redis. Used by analytics dashboards.
Analytics Pipeline: Kafka consumer logs peak counts, unique viewer counts (HyperLogLog reads) to a data warehouse for historical analysis.

Architecture Diagram

Clients (10M)
  → Load Balancer (sticky by viewer_id hash)
    → WebSocket Servers (50 instances)
      → Redis Cluster
        ├── Viewer Sets (Sorted Sets)
        ├── Counts (Strings)
        ├── Connection Registry (Hashes)
        └── Pub/Sub (count change notifications)

Heartbeat Reaper (3 instances, leader-elected)
  → Scans Redis sorted sets for stale entries
  → Decrements counts and publishes updates

Analytics Pipeline
  → Periodically snapshots counts to Kafka → Data Warehouse

6. Deep Dives (15 min)

Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State

Option A: Rely purely on WebSocket connection state

Design a Real-Time Stock Price System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Ingest real-time market data feeds from multiple exchanges (NYSE, NASDAQ, etc.) and normalize into a unified price stream
Fan out live price updates to millions of subscribers via WebSocket with sub-second latency
Maintain and serve order book data (top-of-book and depth) for each symbol
Compute and serve the National Best Bid and Offer (NBBO) by aggregating prices across exchanges
Store and serve historical price data (OHLCV — Open, High, Low, Close, Volume) at multiple time resolutions and support price-change alerting

Non-Functional Requirements

Availability: 99.99% during market hours (9:30 AM - 4:00 PM ET). Pre-market and after-hours: 99.9%.
Latency: Tick-to-display < 100ms for retail users. Tick-to-internal-processing < 10ms. (Not targeting HFT microsecond latency — this is for retail/fintech platforms.)
Consistency: Price data must be ordered correctly (no out-of-order ticks shown to users). NBBO must be accurate within 50ms.
Scale: 10,000 symbols, 1M price updates/sec during peak market hours, 5M concurrent WebSocket subscribers.
Durability: All tick data stored for regulatory compliance (7 years). Historical OHLCV data stored indefinitely.

2. Estimation (3 min)

Traffic

Inbound market data: 1M ticks/sec during peak (10,000 symbols × 100 updates/sec for active symbols)
WebSocket subscribers: 5M concurrent connections
Fan-out: Each tick for a popular symbol (AAPL, TSLA) may go to 500K subscribers → 500M messages/sec outbound at peak
Historical data queries: 10,000 QPS (chart data requests for different symbols/timeframes)

Storage

Tick data (raw): 1M ticks/sec × 64 bytes × 6.5 hours/day × 252 trading days = 37TB/year
OHLCV 1-min candles: 10,000 symbols × 390 candles/day × 252 days × 48 bytes = 47GB/year
OHLCV 1-day candles: 10,000 symbols × 252 days × 48 bytes = 121MB/year
Order book snapshots (top 10 levels): 10,000 symbols × 10 updates/sec × 200 bytes = 20MB/sec (stored in memory, snapshots persisted hourly)

Key Insight

This is a fan-out-heavy system. Ingesting 1M ticks/sec is manageable, but fanning out each tick to potentially 500K subscribers creates a 500:1 amplification. The WebSocket fan-out layer is the hardest scaling challenge. Historical data is a classic time-series storage problem.

Design a Stock Broker System (Zerodha/Robinhood)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Order placement — users can place market, limit, and stop-loss orders for equities, F&O, and ETFs
Real-time portfolio tracking — show holdings, positions, P&L (realized + unrealized), and margin utilization in real time
Market data feed — stream live prices (LTP, bid/ask, depth) to clients via WebSocket
Order book management — maintain per-user order book with status lifecycle (open → pending → executed → settled / rejected / cancelled)
Margin & risk management — pre-trade risk checks: sufficient margin, position limits, circuit breaker enforcement, and exposure calculations

Non-Functional Requirements

Availability: 99.99% during market hours (9:15 AM – 3:30 PM IST). Every minute of downtime = real money lost for users.
Latency: < 10ms for order placement (broker-side), < 50ms for end-to-end order acknowledgement. Market data tick-to-display < 200ms.
Consistency: Orders MUST be strongly consistent — a placed order must never be lost. Portfolio and P&L can be eventually consistent (~1s lag acceptable).
Scale: 10M registered users, 1M concurrent during market hours, 50K orders/sec peak (market open spike at 9:15 AM), 500K+ price ticks/sec from exchange feed.
Durability: Zero tolerance for order loss. Every order must be audit-trailed with nanosecond-precision timestamps for regulatory compliance (SEBI/SEC).

2. Estimation (3 min)

Traffic

Orders: 50K orders/sec peak (market open), ~5K orders/sec steady state during market hours
Market data: 5,000 instruments × 100 ticks/sec = 500K ticks/sec ingested from exchange
WebSocket connections: 1M concurrent users, each subscribed to ~10 instruments on average
Fan-out: each tick fans out to subscribed users → peak 5M messages/sec outbound via WebSocket

Storage

Orders: 50M orders/day × 500 bytes = 25 GB/day → 6 TB/year (must retain 7+ years for regulatory compliance)
Trade history: ~20M trades/day × 300 bytes = 6 GB/day
Market data (OHLCV candles): 5,000 instruments × 375 min × 100 bytes = 188 MB/day for 1-min candles. Tick-level data: 500K ticks/sec × 6.25 hrs × 50 bytes = ~56 GB/day
User portfolios: 10M users × 1 KB average = 10 GB (fits in memory for hot data)

Key Insight

This is a latency-critical, correctness-critical system with a massive spike pattern. 9:15 AM (market open) sees 10–20× traffic surge within seconds. The order pipeline must be idempotent, durable (WAL + queue), and pre-warmed. Market data is a classic pub-sub fan-out problem — 500K ticks/sec → millions of WebSocket pushes.

Design a Stock Price Alert System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Create price alerts — users set alerts on instruments with conditions: price above/below a threshold, or percentage change from a reference price (e.g., “alert me if RELIANCE drops 5% from ₹2500”)
Real-time matching — alerts must be evaluated against every incoming price tick and trigger within seconds of the condition being met
Multi-channel notification — triggered alerts are delivered via push notification, SMS, and/or email based on user preference
Alert lifecycle management — alerts have states: active → triggered → (snoozed | expired | deleted). Users can snooze (re-arm after cooldown), edit, or delete alerts.
Alert dashboard — users can list, filter, and manage all their alerts (active, triggered history, expired)

Non-Functional Requirements

Availability: 99.95% — missing an alert during a market crash is unacceptable. Degraded mode: delay delivery, but never lose a triggered alert.
Latency: < 5 seconds from price crossing the threshold to notification delivery (push). < 30 seconds for SMS/email.
Scale: 100M total alerts across 10M users. 500K price ticks/sec ingested from market data feed. Average 10 alerts per user.
Throughput: During volatile markets, up to 1M alerts could trigger within a 1-minute window (e.g., broad market crash).
Durability: Triggered alerts must be persisted before notification is attempted. At-least-once delivery guarantee for notifications.

2. Estimation (3 min)

Traffic

Price ticks ingested: 5,000 instruments × 100 ticks/sec = 500K ticks/sec
Alert evaluations: Each tick must be checked against all active alerts for that instrument. Average 20K active alerts per instrument (100M alerts / 5,000 instruments) → 500K × 20K = 10B comparisons/sec (naive approach — this is why efficient matching is critical)
Alert triggers: Normal day: ~500K alerts trigger/day. Volatile day: ~5M alerts trigger/day.
Alert creation: ~1M new alerts/day, ~500K deletions/day

Storage

Alert records: 100M alerts × 200 bytes = 20 GB — fits in memory for hot path
Alert history: 500K triggers/day × 300 bytes = 150 MB/day → ~55 GB/year
Notification logs: 1.5M notifications/day (3 channels avg per trigger) × 200 bytes = 300 MB/day

Key Insight

The core challenge is efficient matching: 500K ticks/sec against 100M alerts. Naive O(alerts_per_instrument) scan per tick is 10B comparisons/sec — too expensive. We need a data structure that answers “which alerts are triggered by price X?” in O(log N + K) where K is the number of triggered alerts. Sorted sets (by trigger price) or interval trees make this possible.

Design a Surge Pricing System (Uber)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Compute a dynamic pricing multiplier for each geographic zone based on real-time supply (available drivers) and demand (ride requests)
Divide the service area into geospatial zones and compute independent surge multipliers per zone
Display the current surge multiplier to riders before they confirm a ride, with a price estimate
Apply smoothing and dampening so surge prices don’t oscillate wildly (e.g., 1.0× → 3.5× → 1.2× within minutes)
Enforce price caps and fairness rules (regulatory limits, max multiplier during emergencies, consistent pricing within a zone)

Non-Functional Requirements

Availability: 99.99% — surge pricing is in the critical path of every ride request. If it’s down, rides can’t be priced.
Latency: Surge multiplier lookup < 20ms per ride request. Surge recomputation runs every 30-60 seconds.
Consistency: All ride requests within the same zone at the same time should see the same surge multiplier. Eventual consistency across data centers is acceptable (< 5 second lag).
Scale: 500 cities, 50K zones globally, 100K ride requests/sec at peak, 5M active drivers
Freshness: Surge must reflect conditions no older than 60 seconds. Stale surge = mispricing = lost revenue or angry riders.

2. Estimation (3 min)

Traffic

100K ride requests/sec at peak → each needs a surge lookup → 100K reads/sec
Driver location updates: 5M drivers × 1 update every 4 seconds = 1.25M location updates/sec
Surge recomputation: 50K zones × 1 recomputation/minute = ~833 zone recomputations/sec (lightweight)
Ride request events (for demand counting): 100K events/sec

Storage

Zone definitions: 50K zones × 1 KB (polygon or H3 index) = 50 MB (fits in memory)
Current surge state: 50K zones × 100 bytes (multiplier, supply, demand, timestamp) = 5 MB (fits in Redis)
Historical surge data (for analytics): 50K zones × 1 record/min × 60 min × 24 hr = 72M records/day × 200 bytes = ~14 GB/day

Compute

Per zone recomputation: count supply (drivers in zone), count demand (requests in zone in last 2 min), compute ratio, apply formula
Simple arithmetic — CPU is trivial
The bottleneck is ingesting and aggregating 1.25M location updates/sec to determine supply per zone in real-time

Key Insight

This is a real-time geospatial aggregation problem. The hard parts are: (1) efficiently mapping millions of driver locations to zones every few seconds, (2) computing stable surge multipliers that respond to demand without oscillating, and (3) ensuring riders and drivers see consistent prices during a ride.

Design a System to Identify Top-K Shared Articles

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Track every article share event across the platform (social shares, link copies, email shares)
Return the top-K most shared articles for multiple time windows: last 1 minute, last 1 hour, last 24 hours
Support real-time updates — the top-K list refreshes within seconds of share events
Provide both global top-K and per-category top-K (e.g., top sports articles, top tech articles)
Expose an API for clients to query current trending articles with share counts

Non-Functional Requirements

Availability: 99.99% — trending articles is a high-visibility feature; downtime is immediately noticeable
Latency: < 50ms for top-K queries. Share event ingestion can tolerate up to 5 seconds of delay before reflecting in results.
Consistency: Approximate counts are acceptable. If an article has 10,000 shares, reporting 9,950 is fine. Rankings may be slightly stale by a few seconds.
Scale: 100K share events/sec at peak (viral events, breaking news). 10M+ unique articles in the system. Top-K queries at 50K QPS.
Durability: Share events must not be lost (feed into analytics). Top-K results are recomputable from raw events.

2. Estimation (3 min)

Traffic

Share events: 100K/sec peak, 30K/sec average
Daily share events: 30K x 86,400 = 2.6 billion/day
Top-K query QPS: 50K/sec (served from cache, cheap)

Storage

Each share event: article_id (8 bytes) + user_id (8 bytes) + timestamp (8 bytes) + type (1 byte) + category (2 bytes) = ~30 bytes
Daily raw events: 2.6B x 30 bytes = 78 GB/day
30-day retention for raw events: 2.3 TB

Counting Infrastructure

Count-Min Sketch for approximate counting:
- 4 hash functions x 1M counters each = 4M counters
- Each counter: 4 bytes → 16 MB per time window
- 3 time windows (1min, 1hr, 24hr) → 48 MB total
- Fits entirely in memory on a single node

Key Insight

The core challenge is not storage — it is maintaining accurate, real-time rankings over sliding time windows at 100K events/sec. The Count-Min Sketch + min-heap approach keeps this in O(log K) per event with minimal memory.

Design a Top-K System (Heavy Hitters)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Track the top-K most frequent items (heavy hitters) across a stream of events in real-time (e.g., top 100 trending hashtags, most searched queries, most purchased products)
Support time-windowed queries — top-K in the last 1 minute, 1 hour, 1 day
Provide approximate counts for each item in the top-K list, with bounded error guarantees
Support multiple independent top-K lists (per category, per region, global)
Allow querying both the current top-K snapshot and historical top-K at any past timestamp

Non-Functional Requirements

Availability: 99.99% — the system is used for real-time dashboards and recommendation feeds
Latency: Event processing < 10ms. Top-K query response < 50ms.
Consistency: Approximate counts are acceptable (within 0.1% of true count). The top-K list may briefly lag by a few seconds.
Scale: 1M events/sec ingestion, 100K unique items per time window, top-K where K ≤ 1000
Memory efficiency: Must not grow linearly with the number of unique items. A stream with 100M unique items should still use bounded memory.

2. Estimation (3 min)

Traffic

Event stream: 1M events/sec (e.g., search queries, clicks, purchases)
Each event: item_id (8 bytes) + timestamp (8 bytes) + metadata (32 bytes) = ~48 bytes
Ingestion bandwidth: 1M × 48B = 48 MB/sec
Query QPS: ~10K queries/sec for top-K reads (cached aggressively)

Memory for Exact Counting

100M unique items per day × 8 bytes (item_id) + 8 bytes (count) = 1.6 GB — feasible for daily, but maintaining sliding windows of 1-minute granularity is harder
100M items × 1440 minutes × 16 bytes = 2.3 TB — impossible for exact per-minute counts

Memory for Approximate Counting

Count-Min Sketch: 4 hash functions × 1M counters × 4 bytes = 16 MB per time window
Space-Saving (top-K tracker): K=1000 items × (8 bytes key + 8 bytes count + 8 bytes error) = 24 KB
Per-minute windows for 24 hours: 1440 windows × 16 MB = 23 GB — manageable
With exponential decay (approximate sliding window): Single sketch, no windowing needed = 16 MB total

Key Insight

The core challenge is the memory vs accuracy trade-off. Exact counting requires O(N) memory where N is unique items. Probabilistic data structures (Count-Min Sketch, Space-Saving) provide O(1/ε²) memory for ε-approximate answers. For top-K, we need both frequency estimation AND identification of the top items.

Design a URL Shortening Service (TinyURL)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Given a long URL, generate a short, unique URL
Given a short URL, redirect to the original long URL
Users can optionally set a custom alias
Links expire after a configurable TTL (default: 5 years)
Analytics: track click count per short URL

Non-Functional Requirements

Availability: 99.99% — redirects must always work; this is on the critical path of every click
Latency: Redirect in < 10ms at p99 (just a lookup + 301)
Consistency: Eventual consistency is fine for analytics. Strong consistency for URL creation (no duplicate short codes)
Scale: 100M new URLs/day, 10:1 read-to-write ratio → 1B redirects/day
Durability: URLs must not be lost — a broken short link is permanent reputation damage

2. Estimation (3 min)

Write (URL creation)

100M URLs/day ÷ 100K sec/day = ~1,000 writes/sec
Peak: 5x → 5,000 writes/sec

Read (redirects)

1B redirects/day ÷ 100K = ~10,000 reads/sec
Peak: 50,000 reads/sec

Storage

Each record: short code (7 bytes) + long URL (avg 200 bytes) + metadata (50 bytes) ≈ 250 bytes
100M/day × 365 × 5 years = 182.5B records
182.5B × 250 bytes = ~45 TB over 5 years

Short code space

Base62 (a-z, A-Z, 0-9), 7 characters = 62^7 = 3.5 trillion unique codes — more than enough

3. API Design (3 min)

POST /api/v1/shorten
  Headers: Authorization: Bearer <api_key>
  Body: {
    "long_url": "https://example.com/very/long/path",
    "custom_alias": "my-link",     // optional
    "ttl_days": 365                // optional, default 1825
  }
  Response 201: {
    "short_url": "https://tiny.url/aB3kX9p",
    "short_code": "aB3kX9p",
    "expires_at": "2031-02-22T00:00:00Z"
  }

GET /{short_code}
  Response 301: Location: https://example.com/very/long/path
  Response 404: { "error": "URL not found or expired" }

GET /api/v1/stats/{short_code}
  Headers: Authorization: Bearer <api_key>
  Response 200: {
    "short_code": "aB3kX9p",
    "long_url": "https://example.com/...",
    "total_clicks": 142857,
    "created_at": "2026-02-22T00:00:00Z",
    "expires_at": "2031-02-22T00:00:00Z"
  }

Key decisions:

Design a Weather Forecasting Service

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Provide current weather conditions and 7-day forecasts for any location worldwide (by coordinates, city name, or ZIP code)
Ingest and process data from multiple sources: weather stations (100K+), satellites, radar, and third-party NWS/ECMWF model data
Support geospatial queries: “weather at 40.7128,-74.0060” and reverse geocoding: “weather in New York, NY”
Severe weather alert system: tornado warnings, flood alerts, heat advisories — push notifications to affected users within minutes
Provide historical weather data warehouse for trend analysis, agriculture, insurance, and research use cases

Non-Functional Requirements

Availability: 99.99% for the API. Weather data is safety-critical — aviation, maritime, emergency services depend on it.
Latency: Current conditions API < 100ms. Forecast API < 200ms. Alert delivery < 2 minutes from NWS issuance.
Freshness: Current conditions updated every 5-15 minutes. Forecasts updated every 6 hours (aligned with model runs). Alerts delivered in real-time.
Scale: 1B API requests/day across 10M registered developers. 50K concurrent data ingestion streams. 100TB historical data.
Accuracy: Forecast accuracy comparable to top providers. Temperature within +/- 2 degrees F for 24-hour forecasts, +/- 5 degrees for 7-day.

2. Estimation (3 min)

API Traffic

1B requests/day = 11.5K requests/sec average
Peak: 5x during severe weather events = ~60K requests/sec
Breakdown: 60% current conditions, 30% forecasts, 10% historical/alerts

Data Ingestion

Weather stations: 100K stations reporting every 5-15 minutes = ~400K observations/hour
Satellite data: 10 satellite feeds, each producing ~50GB/day = 500GB/day
Radar: 200 radar stations, 5-minute sweeps, each ~10MB = ~600K images/day = 6TB/day
NWS/ECMWF model output: 4 model runs/day × 50GB each = 200GB/day

Storage

Current conditions cache: 100K stations × 2KB = 200MB — fits entirely in Redis
Forecast grid data: global 0.25-degree grid = 1,440 × 720 = 1M grid points × 7 days × 24 hours × 100 bytes = 17GB per model run → cached in memory
Historical data: 100K stations × 365 days × 288 observations/day (every 5 min) × 200 bytes = 2TB/year station data + satellite/radar archives
Total historical warehouse: ~100TB (5 years of all sources)

Key Insight

The read pattern is highly cacheable — millions of users in New York all get the same weather. The cache hit rate should be > 95%. The hard problem is ingesting, processing, and gridding heterogeneous data sources into a unified model.

Design a Web Crawler

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Crawl the web starting from a seed set of URLs, discovering and following links to new pages
Download and store the HTML content of each page for downstream indexing/processing
Respect robots.txt directives and crawl-delay policies for every domain
Detect and avoid duplicate URLs (normalization) and duplicate content (near-dedup)
Support prioritized crawling — important/fresh pages crawled more frequently than stale/low-quality pages

Non-Functional Requirements

Availability: 99.9% — the crawler should run continuously. Brief outages are acceptable (we just resume from where we left off).
Throughput: Crawl 1 billion pages/day (~12,000 pages/sec sustained). Scale to 5 billion pages total in the index.
Latency: Not a real-time system. End-to-end latency from URL discovery to content storage can be minutes to hours.
Politeness: Never overload a single web server. Maximum 1 request/second per domain by default, respect Crawl-Delay.
Robustness: Handle spider traps, malformed HTML, infinite URL spaces (calendars, session IDs), and adversarial pages gracefully.

2. Estimation (3 min)

Throughput

Target: 1 billion pages/day
1B / 86,400 = ~12,000 pages/sec
Average page size: 100KB (HTML + headers)
Download bandwidth: 12,000 x 100KB = 1.2 GB/sec = ~10 Gbps

Storage

5 billion pages in the index
Average compressed page: 30KB (HTML compresses ~3:1)
Total content storage: 5B x 30KB = 150 TB
URL frontier (URLs to crawl): 10 billion URLs x 200 bytes = 2 TB
URL seen set (deduplication): 50 billion URLs x 8 bytes (fingerprint) = 400 GB — fits in distributed memory

DNS

Unique domains: ~200 million
DNS resolution: must cache aggressively. At 12,000 pages/sec, we cannot do a fresh DNS lookup for each.
DNS cache: 200M domains x 100 bytes = 20 GB — fits in memory

Machines

Each crawler worker: ~200 concurrent connections, 500 pages/sec per worker
12,000 / 500 = ~24 crawler workers (plus headroom → 40 workers)
Each worker: 32 cores, 64GB RAM, 10 Gbps NIC

3. API Design (3 min)

The web crawler is not a user-facing API service — it is an internal batch processing system. However, it has internal control and data interfaces.

Design a Wire Transfer System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Transfer money between accounts within the same bank (internal) and across banks (external via SWIFT/ACH)
Every transaction must follow double-entry bookkeeping — debit one account, credit another, with a full audit trail
Support idempotent transfers — retrying the same request must not duplicate the transfer
Provide real-time transaction status tracking (pending, processing, completed, failed, reversed)
Enforce compliance checks (AML/KYC screening, sanctions list) before processing any transfer

Non-Functional Requirements

Availability: 99.99% — financial systems cannot afford extended downtime. Scheduled maintenance windows are acceptable (e.g., 2am-3am) but unplanned outages are catastrophic.
Latency: Internal transfers < 500ms end-to-end. External transfers (cross-bank) may take seconds to hours depending on the rail (ACH = batch, SWIFT = near-real-time).
Consistency: Strong consistency is mandatory. Money cannot be created or destroyed. Every debit must have a matching credit. We sacrifice availability for consistency (CP system).
Scale: 10,000 transfers/sec at peak. $50B daily volume. 500M accounts.
Durability: Zero data loss. Every transaction must be persisted to durable storage with replication before acknowledgment.

2. Estimation (3 min)

Traffic

10,000 transfers/sec peak, ~3,000 avg
Each transfer involves: 1 write to create the transfer, 2 writes to update account balances (debit + credit), 1 write to the ledger
~40,000 DB writes/sec peak
Read traffic (balance checks, transaction history): ~50,000 reads/sec

Storage

500M accounts × 500 bytes (account metadata + balance) = 250 GB for accounts
300M transfers/day × 365 days × 1 KB per transfer = ~110 TB/year for transaction history
Ledger entries: 600M/day (2 per transfer) × 500 bytes = ~110 TB/year
Total: ~250 TB/year growing. Need partitioning and archival strategy.

Money Math

All monetary amounts stored as integers in the smallest currency unit (cents for USD, pence for GBP)
Never use floating point. $100.50 is stored as 10050 cents.
Maximum transfer size: 64-bit integer → $92 quadrillion in cents. More than enough.

3. API Design (3 min)

// Initiate a transfer
POST /v1/transfers
  Headers: Idempotency-Key: "uuid-abc-123"
  Body: {
    "from_account_id": "acc_sender_001",
    "to_account_id": "acc_receiver_002",
    "amount": 10050,                    // $100.50 in cents
    "currency": "USD",
    "reference": "Invoice #4521",
    "transfer_type": "internal"         // or "ach", "swift"
  }
  Response 201: {
    "transfer_id": "txn_xyz_789",
    "status": "pending",
    "created_at": "2026-02-22T10:00:00Z"
  }

// Get transfer status
GET /v1/transfers/{transfer_id}
  Response 200: {
    "transfer_id": "txn_xyz_789",
    "status": "completed",              // pending | processing | completed | failed | reversed
    "from_account_id": "acc_sender_001",
    "to_account_id": "acc_receiver_002",
    "amount": 10050,
    "currency": "USD",
    "compliance_status": "cleared",
    "created_at": "2026-02-22T10:00:00Z",
    "completed_at": "2026-02-22T10:00:00Z"
  }

// Get account balance
GET /v1/accounts/{account_id}/balance
  Response 200: {
    "account_id": "acc_sender_001",
    "available_balance": 5000000,       // $50,000.00
    "pending_balance": 4989950,         // after pending debit
    "currency": "USD"
  }

// Get transaction history
GET /v1/accounts/{account_id}/transactions?limit=50&cursor=xxx

Key Decisions

Idempotency-Key header is mandatory on POST. The server stores the key and returns the same response on retry.
Amounts are always integers in the smallest currency unit. The API never accepts floats.
Transfer creation is asynchronous — returns pending immediately. Client polls or receives webhook for completion.

4. Data Model (3 min)

Accounts Table (PostgreSQL — sharded by account_id)

Table: accounts
  account_id        (PK)  | varchar(20)
  user_id           (FK)  | varchar(20)
  balance           | bigint          -- available balance in cents
  pending_balance   | bigint          -- balance after pending holds
  currency          | char(3)
  status            | enum('active', 'frozen', 'closed')
  created_at        | timestamp
  updated_at        | timestamp

Transfers Table (PostgreSQL — sharded by transfer_id)

Table: transfers
  transfer_id       (PK)  | varchar(20)
  idempotency_key   (UQ)  | varchar(64)
  from_account_id   (FK)  | varchar(20)
  to_account_id     (FK)  | varchar(20)
  amount            | bigint
  currency          | char(3)
  transfer_type     | enum('internal', 'ach', 'swift')
  status            | enum('pending', 'processing', 'completed', 'failed', 'reversed')
  compliance_status | enum('pending', 'cleared', 'flagged', 'blocked')
  reference         | varchar(200)
  created_at        | timestamp
  completed_at      | timestamp

Ledger Table (Append-Only — the source of truth)

Table: ledger_entries
  entry_id          (PK)  | bigint (auto-increment)
  transfer_id       (FK)  | varchar(20)
  account_id        (FK)  | varchar(20)
  entry_type        | enum('debit', 'credit')
  amount            | bigint
  balance_after     | bigint          -- running balance snapshot
  created_at        | timestamp

Idempotency Store (Redis + PostgreSQL)

Table: idempotency_keys
  idempotency_key   (PK)  | varchar(64)
  transfer_id       | varchar(20)
  response_body     | jsonb
  created_at        | timestamp
  expires_at        | timestamp       -- TTL: 24 hours

Why PostgreSQL?

ACID transactions are non-negotiable for financial systems
Row-level locking for concurrent balance updates
Serializable isolation level for critical transfer logic
Rich constraint system (CHECK balance >= 0, foreign keys)
Proven reliability in banking — this is not the place for eventual consistency

5. High-Level Design (12 min)

Transfer Flow (Internal)

Client
  → API Gateway (auth, rate limiting)
    → Transfer Service
      → 1. Validate idempotency key (Redis lookup)
         If exists → return cached response
      → 2. Create transfer record (status = pending)
      → 3. Compliance check (AML/KYC/sanctions screening)
         If flagged → status = blocked, notify compliance team
      → 4. Execute transfer (single DB transaction):
         BEGIN TRANSACTION (SERIALIZABLE)
           SELECT balance FROM accounts WHERE account_id = sender FOR UPDATE
           IF balance < amount → ROLLBACK, return insufficient funds
           UPDATE accounts SET balance = balance - amount WHERE account_id = sender
           UPDATE accounts SET balance = balance + amount WHERE account_id = receiver
           INSERT INTO ledger_entries (debit for sender)
           INSERT INTO ledger_entries (credit for receiver)
           UPDATE transfers SET status = 'completed'
         COMMIT
      → 5. Store idempotency key → response mapping
      → 6. Send notifications (async via message queue)

Transfer Flow (Cross-Bank via ACH/SWIFT)

Client
  → API Gateway → Transfer Service
    → 1-3. Same as above (validate, create, compliance)
    → 4. Debit sender's account + create hold
    → 5. Submit to Payment Rail:
         ACH: Batch file submitted to ACH operator (Nacha format)
              → Processed in batch windows (next business day)
         SWIFT: MT103 message to correspondent bank
              → Near-real-time via SWIFT network
    → 6. Await confirmation from external bank
         → Payment Rail Adapter (listens for responses)
           → On success: credit receiver's account, update transfer status
           → On failure: release hold, reverse debit, update status
    → 7. Reconciliation job validates all external transfers daily

Components

API Gateway: Authentication, rate limiting, TLS termination. All traffic over mTLS internally.
Transfer Service: Core business logic. Stateless, horizontally scaled. Orchestrates the transfer lifecycle.
Compliance Service: Screens transfers against sanctions lists (OFAC, EU), runs AML rules (large amounts, velocity checks, geographic risk scoring). Calls external providers (e.g., Refinitiv, Dow Jones) for PEP/sanctions screening.
Ledger Database (PostgreSQL): Sharded by account_id. Primary + synchronous replica for zero data loss. Append-only ledger table.
Payment Rail Adapters: Separate services for ACH, SWIFT, FedWire. Handle protocol-specific formatting and communication.
Notification Service: Sends emails, push notifications, webhooks on transfer completion/failure.
Reconciliation Engine: Batch job that runs daily. Compares internal ledger against bank statements and external rail confirmations.
Idempotency Store (Redis): Fast lookup for duplicate detection. Backed by PostgreSQL for durability.

6. Deep Dives (15 min)

Deep Dive 1: Double-Entry Bookkeeping & ACID Guarantees

Every financial movement must create exactly two ledger entries: a debit from one account and a credit to another. The sum of all debits must always equal the sum of all credits (the fundamental accounting equation).

Design Amazon Cart Management Service

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Add, remove, and update quantity of items in the cart (CRUD operations with real-time inventory validation)
Support both guest carts (cookie/session-based) and authenticated user carts, with automatic cart merging when a guest logs in
Persist carts durably across sessions, devices, and app restarts — a user who adds an item on mobile must see it on desktop
Handle price and availability changes while items sit in the cart — show current price, flag out-of-stock items, and surface price-change notifications
Transition the cart atomically to checkout — reserve inventory, lock prices, and create an order in a single coordinated operation

Non-Functional Requirements

Availability: 99.99% uptime. Cart unavailability directly equals lost revenue. Even 1 minute of cart downtime during Prime Day can cost millions.
Latency: Add-to-cart < 100ms p99. Cart read (render page) < 50ms p99. These must hold during 10x traffic spikes.
Consistency: Eventual consistency for cart reads (stale by at most 1-2 seconds). Strong consistency for checkout transition (inventory decrement must be atomic).
Scale: 300M+ active users, 500M+ carts (including guest and abandoned), 50K add-to-cart operations/sec average, 500K/sec peak during Prime Day.
Durability: Zero cart data loss. A cart is a purchase intent — losing a cart with 15 items a user spent 30 minutes curating is unacceptable.

2. Estimation (3 min)

Traffic

Active users: 300M monthly, ~50M daily
Add/remove/update operations: 50M DAU × 3 cart ops/day = 150M writes/day = ~1,750/sec average
Peak (Prime Day, 10x): ~17,500 writes/sec
Cart reads (page loads, mini-cart renders): 5× writes = ~8,750/sec average, 87,500/sec peak
Checkout transitions: 10M orders/day = ~115/sec average, 1,150/sec peak

Storage

Cart record: user_id + metadata = ~200 bytes
Cart item: item_id + seller_id + quantity + price_at_add + timestamp = ~150 bytes
Average cart: 5 items → 200 + (5 × 150) = ~950 bytes per cart
500M carts × 950 bytes = ~475 GB total cart data
Including indexes, replicas, and overhead: ~2 TB provisioned storage

Inventory Checks

Every add-to-cart triggers an inventory check: 1,750/sec average
Every cart page load validates inventory for all items: 8,750/sec × 5 items = 43,750 inventory lookups/sec
Peak: ~440K inventory lookups/sec — must be served from cache, not direct DB

Guest Cart Volume

~40% of carts are guest carts (no login)
200M guest carts with average TTL of 30 days
Cart merge events on login: ~5M/day (guest converts to authenticated)

Key Insight

This is a hot-path, high-availability storage problem with an inventory coordination challenge. The cart is on the critical path of every purchase. The hard problems are: (1) keeping cart data durable without sacrificing read latency, (2) handling inventory races at checkout, and (3) merging guest and authenticated carts without losing items.

Design an A/B Testing Platform

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Create and manage experiments with multiple variants (A/B/n) and traffic allocation percentages
Assign users deterministically to experiment variants (same user always sees the same variant)
Support mutual exclusion (user in experiment X cannot be in experiment Y) and layering (independent experiments can run simultaneously)
Collect and compute metrics (conversion rate, revenue, engagement) with statistical significance testing
Provide a dashboard showing experiment results, confidence intervals, sample sizes, and guardrail metric alerts

Non-Functional Requirements

Availability: 99.99% for the assignment service — if it goes down, every feature behind a flag breaks
Latency: < 5ms for variant assignment — it’s in the critical path of page renders and API calls
Consistency: Assignment must be deterministic and sticky. A user must always see the same variant for the lifetime of an experiment.
Scale: 500M daily active users, 50K+ concurrent experiments, 1B+ assignment checks/day
Durability: No event loss for exposure and metric events — statistical validity depends on complete data

2. Estimation (3 min)

Traffic

500M DAU, average 20 page views/day = 10B page views/day
Each page view checks ~10 experiments = 100B assignment checks/day ≈ 1.15M checks/sec
Exposure logging: ~10B exposure events/day (one per experiment per user per session)
Metric events: ~50B events/day (clicks, conversions, revenue, etc.)

Storage

Experiment configs: 50K experiments × 5KB = 250MB — trivially fits in memory
Exposure events: 10B/day × 100 bytes = 1TB/day
Metric events: 50B/day × 150 bytes = 7.5TB/day
Total raw event storage: ~8.5TB/day, ~3PB/year → needs columnar storage (data warehouse)

Compute

Statistical analysis: for each experiment, aggregate metrics across millions of users. A single experiment with 10M users and 5 metrics requires scanning ~50M rows. Running 50K experiments → batch compute pipeline (Spark/Presto), not real-time.
Pre-aggregation per experiment per day reduces warehouse scans dramatically.

Key Insight

Assignment is a latency-critical, read-heavy problem (solved by hashing, no database needed). Analysis is a compute-heavy, batch problem (solved by a data pipeline). These are two very different subsystems.

Design an Advertising System (Google Ads)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Advertisers create campaigns with targeting criteria (demographics, interests, keywords, geography, device type), set budgets (daily/total), and bid on ad placements (CPC, CPM, CPA)
When a user visits a page, the ad serving system runs a real-time auction among eligible ads, selects the winner(s), and renders the ad within 100ms
Track ad events (impressions, clicks, conversions) with accurate attribution and provide real-time reporting to advertisers
Implement budget pacing — spend the daily budget evenly throughout the day rather than exhausting it in the first hour
Detect and filter fraudulent clicks (bot clicks, click farms, competitor clicking) before charging advertisers

Non-Functional Requirements

Availability: 99.99% — every failed ad request is lost revenue. At $100M daily revenue, each minute of downtime costs ~$70K.
Latency: < 100ms end-to-end from ad request to ad response. The ad auction must complete within 50ms to leave time for network and rendering.
Consistency: Financial data (budgets, billing) must be strongly consistent. Ad serving can tolerate eventual consistency for targeting data (a few minutes of propagation is fine).
Scale: 10M ad requests/sec globally. 10M active ad campaigns. 1 billion ad impressions/day. 50M click events/day.
Durability: Every click and impression must be logged durably — this is billing data. Zero data loss for financial events.

2. Estimation (3 min)

Ad Serving Traffic

10M ad requests/sec
Each request: evaluate ~1000 candidate ads → filter to ~50 eligible → score and rank → select top 3-5
Each ad request: ~2 KB (user context, page context, device info)
Response: ~5 KB (ad creative URLs, tracking pixels, metadata)
Bandwidth: 10M × 7 KB = 70 GB/sec — distributed across 50+ edge PoPs

Auction Computation

10M auctions/sec
Each auction evaluates ~50 eligible ads with ML click prediction
ML inference: ~0.1ms per ad × 50 ads = 5ms per auction (batched inference)
Total ML compute: 10M × 50 = 500M inferences/sec
Requires: ~5,000 GPU/TPU instances for ML serving

Storage

Event logging: 1B impressions/day × 500 bytes = 500 GB/day impressions
50M clicks/day × 200 bytes = 10 GB/day clicks
Total event storage: ~200 TB/year (retained for 2 years)
Ad campaigns: 10M campaigns × 5 KB = 50 GB — fits in memory
User profiles (targeting data): 2B users × 2 KB = 4 TB — distributed cache

Revenue Math

Average CPM (cost per 1000 impressions): $5
1B impressions/day × $5/1000 = $5M/day from display
Average CPC (cost per click): $1
50M clicks/day × $1 = $50M/day from search/click ads
Total: ~$55M/day ≈ $20B/year

3. API Design (3 min)

Advertiser-Facing APIs

// Create a campaign
POST /v1/campaigns
  Headers: Authorization: Bearer <advertiser_token>
  Body: {
    "name": "Summer Sale 2026",
    "objective": "clicks",              // impressions | clicks | conversions
    "daily_budget": 500000,             // $5,000.00 in cents
    "total_budget": 15000000,           // $150,000.00
    "bid_strategy": "manual_cpc",       // manual_cpc | target_cpa | maximize_clicks
    "max_bid": 200,                     // $2.00 max CPC
    "targeting": {
      "geo": ["US", "CA"],
      "age_range": [25, 54],
      "interests": ["technology", "gaming"],
      "keywords": ["gaming laptop", "best GPU 2026"],
      "devices": ["desktop", "mobile"],
      "time_of_day": { "start": 8, "end": 22, "timezone": "America/New_York" }
    },
    "creatives": [
      { "type": "banner", "size": "300x250", "image_url": "...", "landing_url": "..." },
      { "type": "text", "headline": "50% Off Gaming Laptops", "description": "...", "landing_url": "..." }
    ],
    "start_date": "2026-06-01",
    "end_date": "2026-08-31"
  }
  Response 201: { "campaign_id": "camp_xyz", "status": "pending_review" }

// Get campaign performance
GET /v1/campaigns/{campaign_id}/metrics?date_range=last_7d&granularity=daily
  Response 200: {
    "campaign_id": "camp_xyz",
    "metrics": {
      "impressions": 1250000,
      "clicks": 37500,
      "ctr": 0.03,                      // 3% click-through rate
      "conversions": 1125,
      "cpa": 444,                        // $4.44 cost per acquisition
      "spend": 50000000,                // $500,000.00
      "remaining_budget": 10000000
    },
    "daily_breakdown": [...]
  }

Ad Serving API (internal, called by publisher pages)

// Request ads for a page
GET /v1/ads?slots=3&format=banner_300x250,banner_728x90
    &page_url=example.com/tech/reviews
    &user_id=anon_abc                   // hashed, for targeting
    &device=mobile&geo=US-CA
  Response 200 (< 100ms): {
    "ads": [
      {
        "ad_id": "ad_001",
        "creative_url": "https://cdn.ads.com/banner_001.jpg",
        "landing_url": "https://advertiser.com/sale?utm_source=...",
        "impression_url": "https://track.ads.com/imp?id=ad_001&...",
        "click_url": "https://track.ads.com/click?id=ad_001&...",
        "bid_price": 150                // $1.50 CPM, for publisher revenue share
      }
    ],
    "auction_id": "auc_12345"
  }

Event Tracking API

// Impression beacon (fired when ad is rendered)
GET /v1/track/impression?ad_id=ad_001&auction_id=auc_12345&ts=1708632000
  Response 204 (no content, fire-and-forget)

// Click redirect (user clicks ad)
GET /v1/track/click?ad_id=ad_001&auction_id=auc_12345
  Response 302: Redirect to landing_url
  (logs click event before redirect)

// Conversion postback (from advertiser's server)
POST /v1/track/conversion
  Body: { "campaign_id": "camp_xyz", "conversion_id": "conv_001", "value": 9999 }

4. Data Model (3 min)

Campaigns (PostgreSQL — sharded by advertiser_id)

Table: campaigns
  campaign_id      (PK) | varchar(20)
  advertiser_id    (FK) | varchar(20)
  name                   | varchar(200)
  objective              | enum('impressions','clicks','conversions')
  daily_budget           | int           -- cents
  total_budget           | int
  spent_total            | int           -- running total spend
  bid_strategy           | varchar(30)
  max_bid                | int
  targeting              | jsonb
  status                 | enum('draft','pending_review','active','paused','completed','rejected')
  start_date             | date
  end_date               | date
  created_at             | timestamp

Ad Creatives (PostgreSQL)

Table: creatives
  creative_id      (PK) | varchar(20)
  campaign_id      (FK) | varchar(20)
  type                   | enum('banner','text','video','native')
  size                   | varchar(20)
  content                | jsonb         -- image_url, headline, description, etc.
  landing_url            | varchar(500)
  status                 | enum('pending_review','approved','rejected')
  quality_score          | float         -- ML-predicted relevance/quality (0-1)

Ad Index (in-memory, rebuilt every few minutes)

// Inverted index for fast candidate selection during auction
// Loaded into each ad server's memory

keyword_index: Map<keyword, List<campaign_id>>
geo_index: Map<geo_code, List<campaign_id>>
interest_index: Map<interest, List<campaign_id>>
device_index: Map<device_type, List<campaign_id>>

// Each entry: campaign_id + max_bid + remaining_budget + quality_score
// Total size: ~10M campaigns × 100 bytes = 1 GB per ad server (fits in memory)

Event Log (Kafka + ClickHouse)

// Kafka topics: ad_impressions, ad_clicks, ad_conversions
// Retained 7 days in Kafka

// ClickHouse (analytical queries, 2-year retention)
Table: events
  event_type       | Enum('impression','click','conversion')
  event_id         | String
  ad_id            | String
  campaign_id      | String
  advertiser_id    | String
  user_id          | String (hashed)
  auction_id       | String
  timestamp        | DateTime
  bid_price        | UInt32
  geo              | String
  device           | String
  page_url         | String

Budget Tracking (Redis)

Key: budget:{campaign_id}:daily:{date}
Type: Hash
Fields:
  spent     | int (cents, incremented on each billable event)
  limit     | int (daily budget)
  pacing    | float (target spend rate per hour)

Key: budget:{campaign_id}:total
Type: String (remaining total budget in cents)

5. High-Level Design (12 min)

Ad Serving Pipeline (< 100ms total)

User visits a web page
  → Publisher's ad tag (JavaScript) fires ad request
    → CDN/Edge → Ad Server (closest PoP):

      Step 1: Parse Request (< 1ms)
        Extract: user_id, page context, device, geo, ad slot sizes

      Step 2: User Profile Lookup (< 5ms)
        Redis/Memcached: get user interests, demographics, behavior segments
        If cache miss → use contextual signals only (page content, keywords)

      Step 3: Candidate Selection (< 5ms)
        Query in-memory ad index:
          geo_index[US-CA] ∩ device_index[mobile] ∩ interest_index[gaming]
          → ~1000 candidate campaigns
        Filter: active status, within date range, has remaining budget, creative approved
          → ~200 eligible campaigns
        Filter: frequency capping (user hasn't seen this ad > 3 times today)
          → ~150 final candidates

      Step 4: Bid Calculation + Click Prediction (< 20ms)
        For each candidate (batched ML inference):
          pCTR = click_prediction_model(user_features, ad_features, context_features)
          eCPM = bid × pCTR × 1000        // expected revenue per 1000 impressions
        Sort by eCPM descending

      Step 5: Auction (< 2ms)
        Run generalized second-price auction:
          Winner pays the minimum bid needed to beat the second-place ad
          price_to_charge = second_place_eCPM / winner_pCTR
        Select top 3-5 ads for the available slots

      Step 6: Budget Check (< 2ms)
        Redis: INCRBY budget:{campaign_id}:daily:{date} {charge_amount}
        If new_total > daily_budget → skip this ad, select next candidate
        If total_budget exhausted → skip

      Step 7: Response (< 1ms)
        Return ad creatives, tracking URLs, metadata
        Total: ~35ms server-side, ~100ms with network

  → Browser renders ads, fires impression beacons
  → On click: redirect through click tracker → landing page

Event Processing Pipeline

Impression/Click Events:
  → Event Collector (edge servers, receive tracking pixels/redirects)
    → Kafka (durable event stream)
      → Stream Processor (Flink):
        1. Fraud Detection: filter invalid clicks (see Deep Dive 3)
        2. Deduplication: same impression/click ID within 5-min window
        3. Attribution: match clicks to impressions, conversions to clicks
        4. Budget Update: INCRBY in Redis for real-time budget tracking
        5. Write to ClickHouse for reporting
      → ClickHouse (analytical queries)
      → Billing Service (aggregate billable events → generate invoices)

Components

Ad Servers (50+ edge PoPs, 100s of instances): Handle ad requests. Run the full auction pipeline in-process. Stateless except for cached ad index and user profiles.
Ad Index Builder: Reads campaigns from PostgreSQL, builds in-memory inverted indexes, pushes to ad servers every 2-5 minutes. Ensures ad servers have fresh campaign data.
Click Prediction Service (ML): Serves the pCTR model. Deployed on GPU instances. Batched inference for throughput. Model retrained daily on click/no-click data.
User Profile Service (Redis/Memcached): Stores user interest segments, demographics, behavioral signals. Updated by a profile pipeline that processes user activity.
Event Collectors: Edge servers that receive impression beacons and click redirects. Ultra-low latency (< 10ms). Write to Kafka.
Stream Processor (Flink): Real-time event processing: fraud detection, deduplication, attribution, budget updates.
Budget Pacing Service: Calculates target spend rate per campaign per hour. Adjusts bid multipliers to pace spending evenly.
Fraud Detection Engine: ML + rules-based click fraud detection. Filters invalid traffic before billing.
Reporting Service: Reads from ClickHouse. Serves advertiser dashboards with near-real-time metrics (< 5 min delay).
Campaign Management Service: CRUD for campaigns, creatives, targeting. Creative review (automated + manual).

6. Deep Dives (15 min)

Deep Dive 1: The Ad Auction Mechanism

Why not a simple highest-bidder-wins auction? If we only ranked ads by bid price, advertisers with deep pockets would always win, regardless of ad quality. Users would see irrelevant, low-quality ads. Click rates would drop. Publishers would earn less long-term.

Design an API Rate Limiter

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Limit the number of API requests a client can make within a time window
Support multiple rate limiting rules (per user, per IP, per API endpoint, per service)
Return appropriate response when rate limit is exceeded (429 Too Many Requests)
Include rate limit headers in every response (remaining quota, reset time)
Support different limiting algorithms (configurable per rule)

Non-Functional Requirements

Availability: 99.999% — if the rate limiter goes down, we either block all traffic (fail-closed) or allow all traffic (fail-open). Both are bad. It must be more available than the services it protects.
Latency: < 1ms overhead per request — the rate limiter is in the hot path of every API call
Consistency: Approximate accuracy is acceptable. If the limit is 100 req/min, allowing 105 is fine. Allowing 1000 is not.
Scale: Must handle 10M+ requests/sec across all services
Distributed: Must work correctly across multiple API gateway instances (not just per-node limiting)

2. Estimation (3 min)

Traffic

10M requests/sec across all API endpoints
Each request → 1 rate limit check (read + conditional write to counter)
10M rate limit operations/sec

Storage

Active rate limit keys: assume 50M unique clients (user_id or IP)
Each key: ~100 bytes (key + counter + window metadata)
Total: 50M × 100 bytes = 5GB — fits entirely in memory (Redis)

Key Insight

This is a latency-critical, memory-bound system. All state lives in Redis. The challenge is distributed correctness (accurate counting across multiple gateway instances) at extreme throughput with < 1ms latency.

Design an Authentication and Authorization System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Authentication: Support email/password login, OAuth 2.0/OIDC (Google, GitHub, etc.), and multi-factor authentication (TOTP, SMS, WebAuthn/passkeys)
Session Management: Issue, validate, refresh, and revoke sessions. Support “remember me” (long-lived) and “sign out everywhere” (revoke all sessions).
Authorization: Role-Based Access Control (RBAC) with hierarchical roles and Attribute-Based Access Control (ABAC) for fine-grained policies
Single Sign-On (SSO): Act as an identity provider (IdP) — users authenticate once and access multiple applications without re-authenticating
Account security: Rate limit login attempts, detect credential stuffing, support password reset flows, and enforce password policies

Non-Functional Requirements

Availability: 99.999% — if auth is down, no user can log into any application. This is the single most critical shared service.
Latency: Login < 500ms (includes password hashing). Token validation < 5ms (must be in the hot path of every API call). Authorization check < 10ms.
Security: Passwords hashed with Argon2id (memory-hard). Tokens encrypted in transit (TLS 1.3) and at rest. No plaintext secrets in logs. PII encrypted at rest.
Scale: 500M registered users. 50M daily logins. 10B token validations/day (every API call). 1B authorization checks/day.
Compliance: GDPR (right to delete, data portability), SOC 2, support for data residency requirements.

2. Estimation (3 min)

Authentication Traffic

50M logins/day = 580 logins/sec average, 5K/sec peak (morning login surge)
Each login: password hash verification (Argon2id: ~250ms CPU per attempt) + session creation
CPU: 580 logins/sec × 250ms = 145 CPU-seconds/sec → need ~150 CPU cores just for password hashing

Token Validation

10B validations/day = 115K validations/sec
JWT validation: verify signature + check expiry + decode claims = < 0.1ms (no network call)
With opaque tokens: Redis lookup = ~1ms per validation

Authorization

1B checks/day = 11.5K checks/sec
Each check: look up user’s roles/permissions, evaluate policy rules

Storage

User accounts: 500M × 2KB = 1TB
Sessions: 200M active sessions × 500 bytes = 100GB (fits in Redis)
Roles/permissions: 1000 roles × 100 permissions = 100K entries — trivial
Audit logs: 50M logins/day × 500 bytes = 25GB/day, 9TB/year

Key Insight

Token validation is the hottest path (115K/sec). It MUST be local (no network call) → JWT with local signature verification. Password hashing is CPU-intensive → dedicated worker pool with rate limiting. Authorization checks need low latency → cache policies locally, evaluate in-process.

Design an ETA Estimation System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Given an origin and destination, compute the estimated time of arrival (ETA) accounting for current traffic conditions
Support multiple routing options (fastest, shortest, avoid tolls, avoid highways) and return ETAs for each
Incorporate real-time traffic data (GPS probe data from active drivers, incident reports, road closures) to adjust ETAs continuously
Provide ETA updates during an active trip (re-estimate as conditions change or the driver deviates from the route)
Return confidence intervals with ETAs (e.g., “18-24 minutes” not just “21 minutes”) to communicate uncertainty

Non-Functional Requirements

Availability: 99.99% — ETA is shown on every ride request, every search, every navigation step. It’s the most-queried service.
Latency: ETA computation < 200ms for a single origin-destination pair. Batch ETA (nearby drivers to rider) < 500ms for 50 candidates.
Accuracy: Mean Absolute Error (MAE) < 15% of actual trip time. P90 error < 25%. Accuracy matters more than precision.
Scale: 500K ETA requests/sec at peak (every map interaction, every ride match, every navigation step). Road network: 50M road segments globally.
Freshness: Traffic data reflected in ETAs within 2 minutes of observation. Stale traffic = wrong ETAs = bad routing.

2. Estimation (3 min)

Traffic

500K ETA requests/sec at peak
Each ETA request requires a graph search over a road network subgraph
Average route: ~200 road segments explored (varies by distance)
Total road segments processed: 500K × 200 = 100M segment lookups/sec

Road Network Size

Global road network: ~50M road segments, ~40M intersections
Graph representation: each segment = (from_node, to_node, distance, base_travel_time, current_speed)
Storage: 50M segments × 100 bytes = 5 GB — fits in memory on a single server
With precomputed contraction hierarchy: add ~10 GB for shortcut edges → 15 GB total

Traffic Data

GPS probes from active drivers: 5M drivers × 1 update/4 seconds = 1.25M probe points/sec
Each probe: (lat, lng, speed, heading, timestamp) = ~40 bytes
Probe ingestion bandwidth: 1.25M × 40B = 50 MB/sec
Map-matched probes per segment per minute: varies (highways: hundreds, residential: 0-5)

ML Model Inference

If using ML for ETA prediction (not just graph routing):
- Feature extraction: ~50 features per route (distance, segment speeds, time of day, weather, etc.)
- Model inference: ~5ms per prediction (on GPU) or ~20ms (on CPU)
- At 500K requests/sec × 5ms per inference: need 2500 GPU-seconds/sec → ~250 GPU instances
- Or: precompute and cache ETAs for common routes; ML inference only for cache misses

Key Insight

This is a graph search problem enhanced by real-time data and ML. Pure Dijkstra on 50M segments is too slow for 200ms SLA. We need hierarchical precomputation (Contraction Hierarchies or similar) to reduce search space by 100-1000×, combined with real-time traffic edge weights to keep ETAs accurate.

Design an External Sorting System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Sort a dataset that is significantly larger than available RAM (e.g., 10TB of data with 64GB RAM)
Support configurable sort keys (single column, composite keys, ascending/descending)
Produce a single sorted output file (or set of sorted partitions for distributed consumers)
Support pluggable input/output formats (CSV, Parquet, binary records)
Provide progress reporting and the ability to resume after failures (checkpointing)

Non-Functional Requirements

Availability: Batch system — not always-on, but must complete within an SLA (e.g., 10TB sorted within 4 hours)
Latency: Optimize for total wall-clock time, not per-record latency. I/O throughput is the bottleneck.
Consistency: Output must be perfectly sorted. No approximate or lossy sorting.
Scale: Handle datasets from 1GB to 100TB. Single-machine for up to ~5TB, distributed (MapReduce-style) beyond that.
Durability: Intermediate state (sorted chunks) persisted to disk. If the process crashes, restart from the last completed chunk, not from scratch.

2. Estimation (3 min)

Single Machine Scenario

Dataset: 1TB, RAM: 64GB (usable for sort: ~50GB after OS and buffers)
Record size: 100 bytes average, sort key: 20 bytes
Total records: 1TB / 100B = 10 billion records
Number of sorted chunks: 1TB / 50GB = 20 chunks
Each chunk: 50GB, ~500M records, sorted in-memory using quicksort/timsort

I/O Analysis

Disk sequential read/write: 500 MB/s (NVMe SSD) or 200 MB/s (HDD)
Phase 1 (chunk creation): Read 1TB + write 1TB of sorted chunks = 2TB I/O
- SSD: 2TB / 500 MB/s = ~67 minutes
- HDD: 2TB / 200 MB/s = ~167 minutes
Phase 2 (K-way merge): Read 1TB of sorted chunks + write 1TB final output = 2TB I/O
- Same time as Phase 1
Total: ~2.2 hours SSD, ~5.5 hours HDD (I/O dominated)

Distributed Scenario (MapReduce)

Dataset: 100TB across HDFS
1000 mappers, each sorts 100GB locally
Shuffle phase: each reducer pulls its key range from all mappers
100 reducers, each merge-sorts 1TB from 1000 sorted inputs
Network bandwidth: 10Gbps per node → 100TB shuffle in ~2.2 hours with pipelining

Key Insight

External sort is I/O bound, not CPU bound. Every optimization must target reducing I/O passes, maximizing sequential I/O, and minimizing random access. The CPU cost of comparisons is negligible compared to disk read/write time.

Design an IoC Container (Dependency Injection Framework)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Register services with their implementations (interface → concrete class mapping) using explicit registration or auto-discovery (annotations/attributes)
Support three lifecycle scopes: Singleton (one instance per container), Transient (new instance per request), and Scoped (one instance per scope, e.g., per HTTP request)
Automatically resolve dependency graphs — if Service A depends on B and C, and B depends on D, resolve the full chain recursively
Detect circular dependencies at registration time (not at resolve time) and provide clear error messages showing the cycle
Support named/keyed registrations, factory functions, lazy initialization, and optional dependencies

Non-Functional Requirements

Performance: Singleton resolution < 100ns (cached lookup). Transient resolution < 1μs for a 5-deep dependency chain. The container is called millions of times — it must be near-zero overhead.
Memory: Container metadata (registrations, dependency graph) < 10MB for 10K registered services. No memory leaks from scoped instances.
Thread Safety: All resolve operations must be thread-safe. Singleton creation must be exactly-once (no double initialization).
Developer Experience: Clear error messages. Fail-fast on misconfiguration. Support for debugging tools (dependency graph visualization, unused registration detection).
Extensibility: Plugin architecture for interceptors (AOP), decorators, and custom lifetime managers.

2. Estimation (3 min)

Scale (typical large application)

Registered services: 500-5,000 in a large enterprise app
Dependency depth: Average 3-5 levels, max 10-15 levels
Resolution frequency: Web app handling 10K requests/sec, each request resolves 20-50 services = 200K-500K resolutions/sec
Scoped instances: Per HTTP request, ~30 scoped instances created and disposed

Memory

Registration metadata: 5,000 services × 500 bytes (type info, lifetime, dependencies) = 2.5MB
Singleton cache: 500 singletons × 1KB average = 500KB
Scoped instances per request: 30 × 1KB = 30KB (short-lived, GC’d after request)
Compiled resolution plans (pre-computed): 5,000 × 200 bytes = 1MB
Total container overhead: ~5MB — negligible

Key Insight

This is fundamentally a graph problem (dependency graph resolution) combined with a caching problem (singleton/scoped instance reuse). The critical optimization is pre-computing resolution plans at registration time so that resolve-time is just executing a pre-built plan — no graph traversal needed in the hot path.

Design an On-Call Rotation and Alerting System (PagerDuty)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Define on-call rotation schedules (weekly, daily, custom) with multiple layers (primary, secondary, manager) and automatic handoffs
Route incoming alerts through escalation policies: try primary on-call → wait N minutes → escalate to secondary → wait → escalate to manager
Deliver alerts via multiple channels (push notification, SMS, phone call) with configurable per-user preferences and retry logic
Track acknowledgment, resolution, and incident lifecycle (triggered → acknowledged → resolved) with timestamps and responder actions
Support schedule overrides (swap shifts, temporary coverage) and fatigue prevention (quiet hours, max alerts/hour limits)

Non-Functional Requirements

Availability: 99.999% — this system pages humans during outages. If PagerDuty is down when production is down, no one gets woken up. It must be the most reliable system in the organization.
Latency: Alert delivery within 30 seconds of trigger. Phone call initiation within 60 seconds. Escalation timing accurate to within 5 seconds.
Consistency: An alert must be delivered to exactly one on-call person at a time (no missed alerts, no duplicate pages for the same incident at the same escalation level). Acknowledgment must be strongly consistent.
Scale: 10,000 teams, 100,000 users, 500K alerts/day, 50K concurrent active incidents.
Durability: Complete audit trail of every alert, delivery attempt, acknowledgment, and escalation. Zero tolerance for lost alerts.

2. Estimation (3 min)

Traffic

Incoming alerts: 500K/day = ~6 alerts/sec (peak 10x during widespread outage = 60/sec)
Delivery attempts per alert: avg 2.5 (primary gets push + SMS, sometimes escalates to secondary)
Total deliveries: 500K × 2.5 = 1.25M delivery attempts/day = ~15/sec
Phone calls: ~10% of alerts escalate to phone call = 50K calls/day
Schedule lookups: Every alert → resolve current on-call → 500K lookups/day (cached)
API requests (schedule management, incident updates): ~200K/day

Storage

Incidents: 500K/day × 5KB (full incident record with timeline) = 2.5GB/day = ~900GB/year
Schedules: 10,000 teams × 10KB = 100MB (tiny, mostly static)
Audit log: 1.25M delivery events/day × 500 bytes = 625MB/day = ~225GB/year
User preferences: 100K users × 1KB = 100MB

Key Insight

This is a reliability-critical orchestration system. The data volume is modest, but the reliability requirements are extreme. The system must work when everything else is broken (datacenter fires, DNS outages, cloud region failures). The core challenge is guaranteeing alert delivery within tight time windows with multiple fallback channels.

Design an Online Marketplace (Etsy/Amazon Marketplace)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Sellers can onboard, list products with images/descriptions/pricing, and manage inventory
Buyers can search/browse products, add to cart, and place orders with payment
Platform handles payment splitting — holds funds in escrow, pays seller after delivery confirmation
Buyers and sellers can leave reviews/ratings on completed transactions
Platform detects and prevents fraudulent listings, fake reviews, and payment fraud

Non-Functional Requirements

Availability: 99.99% — downtime directly means lost revenue for both platform and sellers
Latency: Search results < 200ms, product page load < 100ms, checkout < 500ms
Consistency: Inventory must be strongly consistent (no overselling). Order/payment state must be exactly-once. Reviews can be eventually consistent.
Scale: 50M products, 10M daily active buyers, 500K active sellers, 2M orders/day
Durability: Zero tolerance for lost orders or payment records. All financial data replicated across regions.

2. Estimation (3 min)

Traffic

Browse/Search: 10M DAU × 20 searches/day = 200M searches/day = ~2,300 QPS (peak 3x = 7,000 QPS)
Product page views: 10M DAU × 30 views/day = 300M/day = ~3,500 QPS (peak 10,000 QPS)
Orders: 2M orders/day = ~23 orders/sec (peak 5x during flash sales = 115/sec)
Listings: 500K sellers × 2 updates/day = 1M write ops/day = ~12 QPS

Storage

Product catalog: 50M products × 5KB metadata = 250GB
Product images: 50M products × 5 images × 500KB = 125TB (object storage)
Orders: 2M/day × 2KB × 365 days × 3 years = ~4TB
Reviews: 50M reviews × 1KB = 50GB
User profiles: 10M buyers + 500K sellers × 2KB = ~21GB

Key Insight

This is a read-heavy system (100:1 read-to-write ratio for catalog browsing). The hard problems are search relevance at scale, inventory consistency during concurrent purchases, and payment orchestration with escrow.

Design Dropbox / Google Drive

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can upload, download, and delete files from any device
Files automatically sync across all connected devices
Users can share files/folders with other users (view/edit permissions)
File versioning — users can view and restore previous versions
Offline support — changes made offline sync when connectivity resumes

Non-Functional Requirements

Availability: 99.99% — users depend on this for critical documents
Durability: 99.999999999% (11 nines) — losing a user’s file is unacceptable
Latency: Small file sync < 5 seconds end-to-end between devices. Large file upload should show progress and be resumable.
Consistency: Strong consistency for file metadata (rename, move, delete must be immediately reflected). Eventual consistency acceptable for sync propagation to other devices (within seconds).
Scale: 500M users, 100M DAU, average user stores 10GB, peak uploads 10M files/hour
Bandwidth efficiency: Only transfer changed parts of files (delta sync)

2. Estimation (3 min)

Storage

500M users × 10GB avg = 5 exabytes (5,000PB) total storage
This is the defining constraint — everything revolves around efficient storage

Traffic

10M file uploads/hour ÷ 3600 = ~2,800 uploads/sec
Average file: 500KB → 1.4GB/sec upload bandwidth
Sync events (metadata changes): 10x file uploads = 28,000 events/sec

Metadata

Average user: 2,000 files → 500M × 2,000 = 1 trillion file metadata records
Each record: ~500 bytes → 500TB metadata

3. API Design (3 min)

// File operations
POST /api/v1/files/upload
  Headers: Content-Range: bytes 0-1048575/5242880  // chunked upload
  Body: <binary chunk>
  Response 200: { "upload_id": "up_123", "next_offset": 1048576 }

POST /api/v1/files/upload/complete
  Body: { "upload_id": "up_123", "filename": "doc.pdf", "parent_id": "folder_abc" }
  Response 201: { "file_id": "f_xyz", "version": 1 }

GET /api/v1/files/{file_id}/download
  Response 200: redirect to pre-signed S3 URL

GET /api/v1/files/{file_id}/versions
  Response 200: [{ "version": 3, "size": 524288, "modified_at": "...", "modified_by": "..." }]

POST /api/v1/files/{file_id}/restore?version=2

// Sync
GET /api/v1/sync/changes?cursor={cursor}
  Response 200: {
    "changes": [
      { "type": "create", "file_id": "f_xyz", "path": "/docs/notes.md", ... },
      { "type": "modify", "file_id": "f_abc", "version": 3, ... },
      { "type": "delete", "file_id": "f_def", ... }
    ],
    "cursor": "c_next_123",
    "has_more": false
  }

// Sharing
POST /api/v1/files/{file_id}/share
  Body: { "user_email": "bob@example.com", "permission": "edit" }

Key decisions:

Design Facebook Messenger / WhatsApp

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

1-on-1 messaging between users (text messages)
Group messaging (up to 256 members)
Message delivery status: sent, delivered, read
Online/offline presence indicators
Message history — persistent, accessible from any device

Non-Functional Requirements

Availability: 99.99% — messaging is real-time communication; downtime is immediately felt
Latency: Message delivery < 200ms end-to-end for online recipients
Consistency: Messages must be ordered correctly within a conversation. No message loss. Exactly-once delivery semantics (no duplicate messages).
Scale: 2B registered users, 500M DAU, 100B messages/day
Durability: Messages are persistent — stored forever (or until user deletes)

2. Estimation (3 min)

Traffic

100B messages/day ÷ 100K sec/day = 1M messages/sec
Peak: 3x → 3M messages/sec
This is extremely high write throughput

Storage

Average message: 100 bytes (text + metadata)
100B messages/day × 100 bytes = 10TB/day
Per year: ~3.6PB
With media (images, voice notes): 10x → ~36PB/year

Connections

500M DAU with persistent WebSocket connections
Average user online 4 hours/day → at any time ~83M concurrent connections
Peak: ~150M concurrent WebSocket connections

3. API Design (3 min)

REST (for non-realtime operations)

POST /api/v1/messages/send
  Body: {
    "conversation_id": "conv_abc",
    "content": "Hello!",
    "type": "text",
    "client_message_id": "cm_uuid_123"  // idempotency key
  }
  Response 201: {
    "message_id": "m_server_456",
    "timestamp": "2026-02-22T18:00:00.123Z"
  }

GET /api/v1/conversations
  Response 200: [
    {
      "conversation_id": "conv_abc",
      "type": "1on1",
      "participants": ["u_1", "u_2"],
      "last_message": { "content": "Hello!", "timestamp": "..." },
      "unread_count": 3
    }
  ]

GET /api/v1/conversations/{conv_id}/messages?cursor={cursor}&limit=50

WebSocket (for real-time)

// Client → Server
{ "type": "message", "conversation_id": "conv_abc", "content": "Hello!", "client_id": "cm_uuid_123" }
{ "type": "typing", "conversation_id": "conv_abc" }
{ "type": "ack", "message_id": "m_456" }        // delivery receipt
{ "type": "read", "conversation_id": "conv_abc", "up_to": "m_456" }  // read receipt

// Server → Client
{ "type": "message", "message_id": "m_456", "from": "u_2", "conversation_id": "conv_abc", "content": "Hi!", "timestamp": "..." }
{ "type": "delivered", "message_id": "m_123" }   // your message was delivered
{ "type": "read", "conversation_id": "conv_abc", "by": "u_2", "up_to": "m_456" }
{ "type": "typing", "conversation_id": "conv_abc", "user": "u_2" }
{ "type": "presence", "user": "u_2", "status": "online" }

Key decisions:

Design Facebook's Like/Reaction System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can like/react to a post (like, love, haha, wow, sad, angry) — one reaction per user per post, toggleable (tap again to remove)
Display the total reaction count on every post and a breakdown by reaction type
Show “who reacted” — a paginated list of users who reacted to a post (with their reaction type)
Notify the post author when their post receives reactions (batched, not per-reaction)
Support reactions on posts, comments, messages, and other content types (polymorphic)

Non-Functional Requirements

Availability: 99.99% — the like button is on every post in the News Feed. Downtime affects all users immediately.
Latency: Like action < 100ms (write). Displaying count < 50ms (read). Counts can be slightly stale (eventual consistency acceptable for counts).
Throughput: 500K likes/sec sustained (2B DAU, average 20 likes/day = 40B likes/day). Peak: 2M likes/sec during viral events.
Consistency: Toggle semantics must be strongly consistent per user — if I tap like twice, the second tap must undo the first. Counts can be eventually consistent.
Scale: 500B total reactions stored. 10B+ posts with at least one reaction.

2. Estimation (3 min)

Write Traffic

2B DAU, average 20 reactions/day = 40B reactions/day
Average: 460K writes/sec, peak: 2M writes/sec
Each write: check if user already reacted → upsert/delete reaction → update counter

Read Traffic

Every post view shows reaction count. 2B DAU × 200 posts viewed/day = 400B post views/day
4.6M reads/sec for reaction counts (but counts are cached on the post object — not a separate query)
“Who reacted” list: viewed much less frequently. ~1B queries/day = 11K reads/sec

Storage

Reaction records: 500B reactions × 30 bytes (user_id, post_id, reaction_type, timestamp) = 15TB
Reaction counts: 10B posts × 50 bytes (6 counters + total) = 500GB (easily fits in cache)
Active reaction count cache: top 1B posts × 50 bytes = 50GB Redis

Key Insight

Reads outnumber writes 10:1 for counts. The “like count” is shown on every post impression but only changes when someone reacts. This is a perfect case for denormalized counters — pre-compute and cache the count rather than counting rows at query time.

Design Facebook's News Feed

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users see a personalized feed of posts from friends, pages, and groups they follow
Feed is ranked by relevance (not chronological) — a scoring model determines post order
Support multiple content types: text, images, videos, links, live streams, stories
New posts appear in followers’ feeds within seconds (near real-time)
Infinite scroll with cursor-based pagination — load more posts as user scrolls

Non-Functional Requirements

Availability: 99.99% — the news feed IS the product. Any downtime is a headline.
Latency: < 200ms to render the first page of feed. Subsequent pages < 100ms.
Consistency: Eventual consistency is fine. A post appearing 5 seconds late in some feeds is acceptable. But a post should never permanently disappear from a feed it belongs in.
Scale: 2 billion DAU, 500M posts/day, each user has ~500 friends on average. Feed generation for 2B users at peak load.
Freshness: New posts from close friends should appear within 5-10 seconds. Posts from pages/groups can tolerate 30-60 seconds.

2. Estimation (3 min)

Traffic

2 billion DAU, each opens feed ~10 times/day
Feed requests: 2B x 10 = 20 billion feed loads/day = 230K QPS average, 500K QPS peak
New posts: 500M/day = 5,800 posts/sec
Each post fans out to ~500 followers (average friends list)
Fan-out writes: 5,800 x 500 = 2.9 million feed inserts/sec (if fan-out on write)

Storage

Feed cache per user: store 500 most recent post IDs
500 post IDs x 8 bytes = 4KB per user
2B users x 4KB = 8 TB for feed cache — fits in a Redis cluster
Post content storage: 500M posts/day x 5KB avg = 2.5 TB/day → standard DB/object store

Key Insight

The core trade-off is fan-out on write (pre-compute feeds, 2.9M writes/sec) vs fan-out on read (compute feed at request time, 500K QPS each querying 500 friends). Neither extreme works alone — the answer is a hybrid approach based on follower count.

Design FASTag (Electronic Toll Collection System)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Vehicles with RFID-based FASTag pass through toll plazas without stopping — the system reads the tag, identifies the vehicle, and deducts the toll amount in real time
Support prepaid wallet accounts linked to FASTag. Users can recharge via UPI, net banking, credit/debit cards, or auto-top-up
Classify vehicles (car, LCV, bus, truck, multi-axle) automatically using RFID tag metadata and optionally ANPR (Automatic Number Plate Recognition) for verification
Handle edge cases: insufficient balance (let pass with negative balance up to a threshold, or deny entry), cloned/blacklisted tags, expired tags, tag-less vehicles (fallback to ANPR + manual toll)
Generate trip receipts, monthly statements, and provide real-time balance/transaction history via mobile app and SMS alerts

Non-Functional Requirements

Availability: 99.99% — toll plazas operate 24/7. Even 1 minute of downtime causes massive traffic jams at ~800+ plazas nationwide.
Latency: < 300ms end-to-end from RFID scan to barrier lift. Vehicles pass at 30 km/h through dedicated FASTag lanes — the window for processing is ~2 seconds.
Consistency: Strong consistency for balance deduction. We cannot deduct the same balance twice or allow a transaction to silently fail and still charge.
Scale: India has ~8 crore (80M) FASTags issued. ~1 crore (10M) toll transactions/day across 800+ plazas with 4000+ lanes. Peak: ~3x average during festivals/holidays.
Durability: Every transaction must be recorded. Financial data — zero loss. Full audit trail for regulatory compliance (NHAI, NPCI).

2. Estimation (3 min)

Traffic

10M transactions/day = ~115 TPS average
Peak (festival season, 3x): ~350 TPS
Each transaction involves: RFID read → tag lookup → balance check → debit → receipt → barrier signal = 6 operations
Effective peak internal ops: ~2100 ops/sec
Read-heavy on tag lookup (every vehicle approaching triggers a read), write-heavy on transactions

Storage

80M FASTag accounts: ~500 bytes each (tag ID, vehicle info, owner, balance, status) = 40GB
10M transactions/day x 365 days x 200 bytes = 730GB/year of transaction logs
5 years retention (regulatory): ~3.7TB of historical transactions
Hot data (accounts + last 30 days txns): ~100GB — fits comfortably in-memory cache + SSD

Bandwidth

Each RFID read event from plaza: ~200 bytes
Each transaction response to plaza: ~300 bytes
Peak: 350 x 500 bytes = ~175KB/s — negligible bandwidth
The bottleneck is latency (sub-300ms), not throughput

Key Insight

This is a latency-critical financial transaction system with strong consistency requirements. The scale is modest (350 TPS peak is not extreme), but the latency budget is brutal (300ms including network to remote plazas) and the consequences of failure are physical (traffic jams, accidents). The design must prioritize reliability, fast failover, and degraded-mode operation.

Design Google Analytics

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Collect client-side events (page views, clicks, custom events) from millions of websites via a lightweight JavaScript SDK
Provide real-time dashboards showing active users, page views per second, and top pages (within ~30 seconds of event)
Support batch analytics queries — daily/weekly/monthly reports on sessions, funnels, bounce rate, conversion paths
Sessionize raw events into user sessions with configurable timeout (default 30 minutes of inactivity)
Count unique visitors accurately across time ranges (daily, weekly, monthly) with deduplication

Non-Functional Requirements

Availability: 99.99% for the ingestion pipeline — dropping events is unacceptable for paying customers. Dashboard reads can tolerate brief degradation.
Latency: Event ingestion < 50ms (client-perceived). Real-time dashboard data within 30 seconds of event. Batch reports within minutes of scheduled time.
Consistency: Eventual consistency is acceptable. Real-time dashboards are approximate. Batch reports must be accurate to within 1%.
Scale: 10M tracked websites, 1M events/sec ingestion, 500TB+ of raw event data per year
Durability: Zero event loss once acknowledged. Raw events retained for 2 years, aggregated data retained indefinitely.

2. Estimation (3 min)

Traffic

10M websites tracked, average 10 page views/sec per active site
Not all sites active simultaneously — assume 100K sites active at peak
Peak ingestion: 100K sites × 10 events/sec = 1M events/sec
Average ingestion: ~300K events/sec
Read QPS (dashboard queries): ~50K/sec (most are cached)

Storage

Average event payload: ~500 bytes (URL, timestamp, user agent, referrer, custom dimensions, session cookie)
Daily raw events: 300K/sec × 86,400 sec × 500 bytes = ~13TB/day
Yearly raw events: ~4.7PB/year
Aggregated rollups (hourly/daily per site per dimension): ~1% of raw = ~50TB/year

Unique Visitor Counting

10M sites, each with up to 100M monthly unique visitors
Exact counting: 10M sites × 100M visitors × 8 bytes = 8PB (impossible)
HyperLogLog: 10M sites × 12KB per HLL = 120GB for monthly uniques — fits in memory

Key Insight

This is a write-heavy, read-light system. Ingestion throughput and storage cost dominate. The core challenge is building a pipeline that can ingest 1M events/sec, make data queryable in real-time, and run efficient batch aggregations without bankrupting the storage budget.

Design Google Calendar

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Create, read, update, and delete events with title, description, time range, location, and attendees
Support recurring events with complex patterns (every weekday, first Monday of each month, every 2 weeks)
Handle timezone conversions correctly — events display in the user’s local timezone regardless of where they were created
Detect scheduling conflicts when creating or accepting events
Send notifications and reminders (email, push) at configurable times before events

Non-Functional Requirements

Availability: 99.99% — calendar outages directly cause missed meetings and lost productivity across entire organizations
Latency: < 200ms to render a week view (fetch all events for 7 days). Event creation < 300ms.
Consistency: Strong consistency for the event owner’s view — after creating an event, the owner must immediately see it. Eventual consistency (< 5 seconds) acceptable for other attendees seeing the event appear.
Scale: 500M active users, average 20 events/week per user, 50K event writes/sec peak, 200K calendar view reads/sec peak
Sync: Events must sync reliably across all devices (web, mobile, desktop) within 5 seconds

2. Estimation (3 min)

Traffic

Event writes (create/update/delete): 50K/sec peak, 20K/sec average
Calendar view reads: 200K/sec peak (Monday mornings are the spike)
Notification triggers: 100K/sec (reminders firing across timezones)
Recurring event expansion: done at query time for future dates, pre-expanded for past 30 days

Storage

500M users x 20 events/week x 52 weeks = 520 billion events/year
But most events are non-recurring single events. Average active events per user: ~200 (upcoming + recent)
500M users x 200 events x 500 bytes = 50 TB for active event data
Recurring event storage: only the RRULE is stored (not expanded instances)
- 500M users x 10 recurring events avg x 200 bytes = 1 TB for recurring patterns
Total: ~51 TB for core event data

Calendar Views

Week view: fetch ~20-30 events per user (single-occurrence + expanded recurring)
Month view: fetch ~80-120 events per user
Most queries are time-range queries on a single user’s calendar → excellent for sharding by user_id

3. API Design (3 min)

// Create an event
POST /calendars/{calendar_id}/events
  Body: {
    "title": "Sprint Planning",
    "description": "Q1 sprint planning meeting",
    "start": "2026-02-23T09:00:00",
    "end": "2026-02-23T10:00:00",
    "timezone": "America/New_York",
    "location": "Conference Room A / https://meet.google.com/xyz",
    "attendees": [
      {"email": "alice@company.com", "optional": false},
      {"email": "bob@company.com", "optional": true}
    ],
    "recurrence": "RRULE:FREQ=WEEKLY;BYDAY=MO;COUNT=12",
    "reminders": [
      {"method": "popup", "minutes": 10},
      {"method": "email", "minutes": 30}
    ],
    "visibility": "default",
    "color_id": 5
  }
  Response 201: {
    "event_id": "evt_abc123",
    "calendar_id": "cal_user456",
    "html_link": "https://calendar.google.com/event?eid=abc123",
    "created": "2026-02-22T14:30:00Z",
    "updated": "2026-02-22T14:30:00Z",
    ...
  }

// Get events in a time range (calendar view)
GET /calendars/{calendar_id}/events?timeMin=2026-02-23T00:00:00Z&timeMax=2026-03-02T00:00:00Z&timezone=America/New_York&singleEvents=true
  Response 200: {
    "events": [
      {
        "event_id": "evt_abc123",
        "title": "Sprint Planning",
        "start": {"dateTime": "2026-02-23T09:00:00-05:00", "timezone": "America/New_York"},
        "end": {"dateTime": "2026-02-23T10:00:00-05:00", "timezone": "America/New_York"},
        "recurring_event_id": "evt_abc123",    // parent recurring event
        "original_start": "2026-02-23T09:00:00-05:00",
        "attendees": [...],
        "status": "confirmed"
      },
      ...
    ]
  }

// Update a single instance of a recurring event
PUT /calendars/{calendar_id}/events/{event_id}?instance=2026-03-02T09:00:00-05:00
  Body: { "start": "2026-03-02T10:00:00", "end": "2026-03-02T11:00:00" }

// Check free/busy time for scheduling
POST /freeBusy
  Body: {
    "timeMin": "2026-02-24T08:00:00Z",
    "timeMax": "2026-02-24T18:00:00Z",
    "items": [
      {"id": "alice@company.com"},
      {"id": "bob@company.com"}
    ]
  }
  Response 200: {
    "calendars": {
      "alice@company.com": {
        "busy": [
          {"start": "2026-02-24T09:00:00Z", "end": "2026-02-24T10:00:00Z"},
          {"start": "2026-02-24T14:00:00Z", "end": "2026-02-24T15:00:00Z"}
        ]
      },
      ...
    }
  }

Key Decisions

singleEvents=true tells the API to expand recurring events into individual instances — critical for calendar rendering
Recurring event modifications are tracked as “exception instances” linked to the parent event
Free/busy API is a separate endpoint optimized for multi-user scheduling (returns only busy slots, not event details — respects privacy)

4. Data Model (3 min)

Events Table (MySQL/PostgreSQL, sharded by owner_user_id)

Table: events
  event_id            (PK) | bigint (Snowflake ID)
  calendar_id              | bigint
  owner_user_id            | bigint (shard key)
  title                    | varchar(500)
  description              | text
  start_time               | timestamp with timezone
  end_time                 | timestamp with timezone
  start_timezone           | varchar(50)    -- e.g., "America/New_York"
  end_timezone             | varchar(50)
  location                 | varchar(500)
  is_all_day               | boolean
  recurrence_rule          | varchar(500)   -- RRULE string, null if non-recurring
  recurrence_end           | timestamp      -- when the recurrence stops
  visibility               | enum(default, public, private)
  status                   | enum(confirmed, tentative, cancelled)
  color_id                 | tinyint
  created_at               | timestamp
  updated_at               | timestamp

Indexes:
  (calendar_id, start_time)  -- primary query: events in a time range for a calendar
  (owner_user_id)            -- shard key

Recurring Event Exceptions Table

Table: event_exceptions
  exception_id        (PK) | bigint
  parent_event_id          | bigint (FK → events)
  original_start_time      | timestamp   -- which instance is being modified
  is_cancelled             | boolean     -- true if this instance is deleted
  modified_title           | varchar(500)
  modified_start_time      | timestamp
  modified_end_time        | timestamp
  modified_location        | varchar(500)
  -- other overridden fields (null = use parent's value)

Attendees Table

Table: event_attendees
  event_id                 | bigint (FK → events)
  user_id                  | bigint
  email                    | varchar(200)
  response_status          | enum(needsAction, accepted, declined, tentative)
  is_optional              | boolean
  is_organizer             | boolean

Index: (user_id, event_start_time)  -- for attendee's calendar view

Reminders Table

Table: reminders
  reminder_id         (PK) | bigint
  event_id                 | bigint
  user_id                  | bigint
  method                   | enum(popup, email, sms)
  minutes_before           | int
  trigger_time             | timestamp   -- precomputed: event.start - minutes_before
  is_sent                  | boolean

Index: (trigger_time, is_sent)  -- for the reminder scheduler to efficiently find due reminders

Why These Choices

Sharded MySQL by owner_user_id — most queries are “my events this week” which hits a single shard. Cross-user queries (free/busy) require scatter-gather but are less frequent.
RRULE stored as string, expanded at query time — storing every instance of “every weekday forever” would be infinite. RRULE is compact and standards-based (RFC 5545).
Exception table for recurring modifications — cleanly separates the recurring pattern from per-instance changes. No need to duplicate the entire event for each modified instance.

5. High-Level Design (12 min)

Architecture

Client (Web / Mobile / Desktop)
  │
  │  REST API / WebSocket (for real-time sync)
  ▼
┌───────────────┐
│  API Gateway   │  (auth, rate limiting, routing)
└───────┬───────┘
        │
        ├──────────────────┬───────────────────┬──────────────────┐
        ▼                  ▼                   ▼                  ▼
┌───────────────┐ ┌─────────────────┐ ┌───────────────┐ ┌───────────────┐
│ Event Service  │ │ Recurring Event │ │ Notification   │ │ Free/Busy     │
│ (CRUD for      │ │ Expander        │ │ Service        │ │ Service       │
│  events)       │ │ (expands RRULE  │ │ (reminders,    │ │ (scheduling   │
│                │ │  into instances) │ │  invites)      │ │  queries)     │
└───────┬───────┘ └────────┬────────┘ └───────┬───────┘ └───────────────┘
        │                  │                   │
        ▼                  ▼                   ▼
┌──────────────────────────────────────────────────────┐
│              Shared Data Layer                          │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Events DB     │  │ Attendees DB │  │ Reminders DB │ │
│  │ (MySQL,       │  │ (MySQL,      │  │ (MySQL,      │ │
│  │  sharded by   │  │  sharded by  │  │  sharded by  │ │
│  │  owner_id)    │  │  user_id)    │  │  trigger_time│ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐                    │
│  │ Cache (Redis) │  │ Sync Queue   │                    │
│  │ (user's week  │  │ (Kafka, for  │                    │
│  │  view cache)  │  │  cross-device│                    │
│  │              │  │  sync)       │                     │
│  └──────────────┘  └──────────────┘                    │
└─────────────────────────────────────────────────────────┘

Event Creation Flow

User creates "Sprint Planning, every Monday 9-10 AM ET, 12 weeks"
  │
  ▼
Event Service:
  1. Validate input (times, timezone, RRULE syntax)
  2. Store event with recurrence_rule = "RRULE:FREQ=WEEKLY;BYDAY=MO;COUNT=12"
     (single row in events table, NOT 12 rows)
  3. Store attendees in event_attendees table
  4. Compute reminders for next 30 days of instances:
     → Expand RRULE for next 30 days → 4-5 instances
     → For each: insert into reminders table with precomputed trigger_time
  5. Publish event to Kafka "event-changes" topic
  │
  ├─→ Notification Service:
  │     → Send email invitations to all attendees
  │     → Each attendee's calendar view cache is invalidated
  │
  └─→ Sync Service:
        → Push real-time update to all owner's connected devices (WebSocket)
        → Push update to attendees' devices

Calendar View (Week) Flow

User opens calendar for week of Feb 23 - Mar 1
  │
  ▼
API Gateway → Event Service:
  1. Check cache: Redis key "cal_view:{user_id}:2026-W09"
     → Cache hit (80% of the time): return cached events, done
  2. Cache miss:
     a. Query events table:
        SELECT * FROM events
        WHERE calendar_id = ?
          AND ((start_time BETWEEN ? AND ?)           -- non-recurring events in range
               OR recurrence_rule IS NOT NULL)         -- all recurring events (need expansion)
     b. For recurring events: expand RRULE to find instances in this week
     c. Join with event_exceptions: apply per-instance modifications, remove cancelled instances
     d. Query attendees table: get events where this user is an attendee
     e. Merge owner's events + attendee events
     f. Sort by start_time
     g. Cache result in Redis (TTL: 5 minutes)
  3. Return to client

Components

Event Service: Core CRUD. Handles event creation, updates, deletion. Sharded by owner_user_id for write locality.
Recurring Event Expander: Library/service that takes an RRULE + time range and produces concrete event instances. Uses RFC 5545-compliant parser. Handles complex rules like “last Thursday of every month” or “every 3rd day, excluding weekends.”
Notification Service: Processes reminder triggers and attendee invitations. Polls the reminders table every 10 seconds for due reminders. Sends via email, push notification, or in-app popup.
Free/Busy Service: Optimized for multi-user scheduling queries. Maintains a denormalized “busy slots” table per user (pre-computed from events). Returns only time ranges, no event details.
Sync Service: Real-time sync across devices. Uses WebSocket connections for push updates. When an event changes, publishes to Kafka, which fans out to all connected devices of affected users.
Cache Layer (Redis): Caches calendar views per user per week/month. Invalidated on event create/update/delete. Hit rate: ~80% (users repeatedly view the same week).

6. Deep Dives (15 min)

Deep Dive 1: Recurring Events and RRULE

The problem: Recurring events are the hardest part of a calendar system. “Every weekday” generates ~260 instances/year. “Every day forever” is infinite. We cannot store every instance. But users can modify individual instances (move Tuesday’s meeting to Wednesday, cancel one occurrence).

Design Google Docs (Collaborative Document Editor)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can create, edit, and delete documents with rich-text formatting (bold, italic, headings, lists, tables, images)
Multiple users can simultaneously edit the same document in real time, seeing each other’s changes within ~200ms
Each collaborator’s cursor position and selection is visible to all other editors (presence awareness)
Full version history — users can view, name, and restore any previous version of the document
Sharing and permissions model: owner, editor, commenter, viewer roles with link-sharing and per-user ACLs

Non-Functional Requirements

Availability: 99.99% — documents are a user’s primary work artifact. Downtime during working hours is extremely costly.
Latency: < 200ms for local edit acknowledgment; < 500ms for remote edit propagation to other collaborators
Consistency: Eventual consistency for document state, but operations must converge — all users see the identical document after quiescence. No lost edits, ever.
Scale: 1B documents total, 10M DAU, up to 100 concurrent editors per document, peak 500K concurrent editing sessions
Durability: Zero data loss. Every keystroke must be persisted. Point-in-time recovery for any document.

2. Estimation (3 min)

Traffic

10M DAU, average 3 editing sessions/day, average session 20 minutes
Average 2 operations/second per active user (character insert, delete, format change)
Concurrent active editors at peak: ~2M users
Write QPS: 2M users x 2 ops/sec = 4M operations/sec
Read QPS: Document opens: 10M x 3 = 30M/day = ~350 reads/sec (bursty, 3x peak = ~1K/sec). Presence/cursor updates: same as write QPS.

Storage

1B documents, average 50KB per document (plain text + formatting metadata)
Document content: 1B x 50KB = 50TB
Operation log (for version history): assume 500 ops per document per day, 10 bytes per op average, retained for 1 year
- Active documents per day: 30M. 30M x 500 x 10B x 365 = ~55TB/year of operation logs
Total: ~105TB primary storage, replicated 3x = ~315TB

Key Insight

This is a write-heavy, real-time collaboration system. The core challenge is not storage or throughput — it’s ensuring that concurrent edits from multiple users converge to the same document state without conflicts or lost updates. The algorithm choice (OT vs CRDT) dominates the design.

Design Google Drive (Cloud Storage & Collaboration)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can upload, download, and delete files of any type and size (up to 5TB per file). Uploads support chunked and resumable upload protocols so large files survive network interruptions.
Files sync automatically across all of a user’s devices. When a file is modified on one device, all other devices reflect the change within seconds.
Sharing and permissions: users can share files/folders with specific people (viewer, commenter, editor) or generate shareable links with configurable access levels. Support organizational domains (anyone at company X can view).
Version history: every edit creates a new version. Users can view and restore previous versions (up to 100 versions, or 30 days, whichever comes first).
Full-text search across file names, contents (for supported formats: docs, PDFs, images via OCR), and metadata (owner, type, modified date).

Non-Functional Requirements

Availability: 99.99% for file access (download). Upload can tolerate slightly lower availability (resumable uploads mask brief outages).
Latency: Metadata operations (list, search, share): < 200ms. Small file download (< 1MB): < 500ms from edge CDN. Upload acknowledgment: < 100ms per chunk.
Consistency: Strong consistency for metadata (permissions, ownership, folder structure). Eventual consistency for file content propagation to other devices is acceptable (target: < 10 seconds).
Scale: 2B users total, 500M MAU, 100M DAU. 2 trillion files stored. 10B API requests/day. 5PB of new data uploaded per day.
Durability: 99.999999999% (11 nines). Files must never be lost. This is the most critical non-functional requirement.

2. Estimation (3 min)

Traffic

10B API requests/day = ~116K requests/sec average
Breakdown: 60% metadata reads (list, search), 25% downloads, 15% uploads
Read QPS: ~70K metadata reads/sec + ~29K downloads/sec
Write QPS: ~17K uploads/sec
Peak: 3x average = ~350K requests/sec total

Storage

Total files: 2 trillion
Average file size: 2.5MB (skewed: many small docs, fewer large videos)
Total storage: 2T x 2.5MB = 5 exabytes (5,000 PB)
New data: 5PB/day → 1.8EB/year
With 3x replication: 15EB raw storage
With deduplication (estimated 30% duplicate data): effective ~10.5EB

Metadata

2 trillion files x 1KB metadata per file = 2PB of metadata
This must be in a fast, queryable database — not blob storage

Bandwidth

Downloads: 29K/sec x 2.5MB average = 72.5GB/sec = 580Gbps
Uploads: 17K/sec x 2.5MB average = 42.5GB/sec = 340Gbps
CDN absorbs most download traffic (cache-hit ratio ~85%), so origin bandwidth: ~87Gbps

Key Insight

This is a storage-dominant system at planetary scale. The core challenges are: (1) storing exabytes of data durably and cost-efficiently, (2) syncing file changes across devices with minimal bandwidth and latency, and (3) making 2 trillion files searchable. The metadata layer is essentially a distributed file system.

Design Instagram

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can upload photos with captions
Users can follow/unfollow other users
Users can view a personalized news feed (photos from people they follow)
Users can like and comment on photos
Users can view any user’s profile (grid of their photos)

Non-Functional Requirements

Availability: 99.99% — social feeds being down is immediately visible to millions
Latency: Feed load < 300ms at p99, photo upload acknowledgment < 2s
Consistency: Eventual consistency for feed (2-5 seconds stale is fine). Strong consistency for uploads (post → refresh → must see it)
Scale: 500M DAU, 50M photo uploads/day, average user views feed 10 times/day
Storage: Photos are large; need cost-efficient media storage

2. Estimation (3 min)

Traffic

Uploads: 50M/day ÷ 100K = 500 writes/sec, peak 2,500/sec
Feed reads: 500M × 10/day = 5B reads/day ÷ 100K = 50,000 reads/sec, peak 250,000/sec
Read-to-write ratio: 100:1 — extremely read-heavy

Storage

Average photo: 2MB original, store 4 sizes (thumbnail 50KB, small 200KB, medium 500KB, large 2MB) = ~2.75MB total per photo
50M photos/day × 2.75MB = 137TB/day
Per year: ~50PB — this is a massive storage system

Bandwidth

Feed: 10 photos per load × 500KB avg = 5MB per feed load
50,000 feeds/sec × 5MB = 250GB/sec peak read bandwidth
CDN is absolutely critical

3. API Design (3 min)

POST /api/v1/photos
  Content-Type: multipart/form-data
  Body: {
    "photo": <binary>,
    "caption": "Sunset vibes",
    "location": "Mumbai, India"     // optional
  }
  Response 201: {
    "photo_id": "p_abc123",
    "urls": {
      "thumbnail": "https://cdn.ig.com/thumb/p_abc123.jpg",
      "full": "https://cdn.ig.com/full/p_abc123.jpg"
    }
  }

GET /api/v1/feed?cursor={cursor}&limit=10
  Response 200: {
    "photos": [
      {
        "photo_id": "p_abc123",
        "user": { "id": "u_1", "username": "chirag", "avatar": "..." },
        "caption": "Sunset vibes",
        "photo_url": "https://cdn.ig.com/med/p_abc123.jpg",
        "like_count": 2847,
        "comment_count": 43,
        "liked_by_me": true,
        "created_at": "2026-02-22T18:00:00Z"
      }
    ],
    "next_cursor": "ts_1708632000"
  }

POST /api/v1/photos/{photo_id}/like
DELETE /api/v1/photos/{photo_id}/like

POST /api/v1/users/{user_id}/follow
DELETE /api/v1/users/{user_id}/follow

GET /api/v1/users/{user_id}/photos?cursor={cursor}&limit=30

Key decisions:

Design Pastebin

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can create a paste (text content, up to 10MB)
Each paste gets a unique, shareable URL
Pastes can be public or private (unlisted — only accessible via URL)
Pastes can have an optional expiration (10 min, 1 hour, 1 day, 1 week, never)
Syntax highlighting for code pastes (client-side, not a backend concern)

Non-Functional Requirements

Availability: 99.9% — reads must be highly available; writes can tolerate brief degradation
Latency: Paste retrieval < 200ms at p99
Consistency: Strong consistency for writes (create → immediately readable). Eventual consistency acceptable for metadata like view counts.
Scale: 5M new pastes/day, 50M reads/day
Storage: Most pastes are small (< 50KB), but we support up to 10MB

2. Estimation (3 min)

Traffic

Writes: 5M/day ÷ 100K = 50 writes/sec, peak 250/sec
Reads: 50M/day ÷ 100K = 500 reads/sec, peak 2,500/sec

Storage

Average paste: 10KB (most are small code snippets)
5M/day × 10KB = 50GB/day
Per year: ~18TB of paste content
Over 5 years: ~90TB

Bandwidth

Read: 500 reads/sec × 10KB = 5MB/sec average
Peak: 25MB/sec — manageable

3. API Design (3 min)

POST /api/v1/pastes
  Body: {
    "content": "def hello():\n    print('world')",
    "title": "My Snippet",          // optional
    "language": "python",            // optional, for syntax highlighting
    "expiration": "1d",              // optional: 10m, 1h, 1d, 1w, never
    "visibility": "unlisted"         // public or unlisted
  }
  Response 201: {
    "id": "aB3kX9p",
    "url": "https://paste.example.com/aB3kX9p",
    "raw_url": "https://paste.example.com/raw/aB3kX9p",
    "expires_at": "2026-02-23T12:00:00Z"
  }

GET /api/v1/pastes/{id}
  Response 200: {
    "id": "aB3kX9p",
    "title": "My Snippet",
    "content": "def hello():\n    print('world')",
    "language": "python",
    "created_at": "2026-02-22T12:00:00Z",
    "expires_at": "2026-02-23T12:00:00Z",
    "view_count": 42
  }

GET /raw/{id}
  Response 200: (plain text content, no JSON wrapper)
  Content-Type: text/plain

GET /api/v1/pastes/recent?limit=20&cursor={cursor}
  Response 200: { "pastes": [...], "next_cursor": "..." }
  // Only returns public pastes

4. Data Model (3 min)

Metadata Store: SQL (PostgreSQL)

Table: pastes
  id           (PK)   | char(7), Base62 encoded
  title                | varchar(200), nullable
  language             | varchar(50), nullable
  visibility           | enum('public', 'unlisted')
  content_key          | varchar(100)  -- S3 object key
  content_size         | int
  created_at           | timestamptz
  expires_at           | timestamptz, nullable
  view_count           | bigint, default 0

Content Store: Object Storage (S3)

Key: pastes/{id} → raw text content
Content stored separately because: metadata queries (list recent, check expiry) shouldn’t load paste content; S3 scales storage cheaply to petabytes; enables CDN integration for reads

Why SQL for metadata?

Need to query WHERE visibility = 'public' ORDER BY created_at DESC for the recent pastes feed
Need to query WHERE expires_at < NOW() for cleanup
5M inserts/day (~50/sec) is well within PostgreSQL’s capacity with proper indexing

5. High-Level Design (12 min)

Write Path

Client → Load Balancer → Paste Service
  → Generate unique ID (Snowflake → Base62)
  → Upload content to S3 (key: pastes/{id})
  → Write metadata to PostgreSQL
  → Return paste URL

Read Path

Client → CDN (CloudFront/Cloudflare)
  Cache hit → Return content
  Cache miss → Load Balancer → Paste Service
    → Read metadata from PostgreSQL (or Redis cache)
    → Check expiry → if expired, return 404
    → Fetch content from S3 (or Redis if cached)
    → Increment view count async (Kafka → consumer)
    → Return response + cache at CDN

Components

Paste Service: Stateless application servers handling create/read
PostgreSQL (primary + replica): Metadata store
S3: Content store — cheap, durable, scales to petabytes
Redis: Cache hot pastes (metadata + content for small pastes < 100KB)
CDN: Cache popular pastes at edge; raw endpoint is especially CDN-friendly
Kafka: Async view count updates + expiry event stream
Cleanup Worker: Periodic job to delete expired pastes from S3 and PostgreSQL

6. Deep Dives (15 min)

Deep Dive 1: Storage Architecture (S3 vs Database)

Why not store content in PostgreSQL?

Design Twitter for Millions of Users

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can post tweets (280 chars, text only for scope)
Users can follow/unfollow other users
Users can view their home timeline (tweets from followed users)
Users can like and retweet
Users can search for tweets

Non-Functional Requirements

Availability: 99.99% — timeline is the core product
Latency: Timeline load < 200ms at p99, tweet post < 500ms
Consistency: Eventual consistency for timeline (a few seconds stale is fine). Strong consistency for the write path (post → refresh → see it).
Scale: 500M DAU, 200M tweets/day. Average user follows 200 people and loads timeline 10 times/day.

2. Estimation (3 min)

Writes

200M tweets/day ÷ 100K = 2,000 tweets/sec, peak 10,000/sec
1B likes/day → 10,000 likes/sec

Reads (Timeline)

500M DAU × 10 loads/day = 5B loads/day ÷ 100K = 50,000 timeline reads/sec
Peak: 250,000/sec
Read-to-write ratio: 25:1

Storage

Tweet: 280 chars × 2 bytes + metadata ~300 bytes = ~860 bytes → round to 1KB
200M/day × 1KB = 200GB/day → 73TB/year for tweets alone
User metadata, follows, likes: additional ~20TB/year

Fan-out calculation

Average user has 200 followers
200M tweets/day × 200 followers = 40B fan-out writes/day
Celebrity with 50M followers posting → 50M cache writes for one tweet

3. API Design (3 min)

POST /api/v1/tweets
  Body: { "content": "Hello world!", "reply_to": "t_abc" }  // reply_to optional
  Response 201: { "tweet_id": "t_xyz", "created_at": "..." }

GET /api/v1/timeline?cursor={cursor}&limit=20
  Response 200: {
    "tweets": [
      {
        "tweet_id": "t_xyz",
        "user": { "id": "u_1", "username": "chirag", "display_name": "Chirag", "avatar": "..." },
        "content": "Hello world!",
        "like_count": 42,
        "retweet_count": 7,
        "reply_count": 3,
        "liked_by_me": false,
        "retweeted_by_me": false,
        "created_at": "2026-02-22T18:00:00Z"
      }
    ],
    "next_cursor": "1708632000_t_abc"
  }

POST /api/v1/tweets/{tweet_id}/like
DELETE /api/v1/tweets/{tweet_id}/like

POST /api/v1/tweets/{tweet_id}/retweet
DELETE /api/v1/tweets/{tweet_id}/retweet

POST /api/v1/users/{user_id}/follow
DELETE /api/v1/users/{user_id}/follow

GET /api/v1/search?q={query}&cursor={cursor}

4. Data Model (3 min)

Table: users
  user_id       (PK) | bigint
  username             | varchar(15), unique
  display_name         | varchar(50)
  bio                  | varchar(160)
  follower_count       | int (denormalized)
  following_count      | int (denormalized)

Table: follows
  follower_id          | bigint
  followee_id          | bigint
  created_at           | timestamptz
  PK: (follower_id, followee_id)
  Index: (followee_id)  -- for "who follows me"

Tweets (Cassandra or DynamoDB)

Table: tweets
  tweet_id      (PK) | bigint (Snowflake ID)
  user_id              | bigint
  content              | text
  reply_to             | bigint, nullable
  like_count           | int (denormalized, eventually consistent)
  retweet_count        | int
  created_at           | timestamp

Timeline Cache (Redis)

Key: timeline:{user_id}
Type: Sorted Set
Members: tweet_ids, scored by created_at timestamp
Max size: 800 entries (older tweets evicted)

Likes (Cassandra)

Table: likes
  tweet_id  (partition key) | bigint
  user_id   (clustering key)| bigint
  created_at                | timestamp

Table: user_likes  (reverse index for "tweets I liked")
  user_id   (partition key) | bigint
  tweet_id  (clustering key)| bigint

5. High-Level Design (12 min)

Tweet Post Path

Client → Load Balancer → Tweet Service
  → Validate (length, content policy)
  → Write to Cassandra (tweets table)
  → Return success to client (fast ack)
  → Publish to Kafka: tweet_events topic
  → Fan-out Service consumes from Kafka:
    → Fetch follower list for the author
    → For each follower:
      → ZADD timeline:{follower_id} {timestamp} {tweet_id} in Redis
      → ZREMRANGEBYRANK to keep max 800 entries
    → For celebrities (> 10K followers): skip fan-out (pull model)

Timeline Read Path

Client → Load Balancer → Timeline Service
  → Read timeline:{user_id} from Redis
    → ZREVRANGE with cursor-based pagination
  → For each tweet_id, batch fetch tweet data:
    → Redis cache first (hot tweets)
    → Cassandra for cache misses
  → Merge with celebrity tweets (pull):
    → Get list of celebrities the user follows
    → Fetch their recent tweets
    → Merge + sort by timestamp
  → Hydrate: add user info, liked_by_me, retweeted_by_me
  → Return assembled timeline

Search Path

Client → Search Service → Elasticsearch
  → Full-text search on tweet content
  → Filter by time range, user, engagement metrics
  → Return results ranked by relevance + recency

Components

Tweet Service: Write path — validates and stores tweets
Timeline Service: Read path — assembles personalized timelines
Fan-out Service: Async — distributes tweets to followers’ cached timelines
Search Service: Elasticsearch-backed tweet search
Cassandra: Tweet storage, likes, retweets
PostgreSQL: User profiles, social graph (follows)
Redis Cluster: Timeline caches, hot tweet cache, user data cache
Kafka: Event stream — tweet events, like events, follow events
CDN: Cache user avatars, media attachments

6. Deep Dives (15 min)

Deep Dive 1: Fan-out Strategy (The Celebrity Problem)

This is THE defining problem of Twitter’s architecture.

Design Twitter Trending Topics

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Detect trending topics — Identify hashtags, phrases, and named entities that are experiencing an abnormal surge in tweet volume (velocity over absolute volume)
Geographic trending — Compute separate trend lists for global, per-country, and per-city levels (~500 geographic regions)
Personalized trends — Show a mix of global/local trends and topics relevant to the user’s interests (who they follow, what they engage with)
Trend context — For each trending topic, display explanatory context: “Trending because of [event]”, tweet count, related tweets, and a category (Sports, Politics, Entertainment, etc.)
Trend timeline — Show how a topic’s volume changed over time (sparkline), when it started trending, and whether it’s rising, peaking, or declining

Non-Functional Requirements

Availability: 99.99% — the trends sidebar is visible on every Twitter page. Downtime is highly visible.
Latency: Trend list must load in < 200ms. Trend detection lag: a topic should appear in the trending list within 2-5 minutes of its surge beginning.
Freshness: Trends update every 1-2 minutes. Stale trends (showing yesterday’s event) severely degrade trust.
Scale: Twitter processes ~500M tweets/day (6,000 tweets/sec average, 15,000/sec peak). Each tweet may contain 0-5 hashtagged or extractable topics. That’s up to 75,000 topic events/sec at peak.
Spam resilience: Coordinated bot campaigns must not be able to artificially push a topic to trending. False positives are worse than false negatives (a fake trend damages trust).

2. Estimation (3 min)

Traffic

Tweets/day: 500M → 6,000 tweets/sec avg, 15,000/sec peak
Topics extracted per tweet: ~2 (hashtags + NLP-extracted phrases) → 12,000 topic events/sec avg, 30,000/sec peak
Trend reads: Every Twitter page load fetches trends. 300M DAU × 10 page loads/day = 3B trend reads/day → 35,000 reads/sec avg, 100,000 reads/sec peak
Geographic granularity: ~500 regions × trend list refresh every 60 seconds = 500 trend computations/minute

Storage

Real-time counters: Sliding window counts for ~5M unique topics across 500 regions
- Each topic-region pair: topic_hash (8B) + count (8B) + window metadata (32B) ≈ 48B
- 5M topics × 500 regions × 48B = 120 GB — fits in memory (Redis cluster)
- In practice, most topics only trend in 1-3 regions, so effective storage is much less
Historical trends: Archive of what trended, when, and why
- 500 regions × 30 trends × 1440 minutes/day × 365 days = 7.9B records/year
- Each record: ~200B → 1.6 TB/year — manageable in a columnar store (ClickHouse)
Count-Min Sketch: For approximate frequency counting of all topics
- 5 hash functions × 1M counters × 4 bytes = 20 MB per sketch per region
- 500 regions × 20 MB = 10 GB total — trivially small

Compute

Topic extraction (NLP): 15,000 tweets/sec × ~1ms per tweet = 15 CPU-seconds/sec → 15 cores dedicated to NLP extraction
Trend scoring: 500 regions × every 60 seconds, score 10K candidate topics per region = 500 × 10K / 60 ≈ 83K scoring operations/sec — lightweight math, < 1 core

3. API Design (3 min)

// Get trending topics for a user (personalized + geographic)
GET /api/v1/trends
  Params:
    user_id (optional)          // for personalized trends
    woeid (optional)            // "Where On Earth ID" — geographic region. 1 = global
    count (default 30)          // number of trends to return
  → 200 {
    "as_of": "2024-03-15T14:32:00Z",
    "location": { "name": "San Francisco", "woeid": 2487956 },
    "trends": [
      {
        "name": "#SuperBowl",
        "query": "%23SuperBowl",
        "url": "https://twitter.com/search?q=%23SuperBowl",
        "tweet_volume_24h": 2340000,
        "tweet_volume_1h": 185000,
        "category": "Sports",
        "context": "The Super Bowl is happening tonight at Allegiant Stadium",
        "started_trending_at": "2024-03-15T10:00:00Z",
        "trend_type": "breaking",         // breaking | sustained | recurring
        "sparkline": [12, 15, 45, 120, 185],  // hourly volumes (last 5h)
        "promoted": false
      },
      ...
    ]
  }

// Get available trending locations
GET /api/v1/trends/available
  → 200 { "locations": [{ "name": "Worldwide", "woeid": 1 }, ...] }

// Get trend details (for "Why is X trending" page)
GET /api/v1/trends/{topic}/details
  Params: woeid
  → 200 {
    "topic": "#SuperBowl",
    "summary": "The Super Bowl LVIII between...",
    "related_topics": ["#Chiefs", "#49ers", "#HalftimeShow"],
    "top_tweets": [...],
    "volume_timeseries": [...],          // minute-by-minute for last 24h
    "demographics": { "top_geos": [...], "top_age_groups": [...] }
  }

Internal APIs

// Topic ingestion (called by tweet processing pipeline)
POST /internal/topics/ingest
  Body: {
    "tweet_id": "...",
    "topics": ["#SuperBowl", "Chiefs", "halftime show"],
    "user_id": "...",
    "geo": { "country": "US", "city": "San Francisco" },
    "timestamp": "..."
  }

// Admin: suppress a trend (safety/trust intervention)
POST /internal/trends/suppress
  Body: { "topic": "#ScamCoin", "reason": "coordinated_inauthentic", "duration_hours": 24 }

Key Decisions

WOEID (Where On Earth ID): Using Yahoo’s WOEID system for geographic identifiers — it’s a well-established standard for location hierarchy (city → state → country → world).
Sparkline data inline: Including a small array of recent hourly volumes allows the client to render a trend line without a separate API call.
Trend type classification: Distinguishing “breaking” (sudden spike), “sustained” (high volume over hours), and “recurring” (regularly trends at this time, like #MondayMotivation) helps the UI prioritize display.

4. Data Model (3 min)

Topic Counts — Sliding Window (Redis)

Key Pattern	Type	Notes
`topic:{hash}:window:{region}:{minute_bucket}`	INT	Tweet count in this 1-minute bucket
`topic:{hash}:meta`	Hash	`{name, category, first_seen, is_hashtag}`
`trend:list:{woeid}`	Sorted Set	Current trends, scored by trend_score. Top 30 = the trend list
`trend:suppress:{topic_hash}`	String	TTL-based suppression flag

Sliding Window Design:

Design Typeahead Suggestion / Autocomplete

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

As the user types in a search box, show top 5-10 suggestions matching the prefix
Suggestions should be ranked by popularity (search frequency)
Suggestions update as new search trends emerge (near-real-time)
Support personalization (user’s own search history weighted higher)
Handle typos/fuzzy matching (optional stretch goal)

Non-Functional Requirements

Availability: 99.99% — broken autocomplete degrades the entire search experience
Latency: Suggestions must appear in < 50ms at p99 (users expect instant response as they type)
Consistency: Eventual consistency is fine — if a new trending search takes a few minutes to appear in suggestions, that’s acceptable
Scale: 10B search queries/day, suggestions requested on every keystroke → 50B+ suggestion requests/day (average query = 5 keystrokes)

2. Estimation (3 min)

Traffic

50B suggestion requests/day ÷ 100K = 500,000 requests/sec
Peak: 3x → 1.5M requests/sec
This is an extremely high-QPS system — latency and throughput are everything

Data Size

Vocabulary: ~5M unique search phrases (with frequency counts)
Each phrase: avg 30 chars + frequency count + metadata = ~100 bytes
Total vocabulary: 5M × 100 bytes = 500MB — fits entirely in memory

Key Insight

This is a memory-bound, latency-critical system. The entire dataset fits in RAM. The challenge is serving 1.5M QPS at < 50ms consistently.

Design Uber's Driver-Rider Matching System

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Rider requests a ride (pickup location, destination) and the system finds the best available driver within seconds
Drivers continuously report their GPS location; the system maintains a real-time view of all available drivers
Matching algorithm considers distance, ETA, driver rating, vehicle type, and supply-demand balance
Support ride types: standard, pool/shared, premium, XL — each with different matching criteria
Handle surge pricing zones and dynamically adjust pricing based on supply-demand ratio

Non-Functional Requirements

Availability: 99.99% — riders cannot request rides if matching is down. Revenue loss is immediate.
Latency: Match a rider to a driver within 3-5 seconds. Location updates processed within 1 second.
Consistency: A driver must never be matched to two riders simultaneously (strong consistency on driver state). Ride pricing can be eventually consistent.
Scale: 5M concurrent drivers, 20M rides/day, 500K location updates/sec globally.
Freshness: Driver location must be accurate within 5 seconds for meaningful ETA calculations.

2. Estimation (3 min)

Traffic

Location updates: 5M drivers × 1 update/4 seconds = 1.25M updates/sec
Ride requests: 20M rides/day = ~230 rides/sec (peak 5x = 1,150/sec)
Match queries: Each ride request queries nearby drivers → geospatial index lookup (1,150 QPS at peak)
ETA calculations: Per match attempt, compute ETA for ~10 candidate drivers = 11,500 ETA computations/sec at peak

Storage

Driver locations: 5M drivers × 64 bytes (id + lat/lng + timestamp + status) = 320MB (fits in memory)
Ride history: 20M rides/day × 1KB × 365 = 7.3TB/year
Geospatial index: In-memory spatial index of 5M points = ~500MB

Key Insight

This is a real-time geospatial matching problem. The core challenge is maintaining an up-to-date spatial index of 5M moving points with 1.25M updates/sec, then efficiently querying it to find optimal matches. The matching itself is a constrained optimization problem (minimize total wait time across all concurrent requests, not just each individual one).

Design Yelp (Proximity Service)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users search for businesses near a given location (latitude, longitude) within a specified radius
Results are ranked by a combination of relevance, distance, and user ratings
Support filtering by business category (restaurants, hotels, gas stations), price range, open now, and rating threshold
Display business details: name, address, photos, hours, reviews, ratings
Business owners can add/update their business listing information

Non-Functional Requirements

Availability: 99.99% — search is the core product. Users expect instant results when they open the app.
Latency: < 100ms for proximity search queries (p99). Business detail pages < 200ms.
Consistency: Eventual consistency for business data updates. A new restaurant appearing in search within 1 hour is acceptable. But search results must be consistent within a single query (no partial results).
Scale: 200M businesses worldwide, 100K search QPS at peak, 500M DAU
Location accuracy: Results within a 500m radius should be nearly exhaustive (recall > 99%). Results at larger radii (5km+) can prioritize relevance over exhaustiveness.

2. Estimation (3 min)

Traffic

Search QPS: 100K peak, 40K average
Business detail views: 200K QPS (users click into results)
Business writes (new/updated listings): 1K/sec (low — read-heavy system)

Storage

200M businesses x 2KB per business (name, address, coordinates, category, hours, ratings) = 400 GB for core business data
Photos: 200M businesses x 5 photos avg x 200KB = 200 TB (CDN-served)
Reviews: 2 billion reviews x 500 bytes = 1 TB
Geospatial index: 200M businesses x 50 bytes (coordinates + geohash + pointers) = 10 GB — fits entirely in memory

Key Insight

The geospatial index (10GB) is small enough to fit in memory on a single machine, but we need to replicate it across multiple read replicas for 100K QPS. The real challenge is not storage — it is efficient spatial querying and ranking at low latency.

Design YouTube / Netflix

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Users can upload videos
Users can stream/watch videos (adaptive bitrate)
Users can search for videos
Personalized home feed (recommended videos)
Video metadata: title, description, view count, likes, comments

Non-Functional Requirements

Availability: 99.99% — video playback must be rock-solid
Latency: Video playback start < 2 seconds. Search results < 300ms. Home feed < 500ms.
Consistency: View counts and likes can be eventually consistent (seconds of delay acceptable). Video availability after upload: within minutes (transcoding pipeline).
Scale: 2B MAU, 1B videos watched/day, 500K video uploads/day
Bandwidth: This is a bandwidth-dominated system — video streaming is 80%+ of internet traffic

2. Estimation (3 min)

Storage

500K uploads/day × avg 5 minutes × 10MB/min (original) = 25TB/day raw uploads
After transcoding (5 resolutions × 3 codecs): ~5x storage multiplier = 125TB/day
Per year: ~45PB — massive storage system

Bandwidth

1B video views/day, avg 5 min watch time, avg bitrate 3Mbps
Concurrent viewers (assume 10% of daily active at peak): 200M concurrent
200M × 3Mbps = 600Tbps peak bandwidth
Even with CDN (95%+ cached), origin bandwidth: 30Tbps

Traffic

Upload: 500K/day ÷ 100K = 5 uploads/sec (low, but each is large and long-running)
Video plays: 1B/day ÷ 100K = 10,000 play starts/sec
Search: assume 500M/day = 5,000 searches/sec

3. API Design (3 min)

// Upload flow (chunked, resumable)
POST /api/v1/videos/upload/init
  Body: { "title": "My Video", "description": "...", "filename": "video.mp4" }
  Response 200: { "upload_id": "up_123", "upload_url": "https://upload.yt.com/up_123" }

PUT /upload/{upload_id}
  Headers: Content-Range: bytes 0-5242879/*
  Body: <binary chunk>
  Response 308: { "next_offset": 5242880 }

POST /api/v1/videos/upload/{upload_id}/complete
  Response 202: { "video_id": "v_abc", "status": "processing" }

// Playback
GET /api/v1/videos/{video_id}
  Response 200: {
    "video_id": "v_abc",
    "title": "My Video",
    "description": "...",
    "channel": { "id": "c_1", "name": "TechChannel", "subscriber_count": 1000000 },
    "view_count": 1500000,
    "like_count": 50000,
    "duration": 300,
    "stream_urls": {
      "dash": "https://cdn.yt.com/v_abc/manifest.mpd",
      "hls": "https://cdn.yt.com/v_abc/master.m3u8"
    },
    "thumbnails": { "default": "...", "high": "..." },
    "published_at": "2026-02-22T12:00:00Z"
  }

GET /api/v1/feed?cursor={cursor}&limit=20
GET /api/v1/search?q={query}&cursor={cursor}

POST /api/v1/videos/{video_id}/like
POST /api/v1/videos/{video_id}/view   // fire-and-forget, for analytics

Key decisions:

Design YouTube Surveys (In-Video Polls)

Mon, 01 Jan 0001 00:00:00 +0000

1. Requirements & Scope (5 min)

Functional Requirements

Create polls — Creators/advertisers can define survey questions (multiple choice, single select, rating scale) and attach them to specific timestamps in a video
Render polls mid-roll — Display a non-intrusive overlay poll at the configured timestamp during video playback, without pausing or blocking the video
Collect votes — Record user responses in real time, enforce one vote per user per poll, and allow changing a vote before the poll closes
Real-time results — Show live vote counts / percentages to the user after they vote (instant feedback)
Analytics dashboard — Provide creators with detailed poll analytics: response rate, demographic breakdown, completion funnel, and A/B test results for different poll placements

Non-Functional Requirements

Availability: 99.95% — a poll failing to render is a missed data collection opportunity, but not as critical as video playback itself
Latency: Poll UI must render within 200ms of the trigger timestamp. Vote submission must ACK within 100ms (perceived instant).
Consistency: Votes must be counted exactly once. Read-after-write consistency for a user seeing their own vote. Aggregate counts can be eventually consistent (1-2 second delay is fine).
Scale: YouTube has 800M daily active viewers, 500M hours of video watched/day. If 5% of videos have polls and 20% of viewers interact → ~16B poll impressions/day → 185K poll renders/sec, 37K votes/sec at peak.
Durability: Every vote must be durably stored. Zero data loss on votes.

2. Estimation (3 min)

Traffic

Daily active viewers: 800M
Videos watched per viewer: ~8/day → 6.4B video views/day
Videos with polls: 5% → 320M poll-eligible views/day
Poll impression rate (viewer sees the poll): 60% → 192M poll impressions/day
Vote rate (viewer actually votes): 30% of impressions → 57.6M votes/day
Peak vote QPS: 57.6M / 86400 × 3 (peak multiplier) ≈ 2,000 votes/sec (average), 6,000 votes/sec (peak)
Peak poll render QPS: 192M / 86400 × 3 ≈ 6,700 renders/sec (peak)

Storage

Poll definitions: 10M active polls × 2 KB (question, options, targeting rules, schedule) = 20 GB — easily fits in a relational DB
Votes: 57.6M votes/day × 365 days × 3 years retention = 63B votes
- Each vote: poll_id (8B) + user_id (8B) + option_id (4B) + timestamp (8B) + metadata (32B) ≈ 60 bytes
- Total: 63B × 60B = 3.78 TB — manageable with partitioned storage
Aggregated counts: Per-poll, per-option counters. 10M polls × 5 options × 16B = 800 MB — trivially small, lives in Redis

Bandwidth

Poll render payload: ~5 KB (question text, options, styling, targeting metadata)
6,700 renders/sec × 5 KB = 33.5 MB/s — negligible compared to video streaming bandwidth

3. API Design (3 min)

Creator-Facing APIs

// Create a poll attached to a video
POST /api/v1/videos/{video_id}/polls
  Body: {
    "question": "What feature should we build next?",
    "type": "single_select",           // single_select | multi_select | rating
    "options": ["Dark mode", "Offline support", "AI search", "Better perf"],
    "trigger_time_sec": 145,           // show at 2:25 in the video
    "display_duration_sec": 15,        // auto-dismiss after 15s
    "targeting": {
      "geo": ["US", "CA", "GB"],
      "demographics": { "age_min": 18, "age_max": 45 },
      "sample_pct": 10                 // only show to 10% of viewers (A/B test)
    },
    "close_after_hours": 168           // stop accepting votes after 7 days
  }
  → 201 { "poll_id": "p_abc123", ... }

// Get poll analytics
GET /api/v1/polls/{poll_id}/analytics
  → 200 {
    "impressions": 145230,
    "votes": 43120,
    "response_rate": 0.297,
    "results": [
      { "option": "Dark mode", "votes": 18200, "pct": 42.2 },
      { "option": "AI search", "votes": 12500, "pct": 29.0 },
      ...
    ],
    "demographics": { ... },
    "ab_test": { "variant_a_response_rate": 0.31, "variant_b_response_rate": 0.26 }
  }

Viewer-Facing APIs

// Fetch polls for a video (called when video starts playing)
GET /api/v1/videos/{video_id}/polls?viewer_id={uid}
  → 200 {
    "polls": [
      {
        "poll_id": "p_abc123",
        "trigger_time_sec": 145,
        "question": "What feature should we build next?",
        "options": [...],
        "user_vote": null              // or option_id if already voted
      }
    ]
  }

// Submit a vote
POST /api/v1/polls/{poll_id}/vote
  Body: { "option_id": "opt_2", "viewer_id": "u_xyz" }
  → 200 { "results": { "opt_1": 42.2, "opt_2": 29.0, ... }, "total_votes": 43121 }

// Change vote (idempotent PUT)
PUT /api/v1/polls/{poll_id}/vote
  Body: { "option_id": "opt_3", "viewer_id": "u_xyz" }
  → 200 { "results": { ... } }

Key Decisions

Pre-fetch polls at video start: The player fetches all polls for the video when playback begins. This avoids a network request at the exact trigger timestamp (which could cause a visible delay).
Vote response includes live results: After voting, the user immediately sees percentages. This is the “social proof” hook that drives engagement.
Targeting evaluated client-side: The server sends all eligible polls plus targeting rules. The client-side SDK evaluates targeting (geo, demographics, A/B bucket) to avoid an extra server round-trip at trigger time.

4. Data Model (3 min)

Polls Table (PostgreSQL — strong consistency for definitions)

Column	Type	Notes
poll_id	UUID (PK)	Globally unique
video_id	VARCHAR(11)	YouTube video ID, indexed
creator_id	BIGINT	FK to creator accounts
question	TEXT	Poll question text
type	ENUM	single_select, multi_select, rating
options	JSONB	Array of {id, text, display_order}
trigger_time_sec	INT	Seconds into the video
display_duration_sec	INT	How long to show the overlay
targeting	JSONB	Geo, demographics, sample_pct, etc.
status	ENUM	draft, active, paused, closed
close_at	TIMESTAMP	When to stop accepting votes
created_at	TIMESTAMP

Votes Table (Cassandra — high write throughput, partitioned by poll_id)

Column	Type	Notes
poll_id	UUID (partition key)
viewer_id	BIGINT (clustering key)	Ensures uniqueness: one vote per user per poll
option_id	UUID	Which option they chose
voted_at	TIMESTAMP
metadata	MAP<TEXT,TEXT>	Device, geo, referrer, A/B variant

Why Cassandra for votes?