<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>System Design Solutions on Chirag Hasija</title>
    <link>https://chiraghasija.cc/designs/</link>
    <description>Recent content in System Design Solutions on Chirag Hasija</description>
    <image>
      <title>Chirag Hasija</title>
      <url>https://chiraghasija.cc/og-image.png</url>
      <link>https://chiraghasija.cc/og-image.png</link>
    </image>
    <generator>Hugo -- 0.155.3</generator>
    <language>en-us</language>
    <atom:link href="https://chiraghasija.cc/designs/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Design a Botnet Detection System</title>
      <link>https://chiraghasija.cc/designs/botnet-detection/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/botnet-detection/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Detect and classify botnet traffic in real-time — distinguish between legitimate users, known bots (scrapers, crawlers), and malicious botnets (DDoS, credential stuffing, spam)&lt;/li&gt;
&lt;li&gt;Maintain an IP reputation database that scores IPs from 0 (definitively malicious) to 100 (definitively legitimate), updated continuously based on observed behavior&lt;/li&gt;
&lt;li&gt;Support multiple detection methods: rate-based, behavioral fingerprinting, ML anomaly detection, honeypot-based, and collaborative intelligence (threat feeds)&lt;/li&gt;
&lt;li&gt;Automatically block or challenge (CAPTCHA) detected bots with configurable actions per threat level&lt;/li&gt;
&lt;li&gt;Provide a real-time dashboard showing attack patterns, blocked traffic, top threat sources, and detection accuracy metrics&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the detection system sits in the critical path. If it goes down, either all traffic is blocked (fail-closed) or all bots get through (fail-open). Neither is acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 5ms decision time per request. Detection is in the hot path of every incoming request. Cannot add perceptible latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for IP reputation scores (a few seconds of propagation delay is fine). Real-time blocking decisions must use locally cached scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1M requests/sec. 10M unique IPs/day. 500K concurrent connections. Detection must scale linearly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;False Positive Rate:&lt;/strong&gt; &amp;lt; 0.1% — blocking legitimate users is worse than letting some bots through. False negatives (missed bots) are bad but recoverable. False positives lose customers.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1M requests/sec → each request needs a bot/not-bot decision&lt;/li&gt;
&lt;li&gt;1M feature extraction + scoring operations per second&lt;/li&gt;
&lt;li&gt;Assuming 5% of traffic is botnet → 50K malicious requests/sec to detect and block&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;ip-reputation-database&#34;&gt;IP Reputation Database&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M unique IPs observed per day&lt;/li&gt;
&lt;li&gt;Each IP record: ~500 bytes (IP, score, features, history, last seen, classification)&lt;/li&gt;
&lt;li&gt;Active IPs (last 24h): 10M × 500 bytes = &lt;strong&gt;5 GB&lt;/strong&gt; — fits in memory (Redis)&lt;/li&gt;
&lt;li&gt;Historical IPs (90-day window): 100M × 500 bytes = &lt;strong&gt;50 GB&lt;/strong&gt; — PostgreSQL with hot/cold tiering&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;feature-extraction&#34;&gt;Feature Extraction&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Per-request features: ~20 signals extracted in &amp;lt; 1ms
&lt;ul&gt;
&lt;li&gt;IP, User-Agent, headers, URL pattern, request timing, TLS fingerprint, geo&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Per-IP aggregated features (sliding window): ~50 signals updated in &amp;lt; 2ms
&lt;ul&gt;
&lt;li&gt;Request rate, endpoint diversity, timing regularity, error rate, session behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Feature vector size: ~200 bytes per request&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;ml-model&#34;&gt;ML Model&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Model: gradient-boosted decision tree (XGBoost/LightGBM)&lt;/li&gt;
&lt;li&gt;Inference time: &amp;lt; 1ms per request (100 trees, max depth 8)&lt;/li&gt;
&lt;li&gt;Model size: ~10 MB (loaded in memory on every detection node)&lt;/li&gt;
&lt;li&gt;Retraining: daily, on previous day&amp;rsquo;s labeled data&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;request-evaluation-api-internal-called-by-api-gatewaywaf&#34;&gt;Request Evaluation API (internal, called by API Gateway/WAF)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Evaluate a request for bot probability
POST /v1/evaluate
  Body: {
    &amp;#34;ip&amp;#34;: &amp;#34;203.0.113.42&amp;#34;,
    &amp;#34;user_agent&amp;#34;: &amp;#34;Mozilla/5.0 ...&amp;#34;,
    &amp;#34;url&amp;#34;: &amp;#34;/api/v1/login&amp;#34;,
    &amp;#34;method&amp;#34;: &amp;#34;POST&amp;#34;,
    &amp;#34;headers&amp;#34;: { ... },
    &amp;#34;tls_fingerprint&amp;#34;: &amp;#34;ja3_hash_abc&amp;#34;,
    &amp;#34;geo&amp;#34;: { &amp;#34;country&amp;#34;: &amp;#34;RU&amp;#34;, &amp;#34;asn&amp;#34;: 12345 },
    &amp;#34;session_id&amp;#34;: &amp;#34;sess_xyz&amp;#34;
  }
  Response 200: {
    &amp;#34;decision&amp;#34;: &amp;#34;block&amp;#34;,               // allow | challenge | throttle | block
    &amp;#34;confidence&amp;#34;: 0.94,
    &amp;#34;threat_type&amp;#34;: &amp;#34;credential_stuffing&amp;#34;,
    &amp;#34;ip_reputation&amp;#34;: 12,
    &amp;#34;signals&amp;#34;: [&amp;#34;high_request_rate&amp;#34;, &amp;#34;ja3_known_bot&amp;#34;, &amp;#34;geo_mismatch&amp;#34;],
    &amp;#34;latency_ms&amp;#34;: 2.3
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ip-reputation-api&#34;&gt;IP Reputation API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Query IP reputation
GET /v1/reputation/{ip}
  Response 200: {
    &amp;#34;ip&amp;#34;: &amp;#34;203.0.113.42&amp;#34;,
    &amp;#34;score&amp;#34;: 12,                        // 0-100, lower = more suspicious
    &amp;#34;classification&amp;#34;: &amp;#34;likely_bot&amp;#34;,
    &amp;#34;first_seen&amp;#34;: &amp;#34;2026-02-01T00:00:00Z&amp;#34;,
    &amp;#34;last_seen&amp;#34;: &amp;#34;2026-02-22T10:00:00Z&amp;#34;,
    &amp;#34;total_requests_24h&amp;#34;: 45000,
    &amp;#34;blocked_requests_24h&amp;#34;: 42000,
    &amp;#34;threat_types&amp;#34;: [&amp;#34;credential_stuffing&amp;#34;, &amp;#34;scraping&amp;#34;],
    &amp;#34;associated_asn&amp;#34;: { &amp;#34;number&amp;#34;: 12345, &amp;#34;name&amp;#34;: &amp;#34;Shady Hosting Inc.&amp;#34; }
  }

// Manually override IP reputation (analyst action)
PUT /v1/reputation/{ip}
  Body: { &amp;#34;score&amp;#34;: 100, &amp;#34;reason&amp;#34;: &amp;#34;False positive - legitimate partner API&amp;#34;, &amp;#34;ttl&amp;#34;: 86400 }

// Report malicious IP (collaborative intelligence)
POST /v1/reputation/report
  Body: { &amp;#34;ip&amp;#34;: &amp;#34;203.0.113.42&amp;#34;, &amp;#34;threat_type&amp;#34;: &amp;#34;ddos&amp;#34;, &amp;#34;evidence&amp;#34;: &amp;#34;...&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;configuration-api&#34;&gt;Configuration API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Set detection policy
PUT /v1/policies/{policy_id}
  Body: {
    &amp;#34;name&amp;#34;: &amp;#34;login_endpoint_protection&amp;#34;,
    &amp;#34;endpoint_pattern&amp;#34;: &amp;#34;/api/*/login&amp;#34;,
    &amp;#34;rules&amp;#34;: [
      { &amp;#34;type&amp;#34;: &amp;#34;rate_limit&amp;#34;, &amp;#34;threshold&amp;#34;: 10, &amp;#34;window&amp;#34;: &amp;#34;1m&amp;#34;, &amp;#34;action&amp;#34;: &amp;#34;challenge&amp;#34; },
      { &amp;#34;type&amp;#34;: &amp;#34;ml_score&amp;#34;, &amp;#34;threshold&amp;#34;: 0.8, &amp;#34;action&amp;#34;: &amp;#34;block&amp;#34; },
      { &amp;#34;type&amp;#34;: &amp;#34;geo_block&amp;#34;, &amp;#34;countries&amp;#34;: [&amp;#34;XX&amp;#34;], &amp;#34;action&amp;#34;: &amp;#34;block&amp;#34; }
    ]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ip-reputation-redis--hot-data--postgresql--full-history&#34;&gt;IP Reputation (Redis — hot data + PostgreSQL — full history)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Redis (real-time lookups)
Key: ip:{ip_address}
Type: Hash
Fields:
  score              | int (0-100)
  classification     | string
  request_count_1h   | int
  request_count_24h  | int
  block_count_24h    | int
  last_seen          | timestamp
  features           | bytes (serialized feature vector)
  threat_types       | string (comma-separated)
TTL: 24 hours (re-populate from PostgreSQL on cache miss)

// PostgreSQL (historical)
Table: ip_reputation
  ip_address       (PK) | inet
  score                  | smallint
  classification         | varchar(50)
  first_seen             | timestamp
  last_seen              | timestamp
  total_requests         | bigint
  total_blocks           | bigint
  threat_types           | text[]
  asn_number             | int
  country_code           | char(2)
  updated_at             | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;request-log-kafka--clickhouse-for-analytics&#34;&gt;Request Log (Kafka + ClickHouse for analytics)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Kafka topic: request_events (retained 7 days)
{
  &amp;#34;timestamp&amp;#34;: 1708632000,
  &amp;#34;ip&amp;#34;: &amp;#34;203.0.113.42&amp;#34;,
  &amp;#34;url&amp;#34;: &amp;#34;/api/v1/login&amp;#34;,
  &amp;#34;method&amp;#34;: &amp;#34;POST&amp;#34;,
  &amp;#34;user_agent&amp;#34;: &amp;#34;...&amp;#34;,
  &amp;#34;tls_fingerprint&amp;#34;: &amp;#34;ja3_abc&amp;#34;,
  &amp;#34;decision&amp;#34;: &amp;#34;block&amp;#34;,
  &amp;#34;ml_score&amp;#34;: 0.94,
  &amp;#34;signals&amp;#34;: [&amp;#34;high_rate&amp;#34;, &amp;#34;ja3_bot&amp;#34;],
  &amp;#34;latency_ms&amp;#34;: 2.3
}

// ClickHouse (aggregated analytics, 90-day retention)
Table: request_events
  timestamp         | DateTime
  ip                | IPv4
  url               | String
  decision          | Enum(&amp;#39;allow&amp;#39;,&amp;#39;challenge&amp;#39;,&amp;#39;throttle&amp;#39;,&amp;#39;block&amp;#39;)
  ml_score          | Float32
  threat_type       | String
  -- columnar storage, efficient for time-series aggregations
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;threat-intelligence-postgresql&#34;&gt;Threat Intelligence (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: threat_feeds
  source            | varchar(100)  -- e.g., &amp;#34;spamhaus&amp;#34;, &amp;#34;abuseipdb&amp;#34;
  ip_address        | inet
  threat_type       | varchar(50)
  confidence        | float
  reported_at       | timestamp
  expires_at        | timestamp

Table: ja3_fingerprints
  ja3_hash         (PK) | varchar(64)
  classification         | varchar(50)  -- &amp;#34;chrome_browser&amp;#34;, &amp;#34;python_requests&amp;#34;, &amp;#34;known_bot&amp;#34;
  confidence             | float
  source                 | varchar(100)
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;request-evaluation-flow&#34;&gt;Request Evaluation Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Incoming Request
  → Edge / API Gateway
    → Bot Detection Middleware (&amp;lt; 5ms total):
      → 1. Extract features (&amp;lt; 0.5ms):
           IP, User-Agent, URL, method, headers, TLS fingerprint (JA3)
      → 2. IP reputation lookup (&amp;lt; 0.5ms):
           Redis: HGET ip:{ip} score
           If score &amp;lt; 20 → fast-path BLOCK (skip ML)
           If score &amp;gt; 90 → fast-path ALLOW (skip ML)
      → 3. Rate check (&amp;lt; 0.5ms):
           Redis: INCR ip:{ip}:rate:{minute_bucket}
           If &amp;gt; threshold → flag as rate-exceeded
      → 4. ML scoring (&amp;lt; 1ms):
           Feed feature vector to in-process model
           Score = probability of being a bot (0.0 - 1.0)
      → 5. Decision engine (&amp;lt; 0.5ms):
           Combine: IP reputation + rate check + ML score + policy rules
           Output: allow / challenge / throttle / block
      → 6. Async: log request event to Kafka
    → If allowed: forward to backend
    → If challenged: return CAPTCHA page
    → If blocked: return 403
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;background-processing&#34;&gt;Background Processing&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Feature Aggregation Pipeline:
  Kafka (request_events) → Stream Processor (Flink/Kafka Streams):
    Per IP, compute sliding window aggregates:
      - Request rate (1min, 5min, 1hr windows)
      - Unique endpoints accessed
      - Error rate (4xx, 5xx responses)
      - Request timing entropy (regular intervals → bot)
      - Geographic consistency
    Write aggregated features to Redis: HSET ip:{ip} features {vector}
    Update IP reputation score based on new features

ML Model Training Pipeline (daily):
  ClickHouse (historical data) → Feature extraction → Label assignment:
    Labels from: manual reviews, honeypot catches, known-good traffic
  → Train XGBoost model → Validate on holdout set
  → If accuracy &amp;gt; threshold: deploy to all detection nodes
  → Canary: deploy to 5% of nodes first, monitor false positive rate
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Bot Detection Middleware:&lt;/strong&gt; Runs in-process on every API Gateway node. Performs feature extraction, reputation lookup, rate check, ML inference, and decision. Must be &amp;lt; 5ms total.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis Cluster:&lt;/strong&gt; IP reputation cache, rate counters, feature vectors. Co-located with API Gateway for &amp;lt; 1ms latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Feature Aggregation Service (Flink):&lt;/strong&gt; Consumes request events from Kafka. Computes per-IP behavioral features in sliding windows. Updates Redis with fresh feature vectors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reputation Scoring Service:&lt;/strong&gt; Periodically recalculates IP reputation scores based on aggregated features, threat feed data, and historical behavior. Writes to Redis + PostgreSQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ML Training Pipeline:&lt;/strong&gt; Daily batch training on labeled data. Model validation and canary deployment. Model versioning and A/B testing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Threat Intelligence Ingester:&lt;/strong&gt; Pulls from external feeds (Spamhaus, AbuseIPDB, etc.) hourly. Updates threat_feeds table and pre-populates IP reputation for known-bad IPs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Honeypot Service:&lt;/strong&gt; Invisible endpoints (hidden form fields, fake API paths) that no legitimate user would access. Any traffic to honeypots is definitively bot traffic → auto-label and block.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analyst Dashboard:&lt;/strong&gt; Real-time view of attack patterns, top blocked IPs, false positive reviews, detection accuracy metrics.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Internet
  → CDN / Edge (DDoS mitigation at L3/L4 — Cloudflare/AWS Shield)
    → API Gateway with Bot Detection Middleware
      ├── Redis Cluster (reputation, rates, features)
      ├── In-process ML model (XGBoost)
      └── Decision Engine (policy rules)
    → Backend Services (if request allowed)

Async Pipeline:
  Kafka ← request events
    → Flink (feature aggregation) → Redis
    → ClickHouse (analytics storage)

Batch Pipeline:
  ClickHouse → ML Training → Model Store → Deploy to Gateway nodes
  Threat Feeds → Ingester → PostgreSQL + Redis
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-behavioral-fingerprinting-and-tls-analysis&#34;&gt;Deep Dive 1: Behavioral Fingerprinting and TLS Analysis&lt;/h3&gt;
&lt;p&gt;IP addresses alone are insufficient for bot detection. Sophisticated botnets use residential IP proxies, rotating through millions of IPs. Each IP has low request volume (below rate limits), making IP-based detection blind.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Cluster Health Monitoring System</title>
      <link>https://chiraghasija.cc/designs/cluster-health/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/cluster-health/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Monitor health of every node in a cluster (CPU, memory, disk, network, process status) via lightweight agents&lt;/li&gt;
&lt;li&gt;Detect node failures within seconds using heartbeat mechanism and classify failure type (node crash, network partition, disk failure, OOM)&lt;/li&gt;
&lt;li&gt;Define alerting rules (threshold-based and anomaly-based) and route alerts to on-call teams via multiple channels&lt;/li&gt;
&lt;li&gt;Provide a real-time dashboard showing cluster topology, node status, and aggregated metrics&lt;/li&gt;
&lt;li&gt;Support auto-remediation actions (restart service, drain node, scale up) triggered by specific failure patterns&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the monitoring system must be more available than the systems it monitors. A monitoring outage is a blind spot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Failure detection within 10 seconds. Dashboard data staleness &amp;lt; 30 seconds. Alert delivery &amp;lt; 60 seconds from event.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Heartbeat status must be accurate (no false positives for node death). Metrics can tolerate eventual consistency (30s lag acceptable).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100,000 nodes, each emitting ~50 metrics every 10 seconds = 500K metric samples/sec.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Metrics retained at full resolution for 7 days, downsampled (1-min aggregates) for 90 days, further downsampled (1-hour) for 2 years.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Nodes:&lt;/strong&gt; 100,000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metrics per node:&lt;/strong&gt; 50 metrics × 1 sample/10s = 5 samples/sec/node&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total ingestion:&lt;/strong&gt; 100K × 5 = &lt;strong&gt;500,000 samples/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Heartbeats:&lt;/strong&gt; 100K nodes × 1 heartbeat/5s = &lt;strong&gt;20,000 heartbeats/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dashboard queries:&lt;/strong&gt; 500 concurrent operators × 1 query/5s = 100 QPS (but each query may scan millions of data points)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Per sample:&lt;/strong&gt; metric_name (8 bytes hashed) + value (8 bytes) + timestamp (8 bytes) + node_id (8 bytes) + tags (16 bytes) = ~48 bytes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Raw data (7 days):&lt;/strong&gt; 500K samples/sec × 86,400 sec/day × 7 days × 48 bytes = &lt;strong&gt;14.5TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;1-min aggregates (90 days):&lt;/strong&gt; 100K nodes × 50 metrics × 1440 min/day × 90 days × 48 bytes = &lt;strong&gt;31TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;1-hour aggregates (2 years):&lt;/strong&gt; 100K nodes × 50 metrics × 24 × 730 × 48 bytes = &lt;strong&gt;4.2TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total:&lt;/strong&gt; ~50TB active storage&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;write-heavy time-series system&lt;/strong&gt; with a 5000:1 write-to-read ratio. The critical challenge is ingesting 500K samples/sec reliably while simultaneously running failure detection with &amp;lt; 10s latency. Storage must be optimized for time-range queries on specific metrics.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Concurrent Screen Limit System (Netflix)</title>
      <link>https://chiraghasija.cc/designs/screen-limit/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/screen-limit/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Enforce a maximum number of concurrent streams per account (e.g., Basic = 1, Standard = 2, Premium = 4)&lt;/li&gt;
&lt;li&gt;Track active streams in real-time via heartbeat-based session monitoring — a stream is &amp;ldquo;active&amp;rdquo; if a heartbeat was received within the last 60 seconds&lt;/li&gt;
&lt;li&gt;When the concurrent limit is reached, deny new stream requests with a clear error message (&amp;ldquo;Too many screens. Stop watching on another device to continue.&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Handle edge cases: device crashes (no explicit stop), network interruptions, and simultaneous stream starts from multiple devices (race conditions)&lt;/li&gt;
&lt;li&gt;Allow users to see active sessions and forcefully terminate a session from another device&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — blocking a paying customer from watching is worse than allowing one extra stream temporarily. Fail-open bias.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Stream start authorization &amp;lt; 100ms. Heartbeat processing &amp;lt; 50ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for stream count enforcement. Two devices simultaneously starting stream #3 on a 2-stream plan must not both succeed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 200M subscribers, 50M concurrent streams at peak, 500M heartbeats/minute (each stream sends a heartbeat every 30 seconds)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fault tolerance:&lt;/strong&gt; Single Redis node failure must not cause mass stream denials. Graceful degradation.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50M concurrent streams at peak&lt;/li&gt;
&lt;li&gt;Each stream sends heartbeat every 30 seconds → 50M / 30 = &lt;strong&gt;1.67M heartbeats/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Stream starts: average stream duration = 45 minutes → 50M / 45 = &lt;strong&gt;~18.5K stream starts/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Stream stops (explicit): ~70% of sessions end cleanly → &lt;strong&gt;~13K stops/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Stream authorization checks (on start + periodic re-check every 5 min): &lt;strong&gt;~185K checks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active session record: account_id (8B) + session_id (16B) + device_id (32B) + started_at (8B) + last_heartbeat (8B) + metadata (28B) = ~100 bytes&lt;/li&gt;
&lt;li&gt;50M concurrent sessions × 100 bytes = &lt;strong&gt;5 GB&lt;/strong&gt; — fits entirely in Redis&lt;/li&gt;
&lt;li&gt;Account → plan mapping: 200M accounts × 50 bytes = &lt;strong&gt;10 GB&lt;/strong&gt; — cached in Redis, source of truth in PostgreSQL&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;heartbeat-processing&#34;&gt;Heartbeat Processing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1.67M heartbeats/sec × 100 bytes each = &lt;strong&gt;167 MB/sec&lt;/strong&gt; of inbound traffic&lt;/li&gt;
&lt;li&gt;Each heartbeat updates a single key in Redis: O(1) operation&lt;/li&gt;
&lt;li&gt;Redis can handle 500K+ operations/sec per instance → need ~4 Redis shards minimum&lt;/li&gt;
&lt;li&gt;With 10 Redis shards: 167K operations/sec each — comfortable headroom&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;distributed counting problem with strong consistency requirements.&lt;/strong&gt; The core challenge is ensuring that the stream count for an account never exceeds the limit, even when multiple devices attempt to start streams simultaneously. This is complicated by the stateless nature of video playback — we must use heartbeats to infer liveness rather than relying on explicit stop signals.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Credit Card Processing System</title>
      <link>https://chiraghasija.cc/designs/credit-card-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/credit-card-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Process credit card authorization requests in real-time — validate card, check funds, place a hold, return approve/decline&lt;/li&gt;
&lt;li&gt;Handle settlement/clearing: batch process authorized transactions at end of day, move funds from issuing bank to acquiring bank&lt;/li&gt;
&lt;li&gt;Support tokenization — replace sensitive card numbers with non-reversible tokens for PCI DSS compliance&lt;/li&gt;
&lt;li&gt;Detect and flag fraudulent transactions in real-time using rules and ML models before authorization&lt;/li&gt;
&lt;li&gt;Handle refunds, chargebacks, and multi-currency transactions with proper exchange rate management&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.999% for the authorization path — every second of downtime costs merchants millions in lost sales&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Authorization response &amp;lt; 300ms end-to-end (merchant → acquirer → card network → issuer → back). Our processing adds &amp;lt; 50ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for financial operations. Every transaction must be exactly-once. Duplicate charges are unacceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100K transactions/sec peak (Black Friday). 5B transactions/day average. $10T annual payment volume.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security:&lt;/strong&gt; PCI DSS Level 1 compliance. Card data encrypted at rest (AES-256) and in transit (TLS 1.3). HSM for key management.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;5B transactions/day average = &lt;strong&gt;58K TPS average&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak (Black Friday, flash sales): &lt;strong&gt;150K TPS&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each authorization involves: fraud check + balance check + hold placement = 3-5 internal operations per transaction&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Transaction records: 5B/day × 500 bytes = 2.5TB/day, &lt;strong&gt;~900TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Token vault: 2B unique cards × 200 bytes = 400GB (fits in memory with replication)&lt;/li&gt;
&lt;li&gt;Fraud model features: 5B/day × 1KB (computed features) = 5TB/day (stored for model training, purged after 90 days)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;financial-math&#34;&gt;Financial Math&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average transaction: $50&lt;/li&gt;
&lt;li&gt;Annual volume: 5B/day × 365 × $50 = &lt;strong&gt;$91T&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Interchange fee (~2%): $1.8T/year in fees flowing through the system&lt;/li&gt;
&lt;li&gt;A 1-second outage at peak: 150K transactions × $50 = &lt;strong&gt;$7.5M in potentially lost transactions&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This system has &lt;strong&gt;zero tolerance for data loss or duplication&lt;/strong&gt;. A lost transaction means either the merchant doesn&amp;rsquo;t get paid or the customer gets charged without receiving goods. A duplicate means double-charging. Every operation must be idempotent and durable.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Cache (Memcached/Redis)</title>
      <link>https://chiraghasija.cc/designs/distributed-cache/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/distributed-cache/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Get/Set/Delete&lt;/strong&gt; — clients can store, retrieve, and remove key-value pairs with sub-millisecond latency&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TTL support&lt;/strong&gt; — every key can have an optional time-to-live; expired keys are automatically purged&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eviction policies&lt;/strong&gt; — when memory is full, evict keys according to a configurable policy (LRU, LFU, random)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Atomic operations&lt;/strong&gt; — support CAS (compare-and-swap), increment/decrement, and simple Lua-script-style transactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cluster management&lt;/strong&gt; — automatically shard data across nodes, detect failures, and rebalance with minimal data movement&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the cache is in the read hot-path; downtime causes a stampede on the backing store&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; p50 &amp;lt; 0.5ms, p99 &amp;lt; 2ms for single-key operations from within the same datacenter&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency across replicas is acceptable. For a single shard, reads-after-writes must be consistent (read-your-writes from the leader)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100M+ ops/sec cluster-wide, 500TB+ aggregate memory across thousands of nodes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Best-effort. Cache is ephemeral by design, but optional AOF/snapshotting for warm restarts&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100M cache operations/sec (70% reads, 30% writes)&lt;/li&gt;
&lt;li&gt;Read QPS: &lt;strong&gt;70M/s&lt;/strong&gt;, Write QPS: &lt;strong&gt;30M/s&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Average key size: 64 bytes, average value size: 1 KB&lt;/li&gt;
&lt;li&gt;Average payload per op: ~1 KB → bandwidth: 100M × 1 KB = &lt;strong&gt;100 GB/s&lt;/strong&gt; cluster-wide&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500B unique keys in steady state (Zipf distribution — 20% of keys serve 80% of traffic)&lt;/li&gt;
&lt;li&gt;Average entry: 64 B (key) + 1 KB (value) + 48 B (metadata: TTL, flags, pointers) ≈ 1.1 KB&lt;/li&gt;
&lt;li&gt;Total memory: 500B × 1.1 KB = &lt;strong&gt;550 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;With replication factor 2 (one replica per shard): &lt;strong&gt;1.1 PB&lt;/strong&gt; raw memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;node-count&#34;&gt;Node Count&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;If each node has 128 GB usable memory → 550 TB / 128 GB ≈ &lt;strong&gt;4,300 primary shards&lt;/strong&gt; + 4,300 replicas ≈ &lt;strong&gt;8,600 nodes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each node handles: 100M / 4,300 ≈ &lt;strong&gt;23K ops/sec&lt;/strong&gt; — very comfortable for a single Redis-like process&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;memory-bound, latency-critical&lt;/strong&gt; system. The hard problems are: (1) distributing keys evenly across thousands of shards, (2) handling hot keys that violate even distribution, and (3) preventing thundering herds when popular keys expire.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Job Scheduler</title>
      <link>https://chiraghasija.cc/designs/job-scheduler/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/job-scheduler/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Schedule jobs to run at a specific time, on a recurring cron schedule, or after a delay (e.g., &amp;ldquo;run in 30 minutes&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Execute jobs with at-least-once semantics — every scheduled job must run, even if a worker crashes mid-execution&lt;/li&gt;
&lt;li&gt;Support job dependencies (DAGs) — Job B runs only after Job A completes successfully&lt;/li&gt;
&lt;li&gt;Provide job priority levels (critical, high, normal, low) with priority-based queue ordering&lt;/li&gt;
&lt;li&gt;Support job lifecycle management: create, pause, resume, cancel, retry; expose job status, logs, and execution history&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — missed or delayed jobs can cost real money (billing runs, report generation, data pipelines)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Job dispatch within 1 second of scheduled time. Job completion depends on the job itself.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Exactly-once scheduling (no duplicate dispatches), at-least-once execution (idempotent jobs handle retries)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M scheduled jobs, 100K job executions per minute at peak, 10K concurrent running jobs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Job definitions and execution history must survive any single node failure. Zero job loss.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M total scheduled jobs (mix of one-time and recurring)&lt;/li&gt;
&lt;li&gt;Recurring jobs: 1M cron jobs, average frequency = every hour → 1M/3600 = &lt;strong&gt;~278 job dispatches/sec&lt;/strong&gt; from cron alone&lt;/li&gt;
&lt;li&gt;One-time jobs: 100K scheduled per hour → &lt;strong&gt;~28 dispatches/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 5× average → &lt;strong&gt;~1500 dispatches/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;API calls (CRUD + status checks): &lt;strong&gt;~5000 QPS&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Job definition: ~2 KB (name, schedule, payload, config, dependencies)&lt;/li&gt;
&lt;li&gt;10M jobs × 2 KB = &lt;strong&gt;20 GB&lt;/strong&gt; — fits in a single PostgreSQL instance&lt;/li&gt;
&lt;li&gt;Execution history: 100K executions/min × 1 KB per record = 100 MB/min = &lt;strong&gt;144 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Retention: 30 days of execution history = &lt;strong&gt;~4.3 TB&lt;/strong&gt; (needs archival strategy)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;worker-pool&#34;&gt;Worker Pool&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average job duration: 30 seconds&lt;/li&gt;
&lt;li&gt;100K executions/min ÷ 60 = 1,667 jobs/sec&lt;/li&gt;
&lt;li&gt;1,667 jobs/sec × 30 sec average = &lt;strong&gt;50,000 concurrent job slots needed&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;With 16 slots per worker: &lt;strong&gt;~3,125 worker machines&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The scheduler itself is not compute-heavy — the challenge is &lt;strong&gt;reliable, timely dispatch at scale with exactly-once semantics.&lt;/strong&gt; The workers that execute jobs are a separate scaling concern. The hardest problems are: (1) distributing scheduling responsibility without missing or duplicating jobs, (2) handling worker failures mid-execution, and (3) orchestrating DAG dependencies efficiently.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Key-Value Store</title>
      <link>https://chiraghasija.cc/designs/kv-store/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/kv-store/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;put(key, value)&lt;/code&gt; — store a key-value pair (upsert semantics)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;get(key)&lt;/code&gt; — retrieve the value for a given key&lt;/li&gt;
&lt;li&gt;&lt;code&gt;delete(key)&lt;/code&gt; — remove a key-value pair&lt;/li&gt;
&lt;li&gt;Support large value sizes (up to 1MB per value, keys up to 256 bytes)&lt;/li&gt;
&lt;li&gt;Automatic data partitioning across nodes with rebalancing on cluster resize&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the store must remain writable even during node failures and network partitions (AP system by default, tunable toward CP)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 5ms p99 for both reads and writes (single datacenter)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Tunable — eventual consistency by default (R=1, W=1), strong consistency optional (R=W=quorum)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100TB+ total data, 500K+ operations/sec, hundreds of nodes, linear horizontal scaling&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; No data loss for acknowledged writes. Replication factor of 3 across failure domains.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Write QPS: 200K ops/sec&lt;/li&gt;
&lt;li&gt;Read QPS: 300K ops/sec (1.5:1 read-to-write ratio)&lt;/li&gt;
&lt;li&gt;Average value size: 10KB, average key size: 100 bytes&lt;/li&gt;
&lt;li&gt;Write bandwidth: 200K x 10KB = &lt;strong&gt;2 GB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Read bandwidth: 300K x 10KB = &lt;strong&gt;3 GB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100 billion key-value pairs in steady state&lt;/li&gt;
&lt;li&gt;Average size per pair: 10KB value + 100 bytes key + 200 bytes metadata = ~10.3KB&lt;/li&gt;
&lt;li&gt;Total data: 100B x 10.3KB = &lt;strong&gt;1 PB&lt;/strong&gt; before replication&lt;/li&gt;
&lt;li&gt;With RF=3: &lt;strong&gt;3 PB&lt;/strong&gt; total&lt;/li&gt;
&lt;li&gt;Per node (assuming 100 nodes): 30TB per node → 8 x 4TB SSDs each&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;memory&#34;&gt;Memory&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each node holds ~1B keys&lt;/li&gt;
&lt;li&gt;In-memory index (key → file offset): 100 bytes key + 16 bytes offset = 116 bytes per key&lt;/li&gt;
&lt;li&gt;Index per node: 1B x 116 bytes = &lt;strong&gt;116 GB&lt;/strong&gt; — too large for RAM&lt;/li&gt;
&lt;li&gt;Solution: Bloom filters + sparse index (see Deep Dive)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Write a key-value pair
PUT /kv/{key}
  Headers: {
    &amp;#34;X-Consistency&amp;#34;: &amp;#34;quorum&amp;#34;,      // optional: one, quorum, all
    &amp;#34;X-TTL&amp;#34;: 86400,                 // optional: TTL in seconds
    &amp;#34;If-Match&amp;#34;: &amp;#34;&amp;lt;vector_clock&amp;gt;&amp;#34;    // optional: conditional write (CAS)
  }
  Body: &amp;lt;raw bytes&amp;gt;
  Response 200: {
    &amp;#34;key&amp;#34;: &amp;#34;user:123:profile&amp;#34;,
    &amp;#34;version&amp;#34;: {&amp;#34;node1&amp;#34;: 3, &amp;#34;node2&amp;#34;: 1},  // vector clock
    &amp;#34;timestamp&amp;#34;: 1708632060000
  }

// Read a key
GET /kv/{key}
  Headers: {
    &amp;#34;X-Consistency&amp;#34;: &amp;#34;quorum&amp;#34;       // optional: one, quorum, all
  }
  Response 200: {
    &amp;#34;key&amp;#34;: &amp;#34;user:123:profile&amp;#34;,
    &amp;#34;value&amp;#34;: &amp;#34;&amp;lt;base64-encoded&amp;gt;&amp;#34;,
    &amp;#34;version&amp;#34;: {&amp;#34;node1&amp;#34;: 3, &amp;#34;node2&amp;#34;: 1},
    &amp;#34;timestamp&amp;#34;: 1708632060000
  }
  // If conflicting versions exist (eventual consistency):
  Response 200: {
    &amp;#34;key&amp;#34;: &amp;#34;user:123:profile&amp;#34;,
    &amp;#34;values&amp;#34;: [
      {&amp;#34;value&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;version&amp;#34;: {&amp;#34;node1&amp;#34;: 3}},
      {&amp;#34;value&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;version&amp;#34;: {&amp;#34;node2&amp;#34;: 2}}
    ],
    &amp;#34;conflict&amp;#34;: true
  }

// Delete a key
DELETE /kv/{key}
  Response 200: { &amp;#34;deleted&amp;#34;: true }
  // Internally: tombstone with TTL (not immediate physical delete)

// Scan keys by prefix
GET /kv?prefix=user:123:&amp;amp;limit=100
  Response 200: { &amp;#34;pairs&amp;#34;: [...], &amp;#34;cursor&amp;#34;: &amp;#34;...&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Client receives vector clock on write, passes it back on subsequent writes for conflict detection&lt;/li&gt;
&lt;li&gt;Conflicts surfaced to client (Dynamo-style) rather than hidden behind last-write-wins&lt;/li&gt;
&lt;li&gt;Delete uses tombstones — physical deletion happens during compaction&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;on-disk-storage-lsm-tree-per-node&#34;&gt;On-Disk Storage (LSM-Tree per node)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;SSTable (Sorted String Table):
  ┌─────────────────────────────────┐
  │ Data Block 1 (4KB, sorted KVs) │
  │ Data Block 2                    │
  │ ...                             │
  │ Data Block N                    │
  │ Index Block (key → block offset)│
  │ Bloom Filter Block              │
  │ Footer (offsets to index/filter)│
  └─────────────────────────────────┘

Each KV entry:
  key             | bytes (up to 256B)
  value           | bytes (up to 1MB)
  timestamp       | int64
  vector_clock    | map&amp;lt;node_id, counter&amp;gt;
  tombstone       | boolean
  ttl_expiry      | int64 (0 = no expiry)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;in-memory-structures-per-node&#34;&gt;In-Memory Structures (per node)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Memtable (Red-Black Tree or Skip List):
  - All recent writes, sorted by key
  - Flushed to SSTable when size exceeds 64MB

Write-Ahead Log (WAL):
  - Sequential append of every write
  - Replayed on crash recovery
  - Truncated after memtable flush

Bloom Filters:
  - One per SSTable, loaded in memory
  - 10 bits per key → ~1% false positive rate
  - Avoids reading SSTables that definitely don&amp;#39;t contain the key
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-lsm-tree-not-b-tree&#34;&gt;Why LSM-Tree (not B-Tree)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Write-optimized:&lt;/strong&gt; All writes are sequential appends (memtable → flush → SSTable). No random I/O on writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; Reads may check multiple SSTables (L0, L1, &amp;hellip;, Ln). Mitigated by bloom filters, level compaction, and caching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction:&lt;/strong&gt; Background process merges SSTables, removes tombstones, reduces read amplification.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Space amplification:&lt;/strong&gt; Leveled compaction keeps it at ~1.1x; size-tiered compaction can be ~2x but has better write throughput.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client SDK (consistent hashing + preference list)
  │
  │  Route to appropriate node(s)
  ▼
┌──────────────────────────────────────────────┐
│              Distributed KV Cluster           │
│                                               │
│  Node A (tokens: 0-90)                       │
│  ┌────────────────────────┐                  │
│  │ Request Handler        │                  │
│  │ ├─ Coordinator logic   │                  │
│  │ ├─ Replication manager │                  │
│  │ └─ Read repair         │                  │
│  │                        │                  │
│  │ Storage Engine          │                  │
│  │ ├─ Memtable (64MB)     │                  │
│  │ ├─ WAL                  │                  │
│  │ ├─ SSTables (L0..Ln)   │                  │
│  │ └─ Bloom filters       │                  │
│  └────────────────────────┘                  │
│                                               │
│  Node B (tokens: 91-180)   ...  Node F       │
│  (similar structure)                          │
│                                               │
│  Cross-cutting:                               │
│  ├─ Gossip Protocol (membership, failure)    │
│  ├─ Anti-Entropy (Merkle trees)              │
│  └─ Hinted Handoff (temporary failed nodes)  │
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;write-path&#34;&gt;Write Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client
  → SDK hashes key → identifies coordinator node (owns token range)
  → Sends PUT to coordinator
  → Coordinator:
    1. Determine preference list: [Node A (primary), Node B (replica), Node C (replica)]
    2. Forward write to all 3 nodes in parallel
    3. Each node:
       a. Append to WAL (fsync for durability)
       b. Insert into Memtable
       c. Respond to coordinator
    4. Coordinator waits for W responses (configurable: W=1, W=2 quorum, W=3 all)
    5. Respond to client with vector clock
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;read-path&#34;&gt;Read Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client
  → SDK hashes key → identifies coordinator
  → Sends GET to coordinator
  → Coordinator:
    1. Send read to R nodes from preference list (R=1, R=2 quorum, R=3 all)
    2. Each node:
       a. Check Memtable → if found, return
       b. Check Bloom filters for each SSTable (L0 first, then L1, L2...)
       c. If bloom filter says &amp;#34;maybe present&amp;#34; → read SSTable index → read data block
       d. Return latest version (by vector clock)
    3. Coordinator:
       a. Compare versions from R responses
       b. If all match → return to client
       c. If conflict → return multiple values (client resolves) OR last-write-wins
       d. Trigger read repair (send latest version to stale nodes)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Nodes (100+):&lt;/strong&gt; Each node runs the storage engine (LSM-tree), handles coordinator logic, participates in gossip, and manages local replication.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistent Hash Ring:&lt;/strong&gt; 256 virtual nodes per physical node for balanced distribution. Token assignment stored in gossip state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gossip Protocol:&lt;/strong&gt; Every node pings 3 random nodes every second, exchanging membership state (heartbeats, token assignments, failure suspicions).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anti-Entropy Service:&lt;/strong&gt; Background process comparing Merkle trees between replica nodes to detect and repair divergence.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hinted Handoff Store:&lt;/strong&gt; When a target replica is down, the coordinator stores the write locally as a &amp;ldquo;hint&amp;rdquo; and replays it when the node recovers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction Manager:&lt;/strong&gt; Background threads running leveled or size-tiered compaction to merge SSTables and remove dead data.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-consistent-hashing-and-data-partitioning&#34;&gt;Deep Dive 1: Consistent Hashing and Data Partitioning&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; How to distribute data across N nodes so that adding/removing a node only moves ~1/N of the data (not a full reshuffle)?&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Logging System (ELK/Splunk)</title>
      <link>https://chiraghasija.cc/designs/distributed-logging/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/distributed-logging/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Log Ingestion&lt;/strong&gt; — Collect structured and unstructured logs from 50K+ servers, containers, and serverless functions. Support push-based (agents forward logs) and pull-based (system scrapes endpoints) collection models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full-Text Search &amp;amp; Field-Based Filtering&lt;/strong&gt; — Users can search logs by arbitrary keywords (full-text), filter by structured fields (service name, log level, host, correlation ID), and scope by time range. Queries return results within seconds even over terabytes of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Live Tail (Real-Time Streaming)&lt;/strong&gt; — Provide a &lt;code&gt;tail -f&lt;/code&gt; equivalent where engineers can subscribe to a live stream of logs from a specific service, host, or filter expression. Latency from log emission to live tail display should be under 5 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alerting &amp;amp; Pattern Detection&lt;/strong&gt; — Define alert rules on log patterns (e.g., &amp;ldquo;more than 100 ERROR logs from payment-service in 5 minutes&amp;rdquo;). Support anomaly detection on log volume and error rates. Route alerts to PagerDuty, Slack, email.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retention &amp;amp; Lifecycle Management&lt;/strong&gt; — Configure per-tenant or per-service retention policies (e.g., 7 days hot, 30 days warm, 1 year cold, 7 years frozen for compliance). Automatic tiered storage migration and deletion.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability&lt;/strong&gt;: 99.99% uptime for ingestion (logs must never be dropped). Search can tolerate brief degradation (99.9%).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: Ingestion-to-searchable latency &amp;lt; 10 seconds (p99). Search queries return in &amp;lt; 5 seconds for recent data (last 24h), &amp;lt; 30 seconds for historical data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Eventual consistency is acceptable. A log written at time T may not appear in search results for up to 10 seconds. No strict ordering guarantees across services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: 50,000 servers, 2 million log lines/second sustained ingestion, 10M lines/sec burst. 1 PB of searchable hot/warm data. 10 PB in cold/frozen archival.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability&lt;/strong&gt;: Zero log loss once acknowledged by the ingestion pipeline. Replicated storage with 99.999999999% (11 nines) durability for archived data.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;write-path-ingestion&#34;&gt;Write Path (Ingestion)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Metric&lt;/th&gt;
          &lt;th&gt;Value&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Servers&lt;/td&gt;
          &lt;td&gt;50,000&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Avg log lines per server per second&lt;/td&gt;
          &lt;td&gt;40&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Sustained ingestion rate&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;2M lines/sec&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Burst ingestion rate (10x spikes)&lt;/td&gt;
          &lt;td&gt;10M lines/sec&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Avg log line size (JSON structured)&lt;/td&gt;
          &lt;td&gt;500 bytes&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Sustained throughput&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;1 GB/sec = 3.6 TB/hour = 86 TB/day&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Compression ratio (zstd)&lt;/td&gt;
          &lt;td&gt;~10:1&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Compressed daily storage&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;~8.6 TB/day&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;read-path-search&#34;&gt;Read Path (Search)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Metric&lt;/th&gt;
          &lt;th&gt;Value&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Concurrent search users&lt;/td&gt;
          &lt;td&gt;~500&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Search QPS&lt;/td&gt;
          &lt;td&gt;~200 queries/sec&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Live tail subscriptions&lt;/td&gt;
          &lt;td&gt;~1,000 concurrent streams&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Avg query scans&lt;/td&gt;
          &lt;td&gt;1–10 GB of index data per query&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;storage-tiers&#34;&gt;Storage Tiers&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Tier&lt;/th&gt;
          &lt;th&gt;Retention&lt;/th&gt;
          &lt;th&gt;Raw Data&lt;/th&gt;
          &lt;th&gt;Compressed&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Hot (SSD, fully indexed)&lt;/td&gt;
          &lt;td&gt;3 days&lt;/td&gt;
          &lt;td&gt;258 TB&lt;/td&gt;
          &lt;td&gt;~26 TB&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Warm (HDD, fully indexed)&lt;/td&gt;
          &lt;td&gt;30 days&lt;/td&gt;
          &lt;td&gt;2.58 PB&lt;/td&gt;
          &lt;td&gt;~258 TB&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Cold (object storage, partial index)&lt;/td&gt;
          &lt;td&gt;1 year&lt;/td&gt;
          &lt;td&gt;~31 PB&lt;/td&gt;
          &lt;td&gt;~3.1 PB&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Frozen (object storage, metadata only)&lt;/td&gt;
          &lt;td&gt;7 years&lt;/td&gt;
          &lt;td&gt;archived&lt;/td&gt;
          &lt;td&gt;~20 PB&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;kafka-sizing&#34;&gt;Kafka Sizing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;2M messages/sec × 500 bytes = 1 GB/sec.&lt;/li&gt;
&lt;li&gt;With replication factor 3, Kafka needs ~3 GB/sec disk write throughput.&lt;/li&gt;
&lt;li&gt;24-hour retention in Kafka (buffer for downstream failures): 86 TB × 3 replicas = 258 TB Kafka storage.&lt;/li&gt;
&lt;li&gt;~50 Kafka brokers (each handling ~60 MB/sec write throughput).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# --- Ingestion APIs ---

# Batch log ingestion (used by agents)
POST /v1/logs/ingest
Headers: X-Tenant-ID, X-API-Key, Content-Encoding: zstd
Body: {
  &amp;#34;logs&amp;#34;: [
    {
      &amp;#34;timestamp&amp;#34;: &amp;#34;2025-07-15T10:23:45.123Z&amp;#34;,
      &amp;#34;level&amp;#34;: &amp;#34;ERROR&amp;#34;,
      &amp;#34;service&amp;#34;: &amp;#34;payment-service&amp;#34;,
      &amp;#34;host&amp;#34;: &amp;#34;prod-web-042&amp;#34;,
      &amp;#34;trace_id&amp;#34;: &amp;#34;abc123def456&amp;#34;,
      &amp;#34;span_id&amp;#34;: &amp;#34;span-789&amp;#34;,
      &amp;#34;message&amp;#34;: &amp;#34;Failed to process payment&amp;#34;,
      &amp;#34;metadata&amp;#34;: { &amp;#34;user_id&amp;#34;: &amp;#34;u-123&amp;#34;, &amp;#34;order_id&amp;#34;: &amp;#34;ord-456&amp;#34;, &amp;#34;error_code&amp;#34;: &amp;#34;TIMEOUT&amp;#34; }
    }
  ]
}
Response: 202 Accepted { &amp;#34;ingested&amp;#34;: 150, &amp;#34;failed&amp;#34;: 0 }

# --- Search APIs ---

# Full-text and field-based search
POST /v1/logs/search
Body: {
  &amp;#34;query&amp;#34;: &amp;#34;Failed to process payment&amp;#34;,
  &amp;#34;filters&amp;#34;: {
    &amp;#34;service&amp;#34;: &amp;#34;payment-service&amp;#34;,
    &amp;#34;level&amp;#34;: [&amp;#34;ERROR&amp;#34;, &amp;#34;WARN&amp;#34;],
    &amp;#34;time_range&amp;#34;: { &amp;#34;from&amp;#34;: &amp;#34;2025-07-15T10:00:00Z&amp;#34;, &amp;#34;to&amp;#34;: &amp;#34;2025-07-15T11:00:00Z&amp;#34; }
  },
  &amp;#34;sort&amp;#34;: &amp;#34;timestamp:desc&amp;#34;,
  &amp;#34;limit&amp;#34;: 100,
  &amp;#34;cursor&amp;#34;: &amp;#34;eyJsYXN0X3RzIjoxNjg...&amp;#34;
}
Response: { &amp;#34;hits&amp;#34;: [...], &amp;#34;total&amp;#34;: 4521, &amp;#34;next_cursor&amp;#34;: &amp;#34;...&amp;#34; }

# Live tail — WebSocket
WS /v1/logs/tail?service=payment-service&amp;amp;level=ERROR
→ Server pushes matching log lines in real time

# Aggregation query (for dashboards)
POST /v1/logs/aggregate
Body: {
  &amp;#34;filters&amp;#34;: { &amp;#34;service&amp;#34;: &amp;#34;payment-service&amp;#34;, &amp;#34;time_range&amp;#34;: { &amp;#34;last&amp;#34;: &amp;#34;1h&amp;#34; } },
  &amp;#34;group_by&amp;#34;: [&amp;#34;level&amp;#34;],
  &amp;#34;interval&amp;#34;: &amp;#34;1m&amp;#34;,
  &amp;#34;metric&amp;#34;: &amp;#34;count&amp;#34;
}

# --- Alert APIs ---

POST /v1/alerts/rules
Body: {
  &amp;#34;name&amp;#34;: &amp;#34;Payment errors spike&amp;#34;,
  &amp;#34;query&amp;#34;: &amp;#34;level:ERROR AND service:payment-service&amp;#34;,
  &amp;#34;condition&amp;#34;: { &amp;#34;threshold&amp;#34;: 100, &amp;#34;window&amp;#34;: &amp;#34;5m&amp;#34;, &amp;#34;operator&amp;#34;: &amp;#34;&amp;gt;&amp;#34; },
  &amp;#34;actions&amp;#34;: [{ &amp;#34;type&amp;#34;: &amp;#34;pagerduty&amp;#34;, &amp;#34;severity&amp;#34;: &amp;#34;critical&amp;#34; }]
}
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch ingestion&lt;/strong&gt; over single-line writes — amortizes network overhead, enables compression. Agents batch locally (every 1–5 seconds or 1000 lines, whichever comes first).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cursor-based pagination&lt;/strong&gt; instead of offset-based — handles the append-heavy, time-sorted nature of log data without expensive deep-page queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WebSocket for live tail&lt;/strong&gt; — HTTP long-polling would waste connections. WebSocket allows server-push with low latency. Each subscription is a filtered Kafka consumer under the hood.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;log-document-elasticsearch--clickhouse&#34;&gt;Log Document (Elasticsearch / ClickHouse)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Field&lt;/th&gt;
          &lt;th&gt;Type&lt;/th&gt;
          &lt;th&gt;Indexed&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;UUID&lt;/td&gt;
          &lt;td&gt;Primary key&lt;/td&gt;
          &lt;td&gt;Generated at ingestion, used for dedup&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;timestamp&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;DateTime64(ms)&lt;/td&gt;
          &lt;td&gt;Yes (sort key)&lt;/td&gt;
          &lt;td&gt;Nanosecond precision stored, ms for indexing&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;tenant_id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String&lt;/td&gt;
          &lt;td&gt;Yes (partition key)&lt;/td&gt;
          &lt;td&gt;Tenant isolation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;service&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String (keyword)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Exact match, not analyzed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;host&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String (keyword)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Server hostname&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;level&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Enum (TRACE/DEBUG/INFO/WARN/ERROR/FATAL)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Stored as uint8 internally&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;trace_id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String (keyword)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;For distributed tracing correlation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;span_id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String (keyword)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Links to specific trace span&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;message&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Text (analyzed)&lt;/td&gt;
          &lt;td&gt;Yes (full-text)&lt;/td&gt;
          &lt;td&gt;Inverted index for search&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;metadata&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;JSON / Map(String, String)&lt;/td&gt;
          &lt;td&gt;Selective&lt;/td&gt;
          &lt;td&gt;Dynamic fields, selectively indexed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;_raw&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String&lt;/td&gt;
          &lt;td&gt;No&lt;/td&gt;
          &lt;td&gt;Original log line, stored but not indexed&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;index-partitioning-strategy&#34;&gt;Index Partitioning Strategy&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Index naming: logs-{tenant_id}-{YYYY.MM.DD}-{shard_number}

Example: logs-payment-team-2025.07.15-003

Daily indices → easy to delete/archive entire days
Tenant prefix → physical isolation for noisy neighbors
Shard number → distribute within a day (target 30-50 GB per shard)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-elasticsearch--clickhouse-hybrid&#34;&gt;Why Elasticsearch + ClickHouse Hybrid?&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Engine&lt;/th&gt;
          &lt;th&gt;Use Case&lt;/th&gt;
          &lt;th&gt;Strength&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Elasticsearch&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Full-text search, live tail, ad-hoc queries&lt;/td&gt;
          &lt;td&gt;Inverted index excels at arbitrary text search. Near real-time indexing (refresh every 1s).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;ClickHouse&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Aggregations, dashboards, analytics, alerting&lt;/td&gt;
          &lt;td&gt;Columnar storage gives 10-100x faster GROUP BY, COUNT, and time-series aggregations. Excellent compression.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;S3 / GCS&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Cold &amp;amp; frozen storage&lt;/td&gt;
          &lt;td&gt;11 nines durability, ~$0.023/GB/month vs $0.10/GB for SSD. Parquet format for occasional queries.&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Both engines are populated from the same Kafka topics. Elasticsearch handles interactive search. ClickHouse handles dashboard queries and alert rule evaluation. Cold data is written to object storage in compressed Parquet with only metadata indexed in a lightweight catalog (e.g., Hive Metastore or Iceberg).&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Message Queue (RabbitMQ)</title>
      <link>https://chiraghasija.cc/designs/rabbitmq/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/rabbitmq/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Producers publish messages to named exchanges, which route messages to queues based on routing rules&lt;/li&gt;
&lt;li&gt;Consumers subscribe to queues and receive messages with acknowledgment-based delivery (message stays in queue until acked)&lt;/li&gt;
&lt;li&gt;Support multiple exchange types: direct (exact routing key match), topic (pattern matching), fanout (broadcast to all bound queues)&lt;/li&gt;
&lt;li&gt;Dead letter queues: messages that fail processing after N retries are moved to a DLQ for inspection&lt;/li&gt;
&lt;li&gt;Message persistence: critical messages survive broker restarts (durable queues + persistent messages)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the message queue is infrastructure that other services depend on. Downtime cascades.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 5ms for message publish (broker acknowledges to producer). &amp;lt; 1ms for message delivery to a connected consumer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ordering:&lt;/strong&gt; Messages within a single queue are delivered in FIFO order. No ordering guarantees across queues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delivery guarantees:&lt;/strong&gt; At-least-once by default (ack-based). At-most-once available (auto-ack). Exactly-once achievable with idempotent consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100K messages/sec ingestion, 10,000 queues, 50,000 consumers, 10M messages in flight at any time&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Message publish rate: 100K messages/sec&lt;/li&gt;
&lt;li&gt;Average message size: 2KB (headers + body)&lt;/li&gt;
&lt;li&gt;Ingestion bandwidth: 100K x 2KB = &lt;strong&gt;200 MB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Consumer delivery rate: 100K messages/sec (steady state: publish rate = consume rate)&lt;/li&gt;
&lt;li&gt;With replication (RF=2): internal bandwidth = 200 MB/sec additional&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Messages in flight (unconsumed): 10M messages x 2KB = &lt;strong&gt;20 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak backlog (consumer down for 1 hour): 100K/sec x 3600 x 2KB = &lt;strong&gt;720 GB&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;This is the sizing case for disk: must handle 1-hour consumer outages without data loss&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Per broker: 6 brokers → 120 GB in-flight + up to 360 GB backlog per broker during incidents&lt;/li&gt;
&lt;li&gt;Disk: 500 GB SSD per broker with headroom&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;memory&#34;&gt;Memory&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Message metadata index: 10M messages x 100 bytes = &lt;strong&gt;1 GB&lt;/strong&gt; — fits easily in RAM&lt;/li&gt;
&lt;li&gt;Queue metadata: 10,000 queues x 1KB = &lt;strong&gt;10 MB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Consumer connection state: 50,000 consumers x 2KB = &lt;strong&gt;100 MB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Total memory per broker: ~8 GB for metadata + OS page cache for message data&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;publisher-api&#34;&gt;Publisher API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Declare an exchange
PUT /exchanges/{vhost}/{exchange_name}
  Body: {
    &amp;#34;type&amp;#34;: &amp;#34;topic&amp;#34;,                    // direct, topic, fanout, headers
    &amp;#34;durable&amp;#34;: true,                    // survive broker restart
    &amp;#34;auto_delete&amp;#34;: false                // delete when no queues bound
  }

// Publish a message
POST /exchanges/{vhost}/{exchange_name}/publish
  Body: {
    &amp;#34;routing_key&amp;#34;: &amp;#34;order.created&amp;#34;,
    &amp;#34;properties&amp;#34;: {
      &amp;#34;delivery_mode&amp;#34;: 2,              // 1=transient, 2=persistent
      &amp;#34;content_type&amp;#34;: &amp;#34;application/json&amp;#34;,
      &amp;#34;message_id&amp;#34;: &amp;#34;msg_abc123&amp;#34;,      // for deduplication
      &amp;#34;correlation_id&amp;#34;: &amp;#34;req_xyz&amp;#34;,     // for RPC-style patterns
      &amp;#34;expiration&amp;#34;: &amp;#34;60000&amp;#34;,           // TTL in milliseconds
      &amp;#34;headers&amp;#34;: {&amp;#34;x-retry-count&amp;#34;: 0}
    },
    &amp;#34;payload&amp;#34;: &amp;#34;{\&amp;#34;order_id\&amp;#34;: 12345, \&amp;#34;amount\&amp;#34;: 99.99}&amp;#34;
  }
  Response 200: { &amp;#34;routed&amp;#34;: true }     // message was routed to at least one queue
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;consumer-api-amqp-protocol-shown-as-pseudo-rest&#34;&gt;Consumer API (AMQP protocol, shown as pseudo-REST)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Declare a queue
PUT /queues/{vhost}/{queue_name}
  Body: {
    &amp;#34;durable&amp;#34;: true,
    &amp;#34;exclusive&amp;#34;: false,                // exclusive to this connection
    &amp;#34;auto_delete&amp;#34;: false,
    &amp;#34;arguments&amp;#34;: {
      &amp;#34;x-dead-letter-exchange&amp;#34;: &amp;#34;dlx&amp;#34;,
      &amp;#34;x-dead-letter-routing-key&amp;#34;: &amp;#34;dead.order&amp;#34;,
      &amp;#34;x-message-ttl&amp;#34;: 300000,        // 5 minutes
      &amp;#34;x-max-length&amp;#34;: 1000000,        // max 1M messages
      &amp;#34;x-max-priority&amp;#34;: 10            // enable priority queue
    }
  }

// Bind queue to exchange
POST /bindings/{vhost}/e/{exchange}/q/{queue}
  Body: { &amp;#34;routing_key&amp;#34;: &amp;#34;order.*&amp;#34; }   // pattern for topic exchange

// Consume messages (long-lived connection, push-based)
// In practice: AMQP channel with basic.consume
// Simplified:
GET /queues/{vhost}/{queue_name}/get?count=10&amp;amp;ack_mode=manual
  Response 200: {
    &amp;#34;messages&amp;#34;: [
      {
        &amp;#34;delivery_tag&amp;#34;: 1,             // unique per channel, used for ack/nack
        &amp;#34;exchange&amp;#34;: &amp;#34;orders&amp;#34;,
        &amp;#34;routing_key&amp;#34;: &amp;#34;order.created&amp;#34;,
        &amp;#34;properties&amp;#34;: {...},
        &amp;#34;payload&amp;#34;: &amp;#34;{\&amp;#34;order_id\&amp;#34;: 12345, ...}&amp;#34;,
        &amp;#34;redelivered&amp;#34;: false
      },
      ...
    ]
  }

// Acknowledge message (consumed successfully)
POST /queues/{vhost}/{queue_name}/ack
  Body: { &amp;#34;delivery_tag&amp;#34;: 1, &amp;#34;multiple&amp;#34;: false }

// Negative acknowledge (processing failed, requeue or dead-letter)
POST /queues/{vhost}/{queue_name}/nack
  Body: { &amp;#34;delivery_tag&amp;#34;: 1, &amp;#34;requeue&amp;#34;: false }  // false → route to DLQ
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Push-based delivery (broker pushes to consumers) — lower latency than polling, consumer prefetch controls flow&lt;/li&gt;
&lt;li&gt;Manual acknowledgment by default — messages are not removed until consumer confirms processing success&lt;/li&gt;
&lt;li&gt;DLQ routing on nack — failed messages automatically move to dead letter queue for debugging&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;message-on-disk-and-in-flight&#34;&gt;Message (on-disk and in-flight)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Message:
  message_id       | uuid          -- publisher-assigned or broker-generated
  exchange         | string
  routing_key      | string
  body             | bytes         -- up to 128MB (configurable)
  properties:
    delivery_mode  | int (1 or 2)  -- transient vs persistent
    content_type   | string
    correlation_id | string
    reply_to       | string        -- for RPC pattern
    expiration     | string        -- TTL
    timestamp      | int64
    priority       | int (0-255)
    headers        | map&amp;lt;str,any&amp;gt;
  metadata:
    publish_seq    | int64         -- publisher confirm sequence number
    queue_position | int64         -- position in queue (for ordering)
    delivery_count | int           -- number of delivery attempts
    first_death_exchange | string  -- for DLQ: original exchange
    first_death_queue    | string  -- for DLQ: original queue
    first_death_reason   | string  -- rejected, expired, maxlen
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;queue-state-per-queue-in-memory--wal&#34;&gt;Queue State (per queue, in memory + WAL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Queue:
  name             | string (PK)
  vhost            | string
  durable          | boolean
  state:
    messages_ready      | int     -- messages waiting for delivery
    messages_unacked    | int     -- delivered but not yet acknowledged
    messages_total      | int
    consumers           | int     -- number of active consumers
    head_position       | int64   -- next message to deliver
    tail_position       | int64   -- where next publish lands

Message Index (in-memory):
  TreeMap&amp;lt;queue_position, MessageRef&amp;gt;
  MessageRef: {store_offset, size, expiry, priority}

  For priority queues: use a priority heap instead of a simple FIFO
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;persistent-message-store-per-broker-disk&#34;&gt;Persistent Message Store (per broker, disk)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Message Store (append-only segments):
  ┌───────────────────────────────┐
  │ Segment 1 (16MB)             │
  │  [msg1][msg2][msg3]...       │
  │ Segment 2 (16MB)             │
  │  [msg4][msg5]...             │
  └───────────────────────────────┘

  On publish (persistent msg):
    1. Append message body to current segment
    2. fsync (or batch fsync every 100ms for throughput)
    3. Add to queue&amp;#39;s in-memory index

  On acknowledge:
    1. Mark message as acked in index
    2. Segment compaction: when all messages in a segment are acked, delete segment

Bindings (in metadata store):
  Exchange → Queue mappings:
    (&amp;#34;orders&amp;#34;, &amp;#34;topic&amp;#34;) → [
      (&amp;#34;order-processing-queue&amp;#34;, &amp;#34;order.created&amp;#34;),
      (&amp;#34;analytics-queue&amp;#34;, &amp;#34;order.*&amp;#34;),
      (&amp;#34;audit-queue&amp;#34;, &amp;#34;#&amp;#34;)       // # matches everything in topic exchange
    ]
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-these-choices&#34;&gt;Why These Choices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Append-only segment files&lt;/strong&gt; — high write throughput, sequential I/O. Similar to Kafka&amp;rsquo;s log segments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In-memory index + on-disk messages&lt;/strong&gt; — index is small (100 bytes per message), messages can be large (KBs). Hot messages served from page cache, cold messages from disk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-queue ordering&lt;/strong&gt; — each queue is an independent FIFO (or priority queue). No cross-queue coordination needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Producers                              Consumers
  │                                       ▲
  │ AMQP / HTTP                           │ AMQP (push)
  ▼                                       │
┌──────────────────────────────────────────────────────┐
│                   Load Balancer                        │
│              (TCP passthrough, sticky)                  │
└────────────────────────┬─────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Broker 1    │ │   Broker 2    │ │   Broker 3    │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Exchange  │ │ │ │ Exchange  │ │ │ │ Exchange  │ │
│ │ Router    │ │ │ │ Router    │ │ │ │ Router    │ │
│ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │
│      │       │ │      │       │ │      │       │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Queue     │ │ │ │ Queue     │ │ │ │ Queue     │ │
│ │ Manager   │ │ │ │ Manager   │ │ │ │ Manager   │ │
│ │           │ │ │ │           │ │ │ │           │ │
│ │ Q1 (lead) │ │ │ │ Q1 (mirr) │ │ │ │ Q4 (lead) │ │
│ │ Q2 (lead) │ │ │ │ Q3 (lead) │ │ │ │ Q5 (lead) │ │
│ │ Q3 (mirr) │ │ │ │ Q4 (mirr) │ │ │ │ Q2 (mirr) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Message   │ │ │ │ Message   │ │ │ │ Message   │ │
│ │ Store     │ │ │ │ Store     │ │ │ │ Store     │ │
│ │ (Disk)    │ │ │ │ (Disk)    │ │ │ │ (Disk)    │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
        │                │                │
        └────────────────┼────────────────┘
                         │
              ┌──────────▼──────────┐
              │ Cluster Coordinator  │
              │ (Raft / mnesia)      │
              │ - Queue leadership   │
              │ - Membership         │
              │ - Exchange/binding   │
              │   metadata           │
              └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;publish-flow&#34;&gt;Publish Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Producer → Broker (any broker via load balancer)
  │
  ▼
1. Receive AMQP basic.publish frame
2. Exchange Router:
   a. Look up exchange by name
   b. Get all bindings for this exchange
   c. Match routing_key against binding patterns:
      - Direct: exact match (routing_key == binding_key)
      - Topic: pattern match (&amp;#34;order.created&amp;#34; matches &amp;#34;order.*&amp;#34; and &amp;#34;#&amp;#34;)
      - Fanout: all bound queues (ignore routing key)
   d. Result: list of queues to route to

3. For each target queue:
   a. If queue leader is on THIS broker:
      → Append to local message store (disk write if persistent)
      → Add to queue&amp;#39;s in-memory index
      → Replicate to mirror brokers (sync or async based on config)
   b. If queue leader is on ANOTHER broker:
      → Forward message to that broker (internal cluster protocol)

4. Publisher confirm:
   → If publisher confirms enabled: send basic.ack to producer with sequence number
   → Only after message is persisted + replicated to mirrors (if ha-mode=all)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;consume-flow&#34;&gt;Consume Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Consumer → Broker (connects to any broker, but deliveries come from queue leader)
  │
  ▼
1. Consumer subscribes: basic.consume(queue=&amp;#34;order-processing&amp;#34;)
2. Broker locates queue leader:
   a. If leader is on THIS broker: deliver directly
   b. If leader is on another broker: proxy the channel to leader broker

3. Queue leader delivers messages:
   a. Check prefetch limit (QoS): consumer has capacity for more messages?
   b. If yes: pop next message from queue head
   c. Mark message as &amp;#34;unacked&amp;#34; (in flight to consumer)
   d. Send basic.deliver frame to consumer
   e. Start ack timeout timer (30 seconds default)

4. Consumer processes message:
   a. Success → basic.ack(delivery_tag)
      → Broker removes message from queue and disk
   b. Failure → basic.nack(delivery_tag, requeue=false)
      → Broker routes to dead letter exchange (if configured)
      → Or basic.nack(delivery_tag, requeue=true) → put back at head of queue

5. Timeout (consumer doesn&amp;#39;t ack within 30s):
   → Message requeued (redelivered flag set to true)
   → Delivered to same or different consumer
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Brokers (3-6):&lt;/strong&gt; Each broker handles connections, routing, queue management, and message storage. Queues are distributed across brokers — each queue has a &amp;ldquo;leader&amp;rdquo; broker.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exchange Router:&lt;/strong&gt; In-memory routing table. Evaluates routing rules per exchange type. Direct exchange: O(1) hash lookup. Topic exchange: O(B) where B = number of bindings (trie-based optimization available). Fanout: O(Q) where Q = bound queues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Queue Manager:&lt;/strong&gt; Manages queue state: ready messages, unacked messages, consumers, prefetch counters. One leader per queue, with mirrors on other brokers for HA.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Message Store:&lt;/strong&gt; Per-broker append-only segment files. Persistent messages fsync&amp;rsquo;d to disk. Segment compaction removes fully-acknowledged messages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cluster Coordinator:&lt;/strong&gt; Manages cluster membership, queue leader election, exchange/binding metadata. Uses Raft consensus (RabbitMQ 3.x used Mnesia/Erlang distribution).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Management UI/API:&lt;/strong&gt; HTTP API for monitoring: queue depths, message rates, consumer counts, connection status. Grafana dashboards for operations.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-delivery-guarantees--at-least-once-at-most-once-exactly-once&#34;&gt;Deep Dive 1: Delivery Guarantees — At-Least-Once, At-Most-Once, Exactly-Once&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;At-most-once delivery:&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Metrics Logging and Aggregation System</title>
      <link>https://chiraghasija.cc/designs/metrics-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/metrics-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Ingest metrics from thousands of services (CPU, memory, request latency, error rate, custom business metrics)&lt;/li&gt;
&lt;li&gt;Store metrics with high granularity (per-second) for recent data, lower granularity (per-minute, per-hour) for historical data&lt;/li&gt;
&lt;li&gt;Query metrics: time-range aggregations (avg, sum, p50, p95, p99, max, min)&lt;/li&gt;
&lt;li&gt;Real-time dashboards with auto-refresh (&amp;lt; 30 second data freshness)&lt;/li&gt;
&lt;li&gt;Alerting: trigger alerts when metrics cross thresholds (e.g., p99 latency &amp;gt; 500ms for 5 minutes)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% for ingestion (losing metrics is acceptable during brief outages — we don&amp;rsquo;t want to lose all data, but a few seconds of data loss during failover is tolerable). 99.99% for querying (dashboards must be up during incidents — that&amp;rsquo;s when they&amp;rsquo;re needed most).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Ingestion → queryable in &amp;lt; 30 seconds. Dashboard queries: simple queries &amp;lt; 500ms, complex aggregations &amp;lt; 5 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M metrics data points/sec ingestion. 1 year retention at full granularity, 3 years at reduced granularity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Metrics data should survive individual node failures. Total loss acceptable only for catastrophic events.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ingestion&#34;&gt;Ingestion&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M data points/sec&lt;/li&gt;
&lt;li&gt;Each data point: metric_name (100B) + tags (200B) + timestamp (8B) + value (8B) = ~316 bytes → round to 300 bytes&lt;/li&gt;
&lt;li&gt;10M × 300 bytes = &lt;strong&gt;3GB/sec&lt;/strong&gt; ingestion throughput&lt;/li&gt;
&lt;li&gt;Per day: &lt;strong&gt;~260TB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Raw data at 1-second granularity: 260TB/day&lt;/li&gt;
&lt;li&gt;With compression (time-series data compresses well): ~10:1 → &lt;strong&gt;26TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;1 year retention: &lt;strong&gt;~9.5PB&lt;/strong&gt; compressed&lt;/li&gt;
&lt;li&gt;Rollup data (1-min, 1-hour granularity): ~1% of raw → additional ~100TB/year&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;query-patterns&#34;&gt;Query patterns&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Dashboard queries: typically 1-6 hour time ranges, 5-10 metrics&lt;/li&gt;
&lt;li&gt;Alert evaluation: thousands of alert rules evaluated every 60 seconds&lt;/li&gt;
&lt;li&gt;Ad-hoc queries: arbitrary time ranges, complex aggregations&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ingestion-api&#34;&gt;Ingestion API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/metrics
  Body: [
    {
      &amp;#34;metric&amp;#34;: &amp;#34;http.request.latency&amp;#34;,
      &amp;#34;tags&amp;#34;: { &amp;#34;service&amp;#34;: &amp;#34;api-gateway&amp;#34;, &amp;#34;endpoint&amp;#34;: &amp;#34;/v1/users&amp;#34;, &amp;#34;method&amp;#34;: &amp;#34;GET&amp;#34;, &amp;#34;region&amp;#34;: &amp;#34;us-east-1&amp;#34; },
      &amp;#34;timestamp&amp;#34;: 1708632000,
      &amp;#34;value&amp;#34;: 45.2
    },
    {
      &amp;#34;metric&amp;#34;: &amp;#34;http.request.count&amp;#34;,
      &amp;#34;tags&amp;#34;: { &amp;#34;service&amp;#34;: &amp;#34;api-gateway&amp;#34;, &amp;#34;status&amp;#34;: &amp;#34;200&amp;#34; },
      &amp;#34;timestamp&amp;#34;: 1708632000,
      &amp;#34;value&amp;#34;: 1
    }
  ]
  Response 202 Accepted

// StatsD/Prometheus-compatible push
UDP /statsd (fire-and-forget, lossy but fast)
  Payload: &amp;#34;http.request.latency:45.2|ms|#service:api-gateway,endpoint:/v1/users&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;query-api&#34;&gt;Query API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/query
  Body: {
    &amp;#34;metric&amp;#34;: &amp;#34;http.request.latency&amp;#34;,
    &amp;#34;aggregation&amp;#34;: &amp;#34;p99&amp;#34;,
    &amp;#34;tags&amp;#34;: { &amp;#34;service&amp;#34;: &amp;#34;api-gateway&amp;#34; },
    &amp;#34;from&amp;#34;: &amp;#34;2026-02-22T12:00:00Z&amp;#34;,
    &amp;#34;to&amp;#34;: &amp;#34;2026-02-22T18:00:00Z&amp;#34;,
    &amp;#34;interval&amp;#34;: &amp;#34;1m&amp;#34;       // rollup interval
  }
  Response 200: {
    &amp;#34;series&amp;#34;: [
      { &amp;#34;timestamp&amp;#34;: 1708603200, &amp;#34;value&amp;#34;: 42.3 },
      { &amp;#34;timestamp&amp;#34;: 1708603260, &amp;#34;value&amp;#34;: 45.1 },
      ...
    ]
  }

// Multi-metric query (for dashboards)
POST /api/v1/query/batch
  Body: {
    &amp;#34;queries&amp;#34;: [
      { &amp;#34;metric&amp;#34;: &amp;#34;cpu.usage&amp;#34;, &amp;#34;aggregation&amp;#34;: &amp;#34;avg&amp;#34;, &amp;#34;tags&amp;#34;: {...}, ... },
      { &amp;#34;metric&amp;#34;: &amp;#34;memory.usage&amp;#34;, &amp;#34;aggregation&amp;#34;: &amp;#34;max&amp;#34;, &amp;#34;tags&amp;#34;: {...}, ... }
    ]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;alert-api&#34;&gt;Alert API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/alerts
  Body: {
    &amp;#34;name&amp;#34;: &amp;#34;High API Latency&amp;#34;,
    &amp;#34;metric&amp;#34;: &amp;#34;http.request.latency&amp;#34;,
    &amp;#34;aggregation&amp;#34;: &amp;#34;p99&amp;#34;,
    &amp;#34;tags&amp;#34;: { &amp;#34;service&amp;#34;: &amp;#34;api-gateway&amp;#34; },
    &amp;#34;condition&amp;#34;: &amp;#34;above&amp;#34;,
    &amp;#34;threshold&amp;#34;: 500,
    &amp;#34;duration&amp;#34;: &amp;#34;5m&amp;#34;,
    &amp;#34;channels&amp;#34;: [&amp;#34;slack:#oncall&amp;#34;, &amp;#34;pagerduty:team-infra&amp;#34;]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;time-series-storage-custom-or-tsdb-like-clickhousetimescaledb&#34;&gt;Time-Series Storage (custom or TSDB like ClickHouse/TimescaleDB)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Conceptual schema:
  metric_name    | string
  tag_set        | map&amp;lt;string, string&amp;gt;  (sorted, for consistent hashing)
  timestamp      | int64 (Unix epoch seconds)
  value          | float64

Physical storage (column-oriented):
  Series ID = hash(metric_name + sorted_tags)

  Series metadata table:
    series_id    (PK) | bigint
    metric_name        | string
    tags               | map&amp;lt;string, string&amp;gt;

  Data points table (partitioned by time):
    series_id          | bigint
    timestamp          | int64
    value              | float64
    Partition key: (series_id, time_bucket)
    Clustering: timestamp ASC
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Key design choice:&lt;/strong&gt; Column-oriented storage. Time-series queries almost always read a specific metric over a time range (columnar scans). Row-oriented storage would waste I/O reading irrelevant columns.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Stream Processing System (Kafka)</title>
      <link>https://chiraghasija.cc/designs/kafka/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/kafka/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Producers publish messages to named topics with optional partitioning keys&lt;/li&gt;
&lt;li&gt;Consumers subscribe to topics and read messages in order within each partition&lt;/li&gt;
&lt;li&gt;Support consumer groups — each message is delivered to exactly one consumer within a group (load balancing), but to all groups (broadcast)&lt;/li&gt;
&lt;li&gt;Messages are durably persisted for a configurable retention period (default 7 days)&lt;/li&gt;
&lt;li&gt;Support message replay — consumers can seek to any offset and re-consume&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the messaging backbone cannot go down without cascading failures across the entire platform&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 10ms end-to-end for p99 publish-to-consume (single datacenter). Throughput is prioritized over single-message latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Messages within a partition are strictly ordered. Across partitions, no ordering guarantee. At-least-once delivery by default; exactly-once semantics available with idempotent producers + transactional consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1M+ messages/sec ingestion, 10TB+ daily throughput, 1000+ topics, 10,000+ partitions across the cluster&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero message loss for acknowledged writes. Replication factor of 3 minimum for production topics.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Write throughput: 1M messages/sec&lt;/li&gt;
&lt;li&gt;Average message size: 1KB&lt;/li&gt;
&lt;li&gt;Write bandwidth: 1M x 1KB = &lt;strong&gt;1 GB/sec&lt;/strong&gt; ingress&lt;/li&gt;
&lt;li&gt;Replication factor 3: 1 GB/sec x 3 = &lt;strong&gt;3 GB/sec&lt;/strong&gt; internal replication traffic&lt;/li&gt;
&lt;li&gt;Read throughput: Assume 5 consumer groups on average → &lt;strong&gt;5 GB/sec&lt;/strong&gt; egress&lt;/li&gt;
&lt;li&gt;Total network: ~9 GB/sec across the cluster&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Daily ingestion: 1 GB/sec x 86,400 = &lt;strong&gt;86.4 TB/day&lt;/strong&gt; (before replication)&lt;/li&gt;
&lt;li&gt;With replication factor 3: &lt;strong&gt;259 TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;7-day retention: &lt;strong&gt;1.8 PB&lt;/strong&gt; raw storage&lt;/li&gt;
&lt;li&gt;Each broker: 12 x 4TB SSDs = 48TB per broker → need ~38 brokers for storage alone&lt;/li&gt;
&lt;li&gt;In practice: &lt;strong&gt;50-60 brokers&lt;/strong&gt; (headroom for CPU, network, rebalancing)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;partitions&#34;&gt;Partitions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1000 topics x average 10 partitions = 10,000 partitions&lt;/li&gt;
&lt;li&gt;Each partition is an append-only log segment on disk&lt;/li&gt;
&lt;li&gt;Partition leader handles all reads/writes → must balance leaders evenly across brokers&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;producer-api&#34;&gt;Producer API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Publish a message
POST /topics/{topic}/messages
  Body: {
    &amp;#34;key&amp;#34;: &amp;#34;user_123&amp;#34;,              // optional partition key
    &amp;#34;value&amp;#34;: &amp;#34;&amp;lt;base64-encoded payload&amp;gt;&amp;#34;,
    &amp;#34;headers&amp;#34;: {&amp;#34;trace_id&amp;#34;: &amp;#34;abc&amp;#34;}, // optional metadata
    &amp;#34;partition&amp;#34;: null                // null = use key hash; int = explicit partition
  }
  Response 200: {
    &amp;#34;topic&amp;#34;: &amp;#34;user-events&amp;#34;,
    &amp;#34;partition&amp;#34;: 7,
    &amp;#34;offset&amp;#34;: 48291034,
    &amp;#34;timestamp&amp;#34;: 1708632060000
  }

// Batch publish (preferred for throughput)
POST /topics/{topic}/messages/batch
  Body: { &amp;#34;messages&amp;#34;: [ ... ] }
  Response 200: { &amp;#34;offsets&amp;#34;: [ ... ] }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;consumer-api&#34;&gt;Consumer API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Subscribe to topic(s) with a consumer group
POST /consumers/{group_id}/subscribe
  Body: { &amp;#34;topics&amp;#34;: [&amp;#34;user-events&amp;#34;, &amp;#34;page-views&amp;#34;] }

// Poll for messages
GET /consumers/{group_id}/poll?max_records=500&amp;amp;timeout_ms=1000
  Response 200: {
    &amp;#34;messages&amp;#34;: [
      {&amp;#34;topic&amp;#34;: &amp;#34;user-events&amp;#34;, &amp;#34;partition&amp;#34;: 7, &amp;#34;offset&amp;#34;: 48291034, &amp;#34;key&amp;#34;: &amp;#34;user_123&amp;#34;, &amp;#34;value&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;timestamp&amp;#34;: ...},
      ...
    ]
  }

// Commit offsets (acknowledge processing)
POST /consumers/{group_id}/offsets
  Body: { &amp;#34;offsets&amp;#34;: {&amp;#34;user-events&amp;#34;: {&amp;#34;7&amp;#34;: 48291035}} }

// Seek to specific offset (replay)
POST /consumers/{group_id}/seek
  Body: { &amp;#34;topic&amp;#34;: &amp;#34;user-events&amp;#34;, &amp;#34;partition&amp;#34;: 7, &amp;#34;offset&amp;#34;: 48000000 }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Batch publishing is the primary path — amortizes network round trips&lt;/li&gt;
&lt;li&gt;Consumer pull model (not push) — consumers control their own pace, natural backpressure&lt;/li&gt;
&lt;li&gt;Offsets are committed by consumers, not auto-acked — enables at-least-once and exactly-once semantics&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;message-on-disk-log-format&#34;&gt;Message (on-disk log format)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Record:
  offset          | int64       -- monotonically increasing per partition
  timestamp       | int64       -- producer-set or broker-set
  key             | bytes       -- nullable, used for partitioning
  value           | bytes       -- the payload
  headers         | map&amp;lt;str,str&amp;gt;-- metadata
  crc32           | int32       -- integrity check
  attributes      | int8        -- compression codec, timestamp type
  batch_offset    | int32       -- offset within producer batch (for idempotency)
  producer_id     | int64       -- for exactly-once (idempotent producer)
  producer_epoch  | int16       -- for exactly-once (fencing)
  sequence_num    | int32       -- for exactly-once (dedup)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;topic-metadata-zookeeperkraft&#34;&gt;Topic Metadata (ZooKeeper/KRaft)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Topic:
  topic_name      | string (PK)
  num_partitions  | int
  replication_factor | int
  config_overrides   | map       -- retention.ms, segment.bytes, etc.

Partition Assignment:
  topic_name      | string
  partition_id    | int
  leader_broker   | int
  isr             | list&amp;lt;int&amp;gt;   -- in-sync replicas
  replicas        | list&amp;lt;int&amp;gt;   -- all assigned replicas
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;consumer-group-offsets&#34;&gt;Consumer Group Offsets&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;__consumer_offsets (internal compacted topic):
  Key: {group_id, topic, partition}
  Value: {committed_offset, metadata, timestamp}
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-this-storage-model&#34;&gt;Why This Storage Model&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Append-only log on local disk&lt;/strong&gt; — sequential writes are the fastest possible I/O pattern. HDDs do 200MB/s sequential, SSDs do 2GB/s+. No random seeks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No traditional database&lt;/strong&gt; — the log IS the database. No index overhead, no B-tree maintenance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumer offsets stored as a compacted topic&lt;/strong&gt; — self-hosted, replicated, no external DB dependency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KRaft (replacing ZooKeeper)&lt;/strong&gt; — metadata managed via Raft consensus among controller brokers, eliminating the ZK dependency.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Producers (thousands)
  │
  ├─ Producer 1 ──┐
  ├─ Producer 2 ──┤    (messages partitioned by key hash)
  └─ Producer N ──┤
                  ▼
         ┌──────────────────────────────────────┐
         │          Kafka Cluster                │
         │                                       │
         │  Broker 1         Broker 2            │
         │  ┌────────────┐   ┌────────────┐     │
         │  │ Topic-A P0 │   │ Topic-A P1 │     │
         │  │  (Leader)  │   │  (Leader)  │     │
         │  │ Topic-A P1 │   │ Topic-A P0 │     │
         │  │  (Replica) │   │  (Replica) │     │
         │  └────────────┘   └────────────┘     │
         │                                       │
         │  Broker 3 (Controller)                │
         │  ┌────────────┐                      │
         │  │ Topic-A P0 │  KRaft Metadata      │
         │  │  (Replica) │  (Raft consensus)    │
         │  │ Topic-A P1 │                      │
         │  │  (Replica) │                      │
         │  └────────────┘                      │
         └──────────────────────────────────────┘
                  │
                  ▼
         Consumer Groups
         ┌─────────────────────┐
         │ Group &amp;#34;analytics&amp;#34;   │
         │  Consumer A → P0    │
         │  Consumer B → P1    │
         └─────────────────────┘
         ┌─────────────────────┐
         │ Group &amp;#34;search-index&amp;#34;│
         │  Consumer C → P0,P1 │
         └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;write-path-producer--broker&#34;&gt;Write Path (Producer → Broker)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Producer
  → Serialize message (Avro/Protobuf/JSON)
  → Compute partition: hash(key) % num_partitions (or round-robin if no key)
  → Batch messages destined for same partition (linger.ms = 5ms, batch.size = 16KB)
  → Compress batch (LZ4/Snappy/Zstd)
  → Send to partition leader broker
  → Leader:
    → Append to local log segment (sequential write)
    → Wait for ISR replicas to fetch and acknowledge
    → When acks=all: respond to producer only after all ISR replicas confirm
    → When acks=1: respond after local write (faster, risk of data loss on leader crash)
  → Producer receives offset confirmation
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;read-path-broker--consumer&#34;&gt;Read Path (Broker → Consumer)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Consumer
  → Poll leader broker for partition
  → Broker:
    → Read from page cache (hot data) or disk (cold data)
    → Use sendfile() / zero-copy: data goes directly from page cache → network socket
      (no kernel→user→kernel copy — saves 2 memory copies and 2 context switches)
    → Respond with batch of messages
  → Consumer deserializes and processes
  → Consumer commits offset (async or sync)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Brokers (50-60):&lt;/strong&gt; Store partition log segments, serve produce/fetch requests, replicate data. Each broker is a partition leader for some partitions and a follower for others.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Controller (KRaft):&lt;/strong&gt; 3-5 controller nodes running Raft consensus. Manages cluster metadata: topic creation, partition assignment, leader election, ISR management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Producers:&lt;/strong&gt; Client libraries that batch, compress, and route messages to the correct partition leader.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumers:&lt;/strong&gt; Client libraries that poll partition leaders, track offsets, and handle rebalancing within consumer groups.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Registry:&lt;/strong&gt; External service storing Avro/Protobuf schemas. Producers register schemas, consumers validate. Prevents breaking changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring (JMX/Prometheus):&lt;/strong&gt; Under-replicated partitions, consumer lag, broker throughput, request latency percentiles.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-replication-and-isr-in-sync-replicas&#34;&gt;Deep Dive 1: Replication and ISR (In-Sync Replicas)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; How do we ensure durability without sacrificing throughput? Traditional quorum (majority must ack) wastes 1/3 of write capacity.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed System Control Plane</title>
      <link>https://chiraghasija.cc/designs/control-plane/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/control-plane/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Service Discovery:&lt;/strong&gt; Services register themselves on startup and discover other services by name. Registry is always up-to-date with running instances.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuration Management:&lt;/strong&gt; Centrally manage and distribute configuration to all services. Support versioning, rollback, and environment-specific overrides.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Health Checking:&lt;/strong&gt; Continuously monitor service health. Automatically remove unhealthy instances from the service registry. Support liveness and readiness probes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rolling Deployments:&lt;/strong&gt; Orchestrate zero-downtime deployments by gradually replacing old instances with new ones, with automatic rollback on failure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load Balancing Policy:&lt;/strong&gt; Define and enforce load balancing policies (round-robin, least-connections, weighted) and circuit breaking rules at the control plane level.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.999% — if the control plane goes down, services can&amp;rsquo;t discover each other, configs can&amp;rsquo;t update, and deployments halt. Data plane must continue to function independently during control plane outages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Service discovery lookups &amp;lt; 5ms. Config pushes reach all nodes within 30 seconds. Health check detection &amp;lt; 10 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Service registry must be strongly consistent (a deregistered service must never receive traffic). Config updates must be atomically applied (no partial config states).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 50K service instances across 500 services. 10 regions. 1M service discovery lookups/sec. 100K config reads/sec.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Tolerance:&lt;/strong&gt; Control plane must handle network partitions gracefully. The data plane (actual service-to-service traffic) must continue even if the control plane is unreachable.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;service-registry&#34;&gt;Service Registry&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500 services × 100 instances each = &lt;strong&gt;50K registered instances&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each registration: ~1KB (service name, host, port, metadata, health status, version)&lt;/li&gt;
&lt;li&gt;Total registry size: 50K × 1KB = &lt;strong&gt;50MB&lt;/strong&gt; — trivially fits in memory on every node&lt;/li&gt;
&lt;li&gt;Registration/deregistration events: ~10K/hour (deploys, scaling, failures)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;service-discovery&#34;&gt;Service Discovery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50K instances, each resolving other services ~20 times/sec = &lt;strong&gt;1M lookups/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;With local caching (refresh every 5-10 seconds): actual control plane queries = 50K instances / 5 sec = &lt;strong&gt;10K queries/sec&lt;/strong&gt; — very manageable&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;health-checking&#34;&gt;Health Checking&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50K instances, health checked every 5 seconds = &lt;strong&gt;10K health checks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each health check: ~200 bytes response&lt;/li&gt;
&lt;li&gt;Network: 10K × 200 bytes = 2MB/sec — trivial&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;configuration&#34;&gt;Configuration&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500 services × 5 config keys each = 2,500 config entries&lt;/li&gt;
&lt;li&gt;Total config data: 2,500 × 10KB average = &lt;strong&gt;25MB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Config change events: ~50/day (most configs rarely change)&lt;/li&gt;
&lt;li&gt;Config reads: 50K instances poll every 30 seconds = &lt;strong&gt;1,700 reads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The control plane is a &lt;strong&gt;low-throughput, high-availability&lt;/strong&gt; system. Data volumes are small (&amp;lt; 100MB total state). The hard problem is availability, consistency, and graceful degradation — not scale.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Distributed Tracing System (Jaeger/Zipkin)</title>
      <link>https://chiraghasija.cc/designs/distributed-tracing/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/distributed-tracing/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Collect spans from every service in the request path and assemble them into end-to-end traces with parent-child relationships&lt;/li&gt;
&lt;li&gt;Propagate trace context (trace ID, span ID, sampling decision) across service boundaries via HTTP headers, gRPC metadata, and message queues&lt;/li&gt;
&lt;li&gt;Support configurable sampling strategies (head-based probabilistic, tail-based on error/latency, always-on for debug)&lt;/li&gt;
&lt;li&gt;Store traces and provide query capabilities: search by trace ID, service name, operation, duration, tags, and time range&lt;/li&gt;
&lt;li&gt;Generate service dependency graphs and latency breakdowns from trace data&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% for collection pipeline (trace loss during outage is acceptable but not data corruption). Query system: 99.95%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Zero observable overhead on the critical path. Span reporting must be async and non-blocking (&amp;lt; 0.1ms per span in the application process).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Traces can be eventually consistent (30-second delay from span emission to queryability is fine).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1,000 microservices, 500K requests/sec, average 10 spans per trace = &lt;strong&gt;5M spans/sec&lt;/strong&gt; ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Retain full traces for 7 days, sampled traces for 30 days, aggregated service metrics for 1 year.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Spans ingested:&lt;/strong&gt; 5M spans/sec (500K traces/sec × 10 spans/trace average)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;With 10% head-based sampling:&lt;/strong&gt; 500K spans/sec stored (reduces storage 10x while preserving statistical accuracy)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tail-based sampling (errors/slow):&lt;/strong&gt; adds ~50K spans/sec (all error traces and p99 latency traces kept at 100%)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total stored:&lt;/strong&gt; ~550K spans/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Per span:&lt;/strong&gt; span_id (16B) + trace_id (16B) + parent_id (16B) + service (32B) + operation (64B) + start_time (8B) + duration (8B) + tags (128B) + logs (256B) = ~544 bytes average&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Raw spans (7 days):&lt;/strong&gt; 550K/sec × 86,400 × 7 × 544B = &lt;strong&gt;23TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;With columnar compression (5x):&lt;/strong&gt; ~4.6TB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sampled traces (30 days, 1% sample):&lt;/strong&gt; ~2TB compressed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service metrics (1 year):&lt;/strong&gt; aggregated data, ~100GB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;high-throughput write pipeline&lt;/strong&gt; with relatively infrequent reads. The core challenges are: (1) ingesting 5M spans/sec with near-zero application overhead, (2) tail-based sampling which requires holding spans in memory until the trace completes, and (3) efficiently querying traces by multiple dimensions.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Document Management System</title>
      <link>https://chiraghasija.cc/designs/document-management/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/document-management/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Upload, download, and organize files in a hierarchical folder structure with support for large files (up to 10GB) via chunked/resumable uploads&lt;/li&gt;
&lt;li&gt;Version control: maintain full version history of documents, allow reverting to any previous version, and show diffs for text-based files&lt;/li&gt;
&lt;li&gt;Fine-grained access control: per-document and per-folder permissions (owner, editor, viewer), shareable links with expiry, and organization-wide policies&lt;/li&gt;
&lt;li&gt;Full-text search across document contents, metadata, and tags. Support filters by file type, date, owner, and folder.&lt;/li&gt;
&lt;li&gt;Real-time collaborative editing for text documents (Google Docs-style) with conflict resolution and presence indicators&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for reads (viewing/downloading). 99.9% for writes (uploading/editing). Acceptable to queue uploads during degraded states.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; File listing &amp;lt; 100ms. File download start (first byte) &amp;lt; 200ms. Search results &amp;lt; 500ms. Collaborative edit sync &amp;lt; 100ms (real-time feel).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; 99.999999999% (11 nines) — zero data loss. Documents are business-critical. Use replication + backups.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M documents totaling 50PB of storage. 10M users. 1M uploads/day. 10M downloads/day. 100K concurrent collaborative editing sessions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance:&lt;/strong&gt; Audit trail for all document access. Support for retention policies and legal holds. GDPR right to deletion.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M documents, average 100MB = &lt;strong&gt;50PB total storage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;With 5 versions average per document: 50PB × 3 (dedup reduces version overhead to ~3x) = &lt;strong&gt;150PB&lt;/strong&gt; raw storage → with S3&amp;rsquo;s internal redundancy: managed&lt;/li&gt;
&lt;li&gt;Daily uploads: 1M files × 100MB average = &lt;strong&gt;100TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Search index: 500M documents × 5KB extracted text average = &lt;strong&gt;2.5TB&lt;/strong&gt; Elasticsearch index&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uploads: 1M/day = &lt;strong&gt;12 uploads/sec average&lt;/strong&gt;, peak 100/sec&lt;/li&gt;
&lt;li&gt;Downloads: 10M/day = &lt;strong&gt;115 downloads/sec average&lt;/strong&gt;, peak 1K/sec&lt;/li&gt;
&lt;li&gt;Search: 5M queries/day = &lt;strong&gt;58 queries/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Collaborative editing: 100K concurrent sessions, each generating ~10 operations/sec = &lt;strong&gt;1M ops/sec&lt;/strong&gt; for the collaboration engine&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uploads: 100TB/day = &lt;strong&gt;9.3 Gbps sustained&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Downloads: 10M × 100MB = 1PB/day = &lt;strong&gt;93 Gbps sustained&lt;/strong&gt; → CDN handles this&lt;/li&gt;
&lt;li&gt;Total: ~100 Gbps — significant but manageable with CDN offload&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;Storage cost dominates. At S3 Standard pricing ($0.023/GB/month), 50PB = &lt;strong&gt;$1.15M/month&lt;/strong&gt;. Tiering cold documents to S3 Glacier ($0.004/GB/month) reduces this to ~$250K/month. Storage tiering is a critical cost optimization, not just a nice-to-have.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Food Ordering System (DoorDash/UberEats)</title>
      <link>https://chiraghasija.cc/designs/food-ordering/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/food-ordering/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Customers can browse nearby restaurants, view menus with real-time item availability, and place orders&lt;/li&gt;
&lt;li&gt;Orders flow through a state machine: placed → confirmed by restaurant → preparing → ready for pickup → picked up → delivered&lt;/li&gt;
&lt;li&gt;Assign delivery drivers to orders based on proximity, current load, and estimated restaurant prep time&lt;/li&gt;
&lt;li&gt;Real-time tracking of driver location from restaurant to customer doorstep&lt;/li&gt;
&lt;li&gt;Estimate and display accurate ETAs at each stage (restaurant prep, driver pickup, delivery)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% during meal-time peaks (11 AM - 1 PM, 5 PM - 9 PM). A 10-minute outage during dinner rush can cost millions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Menu browsing &amp;lt; 100ms, order placement &amp;lt; 500ms, location updates &amp;lt; 1 second.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Order state must be strongly consistent (no duplicate orders, no lost payments). Menu prices can be eventually consistent (seconds-stale acceptable).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500K restaurants, 30M orders/month (~12 orders/sec average, peak 100/sec), 200K concurrent drivers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reliability:&lt;/strong&gt; Payment capture must be exactly-once. Driver assignment must avoid double-booking.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Menu browsing:&lt;/strong&gt; 5M DAU × 5 searches × 3 restaurant views = 75M page loads/day = ~870 QPS (peak 3x = 2,600 QPS)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 1M orders/day = ~12/sec average, peak (dinner rush) = 100/sec&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver location updates:&lt;/strong&gt; 200K drivers × 1 update/4s = &lt;strong&gt;50K updates/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order status updates:&lt;/strong&gt; 1M orders × 6 state transitions = 6M events/day = 70/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Restaurants + menus:&lt;/strong&gt; 500K restaurants × 50 items × 2KB = 50GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 30M/month × 3KB = 90GB/month, ~1TB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver locations:&lt;/strong&gt; Real-time only (Redis), ~200K × 64 bytes = 12.8MB in memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order tracking history:&lt;/strong&gt; 30M/month × 100 location points × 32 bytes = 96GB/month&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;three-sided marketplace&lt;/strong&gt; (customers, restaurants, drivers) with a complex orchestration problem. The hardest challenge is ETA accuracy — it depends on restaurant prep time (variable, 5-45 minutes), driver travel time (traffic-dependent), and coordinating pickup timing so the driver arrives when food is ready (not 15 minutes early waiting, not 10 minutes late with cold food).&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Hotel Booking System</title>
      <link>https://chiraghasija.cc/designs/hotel-booking/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/hotel-booking/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Search for hotels by location, dates, guests, and filters (price range, star rating, amenities) with availability-aware results&lt;/li&gt;
&lt;li&gt;View hotel details, room types, photos, reviews, and real-time pricing for selected dates&lt;/li&gt;
&lt;li&gt;Book a room with a hold-then-confirm workflow: hold inventory for 10 minutes while user enters payment, then confirm or release&lt;/li&gt;
&lt;li&gt;Prevent double-booking: two users cannot book the same room for overlapping dates, even under concurrent requests&lt;/li&gt;
&lt;li&gt;Support booking lifecycle: create, confirm, modify (change dates), cancel (with cancellation policy enforcement), and refund&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for search, 99.999% for booking (losing a confirmed booking is catastrophic)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Search results &amp;lt; 500ms. Booking confirmation &amp;lt; 2 seconds (includes payment).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for inventory. A room shown as available must actually be bookable. Overbooking must be prevented at the database level.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500K hotels, 50M rooms globally, 100M searches/day, 1M bookings/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Booking records and payment transactions must survive any failure. Zero data loss.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Search: 100M/day = &lt;strong&gt;~1,150 searches/sec average&lt;/strong&gt;, 5× peak = &lt;strong&gt;~5,750/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Hotel detail page views: 3× searches = &lt;strong&gt;~3,450/sec average&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Booking attempts: 1M/day = &lt;strong&gt;~12 bookings/sec average&lt;/strong&gt;, peak = &lt;strong&gt;~60/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Booking holds created: 3× bookings (many abandoned) = &lt;strong&gt;~36 holds/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Price lookups: combined with search and detail = &lt;strong&gt;~10K/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Hotels: 500K × 5 KB = &lt;strong&gt;2.5 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Room types: 500K hotels × 5 room types × 1 KB = &lt;strong&gt;2.5 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Room inventory (per room per night): 50M rooms × 365 nights × 50 bytes = &lt;strong&gt;913 GB&lt;/strong&gt; (~1 TB)&lt;/li&gt;
&lt;li&gt;Bookings: 1M/day × 365 days × 1 KB = &lt;strong&gt;365 GB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Search index (Elasticsearch): hotel metadata + denormalized availability = &lt;strong&gt;~50 GB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;pricing&#34;&gt;Pricing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Price varies by date, room type, demand, and channel&lt;/li&gt;
&lt;li&gt;Rate table: 500K hotels × 5 room types × 365 days = &lt;strong&gt;~900M rate entries&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each entry: ~30 bytes → &lt;strong&gt;27 GB&lt;/strong&gt; of rate data&lt;/li&gt;
&lt;li&gt;Cached in Redis for fast lookup&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is an &lt;strong&gt;inventory management + search&lt;/strong&gt; problem. The core challenge is maintaining a correct, real-time inventory count (rooms available per night) under concurrent booking pressure while also providing fast, filter-rich search results. The booking flow is a distributed transaction spanning inventory, payment, and confirmation.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Large File Download System</title>
      <link>https://chiraghasija.cc/designs/file-download/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/file-download/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can download large files (100MB to 50GB) reliably over unstable network connections with resume support&lt;/li&gt;
&lt;li&gt;Files are split into chunks; clients can download chunks in parallel and resume from the last completed chunk after interruption&lt;/li&gt;
&lt;li&gt;Integrity verification at both chunk level (per-chunk checksum) and file level (whole-file hash) to detect corruption&lt;/li&gt;
&lt;li&gt;Distribute files globally via CDN edge nodes to minimize download latency and maximize throughput&lt;/li&gt;
&lt;li&gt;Support bandwidth throttling per user/tier and fair-share scheduling when multiple users download simultaneously&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.95% — users should almost always be able to start or resume a download. Brief outages are tolerable if resume works.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; First byte within 200ms. Download speed should saturate the user&amp;rsquo;s available bandwidth (no artificial bottleneck from our side, except throttling).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; File metadata (checksums, chunk manifest) must be strongly consistent. A user must never download a partially-updated file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M registered users, 500K concurrent downloads, 100PB total stored files, 5PB egress/month.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; 99.999999999% (11 nines) for stored files. No data loss ever.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Concurrent downloads:&lt;/strong&gt; 500K&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Average file size:&lt;/strong&gt; 2GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Average download speed:&lt;/strong&gt; 50 Mbps per user&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total bandwidth:&lt;/strong&gt; 500K × 50 Mbps = &lt;strong&gt;25 Tbps&lt;/strong&gt; peak egress (this is CDN-scale, not single-origin)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk requests:&lt;/strong&gt; 2GB file / 8MB chunk = 250 chunks per download. 500K downloads × 250 = 125M chunk requests (spread over download duration)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk request rate:&lt;/strong&gt; ~200K chunk requests/sec at peak&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Total files:&lt;/strong&gt; 100PB (stored in object storage like S3)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk metadata:&lt;/strong&gt; 50M files × 250 chunks × 64 bytes = 800GB (fits in a database)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File manifests:&lt;/strong&gt; 50M files × 2KB = 100GB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;cost-insight&#34;&gt;Cost Insight&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Egress cost is dominant:&lt;/strong&gt; 5PB/month × $0.05/GB (CDN) = $250K/month&lt;/li&gt;
&lt;li&gt;CDN cache hit ratio is critical: 80% hit rate → origin serves only 1PB/month&lt;/li&gt;
&lt;li&gt;Popular files (top 1%) account for 80% of downloads (cache-friendly)&lt;/li&gt;
&lt;li&gt;Long-tail files need origin serving — optimize with regional origin replicas&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is an &lt;strong&gt;egress-heavy, reliability-focused system&lt;/strong&gt;. The core challenges are: (1) reliable chunk-based downloads with resume over unreliable networks, (2) efficient CDN distribution to minimize origin egress, and (3) integrity guarantees so users never get a corrupted file.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Large-Scale Data Migration System</title>
      <link>https://chiraghasija.cc/designs/data-migration/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/data-migration/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Migrate data from a legacy datastore (e.g., MySQL) to a new datastore (e.g., PostgreSQL, DynamoDB, or a new schema in the same engine) with zero downtime&lt;/li&gt;
&lt;li&gt;Maintain full data consistency between source and destination throughout the migration — no data loss, no corruption&lt;/li&gt;
&lt;li&gt;Provide a validation framework that continuously compares source and destination data and reports discrepancies&lt;/li&gt;
&lt;li&gt;Support rollback at any stage — if the new system has issues, revert to the old system without data loss&lt;/li&gt;
&lt;li&gt;Handle schema transformations during migration (column renames, type changes, data enrichment, denormalization)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; Zero downtime. The system must remain fully operational throughout the entire migration, which may take days to weeks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; No user-facing latency increase during migration. Reads and writes continue at normal speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency. After migration completes, the new datastore must have 100.000% of the data. Not 99.99% — 100%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 50 TB of data, 500M rows across 200 tables. 10K writes/sec, 100K reads/sec. Migration must complete within 2 weeks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; The migration process itself must be resumable. If the migration pipeline crashes, it resumes from where it left off, not from scratch.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;backfill-phase&#34;&gt;Backfill Phase&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50 TB of data to migrate&lt;/li&gt;
&lt;li&gt;Network throughput (source → destination): 1 Gbps sustained = 125 MB/sec&lt;/li&gt;
&lt;li&gt;Time to backfill: 50 TB / 125 MB/sec = 400,000 sec ≈ &lt;strong&gt;4.6 days&lt;/strong&gt; at full throughput&lt;/li&gt;
&lt;li&gt;With throttling to avoid impacting production (50% utilization): &lt;strong&gt;~9 days&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;change-data-capture-cdc-during-backfill&#34;&gt;Change Data Capture (CDC) During Backfill&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10K writes/sec to source DB during backfill&lt;/li&gt;
&lt;li&gt;CDC event size: ~500 bytes average&lt;/li&gt;
&lt;li&gt;CDC throughput: 10K × 500 bytes = &lt;strong&gt;5 MB/sec&lt;/strong&gt; (easily handled by Kafka)&lt;/li&gt;
&lt;li&gt;Events generated during 9-day backfill: 10K/sec × 86400 × 9 = &lt;strong&gt;7.8 billion events&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;At 500 bytes each: &lt;strong&gt;3.9 TB&lt;/strong&gt; of CDC data (Kafka with 14-day retention can handle this)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;validation-phase&#34;&gt;Validation Phase&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Compare 500M rows between source and destination&lt;/li&gt;
&lt;li&gt;Comparison rate: 50K rows/sec (read from both, hash, compare)&lt;/li&gt;
&lt;li&gt;Time: 500M / 50K = 10,000 sec ≈ &lt;strong&gt;2.8 hours&lt;/strong&gt; per full validation pass&lt;/li&gt;
&lt;li&gt;Run 3 passes for confidence: ~8 hours&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;total-timeline&#34;&gt;Total Timeline&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Backfill: 9 days&lt;/li&gt;
&lt;li&gt;CDC catch-up: 1-2 hours (processing backlog of changes accumulated during backfill)&lt;/li&gt;
&lt;li&gt;Validation: 1 day (3 passes + fixing discrepancies)&lt;/li&gt;
&lt;li&gt;Shadow reads: 2-3 days (comparison in production)&lt;/li&gt;
&lt;li&gt;Cutover: minutes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~2 weeks&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;p&gt;This is an internal infrastructure system, not a user-facing API. But it needs operational APIs for engineers to manage the migration.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Live Comments System (YouTube Live/Twitch Chat)</title>
      <link>https://chiraghasija.cc/designs/live-comments/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/live-comments/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can send text messages to a live stream&amp;rsquo;s chat in real-time (&amp;lt; 200ms delivery to other viewers)&lt;/li&gt;
&lt;li&gt;All viewers of a stream see messages in a consistent chronological order&lt;/li&gt;
&lt;li&gt;Support rate limiting per user (e.g., 1 message every 2 seconds) and configurable slow mode (streamer sets interval)&lt;/li&gt;
&lt;li&gt;Moderation tools: delete messages, ban users, assign moderators, auto-filter spam/profanity&lt;/li&gt;
&lt;li&gt;Persist chat history so users joining late can load recent messages (last 200 messages on join)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.95% — chat can briefly degrade (delay messages) but should not go fully offline during a stream&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 200ms from send to display for 99th percentile viewers. For popular streams (100K+ viewers), &amp;lt; 500ms is acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Causal ordering per stream is required. All viewers must see messages in the same order. Messages from the same user must appear in order.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100K concurrent streams. Top streams have 500K concurrent viewers. 50K messages/sec globally across all streams. Top streams receive 5,000 messages/sec.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Chat history persisted for at least 30 days for moderation review and replay.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100K concurrent live streams, 50M total concurrent viewers&lt;/li&gt;
&lt;li&gt;50,000 messages/sec globally (write)&lt;/li&gt;
&lt;li&gt;Fan-out: each message is delivered to all viewers of that stream&lt;/li&gt;
&lt;li&gt;Average stream: 500 viewers. Top stream: 500K viewers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fan-out volume:&lt;/strong&gt; 50,000 messages/sec × average 500 recipients = 25M message deliveries/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average message: 200 bytes (user_id, stream_id, text, timestamp, metadata)&lt;/li&gt;
&lt;li&gt;50,000 messages/sec × 200 bytes = 10 MB/sec = &lt;strong&gt;864 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;30-day retention = &lt;strong&gt;~26 TB&lt;/strong&gt; (compressible to ~5 TB with gzip)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;connection-state&#34;&gt;Connection State&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50M concurrent WebSocket connections&lt;/li&gt;
&lt;li&gt;Each connection: ~10 KB memory overhead (buffers, metadata)&lt;/li&gt;
&lt;li&gt;Total: 50M × 10 KB = &lt;strong&gt;500 GB&lt;/strong&gt; of connection memory&lt;/li&gt;
&lt;li&gt;At 500K connections per server = &lt;strong&gt;100 WebSocket servers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average message delivered: 200 bytes × 25M deliveries/sec = &lt;strong&gt;5 GB/sec&lt;/strong&gt; outbound bandwidth&lt;/li&gt;
&lt;li&gt;Top stream (500K viewers × 5,000 msg/sec): 200 bytes × 500K × 5,000 would be 500 GB/sec — impossible without batching&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Batch messages. Send 50 messages every 200ms = 10 KB per batch per viewer. 500K × 10 KB × 5/sec = 25 GB/sec. Still high — need edge-level fan-out.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;rest-endpoints&#34;&gt;REST Endpoints&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Send a chat message
POST /v1/streams/{stream_id}/messages
  Headers: Authorization: Bearer &amp;lt;token&amp;gt;
  Body: {
    &amp;#34;text&amp;#34;: &amp;#34;Great play!&amp;#34;,
    &amp;#34;reply_to&amp;#34;: &amp;#34;msg_abc&amp;#34;              // optional, for threaded replies
  }
  Response 201: {
    &amp;#34;message_id&amp;#34;: &amp;#34;msg_xyz&amp;#34;,
    &amp;#34;timestamp&amp;#34;: 1708632000123
  }

// Load recent messages (on join)
GET /v1/streams/{stream_id}/messages?limit=200&amp;amp;before=&amp;lt;cursor&amp;gt;
  Response 200: {
    &amp;#34;messages&amp;#34;: [...],
    &amp;#34;cursor&amp;#34;: &amp;#34;msg_abc&amp;#34;
  }

// Delete a message (moderator)
DELETE /v1/streams/{stream_id}/messages/{message_id}
  Headers: Authorization: Bearer &amp;lt;moderator_token&amp;gt;

// Ban a user from chat
POST /v1/streams/{stream_id}/bans
  Body: { &amp;#34;user_id&amp;#34;: &amp;#34;user_123&amp;#34;, &amp;#34;duration&amp;#34;: 600 }  // 10 min timeout

// Set slow mode
PUT /v1/streams/{stream_id}/settings
  Body: { &amp;#34;slow_mode_seconds&amp;#34;: 5 }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;websocket-protocol&#34;&gt;WebSocket Protocol&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Client connects
WS /v1/streams/{stream_id}/chat?token=&amp;lt;auth_token&amp;gt;

// Server → Client: new messages (batched)
{
  &amp;#34;type&amp;#34;: &amp;#34;messages&amp;#34;,
  &amp;#34;data&amp;#34;: [
    { &amp;#34;id&amp;#34;: &amp;#34;msg_1&amp;#34;, &amp;#34;user&amp;#34;: &amp;#34;alice&amp;#34;, &amp;#34;text&amp;#34;: &amp;#34;Hello!&amp;#34;, &amp;#34;ts&amp;#34;: 1708632000123 },
    { &amp;#34;id&amp;#34;: &amp;#34;msg_2&amp;#34;, &amp;#34;user&amp;#34;: &amp;#34;bob&amp;#34;, &amp;#34;text&amp;#34;: &amp;#34;GG!&amp;#34;, &amp;#34;ts&amp;#34;: 1708632000456 }
  ]
}

// Server → Client: message deleted
{ &amp;#34;type&amp;#34;: &amp;#34;delete&amp;#34;, &amp;#34;message_id&amp;#34;: &amp;#34;msg_1&amp;#34; }

// Server → Client: user banned
{ &amp;#34;type&amp;#34;: &amp;#34;ban&amp;#34;, &amp;#34;user_id&amp;#34;: &amp;#34;user_123&amp;#34;, &amp;#34;duration&amp;#34;: 600 }

// Client → Server: heartbeat (every 30s)
{ &amp;#34;type&amp;#34;: &amp;#34;ping&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;messages-cassandra--partitioned-by-stream_id-clustered-by-timestamp&#34;&gt;Messages (Cassandra — partitioned by stream_id, clustered by timestamp)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: messages
  stream_id      (PK)  | varchar
  message_id     (CK)  | varchar       -- time-based UUID (sortable)
  user_id              | varchar
  text                 | varchar(500)
  reply_to             | varchar        -- nullable
  is_deleted           | boolean
  created_at           | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;stream-settings-redis-hash&#34;&gt;Stream Settings (Redis Hash)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Key: stream:{stream_id}:settings
Fields:
  slow_mode_seconds   | int (0 = disabled)
  subscribers_only    | boolean
  emote_only          | boolean
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;bans-redis-set-with-ttl--postgresql-for-permanent-bans&#34;&gt;Bans (Redis Set with TTL + PostgreSQL for permanent bans)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Key: stream:{stream_id}:bans
Type: Sorted Set
Member: user_id
Score: ban_expiry_timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;rate-limit-state-redis&#34;&gt;Rate Limit State (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Key: ratelimit:{stream_id}:{user_id}
Value: last_message_timestamp
TTL: slow_mode_seconds
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-cassandra-for-messages&#34;&gt;Why Cassandra for Messages?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Write-optimized (append-only, log-structured)&lt;/li&gt;
&lt;li&gt;Partitioned by stream_id — all messages for a stream are co-located&lt;/li&gt;
&lt;li&gt;Clustering by message_id (time-based UUID) gives natural chronological ordering&lt;/li&gt;
&lt;li&gt;Handles 50K writes/sec easily with horizontal scaling&lt;/li&gt;
&lt;li&gt;TTL support for automatic 30-day cleanup&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;why-redis-for-real-time-state&#34;&gt;Why Redis for Real-Time State?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Sub-millisecond lookups for rate limiting, ban checks, and settings&lt;/li&gt;
&lt;li&gt;Pub/Sub for cross-server message fan-out&lt;/li&gt;
&lt;li&gt;Sorted sets for efficient ban expiry checks&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;message-send-flow&#34;&gt;Message Send Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client (sender)
  → WebSocket Connection Server (WS Server)
    → 1. Rate limit check (Redis: last message timestamp)
       If too soon → reject with &amp;#34;slow down&amp;#34; error
    → 2. Ban check (Redis: is user_id in ban set?)
       If banned → reject
    → 3. Spam/profanity filter (in-process rules + ML service)
       If spam → silently drop or shadow-ban
    → 4. Assign message_id (time-based UUID) + server timestamp
    → 5. Publish to Message Bus (Kafka topic: chat.{stream_id})
    → 6. Return ACK to sender

Message Bus (Kafka)
  → Chat Dispatcher Service:
    → 1. Persist message to Cassandra
    → 2. Determine which WS Servers hold viewers of this stream
       (lookup: Redis set stream:{stream_id}:servers)
    → 3. Publish to each WS Server via Redis Pub/Sub channel
       Channel: ws_server:{server_id}:messages

WS Server (receivers)
  → 1. Receive message from Redis Pub/Sub
  → 2. Buffer message (batch: up to 50 messages or 200ms, whichever first)
  → 3. Send batched messages to all local WebSocket connections for that stream
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;WebSocket Connection Servers:&lt;/strong&gt; Maintain persistent connections with clients. Stateful — each server knows which streams its connected clients are watching. Horizontally scaled (100+ servers). Sticky routing by stream_id hash to concentrate viewers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Message Bus (Kafka):&lt;/strong&gt; Durable message queue. One topic per popular stream, shared topics for smaller streams. Provides ordering guarantee within a partition (one stream = one partition).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chat Dispatcher Service:&lt;/strong&gt; Consumes from Kafka, persists to Cassandra, fans out to WS Servers. Stateless workers, scaled by partition count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis Cluster:&lt;/strong&gt; Rate limiting, ban lists, stream settings, pub/sub for WS server fan-out, stream→server mapping.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cassandra Cluster:&lt;/strong&gt; Persistent chat history. Read path for &amp;ldquo;load recent messages&amp;rdquo; on viewer join.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spam/Moderation Service:&lt;/strong&gt; Real-time text classification. Regex rules (profanity word list) + ML model (spam detection). Called synchronously on message send (&amp;lt; 10ms budget).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream Registry:&lt;/strong&gt; Tracks which WS servers have viewers for each stream. Updated when connections are established/dropped.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;connection-management&#34;&gt;Connection Management&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Viewer joins stream:
  1. Client establishes WebSocket to assigned WS Server (via load balancer)
  2. WS Server registers: SADD stream:{stream_id}:servers {server_id}
  3. WS Server loads last 200 messages from Cassandra for initial hydration
  4. Client starts receiving live messages via WebSocket

Viewer leaves / disconnects:
  1. WS Server detects close/timeout
  2. Decrement local viewer count for stream
  3. If no more local viewers for stream:
     SREM stream:{stream_id}:servers {server_id}
     Unsubscribe from Redis Pub/Sub channel
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-fan-out-at-scale--handling-500k-viewers&#34;&gt;Deep Dive 1: Fan-Out at Scale — Handling 500K Viewers&lt;/h3&gt;
&lt;p&gt;The hardest problem: a top stream has 500K concurrent viewers receiving 5,000 messages/sec. Naive fan-out (send each message individually to 500K connections) is 2.5 billion message deliveries per second. Impossible.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Notification System</title>
      <link>https://chiraghasija.cc/designs/notification-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/notification-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Send notifications across multiple channels: push notifications (iOS/Android), SMS, email, and in-app&lt;/li&gt;
&lt;li&gt;Support user notification preferences — users can opt in/out per channel, per notification type (e.g., marketing vs transactional)&lt;/li&gt;
&lt;li&gt;Template-based notifications with variable substitution (e.g., &amp;ldquo;Hi {{name}}, your order {{order_id}} has shipped&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Rate limiting per user per channel — no user receives more than N notifications per hour (prevent notification fatigue)&lt;/li&gt;
&lt;li&gt;Track delivery status for each notification (sent, delivered, opened, clicked, bounced, failed) with retry on failure&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — transactional notifications (password resets, 2FA codes) are critical path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Transactional notifications delivered within 5 seconds of trigger. Marketing/batch notifications within 30 minutes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; At-least-once delivery. No notification should be silently dropped. Deduplication prevents sending the same notification twice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1B notifications/day (across all channels), 50K concurrent sends/sec at peak, 500M registered users&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Every notification request and its delivery status persisted. Full audit trail.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B notifications/day = &lt;strong&gt;~11,500 notifications/sec average&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak (morning/evening, campaign blasts): 5× = &lt;strong&gt;~58,000 notifications/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Channel breakdown: Push 50% (500M), Email 30% (300M), In-app 15% (150M), SMS 5% (50M)&lt;/li&gt;
&lt;li&gt;API ingestion (trigger events): ~100K events/sec (many produce multiple notifications)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Notification record: ~500 bytes (recipient, channel, template, payload, status, timestamps)&lt;/li&gt;
&lt;li&gt;1B/day × 500 bytes = &lt;strong&gt;500 GB/day&lt;/strong&gt;, 30-day retention = &lt;strong&gt;15 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;User preferences: 500M users × 200 bytes = &lt;strong&gt;100 GB&lt;/strong&gt; (fits in a single DB)&lt;/li&gt;
&lt;li&gt;Templates: ~10K templates × 5 KB = &lt;strong&gt;50 MB&lt;/strong&gt; (negligible)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;third-party-throughput&#34;&gt;Third-Party Throughput&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Push (APNs/FCM): Both support ~100K sends/sec with connection pooling&lt;/li&gt;
&lt;li&gt;Email (SES/SendGrid): SES supports 50K emails/sec at scale&lt;/li&gt;
&lt;li&gt;SMS (Twilio/SNS): ~1000 SMS/sec per account (need multiple accounts or regional providers for 50M/day)&lt;/li&gt;
&lt;li&gt;SMS is the bottleneck — 50M SMS/day ÷ 86400 = 579/sec average, 2900/sec peak. Need multiple provider accounts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;high-throughput, multi-channel delivery pipeline&lt;/strong&gt; where the core challenges are: (1) routing to the right channel at the right time, (2) respecting user preferences and rate limits, (3) handling third-party provider failures gracefully, and (4) ensuring no notification is lost or duplicated.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Parts Compatibility System (PCPartPicker)</title>
      <link>https://chiraghasija.cc/designs/parts-compatibility/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/parts-compatibility/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users select PC components (CPU, motherboard, RAM, GPU, storage, PSU, case) and the system validates compatibility across all parts in real-time&lt;/li&gt;
&lt;li&gt;Compatibility rules engine checks socket matching (CPU ↔ motherboard), form factor (motherboard ↔ case), RAM type (DDR4 vs DDR5), power requirements (total wattage ↔ PSU), and physical clearances (GPU length, cooler height)&lt;/li&gt;
&lt;li&gt;Provide a power budget calculator that sums component TDPs and recommends minimum PSU wattage with appropriate headroom&lt;/li&gt;
&lt;li&gt;Detect and warn about performance bottlenecks (e.g., pairing a high-end GPU with a low-end CPU)&lt;/li&gt;
&lt;li&gt;Aggregate pricing from multiple retailers with real-time price tracking, price history, and price drop alerts&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% — the site is informational/commerce, not life-critical. Brief outages are tolerable but costly (lost affiliate revenue).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Compatibility checks must complete in &amp;lt; 200ms as users add each component. Price data should be &amp;lt; 1 hour stale.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Compatibility rules must be strictly correct — a false &amp;ldquo;compatible&amp;rdquo; signal that leads to a bad purchase is unacceptable. False &amp;ldquo;incompatible&amp;rdquo; (overly conservative) is tolerable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500K product SKUs across 15 component categories. 10M monthly active users. 50K concurrent build sessions. 100M compatibility checks/day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; User builds must be saved reliably. Price history must be retained for 2+ years for trend analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;product-catalog&#34;&gt;Product Catalog&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500K SKUs across 15 categories&lt;/li&gt;
&lt;li&gt;Average product record: 2 KB (specs, descriptions, images, links)&lt;/li&gt;
&lt;li&gt;Total catalog: 500K × 2 KB = &lt;strong&gt;1 GB&lt;/strong&gt; — easily fits in memory&lt;/li&gt;
&lt;li&gt;Compatibility attributes per product: ~500 bytes of structured spec data&lt;/li&gt;
&lt;li&gt;Compatibility data: 500K × 500 bytes = &lt;strong&gt;250 MB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;compatibility-checks&#34;&gt;Compatibility Checks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100M checks/day = &lt;strong&gt;1,157 checks/sec&lt;/strong&gt; average, ~5,000/sec peak&lt;/li&gt;
&lt;li&gt;Each check: evaluate ~10-15 rules per component pair&lt;/li&gt;
&lt;li&gt;Average build has 7 components → adding 1 component checks against 6 others → ~60-90 rule evaluations&lt;/li&gt;
&lt;li&gt;At &amp;lt; 200ms budget and ~90 rules: &amp;lt; 2ms per rule&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;price-data&#34;&gt;Price Data&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500K products × 10 retailers average = 5M price points&lt;/li&gt;
&lt;li&gt;Price scrape frequency: every 1 hour per retailer&lt;/li&gt;
&lt;li&gt;5M scrapes/hour = &lt;strong&gt;1,400 scrapes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Price history: 5M products × 365 days × 24 data points/day × 10 bytes = &lt;strong&gt;~440 GB/year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;user-builds&#34;&gt;User Builds&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M monthly users, ~30% create builds = 3M builds/month&lt;/li&gt;
&lt;li&gt;Average build: 500 bytes (7 component IDs + metadata)&lt;/li&gt;
&lt;li&gt;3M × 500 bytes = &lt;strong&gt;1.5 GB/month&lt;/strong&gt; — trivial storage&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;rest-endpoints&#34;&gt;REST Endpoints&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Create or update a build
POST /v1/builds
  Headers: Authorization: Bearer &amp;lt;token&amp;gt;  // optional for anonymous builds
  Body: {
    &amp;#34;name&amp;#34;: &amp;#34;My Gaming PC 2026&amp;#34;,
    &amp;#34;components&amp;#34;: {
      &amp;#34;cpu&amp;#34;: &amp;#34;sku_amd_9800x3d&amp;#34;,
      &amp;#34;motherboard&amp;#34;: &amp;#34;sku_asus_x870e&amp;#34;,
      &amp;#34;ram&amp;#34;: &amp;#34;sku_corsair_ddr5_6000_32gb&amp;#34;,
      &amp;#34;gpu&amp;#34;: &amp;#34;sku_nvidia_5080&amp;#34;,
      &amp;#34;storage&amp;#34;: &amp;#34;sku_samsung_990_pro_2tb&amp;#34;,
      &amp;#34;psu&amp;#34;: &amp;#34;sku_corsair_rm850x&amp;#34;,
      &amp;#34;case&amp;#34;: &amp;#34;sku_fractal_north&amp;#34;
    }
  }
  Response 200: {
    &amp;#34;build_id&amp;#34;: &amp;#34;build_abc123&amp;#34;,
    &amp;#34;compatibility&amp;#34;: {
      &amp;#34;status&amp;#34;: &amp;#34;compatible&amp;#34;,           // compatible | warning | incompatible
      &amp;#34;issues&amp;#34;: [],
      &amp;#34;warnings&amp;#34;: [
        {
          &amp;#34;type&amp;#34;: &amp;#34;bottleneck&amp;#34;,
          &amp;#34;severity&amp;#34;: &amp;#34;low&amp;#34;,
          &amp;#34;message&amp;#34;: &amp;#34;CPU may slightly bottleneck GPU at 1080p. Consider 1440p+ gaming.&amp;#34;
        }
      ],
      &amp;#34;power_budget&amp;#34;: {
        &amp;#34;estimated_tdp&amp;#34;: 450,
        &amp;#34;psu_wattage&amp;#34;: 850,
        &amp;#34;headroom_pct&amp;#34;: 47,
        &amp;#34;rating&amp;#34;: &amp;#34;excellent&amp;#34;
      }
    },
    &amp;#34;total_price&amp;#34;: 189499,              // $1,894.99 in cents
    &amp;#34;price_by_component&amp;#34;: { ... }
  }

// Check compatibility for adding a single component
POST /v1/builds/{build_id}/check
  Body: { &amp;#34;category&amp;#34;: &amp;#34;gpu&amp;#34;, &amp;#34;sku&amp;#34;: &amp;#34;sku_nvidia_5080&amp;#34; }
  Response 200: {
    &amp;#34;compatible&amp;#34;: true,
    &amp;#34;issues&amp;#34;: [],
    &amp;#34;warnings&amp;#34;: [...],
    &amp;#34;updated_power_budget&amp;#34;: { ... }
  }

// Search products with filters
GET /v1/products?category=gpu&amp;amp;socket=AM5&amp;amp;min_vram=12&amp;amp;max_price=80000&amp;amp;sort=price_asc
  Response 200: {
    &amp;#34;products&amp;#34;: [
      {
        &amp;#34;sku&amp;#34;: &amp;#34;sku_nvidia_5080&amp;#34;,
        &amp;#34;name&amp;#34;: &amp;#34;NVIDIA GeForce RTX 5080&amp;#34;,
        &amp;#34;specs&amp;#34;: { &amp;#34;vram_gb&amp;#34;: 16, &amp;#34;tdp_watts&amp;#34;: 360, &amp;#34;length_mm&amp;#34;: 304, ... },
        &amp;#34;compatible_with_build&amp;#34;: true,  // pre-filtered based on current build
        &amp;#34;prices&amp;#34;: [
          { &amp;#34;retailer&amp;#34;: &amp;#34;Amazon&amp;#34;, &amp;#34;price&amp;#34;: 99999, &amp;#34;url&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;in_stock&amp;#34;: true },
          { &amp;#34;retailer&amp;#34;: &amp;#34;Newegg&amp;#34;, &amp;#34;price&amp;#34;: 98999, &amp;#34;url&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;in_stock&amp;#34;: true }
        ]
      }
    ]
  }

// Get price history for a product
GET /v1/products/{sku}/price-history?period=90d
  Response 200: {
    &amp;#34;sku&amp;#34;: &amp;#34;sku_nvidia_5080&amp;#34;,
    &amp;#34;history&amp;#34;: [
      { &amp;#34;date&amp;#34;: &amp;#34;2026-02-22&amp;#34;, &amp;#34;min_price&amp;#34;: 98999, &amp;#34;avg_price&amp;#34;: 100499 },
      ...
    ]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;products-postgresql&#34;&gt;Products (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: products
  sku              (PK) | varchar(50)
  category               | enum(&amp;#39;cpu&amp;#39;,&amp;#39;motherboard&amp;#39;,&amp;#39;ram&amp;#39;,&amp;#39;gpu&amp;#39;,&amp;#39;storage&amp;#39;,&amp;#39;psu&amp;#39;,&amp;#39;case&amp;#39;,&amp;#39;cooler&amp;#39;,&amp;#39;monitor&amp;#39;,...)
  name                   | varchar(200)
  manufacturer           | varchar(100)
  specs                  | jsonb         -- category-specific specifications
  compatibility_attrs    | jsonb         -- extracted compatibility attributes
  image_url              | varchar(500)
  release_date           | date
  is_active              | boolean
  updated_at             | timestamp

-- Example specs for a CPU:
-- { &amp;#34;socket&amp;#34;: &amp;#34;AM5&amp;#34;, &amp;#34;cores&amp;#34;: 8, &amp;#34;threads&amp;#34;: 16, &amp;#34;base_clock_ghz&amp;#34;: 4.7,
--   &amp;#34;boost_clock_ghz&amp;#34;: 5.2, &amp;#34;tdp_watts&amp;#34;: 120, &amp;#34;ram_type&amp;#34;: &amp;#34;DDR5&amp;#34;,
--   &amp;#34;max_ram_speed_mhz&amp;#34;: 5200, &amp;#34;pcie_version&amp;#34;: &amp;#34;5.0&amp;#34;, &amp;#34;pcie_lanes&amp;#34;: 24,
--   &amp;#34;integrated_graphics&amp;#34;: false, &amp;#34;cooler_included&amp;#34;: false }

-- Example compatibility_attrs for a CPU:
-- { &amp;#34;socket&amp;#34;: &amp;#34;AM5&amp;#34;, &amp;#34;ram_type&amp;#34;: &amp;#34;DDR5&amp;#34;, &amp;#34;tdp_watts&amp;#34;: 120, &amp;#34;pcie_version&amp;#34;: &amp;#34;5.0&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;compatibility-rules-postgresql--in-memory-cache&#34;&gt;Compatibility Rules (PostgreSQL + in-memory cache)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: compatibility_rules
  rule_id          (PK) | int
  name                   | varchar(100)
  category_a             | varchar(20)   -- e.g., &amp;#34;cpu&amp;#34;
  category_b             | varchar(20)   -- e.g., &amp;#34;motherboard&amp;#34;
  rule_type              | enum(&amp;#39;must_match&amp;#39;, &amp;#39;must_fit&amp;#39;, &amp;#39;must_not_exceed&amp;#39;, &amp;#39;warning&amp;#39;)
  attribute_a            | varchar(50)   -- e.g., &amp;#34;socket&amp;#34;
  attribute_b            | varchar(50)   -- e.g., &amp;#34;socket&amp;#34;
  operator               | enum(&amp;#39;equals&amp;#39;, &amp;#39;gte&amp;#39;, &amp;#39;lte&amp;#39;, &amp;#39;contains&amp;#39;, &amp;#39;fits_in&amp;#39;, &amp;#39;custom&amp;#39;)
  custom_logic           | text          -- for complex rules (e.g., RAM slot count + DIMM count)
  severity               | enum(&amp;#39;error&amp;#39;, &amp;#39;warning&amp;#39;, &amp;#39;info&amp;#39;)
  message_template       | text          -- &amp;#34;CPU socket {a} does not match motherboard socket {b}&amp;#34;
  is_active              | boolean
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;user-builds-postgresql&#34;&gt;User Builds (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: builds
  build_id         (PK) | varchar(20)
  user_id          (FK) | varchar(20)  -- nullable for anonymous
  name                   | varchar(100)
  components             | jsonb        -- { &amp;#34;cpu&amp;#34;: &amp;#34;sku_xxx&amp;#34;, &amp;#34;gpu&amp;#34;: &amp;#34;sku_yyy&amp;#34;, ... }
  total_price            | int
  compatibility_status   | enum(&amp;#39;compatible&amp;#39;,&amp;#39;warning&amp;#39;,&amp;#39;incompatible&amp;#39;)
  is_public              | boolean
  created_at             | timestamp
  updated_at             | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;prices-timescaledb--postgresql-with-partitioning&#34;&gt;Prices (TimescaleDB / PostgreSQL with partitioning)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: prices
  sku              (FK) | varchar(50)
  retailer               | varchar(50)
  price                  | int           -- in cents
  in_stock               | boolean
  url                    | varchar(500)
  scraped_at             | timestamp
  -- Partitioned by month on scraped_at
  -- Index on (sku, retailer, scraped_at DESC)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-postgresql&#34;&gt;Why PostgreSQL?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Product catalog and compatibility rules need relational integrity&lt;/li&gt;
&lt;li&gt;JSONB for flexible specs (each category has different attributes)&lt;/li&gt;
&lt;li&gt;Compatibility rules are read-heavy, easily cached — PostgreSQL is fine as the source of truth&lt;/li&gt;
&lt;li&gt;Price data benefits from TimescaleDB extension for time-series queries (price history charts)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;compatibility-check-flow&#34;&gt;Compatibility Check Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;User adds a GPU to their build:
  → Frontend sends: POST /v1/builds/{id}/check { &amp;#34;category&amp;#34;: &amp;#34;gpu&amp;#34;, &amp;#34;sku&amp;#34;: &amp;#34;sku_5080&amp;#34; }
    → Build Service:
      1. Load current build components from cache/DB
      2. Load new component&amp;#39;s compatibility_attrs from product cache
      3. Call Compatibility Engine:
         For each existing component in the build:
           → Find applicable rules (gpu ↔ motherboard, gpu ↔ psu, gpu ↔ case, ...)
           → Evaluate each rule:
              Rule: gpu.pcie_version &amp;lt;= motherboard.pcie_version → OK
              Rule: gpu.length_mm &amp;lt;= case.max_gpu_length_mm → OK
              Rule: gpu.tdp_watts + ... &amp;lt;= psu.wattage × 0.8 → OK
              Rule: gpu.power_connectors available on PSU → OK
           → Collect all issues and warnings
      4. Calculate power budget:
         Sum TDP of all components → estimated_draw
         Headroom = (psu_wattage - estimated_draw) / psu_wattage × 100
      5. Check bottleneck (heuristic):
         CPU tier vs GPU tier (based on benchmark scores)
         If mismatch &amp;gt; threshold → warning
      6. Return result
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;product-search-with-compatibility-filtering&#34;&gt;Product Search with Compatibility Filtering&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;User is on GPU selection page (already has CPU, motherboard, case, PSU chosen):
  → GET /v1/products?category=gpu&amp;amp;compatible_with=build_abc123

  → Product Service:
    1. Load all GPUs from product cache (filtered by user&amp;#39;s other criteria)
    2. For each GPU candidate:
       → Run compatibility check against existing build components
       → Mark as compatible/warning/incompatible
    3. Return only compatible + warning GPUs (hide incompatible by default)
    4. Sort by user preference (price, performance, popularity)

  Optimization: Pre-compute compatibility for common component pairs.
  For 1,000 CPUs × 500 motherboards = 500K pairs → precompute and cache.
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Build Service:&lt;/strong&gt; Manages user builds. Orchestrates compatibility checks. Stateless, horizontally scaled.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compatibility Engine:&lt;/strong&gt; Core rule evaluation logic. Loaded in-memory with all rules and product compatibility attributes. &amp;lt; 200ms per full build check.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product Service:&lt;/strong&gt; Manages the product catalog. Search, filter, sort. Elasticsearch for full-text search + faceted filtering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Price Aggregation Service:&lt;/strong&gt; Scrapes retailer prices. Stores price history. Detects price drops and sends alerts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product Cache (Redis):&lt;/strong&gt; All 500K products + compatibility attributes cached. 250 MB — fits easily. Refreshed on catalog updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rules Cache (in-memory):&lt;/strong&gt; All compatibility rules loaded into each service instance&amp;rsquo;s memory. Refreshed every 5 minutes from PostgreSQL.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;architecture-diagram&#34;&gt;Architecture Diagram&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Users (browser)
  → CDN (static assets, product images)
  → API Gateway → Load Balancer
    → Build Service
      → Compatibility Engine (in-process)
        → Product Cache (Redis)
        → Rules Cache (in-memory)
    → Product Service
      → Elasticsearch (search + faceted filters)
      → PostgreSQL (catalog source of truth)
    → Price Service
      → Price Scrapers (distributed workers)
      → TimescaleDB (price history)
      → Redis (current prices)

Background:
  Price Scrapers → Retailer websites/APIs → TimescaleDB + Redis
  Catalog Updater → Manufacturer feeds → PostgreSQL → Redis cache invalidation
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-the-compatibility-rules-engine&#34;&gt;Deep Dive 1: The Compatibility Rules Engine&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Rule categories and examples:&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Payment Payout System</title>
      <link>https://chiraghasija.cc/designs/seller-payments/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/seller-payments/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;question-statement&#34;&gt;Question Statement&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;Design a payment payout system for a marketplace (like Amazon, Uber, Airbnb) that pays out sellers/drivers/hosts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Constraints:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Payment gateway takes ~1 minute to process a payment&lt;/li&gt;
&lt;li&gt;Fixed fee per transfer charged by the payment gateway&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Requirements (in order of importance):&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Audit log of all payments made out to sellers&lt;/li&gt;
&lt;li&gt;Sellers paid out in their preferred form of payment, as soon as possible&lt;/li&gt;
&lt;li&gt;Sellers can check the status of their payments, or see steps required to fix problems&lt;/li&gt;
&lt;li&gt;Keep payment gateway fees to a minimum&lt;/li&gt;
&lt;li&gt;No payments should be dropped&lt;/li&gt;
&lt;li&gt;No duplicate payments should be made out&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;#&lt;/th&gt;
          &lt;th&gt;Requirement&lt;/th&gt;
          &lt;th&gt;Details&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;FR1&lt;/td&gt;
          &lt;td&gt;Payment Processing&lt;/td&gt;
          &lt;td&gt;Process seller payouts in their preferred method (bank transfer, PayPal, UPI, etc.)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR2&lt;/td&gt;
          &lt;td&gt;Batching&lt;/td&gt;
          &lt;td&gt;Batch multiple small payments into fewer gateway calls to minimize fees&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR3&lt;/td&gt;
          &lt;td&gt;Idempotency&lt;/td&gt;
          &lt;td&gt;No duplicate payments — exactly-once execution guarantee&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR4&lt;/td&gt;
          &lt;td&gt;Audit Log&lt;/td&gt;
          &lt;td&gt;Immutable, append-only log of every payment state transition&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR5&lt;/td&gt;
          &lt;td&gt;Status Tracking&lt;/td&gt;
          &lt;td&gt;Real-time, queryable status for every payment with actionable error messages&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR6&lt;/td&gt;
          &lt;td&gt;Retry &amp;amp; Recovery&lt;/td&gt;
          &lt;td&gt;Failed payments are retried with backoff; stuck payments are recovered automatically&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR7&lt;/td&gt;
          &lt;td&gt;Settlement Tracking&lt;/td&gt;
          &lt;td&gt;Track whether money actually landed in seller&amp;rsquo;s bank (gateway success ≠ bank settlement)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FR8&lt;/td&gt;
          &lt;td&gt;Reconciliation&lt;/td&gt;
          &lt;td&gt;Daily comparison of gateway records vs internal records to catch mismatches&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;#&lt;/th&gt;
          &lt;th&gt;Requirement&lt;/th&gt;
          &lt;th&gt;Target&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR1&lt;/td&gt;
          &lt;td&gt;Durability&lt;/td&gt;
          &lt;td&gt;Zero data loss. RPO = 0. Every payment survives any single system failure&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR2&lt;/td&gt;
          &lt;td&gt;Consistency&lt;/td&gt;
          &lt;td&gt;Strong consistency for payment state. No scenario where payment is charged but not recorded&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR3&lt;/td&gt;
          &lt;td&gt;Availability&lt;/td&gt;
          &lt;td&gt;99.99% for payment ingestion and status API&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR4&lt;/td&gt;
          &lt;td&gt;Ingestion Latency&lt;/td&gt;
          &lt;td&gt;&amp;lt; 200ms to accept and acknowledge a payment request&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR5&lt;/td&gt;
          &lt;td&gt;Status Lookup Latency&lt;/td&gt;
          &lt;td&gt;&amp;lt; 100ms&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR6&lt;/td&gt;
          &lt;td&gt;Execution Latency&lt;/td&gt;
          &lt;td&gt;Minutes are acceptable (batching + gateway processing time)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;NFR7&lt;/td&gt;
          &lt;td&gt;Idempotency&lt;/td&gt;
          &lt;td&gt;Every payment has a globally unique idempotency key at every level&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Active sellers:           1M
Payouts per day:          10M transactions
Ingestion rate:           10M / 86400 ≈ 115 payments/sec
Batch ratio (10:1):       ~12 gateway calls/sec
Audit log growth:         10M records/day → ~3.6B/year
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /payments                      → Initiate a payout
  Request:
    {
      idempotency_key: &amp;#34;order_12345_payout_v1&amp;#34;,   → client generates this
      seller_id: &amp;#34;s456&amp;#34;,
      amount: 250.00,
      currency: &amp;#34;USD&amp;#34;,
      payment_method: &amp;#34;bank_transfer&amp;#34;
    }
  Response: 202 Accepted
    {
      payment_id: &amp;#34;p789&amp;#34;,
      status: &amp;#34;PENDING&amp;#34;
    }

GET /payments/{payment_id}          → Check payment status
  Response:
    {
      payment_id: &amp;#34;p789&amp;#34;,
      status: &amp;#34;ACCEPTED&amp;#34;,
      amount: 250.00,
      batch_id: &amp;#34;b101&amp;#34;,
      failure_reason: null,
      action_required: null,
      estimated_settlement: &amp;#34;2026-02-26T14:00:00Z&amp;#34;
    }

GET /sellers/{seller_id}/payments   → Seller payment history
  Query params: ?status=FAILED&amp;amp;from=2026-01-01&amp;amp;to=2026-02-25
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-payment-state-machine&#34;&gt;4. Payment State Machine&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;PENDING ───▶ BATCHED ───▶ SUBMITTED ───▶ ACCEPTED ───▶ SETTLED ✓
                                          │
                                          ├───▶ REVERSED (bank rejected after acceptance)
                                          │
                                          └───▶ RETURNED (settled but bounced back)

At any failed stage:
  FAILED ───▶ (re-enters as PENDING after fix or retry)
&lt;/code&gt;&lt;/pre&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Status&lt;/th&gt;
          &lt;th&gt;Meaning&lt;/th&gt;
          &lt;th&gt;Seller-Facing Message&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;PENDING&lt;/td&gt;
          &lt;td&gt;Payment received, waiting to be batched&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment received, queued for processing&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;BATCHED&lt;/td&gt;
          &lt;td&gt;Grouped with other payments for fee optimization&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment queued for processing&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;SUBMITTED&lt;/td&gt;
          &lt;td&gt;Batch sent to payment gateway&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment sent to payment processor&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;ACCEPTED&lt;/td&gt;
          &lt;td&gt;Gateway accepted, in transit to bank&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment in transit to your bank&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;SETTLED&lt;/td&gt;
          &lt;td&gt;Money confirmed in seller&amp;rsquo;s account&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment deposited in your account&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;REVERSED&lt;/td&gt;
          &lt;td&gt;Bank rejected the transfer&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment failed: [reason]. [action required]&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RETURNED&lt;/td&gt;
          &lt;td&gt;Settled but bounced back&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment returned by your bank. Please contact your bank.&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;FAILED&lt;/td&gt;
          &lt;td&gt;Exhausted retries or unrecoverable error&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;Payment failed: [reason]. [action required]&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; Gateway returning &amp;ldquo;success&amp;rdquo; only means ACCEPTED, not SETTLED. True settlement confirmation comes hours/days later via webhooks or polling.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Peer-to-Peer File Sharing System (BitTorrent)</title>
      <link>https://chiraghasija.cc/designs/bittorrent/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/bittorrent/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can share files by creating a torrent (metadata file) and seeding (uploading) the actual content to peers&lt;/li&gt;
&lt;li&gt;Users can download files by obtaining a torrent file or magnet link, discovering peers who have the file, and downloading pieces from multiple peers simultaneously&lt;/li&gt;
&lt;li&gt;Support peer discovery through both centralized trackers and decentralized DHT (Distributed Hash Table)&lt;/li&gt;
&lt;li&gt;Ensure file integrity — every downloaded piece is verified against a cryptographic hash before being accepted&lt;/li&gt;
&lt;li&gt;Implement an incentive mechanism (tit-for-tat) — peers who upload more get faster downloads; freeloaders (leechers) get throttled&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; The system must work even when the tracker is offline (DHT provides decentralized fallback). No single point of failure for content distribution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Peer discovery &amp;lt; 5 seconds. First bytes of download begin within 10 seconds of starting. Sustained throughput scales with the number of available peers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for peer lists (stale peers are tolerable — the client will simply fail to connect and try others). Strong consistency for file integrity (every piece must match its hash).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; Support files from 1 MB to 100 GB. Swarms with 1 million simultaneous peers. Global DHT with 100 million nodes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Files remain available as long as at least one seeder exists. The more popular a file, the more available it becomes (self-scaling).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;file-distribution&#34;&gt;File Distribution&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A 4 GB file, split into 256 KB pieces = 16,384 pieces&lt;/li&gt;
&lt;li&gt;Each piece has a 20-byte SHA-1 hash for verification&lt;/li&gt;
&lt;li&gt;Torrent metadata (info dictionary): 16,384 × 20 bytes = &lt;strong&gt;320 KB&lt;/strong&gt; of hashes + metadata ≈ 350 KB total&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;tracker-load&#34;&gt;Tracker Load&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1 million active torrents&lt;/li&gt;
&lt;li&gt;Average swarm size: 200 peers&lt;/li&gt;
&lt;li&gt;Each peer announces to the tracker every 30 minutes&lt;/li&gt;
&lt;li&gt;200M peers / 1800 seconds = &lt;strong&gt;111K announce requests/sec&lt;/strong&gt; to the tracker&lt;/li&gt;
&lt;li&gt;Each announce: ~200 bytes request, ~2 KB response (list of 50 peers)&lt;/li&gt;
&lt;li&gt;Bandwidth: 111K × 2 KB = &lt;strong&gt;222 MB/sec&lt;/strong&gt; tracker outbound&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;dht-scale&#34;&gt;DHT Scale&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100 million DHT nodes globally&lt;/li&gt;
&lt;li&gt;Routing table per node: 160 k-buckets × 8 entries = 1,280 peers ≈ &lt;strong&gt;50 KB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;DHT lookup: O(log N) hops. log2(100M) ≈ 27 hops, but k-bucket routing reduces this to ~4-6 hops in practice&lt;/li&gt;
&lt;li&gt;Each hop: ~50ms (UDP round trip) → &lt;strong&gt;200-300ms&lt;/strong&gt; total DHT lookup time&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;download-throughput&#34;&gt;Download Throughput&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A well-seeded torrent with 100 peers, each contributing 1 MB/sec&lt;/li&gt;
&lt;li&gt;Theoretical max: 100 MB/sec (limited by receiver&amp;rsquo;s bandwidth, not peer count)&lt;/li&gt;
&lt;li&gt;Practical: 20-50 MB/sec (connection overhead, slow peers, churn)&lt;/li&gt;
&lt;li&gt;A 4 GB file at 30 MB/sec ≈ &lt;strong&gt;2.2 minutes&lt;/strong&gt; to download&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;p&gt;BitTorrent is a peer-to-peer protocol, not a traditional client-server API. But there are three distinct interfaces.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Performance Metrics Collection System</title>
      <link>https://chiraghasija.cc/designs/performance-metrics/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/performance-metrics/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Collect performance metrics (CPU, memory, disk, network, custom application metrics) from thousands of servers and services&lt;/li&gt;
&lt;li&gt;Store time-series data with configurable retention (high-resolution for recent data, downsampled for historical)&lt;/li&gt;
&lt;li&gt;Support flexible querying: aggregate metrics across dimensions (host, service, region), with functions (avg, p99, sum, rate)&lt;/li&gt;
&lt;li&gt;Real-time alerting: trigger alerts when metrics cross thresholds (e.g., CPU &amp;gt; 90% for 5 minutes)&lt;/li&gt;
&lt;li&gt;Dashboard visualization: support building dashboards with graphs, heatmaps, and tables that auto-refresh&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.95% — monitoring must be more reliable than what it monitors. A monitoring outage during a production incident is catastrophic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Metric ingestion: &amp;lt; 5 seconds end-to-end (from source to queryable). Dashboard queries: &amp;lt; 500ms for recent data (last 1 hour), &amp;lt; 2 seconds for historical data (last 30 days).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10,000 hosts, each emitting 500 metrics at 10-second intervals = 500K data points/sec ingestion. 10 million active time series. 10TB of metric data in storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Metric data must survive single-node failures. 30-day raw retention, 1-year downsampled retention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cardinality:&lt;/strong&gt; Handle high-cardinality labels gracefully (up to 1M unique label combinations per metric name without degrading query performance).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ingestion&#34;&gt;Ingestion&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10,000 hosts x 500 metrics x 1 sample/10 sec = &lt;strong&gt;500K samples/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each sample: metric_name (hashed to 8 bytes) + labels (8 bytes hash) + timestamp (8 bytes) + value (8 bytes) = ~32 bytes&lt;/li&gt;
&lt;li&gt;Ingestion bandwidth: 500K x 32 bytes = &lt;strong&gt;16 MB/sec&lt;/strong&gt; — modest, not a bottleneck&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Samples per day: 500K/sec x 86,400 = &lt;strong&gt;43.2 billion samples/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Raw storage per day: 43.2B x 16 bytes (timestamp + value, compressed) = &lt;strong&gt;691 GB/day&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;With compression (gorilla encoding): ~1.37 bytes per sample → &lt;strong&gt;59 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;30-day raw retention: 59 x 30 = &lt;strong&gt;1.8 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;1-year downsampled (5-min resolution): 59 GB/day / 30 (10s → 5min) x 365 = &lt;strong&gt;717 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Total: ~&lt;strong&gt;2.5 TB&lt;/strong&gt; — surprisingly small thanks to time-series compression&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;queries&#34;&gt;Queries&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active dashboards: 1,000 dashboards, each with 10 panels, refreshing every 30 seconds&lt;/li&gt;
&lt;li&gt;Query QPS: 1,000 x 10 / 30 = &lt;strong&gt;~333 queries/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each query scans: ~1 hour of data for 100 time series = 100 x 360 samples = 36K samples per query&lt;/li&gt;
&lt;li&gt;Query throughput: 333 x 36K = &lt;strong&gt;12M samples scanned/sec&lt;/strong&gt; — well within SSD IOPS capacity&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ingestion-api&#34;&gt;Ingestion API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Push metrics (agent → collector)
POST /api/v1/write
  Content-Type: application/x-protobuf    // or application/json
  Body: {
    &amp;#34;timeseries&amp;#34;: [
      {
        &amp;#34;labels&amp;#34;: {&amp;#34;__name__&amp;#34;: &amp;#34;cpu_usage&amp;#34;, &amp;#34;host&amp;#34;: &amp;#34;web-01&amp;#34;, &amp;#34;region&amp;#34;: &amp;#34;us-east&amp;#34;, &amp;#34;service&amp;#34;: &amp;#34;api&amp;#34;},
        &amp;#34;samples&amp;#34;: [
          {&amp;#34;timestamp&amp;#34;: 1708632060, &amp;#34;value&amp;#34;: 73.5},
          {&amp;#34;timestamp&amp;#34;: 1708632070, &amp;#34;value&amp;#34;: 75.2}
        ]
      },
      {
        &amp;#34;labels&amp;#34;: {&amp;#34;__name__&amp;#34;: &amp;#34;http_request_duration_seconds&amp;#34;, &amp;#34;method&amp;#34;: &amp;#34;GET&amp;#34;, &amp;#34;path&amp;#34;: &amp;#34;/api/users&amp;#34;, &amp;#34;status&amp;#34;: &amp;#34;200&amp;#34;},
        &amp;#34;samples&amp;#34;: [
          {&amp;#34;timestamp&amp;#34;: 1708632060, &amp;#34;value&amp;#34;: 0.042}
        ]
      }
    ]
  }
  Response 200: { &amp;#34;samples_written&amp;#34;: 3 }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;query-api&#34;&gt;Query API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// PromQL-compatible query
GET /api/v1/query_range
  Params:
    query = avg(rate(http_request_duration_seconds{service=&amp;#34;api&amp;#34;}[5m])) by (path)
    start = 1708628460    // 1 hour ago
    end = 1708632060      // now
    step = 15             // 15-second resolution

  Response 200: {
    &amp;#34;result_type&amp;#34;: &amp;#34;matrix&amp;#34;,
    &amp;#34;result&amp;#34;: [
      {
        &amp;#34;labels&amp;#34;: {&amp;#34;path&amp;#34;: &amp;#34;/api/users&amp;#34;},
        &amp;#34;values&amp;#34;: [[1708628460, &amp;#34;0.035&amp;#34;], [1708628475, &amp;#34;0.038&amp;#34;], ...]
      },
      {
        &amp;#34;labels&amp;#34;: {&amp;#34;path&amp;#34;: &amp;#34;/api/orders&amp;#34;},
        &amp;#34;values&amp;#34;: [[1708628460, &amp;#34;0.122&amp;#34;], [1708628475, &amp;#34;0.118&amp;#34;], ...]
      }
    ]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;alerting-api&#34;&gt;Alerting API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Create alert rule
POST /api/v1/alerts
  Body: {
    &amp;#34;name&amp;#34;: &amp;#34;HighCPU&amp;#34;,
    &amp;#34;expr&amp;#34;: &amp;#34;avg(cpu_usage{service=&amp;#39;api&amp;#39;}) by (host) &amp;gt; 90&amp;#34;,
    &amp;#34;for&amp;#34;: &amp;#34;5m&amp;#34;,                    // must be true for 5 minutes
    &amp;#34;severity&amp;#34;: &amp;#34;critical&amp;#34;,
    &amp;#34;annotations&amp;#34;: {
      &amp;#34;summary&amp;#34;: &amp;#34;Host {{ $labels.host }} CPU is {{ $value }}%&amp;#34;,
      &amp;#34;runbook&amp;#34;: &amp;#34;https://wiki/runbooks/high-cpu&amp;#34;
    },
    &amp;#34;notify&amp;#34;: [&amp;#34;pagerduty-oncall&amp;#34;, &amp;#34;slack-infra&amp;#34;]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Pull model (Prometheus-style) vs push model: we use &lt;strong&gt;push&lt;/strong&gt; — agents push to collectors. Better for ephemeral/auto-scaled instances that may disappear before being scraped.&lt;/li&gt;
&lt;li&gt;PromQL-compatible query language — industry standard, rich aggregation functions&lt;/li&gt;
&lt;li&gt;Protobuf for ingestion wire format — 50% smaller than JSON, faster serialization&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;time-series-identification&#34;&gt;Time Series Identification&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;A time series is uniquely identified by its label set:
  {__name__=&amp;#34;http_requests_total&amp;#34;, method=&amp;#34;GET&amp;#34;, path=&amp;#34;/api/users&amp;#34;, status=&amp;#34;200&amp;#34;, host=&amp;#34;web-01&amp;#34;}

Internal representation:
  series_id  = hash(sorted_labels) → uint64
  This is the primary key for all operations
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;storage-format-custom-time-series-optimized&#34;&gt;Storage Format (custom time-series optimized)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;On-disk structure (per series, per time block):
  ┌──────────────────────────────────┐
  │ Block (2-hour time window)       │
  │ ┌──────────────────────────────┐ │
  │ │ Series 1: [timestamps][values]│ │
  │ │ Series 2: [timestamps][values]│ │
  │ │ ...                           │ │
  │ └──────────────────────────────┘ │
  │ Index: series_id → offset        │
  │ Label index: label → series_ids  │
  │ Tombstones (deletions)           │
  └──────────────────────────────────┘

Compression (Gorilla encoding):
  Timestamps: delta-of-delta encoding
    - Most deltas are 0 (fixed 10s interval) → 1 bit per timestamp
    - Occasional jitter → few extra bits
    - Average: 1-2 bits per timestamp

  Values: XOR encoding
    - Consecutive values of the same metric are similar
    - XOR of consecutive values has many leading/trailing zeros
    - Average: 1-2 bytes per value

  Result: ~1.37 bytes per sample (vs 16 bytes uncompressed) → 12x compression
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;label-index-inverted-index&#34;&gt;Label Index (inverted index)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Label to series mapping (for query filtering):
  &amp;#34;service=api&amp;#34;   → [series_1, series_2, series_5, series_99, ...]
  &amp;#34;host=web-01&amp;#34;   → [series_1, series_3, series_7, ...]
  &amp;#34;method=GET&amp;#34;    → [series_1, series_2, series_3, ...]

Query: http_requests{service=&amp;#34;api&amp;#34;, method=&amp;#34;GET&amp;#34;}
  → Intersect posting lists: [series_1, series_2, series_5, ...] ∩ [series_1, series_2, series_3, ...]
  → Result: [series_1, series_2]

Stored as: sorted arrays with bitmap intersection (Roaring bitmaps for cardinality &amp;gt; 10K)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;downsampled-data&#34;&gt;Downsampled Data&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: metrics_5min (5-minute aggregates)
  series_id       | uint64
  timestamp       | int64 (5-min aligned)
  min             | float64
  max             | float64
  sum             | float64
  count           | uint32

Computed by background job: raw data → 5-min aggregates → 1-hour aggregates → 1-day aggregates
Each level retains min/max/sum/count so any aggregation can be reconstructed.
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-this-design&#34;&gt;Why This Design&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Custom time-series format (not SQL)&lt;/strong&gt; — relational databases are 100x worse at time-series workloads due to row overhead, index maintenance, and lack of columnar compression.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gorilla compression&lt;/strong&gt; — developed by Facebook for their monitoring system. 12x compression means 10TB logical data fits in ~800GB on disk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inverted label index&lt;/strong&gt; — enables fast multi-label query filtering without scanning all series. Critical for high-cardinality environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Data Sources (10,000 hosts)
  │
  │  Each host runs a metrics agent (collectd/telegraf/custom)
  │  Agent collects: CPU, memory, disk, network, app-specific metrics
  │  Agent pushes every 10 seconds
  │
  ▼
┌──────────────────────────────────────────────────┐
│              Ingestion Layer                        │
│                                                     │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  │
│  │ Collector 1 │  │ Collector 2 │  │ Collector N │  │
│  │ (stateless) │  │             │  │             │  │
│  └──────┬─────┘  └──────┬─────┘  └──────┬─────┘  │
│         │               │               │         │
│         └───────────────┼───────────────┘         │
│                         │                          │
│                         ▼                          │
│              ┌─────────────────┐                   │
│              │  Kafka Cluster   │                   │
│              │  (metrics topic, │                   │
│              │   32 partitions) │                   │
│              └────────┬────────┘                   │
└───────────────────────┼────────────────────────────┘
                        │
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Storage Node │ │ Storage Node │ │ Storage Node │
│ (Ingester)   │ │ (Ingester)   │ │ (Ingester)   │
│              │ │              │ │              │
│ In-memory    │ │ In-memory    │ │ In-memory    │
│ head block   │ │ head block   │ │ head block   │
│ (last 2 hrs) │ │ (last 2 hrs) │ │ (last 2 hrs) │
│      │       │ │      │       │ │      │       │
│      ▼       │ │      ▼       │ │      ▼       │
│ Persistent   │ │ Persistent   │ │ Persistent   │
│ blocks (SSD) │ │ blocks (SSD) │ │ blocks (SSD) │
└──────────────┘ └──────────────┘ └──────────────┘
         │              │              │
         └──────────────┼──────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │  Query Layer     │
              │  (Queriers,      │
              │   fan-out to     │
              │   storage nodes) │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              ▼                 ▼
     ┌──────────────┐  ┌──────────────┐
     │ Alert Manager │  │ Dashboard    │
     │ (evaluates    │  │ (Grafana,    │
     │  rules every  │  │  custom UI)  │
     │  15 seconds)  │  │              │
     └──────────────┘  └──────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ingestion-flow&#34;&gt;Ingestion Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Agent (on host)
  → Collect metrics every 10 seconds
  → Batch into protobuf payload (~500 metrics = ~16KB)
  → Push to any Collector instance (round-robin via load balancer)

Collector (stateless)
  → Validate payload (label format, value range, cardinality check)
  → Route to correct Kafka partition based on hash(series_id) % 32
  → Respond 200 to agent

Kafka → Storage Node (Ingester)
  → Consume from assigned partitions
  → Append samples to in-memory &amp;#34;head block&amp;#34; (current 2-hour window)
  → WAL for crash recovery
  → Every 2 hours: flush head block to disk as compressed block
  → Replication: 2 ingesters consume same partition (RF=2)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;query-flow&#34;&gt;Query Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Dashboard query: avg(cpu_usage{service=&amp;#34;api&amp;#34;}) by (host) [last 1 hour]

Querier:
  1. Parse PromQL expression
  2. Label matching: which series match {service=&amp;#34;api&amp;#34;}?
     → Query label index on storage nodes → get list of series_ids
  3. Time range: last 1 hour → need head block (in-memory) + possibly 1 on-disk block
  4. Fan-out: send sub-queries to storage nodes that hold these series
  5. Each storage node:
     → Decompress relevant samples
     → Apply rate/avg functions locally
     → Return partial result
  6. Querier merges partial results
  7. Return to dashboard
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Metrics Agents (10,000):&lt;/strong&gt; Lightweight daemons on each host. Collect system metrics (CPU, memory, disk, network) and application-specific metrics (HTTP latency, queue depth). Push to collectors. Buffered locally for 1 hour if collectors are unreachable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Collectors (10, stateless):&lt;/strong&gt; Receive pushed metrics, validate, route to Kafka. No state — purely a fan-in and routing layer. Horizontal scaling trivial.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kafka:&lt;/strong&gt; Durability buffer between collectors and storage. Handles burst traffic, provides replay capability. 32 partitions, 3x replication, 24-hour retention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage Nodes / Ingesters (6-10):&lt;/strong&gt; Each owns a subset of time series (by hash). In-memory head block for recent data (fast queries), compressed on-disk blocks for historical data. 2TB SSD per node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Queriers (stateless):&lt;/strong&gt; Parse queries, fan-out to storage nodes, merge results. Auto-scaled based on dashboard load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert Manager:&lt;/strong&gt; Evaluates alert rules every 15 seconds by running PromQL queries. Deduplication, grouping, silencing, routing to notification channels (PagerDuty, Slack, email).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Downsampler (background job):&lt;/strong&gt; Computes 5-min, 1-hour, and 1-day rollups. Runs continuously, processing blocks as they are flushed to disk.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-time-series-compression-gorilla-encoding&#34;&gt;Deep Dive 1: Time-Series Compression (Gorilla Encoding)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; At 500K samples/sec, naive storage (16 bytes per sample: 8 bytes timestamp + 8 bytes float64 value) would require 691 GB/day. We need at least 10x compression to keep costs manageable.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Photo Sharing Service (Flickr/Google Photos)</title>
      <link>https://chiraghasija.cc/designs/photo-sharing/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/photo-sharing/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can upload photos (up to 50MB each) with the system generating multiple thumbnail sizes and extracting EXIF metadata automatically&lt;/li&gt;
&lt;li&gt;Organize photos into albums, add tags, and manage sharing permissions (private, shared with specific users, public)&lt;/li&gt;
&lt;li&gt;Search photos by metadata (date, location, camera), user-applied tags, and ML-generated labels (objects, faces, scenes)&lt;/li&gt;
&lt;li&gt;Browse photo feeds (own photos, shared albums, explore/discover) with infinite scroll and fast thumbnail loading&lt;/li&gt;
&lt;li&gt;Share photos and albums via links with configurable permissions (view-only, download allowed, password-protected)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for viewing photos (reads). 99.9% for uploads (briefly queuing uploads is acceptable).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Thumbnail loading &amp;lt; 100ms. Full-resolution photo &amp;lt; 500ms. Upload acknowledgment &amp;lt; 2 seconds (processing happens async).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Photo metadata must be strongly consistent (if you upload and refresh, your photo must appear). Search index can lag by 30 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100M users, 5M daily active, 50M photo uploads/day, 500M photos viewed/day, 10PB total stored photos.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; 99.999999999% (11 nines) for original photos. Losing a user&amp;rsquo;s photo is unacceptable.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uploads:&lt;/strong&gt; 50M photos/day = ~580 uploads/sec (peak 2x = 1,160/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Photo views (thumbnails):&lt;/strong&gt; 500M/day = ~5,800/sec (peak 3x = 17,400/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full-resolution views:&lt;/strong&gt; 50M/day = ~580/sec&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Search queries:&lt;/strong&gt; 5M DAU × 2 searches/day = 10M/day = ~115 QPS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Upload bandwidth:&lt;/strong&gt; 580/sec × 5MB avg = &lt;strong&gt;2.9 GB/sec inbound&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Photos (originals):&lt;/strong&gt; 50M/day × 5MB avg × 365 days = &lt;strong&gt;91PB/year&lt;/strong&gt; (after dedup and growth: ~50PB/year)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thumbnails (4 sizes per photo):&lt;/strong&gt; 50M/day × 4 × 50KB avg = &lt;strong&gt;10TB/day&lt;/strong&gt; = 3.6PB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata:&lt;/strong&gt; 50M/day × 2KB = 100GB/day = 36TB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total stored:&lt;/strong&gt; ~10PB currently (growing 50PB/year)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;cost-insight&#34;&gt;Cost Insight&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Storage dominates cost: 10PB in S3 = ~$230K/month&lt;/li&gt;
&lt;li&gt;CDN egress for thumbnails: 500M views × 50KB = 25TB/day = ~$37K/month&lt;/li&gt;
&lt;li&gt;ML processing (image labeling): 50M photos × $0.001/photo = $50K/month&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;write-heavy, storage-intensive system&lt;/strong&gt;. The upload pipeline (ingest, process, store, index) is the most complex component. The read path is CDN-dominated (thumbnails served from edge). The most interesting engineering challenge is the image processing pipeline and ML-powered search.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Price Alert System (Google Flights)</title>
      <link>https://chiraghasija.cc/designs/price-alert/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/price-alert/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can set price alerts for a specific route (origin, destination, date range, cabin class) with a target price or &amp;ldquo;notify on any drop&amp;rdquo;&lt;/li&gt;
&lt;li&gt;System continuously monitors flight prices by polling airline APIs / aggregators and detects meaningful price changes&lt;/li&gt;
&lt;li&gt;Send notifications (email, push) when a tracked route&amp;rsquo;s price drops below the user&amp;rsquo;s threshold or changes significantly&lt;/li&gt;
&lt;li&gt;Show price history and trends for a route (price graph over last 30-90 days)&lt;/li&gt;
&lt;li&gt;Support alert management: list, edit, pause, delete alerts; set expiration (auto-delete after travel date)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% for alert creation/management; price monitoring can tolerate brief outages (users won&amp;rsquo;t notice a 5-minute gap in polling)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Alert creation &amp;lt; 200ms. Notification delivery within 15 minutes of a price change (not real-time — flights don&amp;rsquo;t change by the second).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; A user should never miss a significant price drop. False positives (alerting on a stale price) are worse than a 15-minute delay.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100M active alerts across 50M users. 500K unique routes being monitored. 10M price checks/hour.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Airline API calls are expensive (rate-limited, sometimes paid). Minimize redundant polling.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;alerts&#34;&gt;Alerts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50M users, average 2 active alerts each = &lt;strong&gt;100M active alerts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Alert creation/deletion: ~1M/day (relatively low write volume)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;price-monitoring&#34;&gt;Price Monitoring&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500K unique routes being monitored&lt;/li&gt;
&lt;li&gt;Each route checked every 30 minutes = 500K × 48 checks/day = &lt;strong&gt;24M price checks/day ≈ 280 checks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Popular routes (JFK→LAX) shared by 100K+ alerts; long-tail routes shared by &amp;lt; 10 alerts&lt;/li&gt;
&lt;li&gt;Key optimization: check routes, not individual alerts. 500K route checks serve 100M alerts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;notifications&#34;&gt;Notifications&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Assume 5% of checks find a meaningful price change = 1.2M price changes/day&lt;/li&gt;
&lt;li&gt;Each change triggers alerts for all subscribed users. Average 200 users per route change = &lt;strong&gt;240M notifications/day ≈ 2,800/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 3-5x average during fare sales = ~10K notifications/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Alert records: 100M × 500 bytes = 50GB&lt;/li&gt;
&lt;li&gt;Price history: 500K routes × 365 days × 48 data points × 50 bytes = &lt;strong&gt;4.4TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Route metadata/cache: 500K × 2KB = 1GB&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;alert-management&#34;&gt;Alert Management&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /alerts
  Body: {
    &amp;#34;origin&amp;#34;: &amp;#34;JFK&amp;#34;,
    &amp;#34;destination&amp;#34;: &amp;#34;LAX&amp;#34;,
    &amp;#34;departure_date_start&amp;#34;: &amp;#34;2026-06-01&amp;#34;,
    &amp;#34;departure_date_end&amp;#34;: &amp;#34;2026-06-07&amp;#34;,
    &amp;#34;return_date_start&amp;#34;: &amp;#34;2026-06-08&amp;#34;,
    &amp;#34;return_date_end&amp;#34;: &amp;#34;2026-06-14&amp;#34;,
    &amp;#34;cabin_class&amp;#34;: &amp;#34;economy&amp;#34;,
    &amp;#34;max_price&amp;#34;: 250,                    // null = notify on any significant drop
    &amp;#34;notification_channels&amp;#34;: [&amp;#34;email&amp;#34;, &amp;#34;push&amp;#34;],
    &amp;#34;passengers&amp;#34;: 1
  }
  Response 201: { &amp;#34;alert_id&amp;#34;: &amp;#34;alt_abc123&amp;#34;, &amp;#34;current_price&amp;#34;: 312, ... }

GET /alerts?user_id=u_123&amp;amp;status=active
  Response: [list of alert objects with current prices]

DELETE /alerts/{alert_id}

PATCH /alerts/{alert_id}
  Body: { &amp;#34;max_price&amp;#34;: 200, &amp;#34;paused&amp;#34;: false }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;price-data&#34;&gt;Price Data&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;GET /routes/{origin}/{destination}/prices
  Query: departure_start, departure_end, cabin_class, lookback_days=30
  Response: {
    &amp;#34;current_price&amp;#34;: 312,
    &amp;#34;price_history&amp;#34;: [
      {&amp;#34;date&amp;#34;: &amp;#34;2026-02-20&amp;#34;, &amp;#34;min_price&amp;#34;: 318, &amp;#34;median_price&amp;#34;: 345},
      {&amp;#34;date&amp;#34;: &amp;#34;2026-02-21&amp;#34;, &amp;#34;min_price&amp;#34;: 312, &amp;#34;median_price&amp;#34;: 340},
      ...
    ],
    &amp;#34;trend&amp;#34;: &amp;#34;declining&amp;#34;,
    &amp;#34;typical_range&amp;#34;: {&amp;#34;low&amp;#34;: 280, &amp;#34;high&amp;#34;: 420}
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Alerts are defined on route+date_range, not specific flights. The system finds the cheapest option matching the criteria.&lt;/li&gt;
&lt;li&gt;Date ranges (not exact dates) because most travelers have flexibility — this dramatically increases alert match rates.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;alerts-postgresql&#34;&gt;Alerts (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: alerts
  alert_id        (PK) | uuid
  user_id         (FK) | uuid (indexed)
  route_key             | varchar(50) (indexed) -- &amp;#34;JFK-LAX-2026-06-economy&amp;#34;
  origin                | char(3)
  destination           | char(3)
  departure_start       | date
  departure_end         | date
  return_start          | date (nullable)
  return_end            | date (nullable)
  cabin_class           | enum(&amp;#39;economy&amp;#39;, &amp;#39;premium_economy&amp;#39;, &amp;#39;business&amp;#39;, &amp;#39;first&amp;#39;)
  max_price             | decimal (nullable)
  passengers            | int
  status                | enum(&amp;#39;active&amp;#39;, &amp;#39;paused&amp;#39;, &amp;#39;expired&amp;#39;, &amp;#39;triggered&amp;#39;)
  notification_channels | jsonb
  created_at            | timestamp
  expires_at            | timestamp -- auto-set to departure_start
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;monitored-routes-postgresql--redis-cache&#34;&gt;Monitored Routes (PostgreSQL + Redis cache)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: monitored_routes
  route_key       (PK) | varchar(50)
  origin                | char(3)
  destination           | char(3)
  date_range            | daterange
  cabin_class           | varchar(20)
  alert_count           | int -- denormalized, number of active alerts
  last_checked_at       | timestamp
  last_price            | decimal
  check_interval_min    | int -- 15 for popular, 60 for long-tail
  next_check_at         | timestamp (indexed)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;price-history-timescaledb-or-clickhouse&#34;&gt;Price History (TimescaleDB or ClickHouse)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: price_history
  route_key             | varchar(50)
  checked_at            | timestamp
  min_price             | decimal
  median_price          | decimal
  source                | varchar(50) -- airline, aggregator name
  flight_options        | jsonb -- top 3-5 cheapest options with details
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-these-choices&#34;&gt;Why These Choices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; for alerts and routes: strong consistency, complex queries (find all alerts for a route), transactional status updates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TimescaleDB&lt;/strong&gt; for price history: time-series optimized, automatic partitioning by time, efficient range queries for price graphs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt; for route check scheduling: sorted set with next_check_at as score — O(log N) to get the next routes to check&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;price-monitoring-pipeline&#34;&gt;Price Monitoring Pipeline&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Scheduler (Redis Sorted Set: next_check_at)
  → every second: ZRANGEBYSCORE routes 0 {now} LIMIT 100
  → dispatch route checks to worker pool

Price Check Workers (horizontally scaled)
  → For each route:
    1. Call airline APIs / aggregator APIs (Google Flights, Skyscanner, etc.)
    2. Parse response → extract min price for the route criteria
    3. Compare with last_price in DB
    4. If significant change detected:
       → Write new price to price_history
       → Update last_price in monitored_routes
       → Publish PriceChangeEvent to Kafka
    5. Update next_check_at in scheduler

Kafka: PriceChangeEvent
  → Alert Matcher Service
    → Query: SELECT * FROM alerts WHERE route_key = ? AND status = &amp;#39;active&amp;#39;
    → For each matching alert:
       → If new_price &amp;lt;= max_price (or significant drop for &amp;#34;any drop&amp;#34; alerts):
         → Enqueue notification job

Notification Service
  → Read from notification queue
  → Deduplicate (don&amp;#39;t send same user same alert twice in 1 hour)
  → Render email/push template with price details
  → Send via email (SES) / push (FCM/APNs)
  → Update alert status if appropriate (mark as triggered)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Route Scheduler:&lt;/strong&gt; Redis sorted set of routes ordered by next_check_at. A cron-like process pops due routes and dispatches them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Price Check Workers:&lt;/strong&gt; Stateless workers that call external APIs. Auto-scale based on queue depth. Handle rate limits, retries, API key rotation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Route Consolidator:&lt;/strong&gt; Background job that merges overlapping date ranges across alerts into a minimal set of monitored routes. Runs on alert creation/deletion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert Matcher:&lt;/strong&gt; On price change, finds all alerts affected. Uses indexed route_key lookup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Service:&lt;/strong&gt; Manages delivery across channels. Handles batching (don&amp;rsquo;t send 10 alerts at once — batch into a digest), deduplication, and delivery tracking.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Price History Service:&lt;/strong&gt; Serves price graphs and trend analysis. Queries TimescaleDB with pre-computed daily aggregates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API Gateway:&lt;/strong&gt; Handles alert CRUD, price queries, user authentication.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;route-consolidation-example&#34;&gt;Route Consolidation Example&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Alert 1: JFK→LAX, Jun 1-7, economy
Alert 2: JFK→LAX, Jun 3-10, economy
Alert 3: JFK→LAX, Jun 1-7, business

Monitored routes:
  Route A: JFK→LAX, Jun 1-10, economy (covers alerts 1 + 2)
  Route B: JFK→LAX, Jun 1-7, business (covers alert 3)
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This reduces 100M alerts to ~500K monitored routes — a 200x reduction in API calls.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Real-Time Live Likes/Reactions System</title>
      <link>https://chiraghasija.cc/designs/live-likes/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/live-likes/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can send reactions (like, love, wow, haha, etc.) on a live stream or post, and all viewers see animated reactions in real-time&lt;/li&gt;
&lt;li&gt;Display an aggregated reaction count that updates live (e.g., &amp;ldquo;12.3K likes&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Each user can react multiple times (unlike a static like button — this is a live engagement feature, like Facebook Live or Instagram Live hearts)&lt;/li&gt;
&lt;li&gt;Reaction animations appear as floating icons on the viewer&amp;rsquo;s screen, reflecting the volume and type of reactions across all viewers&lt;/li&gt;
&lt;li&gt;Provide reaction rate metrics (reactions per second) for streamer dashboards and analytics&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% — reactions are engagement features, not mission-critical. Brief delays are acceptable; total loss of reactions degrades the experience but does not break the product.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Reactions should appear on other viewers&amp;rsquo; screens within 1-2 seconds of being sent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency is fine. Counts can be approximate. Showing &amp;ldquo;12.3K&amp;rdquo; when the true count is 12,347 is perfectly acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100K concurrent live streams. Top streams: 500K concurrent viewers, up to 50,000 reactions/sec per stream. Global: 5M reactions/sec across all streams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Aggregate counts must be durable (total likes on a stream). Individual reaction events do not need permanent storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;write-traffic&#34;&gt;Write Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;5M reactions/sec globally&lt;/li&gt;
&lt;li&gt;Each reaction event: ~80 bytes (stream_id, user_id, reaction_type, timestamp)&lt;/li&gt;
&lt;li&gt;Inbound data rate: 5M × 80 bytes = &lt;strong&gt;400 MB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;fan-out&#34;&gt;Fan-Out&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each reaction is not individually broadcast — instead, reactions are aggregated into batches&lt;/li&gt;
&lt;li&gt;Every 500ms, each stream produces a batch summary: { &amp;ldquo;like&amp;rdquo;: 142, &amp;ldquo;love&amp;rdquo;: 37, &amp;ldquo;wow&amp;rdquo;: 12 }&lt;/li&gt;
&lt;li&gt;Batch per stream: ~200 bytes&lt;/li&gt;
&lt;li&gt;100K streams × 200 bytes × 2 batches/sec = &lt;strong&gt;40 MB/sec&lt;/strong&gt; fan-out from aggregation layer&lt;/li&gt;
&lt;li&gt;Each batch is pushed to all viewers of that stream&lt;/li&gt;
&lt;li&gt;Top stream: 500K viewers × 200 bytes × 2/sec = &lt;strong&gt;200 MB/sec&lt;/strong&gt; — manageable with edge fan-out&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Aggregate counts per stream: 100K streams × 6 reaction types × 8 bytes = &lt;strong&gt;~5 MB&lt;/strong&gt; in Redis&lt;/li&gt;
&lt;li&gt;Reaction event log (for analytics, retained 7 days): 5M/sec × 80 bytes × 86400 sec × 7 days = &lt;strong&gt;~240 TB/week&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Store sampled (1% sample = 2.4 TB/week) in a data lake for trend analysis, not the full firehose&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;count-storage&#34;&gt;Count Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Final aggregate counts per stream (permanent): negligible — one row per stream in the DB&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;rest-endpoints&#34;&gt;REST Endpoints&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Send a reaction
POST /v1/streams/{stream_id}/reactions
  Headers: Authorization: Bearer &amp;lt;token&amp;gt;
  Body: {
    &amp;#34;type&amp;#34;: &amp;#34;like&amp;#34;                     // like, love, wow, haha, sad, angry
  }
  Response 202: { &amp;#34;status&amp;#34;: &amp;#34;accepted&amp;#34; }
  // 202 Accepted — fire-and-forget, no guarantee of individual delivery

// Get current reaction counts
GET /v1/streams/{stream_id}/reactions/counts
  Response 200: {
    &amp;#34;stream_id&amp;#34;: &amp;#34;stream_abc&amp;#34;,
    &amp;#34;counts&amp;#34;: {
      &amp;#34;like&amp;#34;: 1234567,
      &amp;#34;love&amp;#34;: 234567,
      &amp;#34;wow&amp;#34;: 45678,
      &amp;#34;haha&amp;#34;: 12345,
      &amp;#34;sad&amp;#34;: 1234,
      &amp;#34;angry&amp;#34;: 567
    },
    &amp;#34;rate&amp;#34;: {
      &amp;#34;total_per_second&amp;#34;: 4521,
      &amp;#34;by_type&amp;#34;: { &amp;#34;like&amp;#34;: 2100, &amp;#34;love&amp;#34;: 890, ... }
    }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;websocket-protocol&#34;&gt;WebSocket Protocol&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Client → Server: send a reaction (alternative to REST, lower overhead)
{ &amp;#34;type&amp;#34;: &amp;#34;reaction&amp;#34;, &amp;#34;reaction&amp;#34;: &amp;#34;like&amp;#34; }

// Server → Client: batched reaction update (every 500ms)
{
  &amp;#34;type&amp;#34;: &amp;#34;reaction_batch&amp;#34;,
  &amp;#34;window_ms&amp;#34;: 500,
  &amp;#34;counts&amp;#34;: {
    &amp;#34;like&amp;#34;: 142,
    &amp;#34;love&amp;#34;: 37,
    &amp;#34;wow&amp;#34;: 12,
    &amp;#34;haha&amp;#34;: 8,
    &amp;#34;sad&amp;#34;: 2,
    &amp;#34;angry&amp;#34;: 1
  },
  &amp;#34;total&amp;#34;: {
    &amp;#34;like&amp;#34;: 1234567,
    &amp;#34;love&amp;#34;: 234568,
    ...
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fire-and-forget writes:&lt;/strong&gt; Reactions return 202, not 201. We do not guarantee every reaction is counted. Losing 0.1% of reactions under extreme load is acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batched delivery:&lt;/strong&gt; Individual reactions are never pushed to clients. Instead, 500ms window summaries are pushed. This reduces fan-out by 1000x for high-volume streams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Approximate counts in totals:&lt;/strong&gt; The total count uses eventual consistency. The per-window count (used for animations) is more important for UX.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;in-flight-reaction-aggregation-redis&#34;&gt;In-Flight Reaction Aggregation (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Per-window reaction counts (current 500ms window)
Key: stream:{stream_id}:reactions:current
Type: Hash
Fields: like → 142, love → 37, wow → 12, ...
TTL: 5 seconds (auto-cleanup)

// Running total counts
Key: stream:{stream_id}:reactions:total
Type: Hash
Fields: like → 1234567, love → 234567, ...
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;permanent-counts-postgresql--updated-periodically&#34;&gt;Permanent Counts (PostgreSQL — updated periodically)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: stream_reaction_counts
  stream_id      (PK) | varchar(20)
  like_count          | bigint
  love_count          | bigint
  wow_count           | bigint
  haha_count          | bigint
  sad_count           | bigint
  angry_count         | bigint
  updated_at          | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;rate-limit-state-redis&#34;&gt;Rate Limit State (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Per-user reaction rate limit (max 10 reactions per second per user)
Key: reaction_rl:{stream_id}:{user_id}
Type: String (counter)
TTL: 1 second
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-this-split&#34;&gt;Why This Split?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Redis for real-time:&lt;/strong&gt; All hot-path data lives in Redis. Hash HINCRBY is O(1) and handles millions of increments per second. The aggregation window resets every 500ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL for permanence:&lt;/strong&gt; The final count after a stream ends is checkpointed to PostgreSQL. This happens once per stream, not millions of times.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No per-event storage:&lt;/strong&gt; We do NOT store individual reaction events in a database. 5M events/sec is too expensive to persist individually. Instead, we aggregate in Redis and periodically flush summaries.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;reaction-write-path&#34;&gt;Reaction Write Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client taps &amp;#34;like&amp;#34; button
  → WebSocket Connection Server
    → 1. Rate limit check:
         INCR reaction_rl:{stream_id}:{user_id}
         If &amp;gt; 10 → silently drop (don&amp;#39;t error, just ignore excess)
    → 2. Forward to Reaction Aggregator:
         HINCRBY stream:{stream_id}:reactions:current like 1
         HINCRBY stream:{stream_id}:reactions:total like 1
    → 3. Done. No ACK needed to client.
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;reaction-broadcast-path&#34;&gt;Reaction Broadcast Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Reaction Broadcaster Service (per-stream timer):
  Every 500ms per active stream:
    → 1. HGETALL stream:{stream_id}:reactions:current
         Result: { &amp;#34;like&amp;#34;: 142, &amp;#34;love&amp;#34;: 37, &amp;#34;wow&amp;#34;: 12 }
    → 2. DEL stream:{stream_id}:reactions:current (reset window)
         (Use GETDEL pattern or Lua script for atomicity)
    → 3. If total reactions in window &amp;gt; 0:
         Publish batch to fan-out layer:
         PUBLISH stream_reactions:{stream_id} {batch_json}
    → 4. If total reactions = 0 → skip (no broadcast, save bandwidth)

Fan-Out to Viewers:
  WebSocket Servers subscribe to Redis Pub/Sub: stream_reactions:{stream_id}
    → On batch received:
      → Push batch to all local WebSocket connections for that stream
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;WebSocket Connection Servers (100+ instances):&lt;/strong&gt; Handle client connections. Receive reactions via WebSocket. Forward to Redis for aggregation. Push batched updates to clients.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis Cluster:&lt;/strong&gt; Reaction aggregation (HINCRBY). Rate limiting. Pub/Sub for batch distribution. Stores running totals.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reaction Broadcaster Service:&lt;/strong&gt; Timer-based service, one logical timer per active stream. Reads current window, resets counter, publishes batch. Horizontally scaled — each instance owns a shard of streams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Count Checkpoint Service:&lt;/strong&gt; Periodically (every 60 seconds) writes Redis totals to PostgreSQL. On stream end, performs a final checkpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analytics Sampler:&lt;/strong&gt; Taps into the reaction stream at 1% sample rate. Writes to Kafka for downstream analytics (trend detection, engagement scoring).&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;architecture-diagram&#34;&gt;Architecture Diagram&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Clients (viewers)
  ↕ WebSocket
WebSocket Servers (100+)
  → HINCRBY → Redis Cluster
               ├── Current window counters
               ├── Running totals
               └── Pub/Sub channels

Reaction Broadcaster (sharded by stream)
  → Every 500ms: read + reset current window
  → PUBLISH batch to Redis Pub/Sub
  → WebSocket Servers receive and push to clients

Count Checkpoint Service
  → Every 60s: Redis totals → PostgreSQL

Analytics Sampler → Kafka → Data Warehouse
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;animation-rendering-client-side&#34;&gt;Animation Rendering (Client-Side)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;On receiving batch { &amp;#34;like&amp;#34;: 142, &amp;#34;love&amp;#34;: 37, &amp;#34;wow&amp;#34;: 12 }:
  1. Total reactions in window: 191
  2. Scale animation intensity:
     &amp;lt; 10 reactions → sparse floating icons
     10-100 → moderate stream
     100-1000 → dense stream with size variation
     &amp;gt; 1000 → &amp;#34;burst&amp;#34; mode with explosion effect
  3. For each reaction type, spawn proportional number of animated icons:
     like: 142/191 = 74% → 74% of animation slots are hearts
     love: 37/191 = 19% → 19% are love icons
  4. Randomize: position (x-axis), float speed, size, opacity
  5. Animate using requestAnimationFrame, recycle DOM nodes from a pool
  6. Cap at 60 visible animations simultaneously (performance)
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-high-write-throughput--handling-50k-reactionssec-per-stream&#34;&gt;Deep Dive 1: High Write Throughput — Handling 50K Reactions/Sec Per Stream&lt;/h3&gt;
&lt;p&gt;A top stream receives 50,000 reactions per second. Naively, each reaction is a separate Redis HINCRBY command. At 50K/sec for one stream, this is manageable for Redis (it handles 100K+ ops/sec). But across 100K streams with 5M total reactions/sec, we need to be smarter.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Real-Time Page Viewer Count System</title>
      <link>https://chiraghasija.cc/designs/page-viewers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/page-viewers/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Display a real-time count of users currently viewing a specific page (e.g., &amp;ldquo;47 people viewing this page&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Count updates within 5 seconds of a viewer joining or leaving&lt;/li&gt;
&lt;li&gt;Track unique viewers — refreshing the page or opening multiple tabs from the same user should count as 1 viewer&lt;/li&gt;
&lt;li&gt;Support any page on the platform (product pages, articles, dashboards) — millions of distinct page IDs&lt;/li&gt;
&lt;li&gt;Provide an API to query current viewer count for any page (for analytics dashboards)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% — viewer counts are informational, not business-critical. Showing a stale count briefly is acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Count updates pushed to viewers within 5 seconds. API queries return in &amp;lt; 50ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Approximate counts are fine. Off by 2-3 viewers is acceptable. Off by 50% is not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M concurrent users across 5M distinct pages. Average page: 2 viewers. Hot pages (trending product, breaking news): 500K+ viewers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Viewer counts are ephemeral — no need to persist historical real-time counts. However, log peak counts for analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;connections&#34;&gt;Connections&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M concurrent WebSocket connections&lt;/li&gt;
&lt;li&gt;Each connection: ~10 KB memory → &lt;strong&gt;100 GB&lt;/strong&gt; total connection memory&lt;/li&gt;
&lt;li&gt;At 200K connections per server → &lt;strong&gt;50 WebSocket servers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;heartbeat-traffic&#34;&gt;Heartbeat Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each client sends a heartbeat every 30 seconds&lt;/li&gt;
&lt;li&gt;10M clients / 30 sec = &lt;strong&gt;333K heartbeats/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each heartbeat: ~100 bytes → 33 MB/sec inbound&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;count-update-traffic&#34;&gt;Count Update Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;When a viewer joins/leaves, update the count for that page&lt;/li&gt;
&lt;li&gt;Churn rate: ~5% of viewers change pages per minute → 500K join/leave events per minute → &lt;strong&gt;8,300 events/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each event triggers a count update pushed to all viewers of that page&lt;/li&gt;
&lt;li&gt;Average page: 2 viewers, hot pages: thousands&lt;/li&gt;
&lt;li&gt;Estimated push volume: ~50K count update messages/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active page viewer sets: 5M pages × ~100 bytes per entry × avg 2 viewers = &lt;strong&gt;1 GB&lt;/strong&gt; in Redis&lt;/li&gt;
&lt;li&gt;Hot pages with 500K viewers: a single sorted set with 500K members = ~50 MB. Manageable.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;rest-endpoints&#34;&gt;REST Endpoints&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Get current viewer count for a page
GET /v1/pages/{page_id}/viewers/count
  Response 200: {
    &amp;#34;page_id&amp;#34;: &amp;#34;product_12345&amp;#34;,
    &amp;#34;viewer_count&amp;#34;: 47,
    &amp;#34;updated_at&amp;#34;: &amp;#34;2026-02-22T10:00:05Z&amp;#34;
  }

// Get viewer counts for multiple pages (batch)
POST /v1/pages/viewers/count
  Body: { &amp;#34;page_ids&amp;#34;: [&amp;#34;product_12345&amp;#34;, &amp;#34;article_678&amp;#34;, ...] }
  Response 200: {
    &amp;#34;counts&amp;#34;: {
      &amp;#34;product_12345&amp;#34;: 47,
      &amp;#34;article_678&amp;#34;: 1203
    }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;websocket-protocol&#34;&gt;WebSocket Protocol&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Client connects and subscribes to a page
WS /v1/pages/{page_id}/viewers

// Client → Server: heartbeat (every 30 seconds)
{ &amp;#34;type&amp;#34;: &amp;#34;heartbeat&amp;#34; }

// Server → Client: viewer count update
{ &amp;#34;type&amp;#34;: &amp;#34;viewer_count&amp;#34;, &amp;#34;count&amp;#34;: 48 }

// Client navigates to a different page
// → Close old WebSocket, open new one (or send subscribe/unsubscribe messages)
{ &amp;#34;type&amp;#34;: &amp;#34;subscribe&amp;#34;, &amp;#34;page_id&amp;#34;: &amp;#34;new_page_456&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;unsubscribe&amp;#34;, &amp;#34;page_id&amp;#34;: &amp;#34;product_12345&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use WebSocket for real-time push rather than polling (polling 10M clients every 5 seconds = 2M req/sec just for counts)&lt;/li&gt;
&lt;li&gt;Multiplexed WebSocket: single connection per client, subscribe/unsubscribe to pages as they navigate. Avoids reconnection overhead.&lt;/li&gt;
&lt;li&gt;Heartbeat is mandatory. If no heartbeat received in 90 seconds (3 missed), the server considers the viewer gone.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;active-viewer-set-redis&#34;&gt;Active Viewer Set (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Set of active viewers per page
Key: page:{page_id}:viewers
Type: Sorted Set
Member: viewer_id (user_id or session_id for anonymous users)
Score: last_heartbeat_timestamp

// Current count (cached, updated on join/leave)
Key: page:{page_id}:count
Type: String (integer)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;connection-registry-redis&#34;&gt;Connection Registry (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Which pages a connection is viewing
Key: conn:{connection_id}
Type: Hash
Fields:
  viewer_id      | varchar
  page_id        | varchar
  server_id      | varchar
  connected_at   | timestamp
  last_heartbeat | timestamp
TTL: 120 seconds (auto-cleanup if server crashes)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;unique-viewer-tracking-redis-hyperloglog&#34;&gt;Unique Viewer Tracking (Redis HyperLogLog)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Approximate unique viewers (for analytics, not real-time count)
Key: page:{page_id}:unique_viewers:{date}
Type: HyperLogLog
Operation: PFADD on each new viewer
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-redis&#34;&gt;Why Redis?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;All data is ephemeral (viewer presence is transient)&lt;/li&gt;
&lt;li&gt;Sub-millisecond reads/writes for count lookups and heartbeat updates&lt;/li&gt;
&lt;li&gt;Sorted sets enable efficient expiry scanning (remove viewers with old heartbeats)&lt;/li&gt;
&lt;li&gt;Pub/Sub for cross-server count change notifications&lt;/li&gt;
&lt;li&gt;HyperLogLog for memory-efficient unique counting (12 KB per counter regardless of cardinality)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;viewer-join-flow&#34;&gt;Viewer Join Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client opens page
  → WebSocket Connection Server
    → 1. Authenticate (extract user_id or generate session_id)
    → 2. Deduplicate: Check if this viewer_id already exists in page&amp;#39;s viewer set
         ZSCORE page:{page_id}:viewers {viewer_id}
         If exists → update heartbeat timestamp, do NOT increment count
         If new → ZADD page:{page_id}:viewers {now} {viewer_id}
                   INCR page:{page_id}:count
    → 3. Register connection: HSET conn:{connection_id} ...
    → 4. Publish count change: PUBLISH page_count:{page_id} {new_count}
    → 5. Return current count to client
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;heartbeat-flow&#34;&gt;Heartbeat Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client sends heartbeat every 30 seconds
  → WS Server receives heartbeat
    → ZADD page:{page_id}:viewers {now} {viewer_id}  (update score = timestamp)
    → EXPIRE conn:{connection_id} 120                  (refresh TTL)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;viewer-leave-flow&#34;&gt;Viewer Leave Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Case 1: Client navigates away (graceful close)
  → Client sends unsubscribe or closes WebSocket
  → WS Server:
    → ZREM page:{page_id}:viewers {viewer_id}
    → DECR page:{page_id}:count
    → PUBLISH page_count:{page_id} {new_count}
    → DEL conn:{connection_id}

Case 2: Client crashes / loses network (ungraceful)
  → No more heartbeats received
  → Heartbeat Reaper (background job):
    → Runs every 30 seconds
    → ZRANGEBYSCORE page:{page_id}:viewers -inf (now - 90)
    → Remove stale entries, decrement count
    → Publish updated count
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;count-broadcast-flow&#34;&gt;Count Broadcast Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Redis Pub/Sub channel: page_count:{page_id}
  → When count changes, PUBLISH to this channel
  → Each WS Server subscribes to channels for pages with local viewers
  → On receiving count update:
    → Push { &amp;#34;type&amp;#34;: &amp;#34;viewer_count&amp;#34;, &amp;#34;count&amp;#34;: N } to all local WebSocket
      connections viewing that page
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;WebSocket Connection Servers (50 servers):&lt;/strong&gt; Maintain persistent connections. Handle heartbeats, join/leave events. Subscribe to Redis Pub/Sub for count updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis Cluster (3 shards, replicated):&lt;/strong&gt; Stores viewer sets, counts, connection registry. Provides Pub/Sub for cross-server notifications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Heartbeat Reaper Service:&lt;/strong&gt; Background job scanning for stale viewers. Runs on a schedule (every 30 seconds). Scans sorted sets for expired entries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Count Query API:&lt;/strong&gt; Stateless HTTP service for non-real-time queries. Reads directly from Redis. Used by analytics dashboards.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analytics Pipeline:&lt;/strong&gt; Kafka consumer logs peak counts, unique viewer counts (HyperLogLog reads) to a data warehouse for historical analysis.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;architecture-diagram&#34;&gt;Architecture Diagram&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Clients (10M)
  → Load Balancer (sticky by viewer_id hash)
    → WebSocket Servers (50 instances)
      → Redis Cluster
        ├── Viewer Sets (Sorted Sets)
        ├── Counts (Strings)
        ├── Connection Registry (Hashes)
        └── Pub/Sub (count change notifications)

Heartbeat Reaper (3 instances, leader-elected)
  → Scans Redis sorted sets for stale entries
  → Decrements counts and publishes updates

Analytics Pipeline
  → Periodically snapshots counts to Kafka → Data Warehouse
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-presence-detection--heartbeat-vs-websocket-state&#34;&gt;Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Option A: Rely purely on WebSocket connection state&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Real-Time Stock Price System</title>
      <link>https://chiraghasija.cc/designs/stock-prices/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/stock-prices/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Ingest real-time market data feeds from multiple exchanges (NYSE, NASDAQ, etc.) and normalize into a unified price stream&lt;/li&gt;
&lt;li&gt;Fan out live price updates to millions of subscribers via WebSocket with sub-second latency&lt;/li&gt;
&lt;li&gt;Maintain and serve order book data (top-of-book and depth) for each symbol&lt;/li&gt;
&lt;li&gt;Compute and serve the National Best Bid and Offer (NBBO) by aggregating prices across exchanges&lt;/li&gt;
&lt;li&gt;Store and serve historical price data (OHLCV — Open, High, Low, Close, Volume) at multiple time resolutions and support price-change alerting&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% during market hours (9:30 AM - 4:00 PM ET). Pre-market and after-hours: 99.9%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Tick-to-display &amp;lt; 100ms for retail users. Tick-to-internal-processing &amp;lt; 10ms. (Not targeting HFT microsecond latency — this is for retail/fintech platforms.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Price data must be ordered correctly (no out-of-order ticks shown to users). NBBO must be accurate within 50ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10,000 symbols, 1M price updates/sec during peak market hours, 5M concurrent WebSocket subscribers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; All tick data stored for regulatory compliance (7 years). Historical OHLCV data stored indefinitely.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Inbound market data:&lt;/strong&gt; 1M ticks/sec during peak (10,000 symbols × 100 updates/sec for active symbols)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WebSocket subscribers:&lt;/strong&gt; 5M concurrent connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fan-out:&lt;/strong&gt; Each tick for a popular symbol (AAPL, TSLA) may go to 500K subscribers → &lt;strong&gt;500M messages/sec&lt;/strong&gt; outbound at peak&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Historical data queries:&lt;/strong&gt; 10,000 QPS (chart data requests for different symbols/timeframes)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tick data (raw):&lt;/strong&gt; 1M ticks/sec × 64 bytes × 6.5 hours/day × 252 trading days = &lt;strong&gt;37TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OHLCV 1-min candles:&lt;/strong&gt; 10,000 symbols × 390 candles/day × 252 days × 48 bytes = &lt;strong&gt;47GB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OHLCV 1-day candles:&lt;/strong&gt; 10,000 symbols × 252 days × 48 bytes = &lt;strong&gt;121MB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order book snapshots (top 10 levels):&lt;/strong&gt; 10,000 symbols × 10 updates/sec × 200 bytes = 20MB/sec (stored in memory, snapshots persisted hourly)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;fan-out-heavy system&lt;/strong&gt;. Ingesting 1M ticks/sec is manageable, but fanning out each tick to potentially 500K subscribers creates a 500:1 amplification. The WebSocket fan-out layer is the hardest scaling challenge. Historical data is a classic time-series storage problem.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Stock Broker System (Zerodha/Robinhood)</title>
      <link>https://chiraghasija.cc/designs/stock-broker/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/stock-broker/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Order placement&lt;/strong&gt; — users can place market, limit, and stop-loss orders for equities, F&amp;amp;O, and ETFs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time portfolio tracking&lt;/strong&gt; — show holdings, positions, P&amp;amp;L (realized + unrealized), and margin utilization in real time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Market data feed&lt;/strong&gt; — stream live prices (LTP, bid/ask, depth) to clients via WebSocket&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order book management&lt;/strong&gt; — maintain per-user order book with status lifecycle (open → pending → executed → settled / rejected / cancelled)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Margin &amp;amp; risk management&lt;/strong&gt; — pre-trade risk checks: sufficient margin, position limits, circuit breaker enforcement, and exposure calculations&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% during market hours (9:15 AM – 3:30 PM IST). Every minute of downtime = real money lost for users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 10ms for order placement (broker-side), &amp;lt; 50ms for end-to-end order acknowledgement. Market data tick-to-display &amp;lt; 200ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Orders MUST be strongly consistent — a placed order must never be lost. Portfolio and P&amp;amp;L can be eventually consistent (~1s lag acceptable).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M registered users, 1M concurrent during market hours, 50K orders/sec peak (market open spike at 9:15 AM), 500K+ price ticks/sec from exchange feed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero tolerance for order loss. Every order must be audit-trailed with nanosecond-precision timestamps for regulatory compliance (SEBI/SEC).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 50K orders/sec peak (market open), ~5K orders/sec steady state during market hours&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Market data:&lt;/strong&gt; 5,000 instruments × 100 ticks/sec = 500K ticks/sec ingested from exchange&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WebSocket connections:&lt;/strong&gt; 1M concurrent users, each subscribed to ~10 instruments on average&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fan-out:&lt;/strong&gt; each tick fans out to subscribed users → peak 5M messages/sec outbound via WebSocket&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 50M orders/day × 500 bytes = &lt;strong&gt;25 GB/day&lt;/strong&gt; → 6 TB/year (must retain 7+ years for regulatory compliance)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trade history:&lt;/strong&gt; ~20M trades/day × 300 bytes = 6 GB/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Market data (OHLCV candles):&lt;/strong&gt; 5,000 instruments × 375 min × 100 bytes = 188 MB/day for 1-min candles. Tick-level data: 500K ticks/sec × 6.25 hrs × 50 bytes = &lt;strong&gt;~56 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User portfolios:&lt;/strong&gt; 10M users × 1 KB average = 10 GB (fits in memory for hot data)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;latency-critical, correctness-critical&lt;/strong&gt; system with a massive spike pattern. 9:15 AM (market open) sees 10–20× traffic surge within seconds. The order pipeline must be idempotent, durable (WAL + queue), and pre-warmed. Market data is a classic pub-sub fan-out problem — 500K ticks/sec → millions of WebSocket pushes.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Stock Price Alert System</title>
      <link>https://chiraghasija.cc/designs/stock-alert-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/stock-alert-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Create price alerts&lt;/strong&gt; — users set alerts on instruments with conditions: price above/below a threshold, or percentage change from a reference price (e.g., &amp;ldquo;alert me if RELIANCE drops 5% from ₹2500&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time matching&lt;/strong&gt; — alerts must be evaluated against every incoming price tick and trigger within seconds of the condition being met&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-channel notification&lt;/strong&gt; — triggered alerts are delivered via push notification, SMS, and/or email based on user preference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert lifecycle management&lt;/strong&gt; — alerts have states: active → triggered → (snoozed | expired | deleted). Users can snooze (re-arm after cooldown), edit, or delete alerts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert dashboard&lt;/strong&gt; — users can list, filter, and manage all their alerts (active, triggered history, expired)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.95% — missing an alert during a market crash is unacceptable. Degraded mode: delay delivery, but never lose a triggered alert.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 5 seconds from price crossing the threshold to notification delivery (push). &amp;lt; 30 seconds for SMS/email.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100M total alerts across 10M users. 500K price ticks/sec ingested from market data feed. Average 10 alerts per user.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Throughput:&lt;/strong&gt; During volatile markets, up to 1M alerts could trigger within a 1-minute window (e.g., broad market crash).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Triggered alerts must be persisted before notification is attempted. At-least-once delivery guarantee for notifications.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Price ticks ingested:&lt;/strong&gt; 5,000 instruments × 100 ticks/sec = &lt;strong&gt;500K ticks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert evaluations:&lt;/strong&gt; Each tick must be checked against all active alerts for that instrument. Average 20K active alerts per instrument (100M alerts / 5,000 instruments) → &lt;strong&gt;500K × 20K = 10B comparisons/sec&lt;/strong&gt; (naive approach — this is why efficient matching is critical)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert triggers:&lt;/strong&gt; Normal day: ~500K alerts trigger/day. Volatile day: ~5M alerts trigger/day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert creation:&lt;/strong&gt; ~1M new alerts/day, ~500K deletions/day&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Alert records:&lt;/strong&gt; 100M alerts × 200 bytes = &lt;strong&gt;20 GB&lt;/strong&gt; — fits in memory for hot path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert history:&lt;/strong&gt; 500K triggers/day × 300 bytes = 150 MB/day → ~55 GB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification logs:&lt;/strong&gt; 1.5M notifications/day (3 channels avg per trigger) × 200 bytes = 300 MB/day&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The core challenge is &lt;strong&gt;efficient matching&lt;/strong&gt;: 500K ticks/sec against 100M alerts. Naive O(alerts_per_instrument) scan per tick is 10B comparisons/sec — too expensive. We need a data structure that answers &amp;ldquo;which alerts are triggered by price X?&amp;rdquo; in O(log N + K) where K is the number of triggered alerts. &lt;strong&gt;Sorted sets&lt;/strong&gt; (by trigger price) or &lt;strong&gt;interval trees&lt;/strong&gt; make this possible.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Surge Pricing System (Uber)</title>
      <link>https://chiraghasija.cc/designs/surge-pricing/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/surge-pricing/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Compute a dynamic pricing multiplier for each geographic zone based on real-time supply (available drivers) and demand (ride requests)&lt;/li&gt;
&lt;li&gt;Divide the service area into geospatial zones and compute independent surge multipliers per zone&lt;/li&gt;
&lt;li&gt;Display the current surge multiplier to riders before they confirm a ride, with a price estimate&lt;/li&gt;
&lt;li&gt;Apply smoothing and dampening so surge prices don&amp;rsquo;t oscillate wildly (e.g., 1.0× → 3.5× → 1.2× within minutes)&lt;/li&gt;
&lt;li&gt;Enforce price caps and fairness rules (regulatory limits, max multiplier during emergencies, consistent pricing within a zone)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — surge pricing is in the critical path of every ride request. If it&amp;rsquo;s down, rides can&amp;rsquo;t be priced.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Surge multiplier lookup &amp;lt; 20ms per ride request. Surge recomputation runs every 30-60 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; All ride requests within the same zone at the same time should see the same surge multiplier. Eventual consistency across data centers is acceptable (&amp;lt; 5 second lag).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500 cities, 50K zones globally, 100K ride requests/sec at peak, 5M active drivers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; Surge must reflect conditions no older than 60 seconds. Stale surge = mispricing = lost revenue or angry riders.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100K ride requests/sec at peak → each needs a surge lookup → &lt;strong&gt;100K reads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Driver location updates: 5M drivers × 1 update every 4 seconds = &lt;strong&gt;1.25M location updates/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Surge recomputation: 50K zones × 1 recomputation/minute = &lt;strong&gt;~833 zone recomputations/sec&lt;/strong&gt; (lightweight)&lt;/li&gt;
&lt;li&gt;Ride request events (for demand counting): &lt;strong&gt;100K events/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Zone definitions: 50K zones × 1 KB (polygon or H3 index) = &lt;strong&gt;50 MB&lt;/strong&gt; (fits in memory)&lt;/li&gt;
&lt;li&gt;Current surge state: 50K zones × 100 bytes (multiplier, supply, demand, timestamp) = &lt;strong&gt;5 MB&lt;/strong&gt; (fits in Redis)&lt;/li&gt;
&lt;li&gt;Historical surge data (for analytics): 50K zones × 1 record/min × 60 min × 24 hr = 72M records/day × 200 bytes = &lt;strong&gt;~14 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;compute&#34;&gt;Compute&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Per zone recomputation: count supply (drivers in zone), count demand (requests in zone in last 2 min), compute ratio, apply formula&lt;/li&gt;
&lt;li&gt;Simple arithmetic — CPU is trivial&lt;/li&gt;
&lt;li&gt;The bottleneck is &lt;strong&gt;ingesting and aggregating 1.25M location updates/sec&lt;/strong&gt; to determine supply per zone in real-time&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;real-time geospatial aggregation&lt;/strong&gt; problem. The hard parts are: (1) efficiently mapping millions of driver locations to zones every few seconds, (2) computing stable surge multipliers that respond to demand without oscillating, and (3) ensuring riders and drivers see consistent prices during a ride.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a System to Identify Top-K Shared Articles</title>
      <link>https://chiraghasija.cc/designs/top-k-articles/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/top-k-articles/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Track every article share event across the platform (social shares, link copies, email shares)&lt;/li&gt;
&lt;li&gt;Return the top-K most shared articles for multiple time windows: last 1 minute, last 1 hour, last 24 hours&lt;/li&gt;
&lt;li&gt;Support real-time updates — the top-K list refreshes within seconds of share events&lt;/li&gt;
&lt;li&gt;Provide both global top-K and per-category top-K (e.g., top sports articles, top tech articles)&lt;/li&gt;
&lt;li&gt;Expose an API for clients to query current trending articles with share counts&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — trending articles is a high-visibility feature; downtime is immediately noticeable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 50ms for top-K queries. Share event ingestion can tolerate up to 5 seconds of delay before reflecting in results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Approximate counts are acceptable. If an article has 10,000 shares, reporting 9,950 is fine. Rankings may be slightly stale by a few seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100K share events/sec at peak (viral events, breaking news). 10M+ unique articles in the system. Top-K queries at 50K QPS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Share events must not be lost (feed into analytics). Top-K results are recomputable from raw events.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Share events: 100K/sec peak, 30K/sec average&lt;/li&gt;
&lt;li&gt;Daily share events: 30K x 86,400 = &lt;strong&gt;2.6 billion/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Top-K query QPS: 50K/sec (served from cache, cheap)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each share event: article_id (8 bytes) + user_id (8 bytes) + timestamp (8 bytes) + type (1 byte) + category (2 bytes) = ~30 bytes&lt;/li&gt;
&lt;li&gt;Daily raw events: 2.6B x 30 bytes = &lt;strong&gt;78 GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;30-day retention for raw events: &lt;strong&gt;2.3 TB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;counting-infrastructure&#34;&gt;Counting Infrastructure&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Count-Min Sketch for approximate counting:
&lt;ul&gt;
&lt;li&gt;4 hash functions x 1M counters each = 4M counters&lt;/li&gt;
&lt;li&gt;Each counter: 4 bytes → &lt;strong&gt;16 MB&lt;/strong&gt; per time window&lt;/li&gt;
&lt;li&gt;3 time windows (1min, 1hr, 24hr) → &lt;strong&gt;48 MB&lt;/strong&gt; total&lt;/li&gt;
&lt;li&gt;Fits entirely in memory on a single node&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The core challenge is &lt;strong&gt;not storage&lt;/strong&gt; — it is maintaining accurate, real-time rankings over sliding time windows at 100K events/sec. The Count-Min Sketch + min-heap approach keeps this in O(log K) per event with minimal memory.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Top-K System (Heavy Hitters)</title>
      <link>https://chiraghasija.cc/designs/top-k-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/top-k-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Track the top-K most frequent items (heavy hitters) across a stream of events in real-time (e.g., top 100 trending hashtags, most searched queries, most purchased products)&lt;/li&gt;
&lt;li&gt;Support time-windowed queries — top-K in the last 1 minute, 1 hour, 1 day&lt;/li&gt;
&lt;li&gt;Provide approximate counts for each item in the top-K list, with bounded error guarantees&lt;/li&gt;
&lt;li&gt;Support multiple independent top-K lists (per category, per region, global)&lt;/li&gt;
&lt;li&gt;Allow querying both the current top-K snapshot and historical top-K at any past timestamp&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the system is used for real-time dashboards and recommendation feeds&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Event processing &amp;lt; 10ms. Top-K query response &amp;lt; 50ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Approximate counts are acceptable (within 0.1% of true count). The top-K list may briefly lag by a few seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1M events/sec ingestion, 100K unique items per time window, top-K where K ≤ 1000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory efficiency:&lt;/strong&gt; Must not grow linearly with the number of unique items. A stream with 100M unique items should still use bounded memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Event stream: 1M events/sec (e.g., search queries, clicks, purchases)&lt;/li&gt;
&lt;li&gt;Each event: item_id (8 bytes) + timestamp (8 bytes) + metadata (32 bytes) = ~48 bytes&lt;/li&gt;
&lt;li&gt;Ingestion bandwidth: 1M × 48B = &lt;strong&gt;48 MB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Query QPS: ~10K queries/sec for top-K reads (cached aggressively)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;memory-for-exact-counting&#34;&gt;Memory for Exact Counting&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100M unique items per day × 8 bytes (item_id) + 8 bytes (count) = &lt;strong&gt;1.6 GB&lt;/strong&gt; — feasible for daily, but maintaining sliding windows of 1-minute granularity is harder&lt;/li&gt;
&lt;li&gt;100M items × 1440 minutes × 16 bytes = &lt;strong&gt;2.3 TB&lt;/strong&gt; — impossible for exact per-minute counts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;memory-for-approximate-counting&#34;&gt;Memory for Approximate Counting&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Count-Min Sketch:&lt;/strong&gt; 4 hash functions × 1M counters × 4 bytes = &lt;strong&gt;16 MB&lt;/strong&gt; per time window&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Space-Saving (top-K tracker):&lt;/strong&gt; K=1000 items × (8 bytes key + 8 bytes count + 8 bytes error) = &lt;strong&gt;24 KB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-minute windows for 24 hours:&lt;/strong&gt; 1440 windows × 16 MB = &lt;strong&gt;23 GB&lt;/strong&gt; — manageable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;With exponential decay (approximate sliding window):&lt;/strong&gt; Single sketch, no windowing needed = &lt;strong&gt;16 MB total&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The core challenge is the &lt;strong&gt;memory vs accuracy&lt;/strong&gt; trade-off. Exact counting requires O(N) memory where N is unique items. Probabilistic data structures (Count-Min Sketch, Space-Saving) provide O(1/ε²) memory for ε-approximate answers. For top-K, we need both frequency estimation AND identification of the top items.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a URL Shortening Service (TinyURL)</title>
      <link>https://chiraghasija.cc/designs/url-shortener/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/url-shortener/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Given a long URL, generate a short, unique URL&lt;/li&gt;
&lt;li&gt;Given a short URL, redirect to the original long URL&lt;/li&gt;
&lt;li&gt;Users can optionally set a custom alias&lt;/li&gt;
&lt;li&gt;Links expire after a configurable TTL (default: 5 years)&lt;/li&gt;
&lt;li&gt;Analytics: track click count per short URL&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — redirects must always work; this is on the critical path of every click&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Redirect in &amp;lt; 10ms at p99 (just a lookup + 301)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency is fine for analytics. Strong consistency for URL creation (no duplicate short codes)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 100M new URLs/day, 10:1 read-to-write ratio → 1B redirects/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; URLs must not be lost — a broken short link is permanent reputation damage&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;write-url-creation&#34;&gt;Write (URL creation)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100M URLs/day ÷ 100K sec/day = &lt;strong&gt;~1,000 writes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 5x → &lt;strong&gt;5,000 writes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;read-redirects&#34;&gt;Read (redirects)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B redirects/day ÷ 100K = &lt;strong&gt;~10,000 reads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: &lt;strong&gt;50,000 reads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each record: short code (7 bytes) + long URL (avg 200 bytes) + metadata (50 bytes) ≈ 250 bytes&lt;/li&gt;
&lt;li&gt;100M/day × 365 × 5 years = 182.5B records&lt;/li&gt;
&lt;li&gt;182.5B × 250 bytes = &lt;strong&gt;~45 TB&lt;/strong&gt; over 5 years&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;short-code-space&#34;&gt;Short code space&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Base62 (a-z, A-Z, 0-9), 7 characters = 62^7 = &lt;strong&gt;3.5 trillion&lt;/strong&gt; unique codes — more than enough&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/shorten
  Headers: Authorization: Bearer &amp;lt;api_key&amp;gt;
  Body: {
    &amp;#34;long_url&amp;#34;: &amp;#34;https://example.com/very/long/path&amp;#34;,
    &amp;#34;custom_alias&amp;#34;: &amp;#34;my-link&amp;#34;,     // optional
    &amp;#34;ttl_days&amp;#34;: 365                // optional, default 1825
  }
  Response 201: {
    &amp;#34;short_url&amp;#34;: &amp;#34;https://tiny.url/aB3kX9p&amp;#34;,
    &amp;#34;short_code&amp;#34;: &amp;#34;aB3kX9p&amp;#34;,
    &amp;#34;expires_at&amp;#34;: &amp;#34;2031-02-22T00:00:00Z&amp;#34;
  }

GET /{short_code}
  Response 301: Location: https://example.com/very/long/path
  Response 404: { &amp;#34;error&amp;#34;: &amp;#34;URL not found or expired&amp;#34; }

GET /api/v1/stats/{short_code}
  Headers: Authorization: Bearer &amp;lt;api_key&amp;gt;
  Response 200: {
    &amp;#34;short_code&amp;#34;: &amp;#34;aB3kX9p&amp;#34;,
    &amp;#34;long_url&amp;#34;: &amp;#34;https://example.com/...&amp;#34;,
    &amp;#34;total_clicks&amp;#34;: 142857,
    &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T00:00:00Z&amp;#34;,
    &amp;#34;expires_at&amp;#34;: &amp;#34;2031-02-22T00:00:00Z&amp;#34;
  }
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Key decisions:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Weather Forecasting Service</title>
      <link>https://chiraghasija.cc/designs/weather-service/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/weather-service/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Provide current weather conditions and 7-day forecasts for any location worldwide (by coordinates, city name, or ZIP code)&lt;/li&gt;
&lt;li&gt;Ingest and process data from multiple sources: weather stations (100K+), satellites, radar, and third-party NWS/ECMWF model data&lt;/li&gt;
&lt;li&gt;Support geospatial queries: &amp;ldquo;weather at 40.7128,-74.0060&amp;rdquo; and reverse geocoding: &amp;ldquo;weather in New York, NY&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Severe weather alert system: tornado warnings, flood alerts, heat advisories — push notifications to affected users within minutes&lt;/li&gt;
&lt;li&gt;Provide historical weather data warehouse for trend analysis, agriculture, insurance, and research use cases&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for the API. Weather data is safety-critical — aviation, maritime, emergency services depend on it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Current conditions API &amp;lt; 100ms. Forecast API &amp;lt; 200ms. Alert delivery &amp;lt; 2 minutes from NWS issuance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; Current conditions updated every 5-15 minutes. Forecasts updated every 6 hours (aligned with model runs). Alerts delivered in real-time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1B API requests/day across 10M registered developers. 50K concurrent data ingestion streams. 100TB historical data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accuracy:&lt;/strong&gt; Forecast accuracy comparable to top providers. Temperature within +/- 2 degrees F for 24-hour forecasts, +/- 5 degrees for 7-day.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;api-traffic&#34;&gt;API Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B requests/day = &lt;strong&gt;11.5K requests/sec average&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 5x during severe weather events = &lt;strong&gt;~60K requests/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Breakdown: 60% current conditions, 30% forecasts, 10% historical/alerts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;data-ingestion&#34;&gt;Data Ingestion&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Weather stations: 100K stations reporting every 5-15 minutes = ~400K observations/hour&lt;/li&gt;
&lt;li&gt;Satellite data: 10 satellite feeds, each producing ~50GB/day = &lt;strong&gt;500GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Radar: 200 radar stations, 5-minute sweeps, each ~10MB = ~600K images/day = &lt;strong&gt;6TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;NWS/ECMWF model output: 4 model runs/day × 50GB each = &lt;strong&gt;200GB/day&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Current conditions cache: 100K stations × 2KB = 200MB — fits entirely in Redis&lt;/li&gt;
&lt;li&gt;Forecast grid data: global 0.25-degree grid = 1,440 × 720 = 1M grid points × 7 days × 24 hours × 100 bytes = &lt;strong&gt;17GB&lt;/strong&gt; per model run → cached in memory&lt;/li&gt;
&lt;li&gt;Historical data: 100K stations × 365 days × 288 observations/day (every 5 min) × 200 bytes = &lt;strong&gt;2TB/year station data&lt;/strong&gt; + satellite/radar archives&lt;/li&gt;
&lt;li&gt;Total historical warehouse: &lt;strong&gt;~100TB&lt;/strong&gt; (5 years of all sources)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The read pattern is highly cacheable — millions of users in New York all get the same weather. The cache hit rate should be &amp;gt; 95%. The hard problem is ingesting, processing, and gridding heterogeneous data sources into a unified model.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Web Crawler</title>
      <link>https://chiraghasija.cc/designs/web-crawler/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/web-crawler/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Crawl the web starting from a seed set of URLs, discovering and following links to new pages&lt;/li&gt;
&lt;li&gt;Download and store the HTML content of each page for downstream indexing/processing&lt;/li&gt;
&lt;li&gt;Respect robots.txt directives and crawl-delay policies for every domain&lt;/li&gt;
&lt;li&gt;Detect and avoid duplicate URLs (normalization) and duplicate content (near-dedup)&lt;/li&gt;
&lt;li&gt;Support prioritized crawling — important/fresh pages crawled more frequently than stale/low-quality pages&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% — the crawler should run continuously. Brief outages are acceptable (we just resume from where we left off).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Throughput:&lt;/strong&gt; Crawl 1 billion pages/day (~12,000 pages/sec sustained). Scale to 5 billion pages total in the index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Not a real-time system. End-to-end latency from URL discovery to content storage can be minutes to hours.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Politeness:&lt;/strong&gt; Never overload a single web server. Maximum 1 request/second per domain by default, respect Crawl-Delay.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Robustness:&lt;/strong&gt; Handle spider traps, malformed HTML, infinite URL spaces (calendars, session IDs), and adversarial pages gracefully.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;throughput&#34;&gt;Throughput&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Target: 1 billion pages/day&lt;/li&gt;
&lt;li&gt;1B / 86,400 = &lt;strong&gt;~12,000 pages/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Average page size: 100KB (HTML + headers)&lt;/li&gt;
&lt;li&gt;Download bandwidth: 12,000 x 100KB = &lt;strong&gt;1.2 GB/sec&lt;/strong&gt; = ~10 Gbps&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;5 billion pages in the index&lt;/li&gt;
&lt;li&gt;Average compressed page: 30KB (HTML compresses ~3:1)&lt;/li&gt;
&lt;li&gt;Total content storage: 5B x 30KB = &lt;strong&gt;150 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;URL frontier (URLs to crawl): 10 billion URLs x 200 bytes = &lt;strong&gt;2 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;URL seen set (deduplication): 50 billion URLs x 8 bytes (fingerprint) = &lt;strong&gt;400 GB&lt;/strong&gt; — fits in distributed memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;dns&#34;&gt;DNS&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Unique domains: ~200 million&lt;/li&gt;
&lt;li&gt;DNS resolution: must cache aggressively. At 12,000 pages/sec, we cannot do a fresh DNS lookup for each.&lt;/li&gt;
&lt;li&gt;DNS cache: 200M domains x 100 bytes = &lt;strong&gt;20 GB&lt;/strong&gt; — fits in memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;machines&#34;&gt;Machines&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each crawler worker: ~200 concurrent connections, 500 pages/sec per worker&lt;/li&gt;
&lt;li&gt;12,000 / 500 = &lt;strong&gt;~24 crawler workers&lt;/strong&gt; (plus headroom → 40 workers)&lt;/li&gt;
&lt;li&gt;Each worker: 32 cores, 64GB RAM, 10 Gbps NIC&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;p&gt;The web crawler is not a user-facing API service — it is an internal batch processing system. However, it has internal control and data interfaces.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design a Wire Transfer System</title>
      <link>https://chiraghasija.cc/designs/wire-transfer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/wire-transfer/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Transfer money between accounts within the same bank (internal) and across banks (external via SWIFT/ACH)&lt;/li&gt;
&lt;li&gt;Every transaction must follow double-entry bookkeeping — debit one account, credit another, with a full audit trail&lt;/li&gt;
&lt;li&gt;Support idempotent transfers — retrying the same request must not duplicate the transfer&lt;/li&gt;
&lt;li&gt;Provide real-time transaction status tracking (pending, processing, completed, failed, reversed)&lt;/li&gt;
&lt;li&gt;Enforce compliance checks (AML/KYC screening, sanctions list) before processing any transfer&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — financial systems cannot afford extended downtime. Scheduled maintenance windows are acceptable (e.g., 2am-3am) but unplanned outages are catastrophic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Internal transfers &amp;lt; 500ms end-to-end. External transfers (cross-bank) may take seconds to hours depending on the rail (ACH = batch, SWIFT = near-real-time).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency is mandatory. Money cannot be created or destroyed. Every debit must have a matching credit. We sacrifice availability for consistency (CP system).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10,000 transfers/sec at peak. $50B daily volume. 500M accounts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero data loss. Every transaction must be persisted to durable storage with replication before acknowledgment.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10,000 transfers/sec peak, ~3,000 avg&lt;/li&gt;
&lt;li&gt;Each transfer involves: 1 write to create the transfer, 2 writes to update account balances (debit + credit), 1 write to the ledger&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;~40,000 DB writes/sec peak&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Read traffic (balance checks, transaction history): ~50,000 reads/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M accounts × 500 bytes (account metadata + balance) = &lt;strong&gt;250 GB&lt;/strong&gt; for accounts&lt;/li&gt;
&lt;li&gt;300M transfers/day × 365 days × 1 KB per transfer = &lt;strong&gt;~110 TB/year&lt;/strong&gt; for transaction history&lt;/li&gt;
&lt;li&gt;Ledger entries: 600M/day (2 per transfer) × 500 bytes = &lt;strong&gt;~110 TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Total: ~250 TB/year growing. Need partitioning and archival strategy.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;money-math&#34;&gt;Money Math&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;All monetary amounts stored as &lt;strong&gt;integers in the smallest currency unit&lt;/strong&gt; (cents for USD, pence for GBP)&lt;/li&gt;
&lt;li&gt;Never use floating point. $100.50 is stored as 10050 cents.&lt;/li&gt;
&lt;li&gt;Maximum transfer size: 64-bit integer → $92 quadrillion in cents. More than enough.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Initiate a transfer
POST /v1/transfers
  Headers: Idempotency-Key: &amp;#34;uuid-abc-123&amp;#34;
  Body: {
    &amp;#34;from_account_id&amp;#34;: &amp;#34;acc_sender_001&amp;#34;,
    &amp;#34;to_account_id&amp;#34;: &amp;#34;acc_receiver_002&amp;#34;,
    &amp;#34;amount&amp;#34;: 10050,                    // $100.50 in cents
    &amp;#34;currency&amp;#34;: &amp;#34;USD&amp;#34;,
    &amp;#34;reference&amp;#34;: &amp;#34;Invoice #4521&amp;#34;,
    &amp;#34;transfer_type&amp;#34;: &amp;#34;internal&amp;#34;         // or &amp;#34;ach&amp;#34;, &amp;#34;swift&amp;#34;
  }
  Response 201: {
    &amp;#34;transfer_id&amp;#34;: &amp;#34;txn_xyz_789&amp;#34;,
    &amp;#34;status&amp;#34;: &amp;#34;pending&amp;#34;,
    &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T10:00:00Z&amp;#34;
  }

// Get transfer status
GET /v1/transfers/{transfer_id}
  Response 200: {
    &amp;#34;transfer_id&amp;#34;: &amp;#34;txn_xyz_789&amp;#34;,
    &amp;#34;status&amp;#34;: &amp;#34;completed&amp;#34;,              // pending | processing | completed | failed | reversed
    &amp;#34;from_account_id&amp;#34;: &amp;#34;acc_sender_001&amp;#34;,
    &amp;#34;to_account_id&amp;#34;: &amp;#34;acc_receiver_002&amp;#34;,
    &amp;#34;amount&amp;#34;: 10050,
    &amp;#34;currency&amp;#34;: &amp;#34;USD&amp;#34;,
    &amp;#34;compliance_status&amp;#34;: &amp;#34;cleared&amp;#34;,
    &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T10:00:00Z&amp;#34;,
    &amp;#34;completed_at&amp;#34;: &amp;#34;2026-02-22T10:00:00Z&amp;#34;
  }

// Get account balance
GET /v1/accounts/{account_id}/balance
  Response 200: {
    &amp;#34;account_id&amp;#34;: &amp;#34;acc_sender_001&amp;#34;,
    &amp;#34;available_balance&amp;#34;: 5000000,       // $50,000.00
    &amp;#34;pending_balance&amp;#34;: 4989950,         // after pending debit
    &amp;#34;currency&amp;#34;: &amp;#34;USD&amp;#34;
  }

// Get transaction history
GET /v1/accounts/{account_id}/transactions?limit=50&amp;amp;cursor=xxx
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Idempotency-Key header&lt;/strong&gt; is mandatory on POST. The server stores the key and returns the same response on retry.&lt;/li&gt;
&lt;li&gt;Amounts are always integers in the smallest currency unit. The API never accepts floats.&lt;/li&gt;
&lt;li&gt;Transfer creation is asynchronous — returns &lt;code&gt;pending&lt;/code&gt; immediately. Client polls or receives webhook for completion.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;accounts-table-postgresql--sharded-by-account_id&#34;&gt;Accounts Table (PostgreSQL — sharded by account_id)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: accounts
  account_id        (PK)  | varchar(20)
  user_id           (FK)  | varchar(20)
  balance           | bigint          -- available balance in cents
  pending_balance   | bigint          -- balance after pending holds
  currency          | char(3)
  status            | enum(&amp;#39;active&amp;#39;, &amp;#39;frozen&amp;#39;, &amp;#39;closed&amp;#39;)
  created_at        | timestamp
  updated_at        | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;transfers-table-postgresql--sharded-by-transfer_id&#34;&gt;Transfers Table (PostgreSQL — sharded by transfer_id)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: transfers
  transfer_id       (PK)  | varchar(20)
  idempotency_key   (UQ)  | varchar(64)
  from_account_id   (FK)  | varchar(20)
  to_account_id     (FK)  | varchar(20)
  amount            | bigint
  currency          | char(3)
  transfer_type     | enum(&amp;#39;internal&amp;#39;, &amp;#39;ach&amp;#39;, &amp;#39;swift&amp;#39;)
  status            | enum(&amp;#39;pending&amp;#39;, &amp;#39;processing&amp;#39;, &amp;#39;completed&amp;#39;, &amp;#39;failed&amp;#39;, &amp;#39;reversed&amp;#39;)
  compliance_status | enum(&amp;#39;pending&amp;#39;, &amp;#39;cleared&amp;#39;, &amp;#39;flagged&amp;#39;, &amp;#39;blocked&amp;#39;)
  reference         | varchar(200)
  created_at        | timestamp
  completed_at      | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ledger-table-append-only--the-source-of-truth&#34;&gt;Ledger Table (Append-Only — the source of truth)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: ledger_entries
  entry_id          (PK)  | bigint (auto-increment)
  transfer_id       (FK)  | varchar(20)
  account_id        (FK)  | varchar(20)
  entry_type        | enum(&amp;#39;debit&amp;#39;, &amp;#39;credit&amp;#39;)
  amount            | bigint
  balance_after     | bigint          -- running balance snapshot
  created_at        | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;idempotency-store-redis--postgresql&#34;&gt;Idempotency Store (Redis + PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: idempotency_keys
  idempotency_key   (PK)  | varchar(64)
  transfer_id       | varchar(20)
  response_body     | jsonb
  created_at        | timestamp
  expires_at        | timestamp       -- TTL: 24 hours
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-postgresql&#34;&gt;Why PostgreSQL?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;ACID transactions are non-negotiable for financial systems&lt;/li&gt;
&lt;li&gt;Row-level locking for concurrent balance updates&lt;/li&gt;
&lt;li&gt;Serializable isolation level for critical transfer logic&lt;/li&gt;
&lt;li&gt;Rich constraint system (CHECK balance &amp;gt;= 0, foreign keys)&lt;/li&gt;
&lt;li&gt;Proven reliability in banking — this is not the place for eventual consistency&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;transfer-flow-internal&#34;&gt;Transfer Flow (Internal)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client
  → API Gateway (auth, rate limiting)
    → Transfer Service
      → 1. Validate idempotency key (Redis lookup)
         If exists → return cached response
      → 2. Create transfer record (status = pending)
      → 3. Compliance check (AML/KYC/sanctions screening)
         If flagged → status = blocked, notify compliance team
      → 4. Execute transfer (single DB transaction):
         BEGIN TRANSACTION (SERIALIZABLE)
           SELECT balance FROM accounts WHERE account_id = sender FOR UPDATE
           IF balance &amp;lt; amount → ROLLBACK, return insufficient funds
           UPDATE accounts SET balance = balance - amount WHERE account_id = sender
           UPDATE accounts SET balance = balance + amount WHERE account_id = receiver
           INSERT INTO ledger_entries (debit for sender)
           INSERT INTO ledger_entries (credit for receiver)
           UPDATE transfers SET status = &amp;#39;completed&amp;#39;
         COMMIT
      → 5. Store idempotency key → response mapping
      → 6. Send notifications (async via message queue)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;transfer-flow-cross-bank-via-achswift&#34;&gt;Transfer Flow (Cross-Bank via ACH/SWIFT)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client
  → API Gateway → Transfer Service
    → 1-3. Same as above (validate, create, compliance)
    → 4. Debit sender&amp;#39;s account + create hold
    → 5. Submit to Payment Rail:
         ACH: Batch file submitted to ACH operator (Nacha format)
              → Processed in batch windows (next business day)
         SWIFT: MT103 message to correspondent bank
              → Near-real-time via SWIFT network
    → 6. Await confirmation from external bank
         → Payment Rail Adapter (listens for responses)
           → On success: credit receiver&amp;#39;s account, update transfer status
           → On failure: release hold, reverse debit, update status
    → 7. Reconciliation job validates all external transfers daily
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;API Gateway:&lt;/strong&gt; Authentication, rate limiting, TLS termination. All traffic over mTLS internally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transfer Service:&lt;/strong&gt; Core business logic. Stateless, horizontally scaled. Orchestrates the transfer lifecycle.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance Service:&lt;/strong&gt; Screens transfers against sanctions lists (OFAC, EU), runs AML rules (large amounts, velocity checks, geographic risk scoring). Calls external providers (e.g., Refinitiv, Dow Jones) for PEP/sanctions screening.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ledger Database (PostgreSQL):&lt;/strong&gt; Sharded by account_id. Primary + synchronous replica for zero data loss. Append-only ledger table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Payment Rail Adapters:&lt;/strong&gt; Separate services for ACH, SWIFT, FedWire. Handle protocol-specific formatting and communication.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Service:&lt;/strong&gt; Sends emails, push notifications, webhooks on transfer completion/failure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reconciliation Engine:&lt;/strong&gt; Batch job that runs daily. Compares internal ledger against bank statements and external rail confirmations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotency Store (Redis):&lt;/strong&gt; Fast lookup for duplicate detection. Backed by PostgreSQL for durability.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-double-entry-bookkeeping--acid-guarantees&#34;&gt;Deep Dive 1: Double-Entry Bookkeeping &amp;amp; ACID Guarantees&lt;/h3&gt;
&lt;p&gt;Every financial movement must create exactly two ledger entries: a debit from one account and a credit to another. The sum of all debits must always equal the sum of all credits (the fundamental accounting equation).&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Amazon Cart Management Service</title>
      <link>https://chiraghasija.cc/designs/cart-service/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/cart-service/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Add, remove, and update quantity of items in the cart (CRUD operations with real-time inventory validation)&lt;/li&gt;
&lt;li&gt;Support both guest carts (cookie/session-based) and authenticated user carts, with automatic cart merging when a guest logs in&lt;/li&gt;
&lt;li&gt;Persist carts durably across sessions, devices, and app restarts — a user who adds an item on mobile must see it on desktop&lt;/li&gt;
&lt;li&gt;Handle price and availability changes while items sit in the cart — show current price, flag out-of-stock items, and surface price-change notifications&lt;/li&gt;
&lt;li&gt;Transition the cart atomically to checkout — reserve inventory, lock prices, and create an order in a single coordinated operation&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% uptime. Cart unavailability directly equals lost revenue. Even 1 minute of cart downtime during Prime Day can cost millions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Add-to-cart &amp;lt; 100ms p99. Cart read (render page) &amp;lt; 50ms p99. These must hold during 10x traffic spikes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for cart reads (stale by at most 1-2 seconds). Strong consistency for checkout transition (inventory decrement must be atomic).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 300M+ active users, 500M+ carts (including guest and abandoned), 50K add-to-cart operations/sec average, 500K/sec peak during Prime Day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero cart data loss. A cart is a purchase intent — losing a cart with 15 items a user spent 30 minutes curating is unacceptable.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active users: 300M monthly, ~50M daily&lt;/li&gt;
&lt;li&gt;Add/remove/update operations: 50M DAU × 3 cart ops/day = &lt;strong&gt;150M writes/day = ~1,750/sec average&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak (Prime Day, 10x): &lt;strong&gt;~17,500 writes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Cart reads (page loads, mini-cart renders): 5× writes = &lt;strong&gt;~8,750/sec average, 87,500/sec peak&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Checkout transitions: 10M orders/day = &lt;strong&gt;~115/sec average, 1,150/sec peak&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Cart record: user_id + metadata = ~200 bytes&lt;/li&gt;
&lt;li&gt;Cart item: item_id + seller_id + quantity + price_at_add + timestamp = ~150 bytes&lt;/li&gt;
&lt;li&gt;Average cart: 5 items → 200 + (5 × 150) = &lt;strong&gt;~950 bytes per cart&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;500M carts × 950 bytes = &lt;strong&gt;~475 GB total cart data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Including indexes, replicas, and overhead: &lt;strong&gt;~2 TB provisioned storage&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;inventory-checks&#34;&gt;Inventory Checks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Every add-to-cart triggers an inventory check: 1,750/sec average&lt;/li&gt;
&lt;li&gt;Every cart page load validates inventory for all items: 8,750/sec × 5 items = &lt;strong&gt;43,750 inventory lookups/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: &lt;strong&gt;~440K inventory lookups/sec&lt;/strong&gt; — must be served from cache, not direct DB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;guest-cart-volume&#34;&gt;Guest Cart Volume&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;~40% of carts are guest carts (no login)&lt;/li&gt;
&lt;li&gt;200M guest carts with average TTL of 30 days&lt;/li&gt;
&lt;li&gt;Cart merge events on login: ~5M/day (guest converts to authenticated)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;hot-path, high-availability storage problem&lt;/strong&gt; with an inventory coordination challenge. The cart is on the critical path of every purchase. The hard problems are: (1) keeping cart data durable without sacrificing read latency, (2) handling inventory races at checkout, and (3) merging guest and authenticated carts without losing items.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an A/B Testing Platform</title>
      <link>https://chiraghasija.cc/designs/ab-testing/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/ab-testing/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Create and manage experiments with multiple variants (A/B/n) and traffic allocation percentages&lt;/li&gt;
&lt;li&gt;Assign users deterministically to experiment variants (same user always sees the same variant)&lt;/li&gt;
&lt;li&gt;Support mutual exclusion (user in experiment X cannot be in experiment Y) and layering (independent experiments can run simultaneously)&lt;/li&gt;
&lt;li&gt;Collect and compute metrics (conversion rate, revenue, engagement) with statistical significance testing&lt;/li&gt;
&lt;li&gt;Provide a dashboard showing experiment results, confidence intervals, sample sizes, and guardrail metric alerts&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for the assignment service — if it goes down, every feature behind a flag breaks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 5ms for variant assignment — it&amp;rsquo;s in the critical path of page renders and API calls&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Assignment must be deterministic and sticky. A user must always see the same variant for the lifetime of an experiment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M daily active users, 50K+ concurrent experiments, 1B+ assignment checks/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; No event loss for exposure and metric events — statistical validity depends on complete data&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M DAU, average 20 page views/day = 10B page views/day&lt;/li&gt;
&lt;li&gt;Each page view checks ~10 experiments = &lt;strong&gt;100B assignment checks/day ≈ 1.15M checks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Exposure logging: ~10B exposure events/day (one per experiment per user per session)&lt;/li&gt;
&lt;li&gt;Metric events: ~50B events/day (clicks, conversions, revenue, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Experiment configs: 50K experiments × 5KB = 250MB — trivially fits in memory&lt;/li&gt;
&lt;li&gt;Exposure events: 10B/day × 100 bytes = 1TB/day&lt;/li&gt;
&lt;li&gt;Metric events: 50B/day × 150 bytes = 7.5TB/day&lt;/li&gt;
&lt;li&gt;Total raw event storage: &lt;strong&gt;~8.5TB/day, ~3PB/year&lt;/strong&gt; → needs columnar storage (data warehouse)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;compute&#34;&gt;Compute&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Statistical analysis: for each experiment, aggregate metrics across millions of users. A single experiment with 10M users and 5 metrics requires scanning ~50M rows. Running 50K experiments → batch compute pipeline (Spark/Presto), not real-time.&lt;/li&gt;
&lt;li&gt;Pre-aggregation per experiment per day reduces warehouse scans dramatically.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;Assignment is a &lt;strong&gt;latency-critical, read-heavy&lt;/strong&gt; problem (solved by hashing, no database needed). Analysis is a &lt;strong&gt;compute-heavy, batch&lt;/strong&gt; problem (solved by a data pipeline). These are two very different subsystems.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an Advertising System (Google Ads)</title>
      <link>https://chiraghasija.cc/designs/ads-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/ads-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Advertisers create campaigns with targeting criteria (demographics, interests, keywords, geography, device type), set budgets (daily/total), and bid on ad placements (CPC, CPM, CPA)&lt;/li&gt;
&lt;li&gt;When a user visits a page, the ad serving system runs a real-time auction among eligible ads, selects the winner(s), and renders the ad within 100ms&lt;/li&gt;
&lt;li&gt;Track ad events (impressions, clicks, conversions) with accurate attribution and provide real-time reporting to advertisers&lt;/li&gt;
&lt;li&gt;Implement budget pacing — spend the daily budget evenly throughout the day rather than exhausting it in the first hour&lt;/li&gt;
&lt;li&gt;Detect and filter fraudulent clicks (bot clicks, click farms, competitor clicking) before charging advertisers&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — every failed ad request is lost revenue. At $100M daily revenue, each minute of downtime costs ~$70K.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 100ms end-to-end from ad request to ad response. The ad auction must complete within 50ms to leave time for network and rendering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Financial data (budgets, billing) must be strongly consistent. Ad serving can tolerate eventual consistency for targeting data (a few minutes of propagation is fine).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M ad requests/sec globally. 10M active ad campaigns. 1 billion ad impressions/day. 50M click events/day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Every click and impression must be logged durably — this is billing data. Zero data loss for financial events.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;ad-serving-traffic&#34;&gt;Ad Serving Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M ad requests/sec&lt;/li&gt;
&lt;li&gt;Each request: evaluate ~1000 candidate ads → filter to ~50 eligible → score and rank → select top 3-5&lt;/li&gt;
&lt;li&gt;Each ad request: ~2 KB (user context, page context, device info)&lt;/li&gt;
&lt;li&gt;Response: ~5 KB (ad creative URLs, tracking pixels, metadata)&lt;/li&gt;
&lt;li&gt;Bandwidth: 10M × 7 KB = &lt;strong&gt;70 GB/sec&lt;/strong&gt; — distributed across 50+ edge PoPs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;auction-computation&#34;&gt;Auction Computation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M auctions/sec&lt;/li&gt;
&lt;li&gt;Each auction evaluates ~50 eligible ads with ML click prediction&lt;/li&gt;
&lt;li&gt;ML inference: ~0.1ms per ad × 50 ads = &lt;strong&gt;5ms&lt;/strong&gt; per auction (batched inference)&lt;/li&gt;
&lt;li&gt;Total ML compute: 10M × 50 = &lt;strong&gt;500M inferences/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Requires: ~5,000 GPU/TPU instances for ML serving&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Event logging: 1B impressions/day × 500 bytes = &lt;strong&gt;500 GB/day&lt;/strong&gt; impressions&lt;/li&gt;
&lt;li&gt;50M clicks/day × 200 bytes = &lt;strong&gt;10 GB/day&lt;/strong&gt; clicks&lt;/li&gt;
&lt;li&gt;Total event storage: &lt;strong&gt;~200 TB/year&lt;/strong&gt; (retained for 2 years)&lt;/li&gt;
&lt;li&gt;Ad campaigns: 10M campaigns × 5 KB = &lt;strong&gt;50 GB&lt;/strong&gt; — fits in memory&lt;/li&gt;
&lt;li&gt;User profiles (targeting data): 2B users × 2 KB = &lt;strong&gt;4 TB&lt;/strong&gt; — distributed cache&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;revenue-math&#34;&gt;Revenue Math&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average CPM (cost per 1000 impressions): $5&lt;/li&gt;
&lt;li&gt;1B impressions/day × $5/1000 = &lt;strong&gt;$5M/day&lt;/strong&gt; from display&lt;/li&gt;
&lt;li&gt;Average CPC (cost per click): $1&lt;/li&gt;
&lt;li&gt;50M clicks/day × $1 = &lt;strong&gt;$50M/day&lt;/strong&gt; from search/click ads&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;~$55M/day&lt;/strong&gt; ≈ &lt;strong&gt;$20B/year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;advertiser-facing-apis&#34;&gt;Advertiser-Facing APIs&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Create a campaign
POST /v1/campaigns
  Headers: Authorization: Bearer &amp;lt;advertiser_token&amp;gt;
  Body: {
    &amp;#34;name&amp;#34;: &amp;#34;Summer Sale 2026&amp;#34;,
    &amp;#34;objective&amp;#34;: &amp;#34;clicks&amp;#34;,              // impressions | clicks | conversions
    &amp;#34;daily_budget&amp;#34;: 500000,             // $5,000.00 in cents
    &amp;#34;total_budget&amp;#34;: 15000000,           // $150,000.00
    &amp;#34;bid_strategy&amp;#34;: &amp;#34;manual_cpc&amp;#34;,       // manual_cpc | target_cpa | maximize_clicks
    &amp;#34;max_bid&amp;#34;: 200,                     // $2.00 max CPC
    &amp;#34;targeting&amp;#34;: {
      &amp;#34;geo&amp;#34;: [&amp;#34;US&amp;#34;, &amp;#34;CA&amp;#34;],
      &amp;#34;age_range&amp;#34;: [25, 54],
      &amp;#34;interests&amp;#34;: [&amp;#34;technology&amp;#34;, &amp;#34;gaming&amp;#34;],
      &amp;#34;keywords&amp;#34;: [&amp;#34;gaming laptop&amp;#34;, &amp;#34;best GPU 2026&amp;#34;],
      &amp;#34;devices&amp;#34;: [&amp;#34;desktop&amp;#34;, &amp;#34;mobile&amp;#34;],
      &amp;#34;time_of_day&amp;#34;: { &amp;#34;start&amp;#34;: 8, &amp;#34;end&amp;#34;: 22, &amp;#34;timezone&amp;#34;: &amp;#34;America/New_York&amp;#34; }
    },
    &amp;#34;creatives&amp;#34;: [
      { &amp;#34;type&amp;#34;: &amp;#34;banner&amp;#34;, &amp;#34;size&amp;#34;: &amp;#34;300x250&amp;#34;, &amp;#34;image_url&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;landing_url&amp;#34;: &amp;#34;...&amp;#34; },
      { &amp;#34;type&amp;#34;: &amp;#34;text&amp;#34;, &amp;#34;headline&amp;#34;: &amp;#34;50% Off Gaming Laptops&amp;#34;, &amp;#34;description&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;landing_url&amp;#34;: &amp;#34;...&amp;#34; }
    ],
    &amp;#34;start_date&amp;#34;: &amp;#34;2026-06-01&amp;#34;,
    &amp;#34;end_date&amp;#34;: &amp;#34;2026-08-31&amp;#34;
  }
  Response 201: { &amp;#34;campaign_id&amp;#34;: &amp;#34;camp_xyz&amp;#34;, &amp;#34;status&amp;#34;: &amp;#34;pending_review&amp;#34; }

// Get campaign performance
GET /v1/campaigns/{campaign_id}/metrics?date_range=last_7d&amp;amp;granularity=daily
  Response 200: {
    &amp;#34;campaign_id&amp;#34;: &amp;#34;camp_xyz&amp;#34;,
    &amp;#34;metrics&amp;#34;: {
      &amp;#34;impressions&amp;#34;: 1250000,
      &amp;#34;clicks&amp;#34;: 37500,
      &amp;#34;ctr&amp;#34;: 0.03,                      // 3% click-through rate
      &amp;#34;conversions&amp;#34;: 1125,
      &amp;#34;cpa&amp;#34;: 444,                        // $4.44 cost per acquisition
      &amp;#34;spend&amp;#34;: 50000000,                // $500,000.00
      &amp;#34;remaining_budget&amp;#34;: 10000000
    },
    &amp;#34;daily_breakdown&amp;#34;: [...]
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ad-serving-api-internal-called-by-publisher-pages&#34;&gt;Ad Serving API (internal, called by publisher pages)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Request ads for a page
GET /v1/ads?slots=3&amp;amp;format=banner_300x250,banner_728x90
    &amp;amp;page_url=example.com/tech/reviews
    &amp;amp;user_id=anon_abc                   // hashed, for targeting
    &amp;amp;device=mobile&amp;amp;geo=US-CA
  Response 200 (&amp;lt; 100ms): {
    &amp;#34;ads&amp;#34;: [
      {
        &amp;#34;ad_id&amp;#34;: &amp;#34;ad_001&amp;#34;,
        &amp;#34;creative_url&amp;#34;: &amp;#34;https://cdn.ads.com/banner_001.jpg&amp;#34;,
        &amp;#34;landing_url&amp;#34;: &amp;#34;https://advertiser.com/sale?utm_source=...&amp;#34;,
        &amp;#34;impression_url&amp;#34;: &amp;#34;https://track.ads.com/imp?id=ad_001&amp;amp;...&amp;#34;,
        &amp;#34;click_url&amp;#34;: &amp;#34;https://track.ads.com/click?id=ad_001&amp;amp;...&amp;#34;,
        &amp;#34;bid_price&amp;#34;: 150                // $1.50 CPM, for publisher revenue share
      }
    ],
    &amp;#34;auction_id&amp;#34;: &amp;#34;auc_12345&amp;#34;
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;event-tracking-api&#34;&gt;Event Tracking API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Impression beacon (fired when ad is rendered)
GET /v1/track/impression?ad_id=ad_001&amp;amp;auction_id=auc_12345&amp;amp;ts=1708632000
  Response 204 (no content, fire-and-forget)

// Click redirect (user clicks ad)
GET /v1/track/click?ad_id=ad_001&amp;amp;auction_id=auc_12345
  Response 302: Redirect to landing_url
  (logs click event before redirect)

// Conversion postback (from advertiser&amp;#39;s server)
POST /v1/track/conversion
  Body: { &amp;#34;campaign_id&amp;#34;: &amp;#34;camp_xyz&amp;#34;, &amp;#34;conversion_id&amp;#34;: &amp;#34;conv_001&amp;#34;, &amp;#34;value&amp;#34;: 9999 }
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;campaigns-postgresql--sharded-by-advertiser_id&#34;&gt;Campaigns (PostgreSQL — sharded by advertiser_id)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: campaigns
  campaign_id      (PK) | varchar(20)
  advertiser_id    (FK) | varchar(20)
  name                   | varchar(200)
  objective              | enum(&amp;#39;impressions&amp;#39;,&amp;#39;clicks&amp;#39;,&amp;#39;conversions&amp;#39;)
  daily_budget           | int           -- cents
  total_budget           | int
  spent_total            | int           -- running total spend
  bid_strategy           | varchar(30)
  max_bid                | int
  targeting              | jsonb
  status                 | enum(&amp;#39;draft&amp;#39;,&amp;#39;pending_review&amp;#39;,&amp;#39;active&amp;#39;,&amp;#39;paused&amp;#39;,&amp;#39;completed&amp;#39;,&amp;#39;rejected&amp;#39;)
  start_date             | date
  end_date               | date
  created_at             | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ad-creatives-postgresql&#34;&gt;Ad Creatives (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: creatives
  creative_id      (PK) | varchar(20)
  campaign_id      (FK) | varchar(20)
  type                   | enum(&amp;#39;banner&amp;#39;,&amp;#39;text&amp;#39;,&amp;#39;video&amp;#39;,&amp;#39;native&amp;#39;)
  size                   | varchar(20)
  content                | jsonb         -- image_url, headline, description, etc.
  landing_url            | varchar(500)
  status                 | enum(&amp;#39;pending_review&amp;#39;,&amp;#39;approved&amp;#39;,&amp;#39;rejected&amp;#39;)
  quality_score          | float         -- ML-predicted relevance/quality (0-1)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;ad-index-in-memory-rebuilt-every-few-minutes&#34;&gt;Ad Index (in-memory, rebuilt every few minutes)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Inverted index for fast candidate selection during auction
// Loaded into each ad server&amp;#39;s memory

keyword_index: Map&amp;lt;keyword, List&amp;lt;campaign_id&amp;gt;&amp;gt;
geo_index: Map&amp;lt;geo_code, List&amp;lt;campaign_id&amp;gt;&amp;gt;
interest_index: Map&amp;lt;interest, List&amp;lt;campaign_id&amp;gt;&amp;gt;
device_index: Map&amp;lt;device_type, List&amp;lt;campaign_id&amp;gt;&amp;gt;

// Each entry: campaign_id + max_bid + remaining_budget + quality_score
// Total size: ~10M campaigns × 100 bytes = 1 GB per ad server (fits in memory)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;event-log-kafka--clickhouse&#34;&gt;Event Log (Kafka + ClickHouse)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Kafka topics: ad_impressions, ad_clicks, ad_conversions
// Retained 7 days in Kafka

// ClickHouse (analytical queries, 2-year retention)
Table: events
  event_type       | Enum(&amp;#39;impression&amp;#39;,&amp;#39;click&amp;#39;,&amp;#39;conversion&amp;#39;)
  event_id         | String
  ad_id            | String
  campaign_id      | String
  advertiser_id    | String
  user_id          | String (hashed)
  auction_id       | String
  timestamp        | DateTime
  bid_price        | UInt32
  geo              | String
  device           | String
  page_url         | String
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;budget-tracking-redis&#34;&gt;Budget Tracking (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Key: budget:{campaign_id}:daily:{date}
Type: Hash
Fields:
  spent     | int (cents, incremented on each billable event)
  limit     | int (daily budget)
  pacing    | float (target spend rate per hour)

Key: budget:{campaign_id}:total
Type: String (remaining total budget in cents)
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;ad-serving-pipeline--100ms-total&#34;&gt;Ad Serving Pipeline (&amp;lt; 100ms total)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;User visits a web page
  → Publisher&amp;#39;s ad tag (JavaScript) fires ad request
    → CDN/Edge → Ad Server (closest PoP):

      Step 1: Parse Request (&amp;lt; 1ms)
        Extract: user_id, page context, device, geo, ad slot sizes

      Step 2: User Profile Lookup (&amp;lt; 5ms)
        Redis/Memcached: get user interests, demographics, behavior segments
        If cache miss → use contextual signals only (page content, keywords)

      Step 3: Candidate Selection (&amp;lt; 5ms)
        Query in-memory ad index:
          geo_index[US-CA] ∩ device_index[mobile] ∩ interest_index[gaming]
          → ~1000 candidate campaigns
        Filter: active status, within date range, has remaining budget, creative approved
          → ~200 eligible campaigns
        Filter: frequency capping (user hasn&amp;#39;t seen this ad &amp;gt; 3 times today)
          → ~150 final candidates

      Step 4: Bid Calculation + Click Prediction (&amp;lt; 20ms)
        For each candidate (batched ML inference):
          pCTR = click_prediction_model(user_features, ad_features, context_features)
          eCPM = bid × pCTR × 1000        // expected revenue per 1000 impressions
        Sort by eCPM descending

      Step 5: Auction (&amp;lt; 2ms)
        Run generalized second-price auction:
          Winner pays the minimum bid needed to beat the second-place ad
          price_to_charge = second_place_eCPM / winner_pCTR
        Select top 3-5 ads for the available slots

      Step 6: Budget Check (&amp;lt; 2ms)
        Redis: INCRBY budget:{campaign_id}:daily:{date} {charge_amount}
        If new_total &amp;gt; daily_budget → skip this ad, select next candidate
        If total_budget exhausted → skip

      Step 7: Response (&amp;lt; 1ms)
        Return ad creatives, tracking URLs, metadata
        Total: ~35ms server-side, ~100ms with network

  → Browser renders ads, fires impression beacons
  → On click: redirect through click tracker → landing page
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;event-processing-pipeline&#34;&gt;Event Processing Pipeline&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Impression/Click Events:
  → Event Collector (edge servers, receive tracking pixels/redirects)
    → Kafka (durable event stream)
      → Stream Processor (Flink):
        1. Fraud Detection: filter invalid clicks (see Deep Dive 3)
        2. Deduplication: same impression/click ID within 5-min window
        3. Attribution: match clicks to impressions, conversions to clicks
        4. Budget Update: INCRBY in Redis for real-time budget tracking
        5. Write to ClickHouse for reporting
      → ClickHouse (analytical queries)
      → Billing Service (aggregate billable events → generate invoices)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Ad Servers (50+ edge PoPs, 100s of instances):&lt;/strong&gt; Handle ad requests. Run the full auction pipeline in-process. Stateless except for cached ad index and user profiles.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ad Index Builder:&lt;/strong&gt; Reads campaigns from PostgreSQL, builds in-memory inverted indexes, pushes to ad servers every 2-5 minutes. Ensures ad servers have fresh campaign data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Click Prediction Service (ML):&lt;/strong&gt; Serves the pCTR model. Deployed on GPU instances. Batched inference for throughput. Model retrained daily on click/no-click data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Profile Service (Redis/Memcached):&lt;/strong&gt; Stores user interest segments, demographics, behavioral signals. Updated by a profile pipeline that processes user activity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Event Collectors:&lt;/strong&gt; Edge servers that receive impression beacons and click redirects. Ultra-low latency (&amp;lt; 10ms). Write to Kafka.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream Processor (Flink):&lt;/strong&gt; Real-time event processing: fraud detection, deduplication, attribution, budget updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Budget Pacing Service:&lt;/strong&gt; Calculates target spend rate per campaign per hour. Adjusts bid multipliers to pace spending evenly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fraud Detection Engine:&lt;/strong&gt; ML + rules-based click fraud detection. Filters invalid traffic before billing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reporting Service:&lt;/strong&gt; Reads from ClickHouse. Serves advertiser dashboards with near-real-time metrics (&amp;lt; 5 min delay).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Campaign Management Service:&lt;/strong&gt; CRUD for campaigns, creatives, targeting. Creative review (automated + manual).&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-the-ad-auction-mechanism&#34;&gt;Deep Dive 1: The Ad Auction Mechanism&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why not a simple highest-bidder-wins auction?&lt;/strong&gt;
If we only ranked ads by bid price, advertisers with deep pockets would always win, regardless of ad quality. Users would see irrelevant, low-quality ads. Click rates would drop. Publishers would earn less long-term.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an API Rate Limiter</title>
      <link>https://chiraghasija.cc/designs/rate-limiter/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/rate-limiter/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Limit the number of API requests a client can make within a time window&lt;/li&gt;
&lt;li&gt;Support multiple rate limiting rules (per user, per IP, per API endpoint, per service)&lt;/li&gt;
&lt;li&gt;Return appropriate response when rate limit is exceeded (429 Too Many Requests)&lt;/li&gt;
&lt;li&gt;Include rate limit headers in every response (remaining quota, reset time)&lt;/li&gt;
&lt;li&gt;Support different limiting algorithms (configurable per rule)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.999% — if the rate limiter goes down, we either block all traffic (fail-closed) or allow all traffic (fail-open). Both are bad. It must be more available than the services it protects.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 1ms overhead per request — the rate limiter is in the hot path of every API call&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Approximate accuracy is acceptable. If the limit is 100 req/min, allowing 105 is fine. Allowing 1000 is not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; Must handle 10M+ requests/sec across all services&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed:&lt;/strong&gt; Must work correctly across multiple API gateway instances (not just per-node limiting)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M requests/sec across all API endpoints&lt;/li&gt;
&lt;li&gt;Each request → 1 rate limit check (read + conditional write to counter)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;10M rate limit operations/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active rate limit keys: assume 50M unique clients (user_id or IP)&lt;/li&gt;
&lt;li&gt;Each key: ~100 bytes (key + counter + window metadata)&lt;/li&gt;
&lt;li&gt;Total: 50M × 100 bytes = &lt;strong&gt;5GB&lt;/strong&gt; — fits entirely in memory (Redis)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;latency-critical, memory-bound&lt;/strong&gt; system. All state lives in Redis. The challenge is distributed correctness (accurate counting across multiple gateway instances) at extreme throughput with &amp;lt; 1ms latency.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an Authentication and Authorization System</title>
      <link>https://chiraghasija.cc/designs/auth-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/auth-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; Support email/password login, OAuth 2.0/OIDC (Google, GitHub, etc.), and multi-factor authentication (TOTP, SMS, WebAuthn/passkeys)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session Management:&lt;/strong&gt; Issue, validate, refresh, and revoke sessions. Support &amp;ldquo;remember me&amp;rdquo; (long-lived) and &amp;ldquo;sign out everywhere&amp;rdquo; (revoke all sessions).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authorization:&lt;/strong&gt; Role-Based Access Control (RBAC) with hierarchical roles and Attribute-Based Access Control (ABAC) for fine-grained policies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Single Sign-On (SSO):&lt;/strong&gt; Act as an identity provider (IdP) — users authenticate once and access multiple applications without re-authenticating&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Account security:&lt;/strong&gt; Rate limit login attempts, detect credential stuffing, support password reset flows, and enforce password policies&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.999% — if auth is down, no user can log into any application. This is the single most critical shared service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Login &amp;lt; 500ms (includes password hashing). Token validation &amp;lt; 5ms (must be in the hot path of every API call). Authorization check &amp;lt; 10ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security:&lt;/strong&gt; Passwords hashed with Argon2id (memory-hard). Tokens encrypted in transit (TLS 1.3) and at rest. No plaintext secrets in logs. PII encrypted at rest.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M registered users. 50M daily logins. 10B token validations/day (every API call). 1B authorization checks/day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance:&lt;/strong&gt; GDPR (right to delete, data portability), SOC 2, support for data residency requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;authentication-traffic&#34;&gt;Authentication Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50M logins/day = &lt;strong&gt;580 logins/sec average&lt;/strong&gt;, 5K/sec peak (morning login surge)&lt;/li&gt;
&lt;li&gt;Each login: password hash verification (Argon2id: ~250ms CPU per attempt) + session creation&lt;/li&gt;
&lt;li&gt;CPU: 580 logins/sec × 250ms = &lt;strong&gt;145 CPU-seconds/sec&lt;/strong&gt; → need ~150 CPU cores just for password hashing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;token-validation&#34;&gt;Token Validation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10B validations/day = &lt;strong&gt;115K validations/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;JWT validation: verify signature + check expiry + decode claims = &lt;strong&gt;&amp;lt; 0.1ms&lt;/strong&gt; (no network call)&lt;/li&gt;
&lt;li&gt;With opaque tokens: Redis lookup = &lt;strong&gt;~1ms&lt;/strong&gt; per validation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;authorization&#34;&gt;Authorization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B checks/day = &lt;strong&gt;11.5K checks/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each check: look up user&amp;rsquo;s roles/permissions, evaluate policy rules&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;User accounts: 500M × 2KB = &lt;strong&gt;1TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Sessions: 200M active sessions × 500 bytes = &lt;strong&gt;100GB&lt;/strong&gt; (fits in Redis)&lt;/li&gt;
&lt;li&gt;Roles/permissions: 1000 roles × 100 permissions = 100K entries — trivial&lt;/li&gt;
&lt;li&gt;Audit logs: 50M logins/day × 500 bytes = 25GB/day, &lt;strong&gt;9TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;Token validation is the hottest path (115K/sec). It MUST be local (no network call) → JWT with local signature verification. Password hashing is CPU-intensive → dedicated worker pool with rate limiting. Authorization checks need low latency → cache policies locally, evaluate in-process.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an ETA Estimation System</title>
      <link>https://chiraghasija.cc/designs/eta-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/eta-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Given an origin and destination, compute the estimated time of arrival (ETA) accounting for current traffic conditions&lt;/li&gt;
&lt;li&gt;Support multiple routing options (fastest, shortest, avoid tolls, avoid highways) and return ETAs for each&lt;/li&gt;
&lt;li&gt;Incorporate real-time traffic data (GPS probe data from active drivers, incident reports, road closures) to adjust ETAs continuously&lt;/li&gt;
&lt;li&gt;Provide ETA updates during an active trip (re-estimate as conditions change or the driver deviates from the route)&lt;/li&gt;
&lt;li&gt;Return confidence intervals with ETAs (e.g., &amp;ldquo;18-24 minutes&amp;rdquo; not just &amp;ldquo;21 minutes&amp;rdquo;) to communicate uncertainty&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — ETA is shown on every ride request, every search, every navigation step. It&amp;rsquo;s the most-queried service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; ETA computation &amp;lt; 200ms for a single origin-destination pair. Batch ETA (nearby drivers to rider) &amp;lt; 500ms for 50 candidates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accuracy:&lt;/strong&gt; Mean Absolute Error (MAE) &amp;lt; 15% of actual trip time. P90 error &amp;lt; 25%. Accuracy matters more than precision.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500K ETA requests/sec at peak (every map interaction, every ride match, every navigation step). Road network: 50M road segments globally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; Traffic data reflected in ETAs within 2 minutes of observation. Stale traffic = wrong ETAs = bad routing.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500K ETA requests/sec at peak&lt;/li&gt;
&lt;li&gt;Each ETA request requires a graph search over a road network subgraph&lt;/li&gt;
&lt;li&gt;Average route: ~200 road segments explored (varies by distance)&lt;/li&gt;
&lt;li&gt;Total road segments processed: 500K × 200 = &lt;strong&gt;100M segment lookups/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;road-network-size&#34;&gt;Road Network Size&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Global road network: ~50M road segments, ~40M intersections&lt;/li&gt;
&lt;li&gt;Graph representation: each segment = (from_node, to_node, distance, base_travel_time, current_speed)&lt;/li&gt;
&lt;li&gt;Storage: 50M segments × 100 bytes = &lt;strong&gt;5 GB&lt;/strong&gt; — fits in memory on a single server&lt;/li&gt;
&lt;li&gt;With precomputed contraction hierarchy: add ~10 GB for shortcut edges → &lt;strong&gt;15 GB total&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;traffic-data&#34;&gt;Traffic Data&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GPS probes from active drivers: 5M drivers × 1 update/4 seconds = &lt;strong&gt;1.25M probe points/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each probe: (lat, lng, speed, heading, timestamp) = ~40 bytes&lt;/li&gt;
&lt;li&gt;Probe ingestion bandwidth: 1.25M × 40B = &lt;strong&gt;50 MB/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Map-matched probes per segment per minute: varies (highways: hundreds, residential: 0-5)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;ml-model-inference&#34;&gt;ML Model Inference&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;If using ML for ETA prediction (not just graph routing):
&lt;ul&gt;
&lt;li&gt;Feature extraction: ~50 features per route (distance, segment speeds, time of day, weather, etc.)&lt;/li&gt;
&lt;li&gt;Model inference: ~5ms per prediction (on GPU) or ~20ms (on CPU)&lt;/li&gt;
&lt;li&gt;At 500K requests/sec × 5ms per inference: need 2500 GPU-seconds/sec → &lt;strong&gt;~250 GPU instances&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Or: precompute and cache ETAs for common routes; ML inference only for cache misses&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;graph search problem enhanced by real-time data and ML.&lt;/strong&gt; Pure Dijkstra on 50M segments is too slow for 200ms SLA. We need hierarchical precomputation (Contraction Hierarchies or similar) to reduce search space by 100-1000×, combined with real-time traffic edge weights to keep ETAs accurate.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an External Sorting System</title>
      <link>https://chiraghasija.cc/designs/external-sort/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/external-sort/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Sort a dataset that is significantly larger than available RAM (e.g., 10TB of data with 64GB RAM)&lt;/li&gt;
&lt;li&gt;Support configurable sort keys (single column, composite keys, ascending/descending)&lt;/li&gt;
&lt;li&gt;Produce a single sorted output file (or set of sorted partitions for distributed consumers)&lt;/li&gt;
&lt;li&gt;Support pluggable input/output formats (CSV, Parquet, binary records)&lt;/li&gt;
&lt;li&gt;Provide progress reporting and the ability to resume after failures (checkpointing)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; Batch system — not always-on, but must complete within an SLA (e.g., 10TB sorted within 4 hours)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Optimize for total wall-clock time, not per-record latency. I/O throughput is the bottleneck.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Output must be perfectly sorted. No approximate or lossy sorting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; Handle datasets from 1GB to 100TB. Single-machine for up to ~5TB, distributed (MapReduce-style) beyond that.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Intermediate state (sorted chunks) persisted to disk. If the process crashes, restart from the last completed chunk, not from scratch.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;single-machine-scenario&#34;&gt;Single Machine Scenario&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Dataset: 1TB, RAM: 64GB (usable for sort: ~50GB after OS and buffers)&lt;/li&gt;
&lt;li&gt;Record size: 100 bytes average, sort key: 20 bytes&lt;/li&gt;
&lt;li&gt;Total records: 1TB / 100B = &lt;strong&gt;10 billion records&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Number of sorted chunks: 1TB / 50GB = &lt;strong&gt;20 chunks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each chunk: 50GB, ~500M records, sorted in-memory using quicksort/timsort&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;io-analysis&#34;&gt;I/O Analysis&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Disk sequential read/write: 500 MB/s (NVMe SSD) or 200 MB/s (HDD)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 (chunk creation):&lt;/strong&gt; Read 1TB + write 1TB of sorted chunks = 2TB I/O
&lt;ul&gt;
&lt;li&gt;SSD: 2TB / 500 MB/s = ~67 minutes&lt;/li&gt;
&lt;li&gt;HDD: 2TB / 200 MB/s = ~167 minutes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 (K-way merge):&lt;/strong&gt; Read 1TB of sorted chunks + write 1TB final output = 2TB I/O
&lt;ul&gt;
&lt;li&gt;Same time as Phase 1&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total:&lt;/strong&gt; ~2.2 hours SSD, ~5.5 hours HDD (I/O dominated)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;distributed-scenario-mapreduce&#34;&gt;Distributed Scenario (MapReduce)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Dataset: 100TB across HDFS&lt;/li&gt;
&lt;li&gt;1000 mappers, each sorts 100GB locally&lt;/li&gt;
&lt;li&gt;Shuffle phase: each reducer pulls its key range from all mappers&lt;/li&gt;
&lt;li&gt;100 reducers, each merge-sorts 1TB from 1000 sorted inputs&lt;/li&gt;
&lt;li&gt;Network bandwidth: 10Gbps per node → 100TB shuffle in ~2.2 hours with pipelining&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;External sort is &lt;strong&gt;I/O bound&lt;/strong&gt;, not CPU bound. Every optimization must target reducing I/O passes, maximizing sequential I/O, and minimizing random access. The CPU cost of comparisons is negligible compared to disk read/write time.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an IoC Container (Dependency Injection Framework)</title>
      <link>https://chiraghasija.cc/designs/ioc-container/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/ioc-container/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Register services with their implementations (interface → concrete class mapping) using explicit registration or auto-discovery (annotations/attributes)&lt;/li&gt;
&lt;li&gt;Support three lifecycle scopes: &lt;strong&gt;Singleton&lt;/strong&gt; (one instance per container), &lt;strong&gt;Transient&lt;/strong&gt; (new instance per request), and &lt;strong&gt;Scoped&lt;/strong&gt; (one instance per scope, e.g., per HTTP request)&lt;/li&gt;
&lt;li&gt;Automatically resolve dependency graphs — if Service A depends on B and C, and B depends on D, resolve the full chain recursively&lt;/li&gt;
&lt;li&gt;Detect circular dependencies at registration time (not at resolve time) and provide clear error messages showing the cycle&lt;/li&gt;
&lt;li&gt;Support named/keyed registrations, factory functions, lazy initialization, and optional dependencies&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Performance:&lt;/strong&gt; Singleton resolution &amp;lt; 100ns (cached lookup). Transient resolution &amp;lt; 1μs for a 5-deep dependency chain. The container is called millions of times — it must be near-zero overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory:&lt;/strong&gt; Container metadata (registrations, dependency graph) &amp;lt; 10MB for 10K registered services. No memory leaks from scoped instances.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thread Safety:&lt;/strong&gt; All resolve operations must be thread-safe. Singleton creation must be exactly-once (no double initialization).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer Experience:&lt;/strong&gt; Clear error messages. Fail-fast on misconfiguration. Support for debugging tools (dependency graph visualization, unused registration detection).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensibility:&lt;/strong&gt; Plugin architecture for interceptors (AOP), decorators, and custom lifetime managers.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;scale-typical-large-application&#34;&gt;Scale (typical large application)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Registered services:&lt;/strong&gt; 500-5,000 in a large enterprise app&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency depth:&lt;/strong&gt; Average 3-5 levels, max 10-15 levels&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution frequency:&lt;/strong&gt; Web app handling 10K requests/sec, each request resolves 20-50 services = &lt;strong&gt;200K-500K resolutions/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scoped instances:&lt;/strong&gt; Per HTTP request, ~30 scoped instances created and disposed&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;memory&#34;&gt;Memory&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Registration metadata: 5,000 services × 500 bytes (type info, lifetime, dependencies) = &lt;strong&gt;2.5MB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Singleton cache: 500 singletons × 1KB average = &lt;strong&gt;500KB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Scoped instances per request: 30 × 1KB = 30KB (short-lived, GC&amp;rsquo;d after request)&lt;/li&gt;
&lt;li&gt;Compiled resolution plans (pre-computed): 5,000 × 200 bytes = &lt;strong&gt;1MB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Total container overhead: &lt;strong&gt;~5MB&lt;/strong&gt; — negligible&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is fundamentally a &lt;strong&gt;graph problem&lt;/strong&gt; (dependency graph resolution) combined with a &lt;strong&gt;caching problem&lt;/strong&gt; (singleton/scoped instance reuse). The critical optimization is pre-computing resolution plans at registration time so that resolve-time is just executing a pre-built plan — no graph traversal needed in the hot path.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an On-Call Rotation and Alerting System (PagerDuty)</title>
      <link>https://chiraghasija.cc/designs/on-call-system/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/on-call-system/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Define on-call rotation schedules (weekly, daily, custom) with multiple layers (primary, secondary, manager) and automatic handoffs&lt;/li&gt;
&lt;li&gt;Route incoming alerts through escalation policies: try primary on-call → wait N minutes → escalate to secondary → wait → escalate to manager&lt;/li&gt;
&lt;li&gt;Deliver alerts via multiple channels (push notification, SMS, phone call) with configurable per-user preferences and retry logic&lt;/li&gt;
&lt;li&gt;Track acknowledgment, resolution, and incident lifecycle (triggered → acknowledged → resolved) with timestamps and responder actions&lt;/li&gt;
&lt;li&gt;Support schedule overrides (swap shifts, temporary coverage) and fatigue prevention (quiet hours, max alerts/hour limits)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.999% — this system pages humans during outages. If PagerDuty is down when production is down, no one gets woken up. It must be the most reliable system in the organization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Alert delivery within 30 seconds of trigger. Phone call initiation within 60 seconds. Escalation timing accurate to within 5 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; An alert must be delivered to exactly one on-call person at a time (no missed alerts, no duplicate pages for the same incident at the same escalation level). Acknowledgment must be strongly consistent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10,000 teams, 100,000 users, 500K alerts/day, 50K concurrent active incidents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Complete audit trail of every alert, delivery attempt, acknowledgment, and escalation. Zero tolerance for lost alerts.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incoming alerts:&lt;/strong&gt; 500K/day = ~6 alerts/sec (peak 10x during widespread outage = 60/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delivery attempts per alert:&lt;/strong&gt; avg 2.5 (primary gets push + SMS, sometimes escalates to secondary)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total deliveries:&lt;/strong&gt; 500K × 2.5 = &lt;strong&gt;1.25M delivery attempts/day&lt;/strong&gt; = ~15/sec&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phone calls:&lt;/strong&gt; ~10% of alerts escalate to phone call = 50K calls/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedule lookups:&lt;/strong&gt; Every alert → resolve current on-call → 500K lookups/day (cached)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API requests (schedule management, incident updates):&lt;/strong&gt; ~200K/day&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incidents:&lt;/strong&gt; 500K/day × 5KB (full incident record with timeline) = 2.5GB/day = ~900GB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedules:&lt;/strong&gt; 10,000 teams × 10KB = 100MB (tiny, mostly static)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit log:&lt;/strong&gt; 1.25M delivery events/day × 500 bytes = 625MB/day = ~225GB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User preferences:&lt;/strong&gt; 100K users × 1KB = 100MB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;reliability-critical orchestration system&lt;/strong&gt;. The data volume is modest, but the reliability requirements are extreme. The system must work when everything else is broken (datacenter fires, DNS outages, cloud region failures). The core challenge is guaranteeing alert delivery within tight time windows with multiple fallback channels.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design an Online Marketplace (Etsy/Amazon Marketplace)</title>
      <link>https://chiraghasija.cc/designs/marketplace/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/marketplace/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Sellers can onboard, list products with images/descriptions/pricing, and manage inventory&lt;/li&gt;
&lt;li&gt;Buyers can search/browse products, add to cart, and place orders with payment&lt;/li&gt;
&lt;li&gt;Platform handles payment splitting — holds funds in escrow, pays seller after delivery confirmation&lt;/li&gt;
&lt;li&gt;Buyers and sellers can leave reviews/ratings on completed transactions&lt;/li&gt;
&lt;li&gt;Platform detects and prevents fraudulent listings, fake reviews, and payment fraud&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — downtime directly means lost revenue for both platform and sellers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Search results &amp;lt; 200ms, product page load &amp;lt; 100ms, checkout &amp;lt; 500ms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Inventory must be strongly consistent (no overselling). Order/payment state must be exactly-once. Reviews can be eventually consistent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 50M products, 10M daily active buyers, 500K active sellers, 2M orders/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero tolerance for lost orders or payment records. All financial data replicated across regions.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Browse/Search:&lt;/strong&gt; 10M DAU × 20 searches/day = 200M searches/day = ~2,300 QPS (peak 3x = 7,000 QPS)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product page views:&lt;/strong&gt; 10M DAU × 30 views/day = 300M/day = ~3,500 QPS (peak 10,000 QPS)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 2M orders/day = ~23 orders/sec (peak 5x during flash sales = 115/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Listings:&lt;/strong&gt; 500K sellers × 2 updates/day = 1M write ops/day = ~12 QPS&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Product catalog:&lt;/strong&gt; 50M products × 5KB metadata = 250GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product images:&lt;/strong&gt; 50M products × 5 images × 500KB = 125TB (object storage)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orders:&lt;/strong&gt; 2M/day × 2KB × 365 days × 3 years = ~4TB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reviews:&lt;/strong&gt; 50M reviews × 1KB = 50GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User profiles:&lt;/strong&gt; 10M buyers + 500K sellers × 2KB = ~21GB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;read-heavy system&lt;/strong&gt; (100:1 read-to-write ratio for catalog browsing). The hard problems are search relevance at scale, inventory consistency during concurrent purchases, and payment orchestration with escrow.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Dropbox / Google Drive</title>
      <link>https://chiraghasija.cc/designs/dropbox/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/dropbox/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can upload, download, and delete files from any device&lt;/li&gt;
&lt;li&gt;Files automatically sync across all connected devices&lt;/li&gt;
&lt;li&gt;Users can share files/folders with other users (view/edit permissions)&lt;/li&gt;
&lt;li&gt;File versioning — users can view and restore previous versions&lt;/li&gt;
&lt;li&gt;Offline support — changes made offline sync when connectivity resumes&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — users depend on this for critical documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; 99.999999999% (11 nines) — losing a user&amp;rsquo;s file is unacceptable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Small file sync &amp;lt; 5 seconds end-to-end between devices. Large file upload should show progress and be resumable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for file metadata (rename, move, delete must be immediately reflected). Eventual consistency acceptable for sync propagation to other devices (within seconds).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M users, 100M DAU, average user stores 10GB, peak uploads 10M files/hour&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bandwidth efficiency:&lt;/strong&gt; Only transfer changed parts of files (delta sync)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M users × 10GB avg = &lt;strong&gt;5 exabytes (5,000PB)&lt;/strong&gt; total storage&lt;/li&gt;
&lt;li&gt;This is the defining constraint — everything revolves around efficient storage&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M file uploads/hour ÷ 3600 = &lt;strong&gt;~2,800 uploads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Average file: 500KB → &lt;strong&gt;1.4GB/sec&lt;/strong&gt; upload bandwidth&lt;/li&gt;
&lt;li&gt;Sync events (metadata changes): 10x file uploads = &lt;strong&gt;28,000 events/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;metadata&#34;&gt;Metadata&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average user: 2,000 files → 500M × 2,000 = &lt;strong&gt;1 trillion&lt;/strong&gt; file metadata records&lt;/li&gt;
&lt;li&gt;Each record: ~500 bytes → &lt;strong&gt;500TB&lt;/strong&gt; metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// File operations
POST /api/v1/files/upload
  Headers: Content-Range: bytes 0-1048575/5242880  // chunked upload
  Body: &amp;lt;binary chunk&amp;gt;
  Response 200: { &amp;#34;upload_id&amp;#34;: &amp;#34;up_123&amp;#34;, &amp;#34;next_offset&amp;#34;: 1048576 }

POST /api/v1/files/upload/complete
  Body: { &amp;#34;upload_id&amp;#34;: &amp;#34;up_123&amp;#34;, &amp;#34;filename&amp;#34;: &amp;#34;doc.pdf&amp;#34;, &amp;#34;parent_id&amp;#34;: &amp;#34;folder_abc&amp;#34; }
  Response 201: { &amp;#34;file_id&amp;#34;: &amp;#34;f_xyz&amp;#34;, &amp;#34;version&amp;#34;: 1 }

GET /api/v1/files/{file_id}/download
  Response 200: redirect to pre-signed S3 URL

GET /api/v1/files/{file_id}/versions
  Response 200: [{ &amp;#34;version&amp;#34;: 3, &amp;#34;size&amp;#34;: 524288, &amp;#34;modified_at&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;modified_by&amp;#34;: &amp;#34;...&amp;#34; }]

POST /api/v1/files/{file_id}/restore?version=2

// Sync
GET /api/v1/sync/changes?cursor={cursor}
  Response 200: {
    &amp;#34;changes&amp;#34;: [
      { &amp;#34;type&amp;#34;: &amp;#34;create&amp;#34;, &amp;#34;file_id&amp;#34;: &amp;#34;f_xyz&amp;#34;, &amp;#34;path&amp;#34;: &amp;#34;/docs/notes.md&amp;#34;, ... },
      { &amp;#34;type&amp;#34;: &amp;#34;modify&amp;#34;, &amp;#34;file_id&amp;#34;: &amp;#34;f_abc&amp;#34;, &amp;#34;version&amp;#34;: 3, ... },
      { &amp;#34;type&amp;#34;: &amp;#34;delete&amp;#34;, &amp;#34;file_id&amp;#34;: &amp;#34;f_def&amp;#34;, ... }
    ],
    &amp;#34;cursor&amp;#34;: &amp;#34;c_next_123&amp;#34;,
    &amp;#34;has_more&amp;#34;: false
  }

// Sharing
POST /api/v1/files/{file_id}/share
  Body: { &amp;#34;user_email&amp;#34;: &amp;#34;bob@example.com&amp;#34;, &amp;#34;permission&amp;#34;: &amp;#34;edit&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Key decisions:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Facebook Messenger / WhatsApp</title>
      <link>https://chiraghasija.cc/designs/messenger/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/messenger/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;1-on-1 messaging between users (text messages)&lt;/li&gt;
&lt;li&gt;Group messaging (up to 256 members)&lt;/li&gt;
&lt;li&gt;Message delivery status: sent, delivered, read&lt;/li&gt;
&lt;li&gt;Online/offline presence indicators&lt;/li&gt;
&lt;li&gt;Message history — persistent, accessible from any device&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — messaging is real-time communication; downtime is immediately felt&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Message delivery &amp;lt; 200ms end-to-end for online recipients&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Messages must be ordered correctly within a conversation. No message loss. Exactly-once delivery semantics (no duplicate messages).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 2B registered users, 500M DAU, 100B messages/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Messages are persistent — stored forever (or until user deletes)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;100B messages/day ÷ 100K sec/day = &lt;strong&gt;1M messages/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 3x → &lt;strong&gt;3M messages/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;This is extremely high write throughput&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average message: 100 bytes (text + metadata)&lt;/li&gt;
&lt;li&gt;100B messages/day × 100 bytes = &lt;strong&gt;10TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Per year: &lt;strong&gt;~3.6PB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;With media (images, voice notes): 10x → &lt;strong&gt;~36PB/year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;connections&#34;&gt;Connections&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M DAU with persistent WebSocket connections&lt;/li&gt;
&lt;li&gt;Average user online 4 hours/day → at any time ~83M concurrent connections&lt;/li&gt;
&lt;li&gt;Peak: &lt;strong&gt;~150M concurrent WebSocket connections&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;rest-for-non-realtime-operations&#34;&gt;REST (for non-realtime operations)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/messages/send
  Body: {
    &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;,
    &amp;#34;content&amp;#34;: &amp;#34;Hello!&amp;#34;,
    &amp;#34;type&amp;#34;: &amp;#34;text&amp;#34;,
    &amp;#34;client_message_id&amp;#34;: &amp;#34;cm_uuid_123&amp;#34;  // idempotency key
  }
  Response 201: {
    &amp;#34;message_id&amp;#34;: &amp;#34;m_server_456&amp;#34;,
    &amp;#34;timestamp&amp;#34;: &amp;#34;2026-02-22T18:00:00.123Z&amp;#34;
  }

GET /api/v1/conversations
  Response 200: [
    {
      &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;,
      &amp;#34;type&amp;#34;: &amp;#34;1on1&amp;#34;,
      &amp;#34;participants&amp;#34;: [&amp;#34;u_1&amp;#34;, &amp;#34;u_2&amp;#34;],
      &amp;#34;last_message&amp;#34;: { &amp;#34;content&amp;#34;: &amp;#34;Hello!&amp;#34;, &amp;#34;timestamp&amp;#34;: &amp;#34;...&amp;#34; },
      &amp;#34;unread_count&amp;#34;: 3
    }
  ]

GET /api/v1/conversations/{conv_id}/messages?cursor={cursor}&amp;amp;limit=50
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;websocket-for-real-time&#34;&gt;WebSocket (for real-time)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Client → Server
{ &amp;#34;type&amp;#34;: &amp;#34;message&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;Hello!&amp;#34;, &amp;#34;client_id&amp;#34;: &amp;#34;cm_uuid_123&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;typing&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;ack&amp;#34;, &amp;#34;message_id&amp;#34;: &amp;#34;m_456&amp;#34; }        // delivery receipt
{ &amp;#34;type&amp;#34;: &amp;#34;read&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;, &amp;#34;up_to&amp;#34;: &amp;#34;m_456&amp;#34; }  // read receipt

// Server → Client
{ &amp;#34;type&amp;#34;: &amp;#34;message&amp;#34;, &amp;#34;message_id&amp;#34;: &amp;#34;m_456&amp;#34;, &amp;#34;from&amp;#34;: &amp;#34;u_2&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;Hi!&amp;#34;, &amp;#34;timestamp&amp;#34;: &amp;#34;...&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;delivered&amp;#34;, &amp;#34;message_id&amp;#34;: &amp;#34;m_123&amp;#34; }   // your message was delivered
{ &amp;#34;type&amp;#34;: &amp;#34;read&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;, &amp;#34;by&amp;#34;: &amp;#34;u_2&amp;#34;, &amp;#34;up_to&amp;#34;: &amp;#34;m_456&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;typing&amp;#34;, &amp;#34;conversation_id&amp;#34;: &amp;#34;conv_abc&amp;#34;, &amp;#34;user&amp;#34;: &amp;#34;u_2&amp;#34; }
{ &amp;#34;type&amp;#34;: &amp;#34;presence&amp;#34;, &amp;#34;user&amp;#34;: &amp;#34;u_2&amp;#34;, &amp;#34;status&amp;#34;: &amp;#34;online&amp;#34; }
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Key decisions:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Facebook&#39;s Like/Reaction System</title>
      <link>https://chiraghasija.cc/designs/facebook-likes/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/facebook-likes/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can like/react to a post (like, love, haha, wow, sad, angry) — one reaction per user per post, toggleable (tap again to remove)&lt;/li&gt;
&lt;li&gt;Display the total reaction count on every post and a breakdown by reaction type&lt;/li&gt;
&lt;li&gt;Show &amp;ldquo;who reacted&amp;rdquo; — a paginated list of users who reacted to a post (with their reaction type)&lt;/li&gt;
&lt;li&gt;Notify the post author when their post receives reactions (batched, not per-reaction)&lt;/li&gt;
&lt;li&gt;Support reactions on posts, comments, messages, and other content types (polymorphic)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the like button is on every post in the News Feed. Downtime affects all users immediately.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Like action &amp;lt; 100ms (write). Displaying count &amp;lt; 50ms (read). Counts can be slightly stale (eventual consistency acceptable for counts).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Throughput:&lt;/strong&gt; 500K likes/sec sustained (2B DAU, average 20 likes/day = 40B likes/day). Peak: 2M likes/sec during viral events.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Toggle semantics must be strongly consistent per user — if I tap like twice, the second tap must undo the first. Counts can be eventually consistent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500B total reactions stored. 10B+ posts with at least one reaction.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;write-traffic&#34;&gt;Write Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;2B DAU, average 20 reactions/day = 40B reactions/day&lt;/li&gt;
&lt;li&gt;Average: &lt;strong&gt;460K writes/sec&lt;/strong&gt;, peak: &lt;strong&gt;2M writes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each write: check if user already reacted → upsert/delete reaction → update counter&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;read-traffic&#34;&gt;Read Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Every post view shows reaction count. 2B DAU × 200 posts viewed/day = 400B post views/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;4.6M reads/sec&lt;/strong&gt; for reaction counts (but counts are cached on the post object — not a separate query)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Who reacted&amp;rdquo; list: viewed much less frequently. ~1B queries/day = 11K reads/sec&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Reaction records: 500B reactions × 30 bytes (user_id, post_id, reaction_type, timestamp) = &lt;strong&gt;15TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Reaction counts: 10B posts × 50 bytes (6 counters + total) = &lt;strong&gt;500GB&lt;/strong&gt; (easily fits in cache)&lt;/li&gt;
&lt;li&gt;Active reaction count cache: top 1B posts × 50 bytes = &lt;strong&gt;50GB&lt;/strong&gt; Redis&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;Reads outnumber writes 10:1 for counts. The &amp;ldquo;like count&amp;rdquo; is shown on every post impression but only changes when someone reacts. This is a perfect case for &lt;strong&gt;denormalized counters&lt;/strong&gt; — pre-compute and cache the count rather than counting rows at query time.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Facebook&#39;s News Feed</title>
      <link>https://chiraghasija.cc/designs/news-feed/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/news-feed/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users see a personalized feed of posts from friends, pages, and groups they follow&lt;/li&gt;
&lt;li&gt;Feed is ranked by relevance (not chronological) — a scoring model determines post order&lt;/li&gt;
&lt;li&gt;Support multiple content types: text, images, videos, links, live streams, stories&lt;/li&gt;
&lt;li&gt;New posts appear in followers&amp;rsquo; feeds within seconds (near real-time)&lt;/li&gt;
&lt;li&gt;Infinite scroll with cursor-based pagination — load more posts as user scrolls&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the news feed IS the product. Any downtime is a headline.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 200ms to render the first page of feed. Subsequent pages &amp;lt; 100ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency is fine. A post appearing 5 seconds late in some feeds is acceptable. But a post should never permanently disappear from a feed it belongs in.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 2 billion DAU, 500M posts/day, each user has ~500 friends on average. Feed generation for 2B users at peak load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; New posts from close friends should appear within 5-10 seconds. Posts from pages/groups can tolerate 30-60 seconds.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;2 billion DAU, each opens feed ~10 times/day&lt;/li&gt;
&lt;li&gt;Feed requests: 2B x 10 = &lt;strong&gt;20 billion feed loads/day&lt;/strong&gt; = &lt;strong&gt;230K QPS average&lt;/strong&gt;, &lt;strong&gt;500K QPS peak&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;New posts: 500M/day = &lt;strong&gt;5,800 posts/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each post fans out to ~500 followers (average friends list)&lt;/li&gt;
&lt;li&gt;Fan-out writes: 5,800 x 500 = &lt;strong&gt;2.9 million feed inserts/sec&lt;/strong&gt; (if fan-out on write)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Feed cache per user: store 500 most recent post IDs&lt;/li&gt;
&lt;li&gt;500 post IDs x 8 bytes = 4KB per user&lt;/li&gt;
&lt;li&gt;2B users x 4KB = &lt;strong&gt;8 TB&lt;/strong&gt; for feed cache — fits in a Redis cluster&lt;/li&gt;
&lt;li&gt;Post content storage: 500M posts/day x 5KB avg = &lt;strong&gt;2.5 TB/day&lt;/strong&gt; → standard DB/object store&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The core trade-off is &lt;strong&gt;fan-out on write&lt;/strong&gt; (pre-compute feeds, 2.9M writes/sec) vs &lt;strong&gt;fan-out on read&lt;/strong&gt; (compute feed at request time, 500K QPS each querying 500 friends). Neither extreme works alone — the answer is a &lt;strong&gt;hybrid approach&lt;/strong&gt; based on follower count.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design FASTag (Electronic Toll Collection System)</title>
      <link>https://chiraghasija.cc/designs/fastag/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/fastag/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Vehicles with RFID-based FASTag pass through toll plazas without stopping — the system reads the tag, identifies the vehicle, and deducts the toll amount in real time&lt;/li&gt;
&lt;li&gt;Support prepaid wallet accounts linked to FASTag. Users can recharge via UPI, net banking, credit/debit cards, or auto-top-up&lt;/li&gt;
&lt;li&gt;Classify vehicles (car, LCV, bus, truck, multi-axle) automatically using RFID tag metadata and optionally ANPR (Automatic Number Plate Recognition) for verification&lt;/li&gt;
&lt;li&gt;Handle edge cases: insufficient balance (let pass with negative balance up to a threshold, or deny entry), cloned/blacklisted tags, expired tags, tag-less vehicles (fallback to ANPR + manual toll)&lt;/li&gt;
&lt;li&gt;Generate trip receipts, monthly statements, and provide real-time balance/transaction history via mobile app and SMS alerts&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — toll plazas operate 24/7. Even 1 minute of downtime causes massive traffic jams at ~800+ plazas nationwide.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 300ms end-to-end from RFID scan to barrier lift. Vehicles pass at 30 km/h through dedicated FASTag lanes — the window for processing is ~2 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for balance deduction. We cannot deduct the same balance twice or allow a transaction to silently fail and still charge.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; India has ~8 crore (80M) FASTags issued. ~1 crore (10M) toll transactions/day across 800+ plazas with 4000+ lanes. Peak: ~3x average during festivals/holidays.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Every transaction must be recorded. Financial data — zero loss. Full audit trail for regulatory compliance (NHAI, NPCI).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M transactions/day = ~115 TPS average&lt;/li&gt;
&lt;li&gt;Peak (festival season, 3x): &lt;strong&gt;~350 TPS&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each transaction involves: RFID read → tag lookup → balance check → debit → receipt → barrier signal = 6 operations&lt;/li&gt;
&lt;li&gt;Effective peak internal ops: &lt;strong&gt;~2100 ops/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Read-heavy on tag lookup (every vehicle approaching triggers a read), write-heavy on transactions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;80M FASTag accounts: ~500 bytes each (tag ID, vehicle info, owner, balance, status) = &lt;strong&gt;40GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;10M transactions/day x 365 days x 200 bytes = &lt;strong&gt;730GB/year&lt;/strong&gt; of transaction logs&lt;/li&gt;
&lt;li&gt;5 years retention (regulatory): &lt;strong&gt;~3.7TB&lt;/strong&gt; of historical transactions&lt;/li&gt;
&lt;li&gt;Hot data (accounts + last 30 days txns): ~100GB — fits comfortably in-memory cache + SSD&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each RFID read event from plaza: ~200 bytes&lt;/li&gt;
&lt;li&gt;Each transaction response to plaza: ~300 bytes&lt;/li&gt;
&lt;li&gt;Peak: 350 x 500 bytes = ~175KB/s — negligible bandwidth&lt;/li&gt;
&lt;li&gt;The bottleneck is &lt;strong&gt;latency&lt;/strong&gt; (sub-300ms), not throughput&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;latency-critical financial transaction system&lt;/strong&gt; with strong consistency requirements. The scale is modest (350 TPS peak is not extreme), but the latency budget is brutal (300ms including network to remote plazas) and the consequences of failure are physical (traffic jams, accidents). The design must prioritize reliability, fast failover, and degraded-mode operation.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Google Analytics</title>
      <link>https://chiraghasija.cc/designs/google-analytics/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/google-analytics/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Collect client-side events (page views, clicks, custom events) from millions of websites via a lightweight JavaScript SDK&lt;/li&gt;
&lt;li&gt;Provide real-time dashboards showing active users, page views per second, and top pages (within ~30 seconds of event)&lt;/li&gt;
&lt;li&gt;Support batch analytics queries — daily/weekly/monthly reports on sessions, funnels, bounce rate, conversion paths&lt;/li&gt;
&lt;li&gt;Sessionize raw events into user sessions with configurable timeout (default 30 minutes of inactivity)&lt;/li&gt;
&lt;li&gt;Count unique visitors accurately across time ranges (daily, weekly, monthly) with deduplication&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for the ingestion pipeline — dropping events is unacceptable for paying customers. Dashboard reads can tolerate brief degradation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Event ingestion &amp;lt; 50ms (client-perceived). Real-time dashboard data within 30 seconds of event. Batch reports within minutes of scheduled time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency is acceptable. Real-time dashboards are approximate. Batch reports must be accurate to within 1%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10M tracked websites, 1M events/sec ingestion, 500TB+ of raw event data per year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero event loss once acknowledged. Raw events retained for 2 years, aggregated data retained indefinitely.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M websites tracked, average 10 page views/sec per active site&lt;/li&gt;
&lt;li&gt;Not all sites active simultaneously — assume 100K sites active at peak&lt;/li&gt;
&lt;li&gt;Peak ingestion: 100K sites × 10 events/sec = &lt;strong&gt;1M events/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Average ingestion: ~300K events/sec&lt;/li&gt;
&lt;li&gt;Read QPS (dashboard queries): ~50K/sec (most are cached)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average event payload: ~500 bytes (URL, timestamp, user agent, referrer, custom dimensions, session cookie)&lt;/li&gt;
&lt;li&gt;Daily raw events: 300K/sec × 86,400 sec × 500 bytes = &lt;strong&gt;~13TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Yearly raw events: &lt;strong&gt;~4.7PB/year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Aggregated rollups (hourly/daily per site per dimension): ~1% of raw = &lt;strong&gt;~50TB/year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;unique-visitor-counting&#34;&gt;Unique Visitor Counting&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M sites, each with up to 100M monthly unique visitors&lt;/li&gt;
&lt;li&gt;Exact counting: 10M sites × 100M visitors × 8 bytes = 8PB (impossible)&lt;/li&gt;
&lt;li&gt;HyperLogLog: 10M sites × 12KB per HLL = &lt;strong&gt;120GB&lt;/strong&gt; for monthly uniques — fits in memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;write-heavy, read-light&lt;/strong&gt; system. Ingestion throughput and storage cost dominate. The core challenge is building a pipeline that can ingest 1M events/sec, make data queryable in real-time, and run efficient batch aggregations without bankrupting the storage budget.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Google Calendar</title>
      <link>https://chiraghasija.cc/designs/google-calendar/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/google-calendar/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Create, read, update, and delete events with title, description, time range, location, and attendees&lt;/li&gt;
&lt;li&gt;Support recurring events with complex patterns (every weekday, first Monday of each month, every 2 weeks)&lt;/li&gt;
&lt;li&gt;Handle timezone conversions correctly — events display in the user&amp;rsquo;s local timezone regardless of where they were created&lt;/li&gt;
&lt;li&gt;Detect scheduling conflicts when creating or accepting events&lt;/li&gt;
&lt;li&gt;Send notifications and reminders (email, push) at configurable times before events&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — calendar outages directly cause missed meetings and lost productivity across entire organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 200ms to render a week view (fetch all events for 7 days). Event creation &amp;lt; 300ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for the event owner&amp;rsquo;s view — after creating an event, the owner must immediately see it. Eventual consistency (&amp;lt; 5 seconds) acceptable for other attendees seeing the event appear.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M active users, average 20 events/week per user, 50K event writes/sec peak, 200K calendar view reads/sec peak&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sync:&lt;/strong&gt; Events must sync reliably across all devices (web, mobile, desktop) within 5 seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Event writes (create/update/delete): 50K/sec peak, 20K/sec average&lt;/li&gt;
&lt;li&gt;Calendar view reads: 200K/sec peak (Monday mornings are the spike)&lt;/li&gt;
&lt;li&gt;Notification triggers: 100K/sec (reminders firing across timezones)&lt;/li&gt;
&lt;li&gt;Recurring event expansion: done at query time for future dates, pre-expanded for past 30 days&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M users x 20 events/week x 52 weeks = 520 billion events/year&lt;/li&gt;
&lt;li&gt;But most events are non-recurring single events. Average active events per user: ~200 (upcoming + recent)&lt;/li&gt;
&lt;li&gt;500M users x 200 events x 500 bytes = &lt;strong&gt;50 TB&lt;/strong&gt; for active event data&lt;/li&gt;
&lt;li&gt;Recurring event storage: only the RRULE is stored (not expanded instances)
&lt;ul&gt;
&lt;li&gt;500M users x 10 recurring events avg x 200 bytes = &lt;strong&gt;1 TB&lt;/strong&gt; for recurring patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Total: ~&lt;strong&gt;51 TB&lt;/strong&gt; for core event data&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;calendar-views&#34;&gt;Calendar Views&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Week view: fetch ~20-30 events per user (single-occurrence + expanded recurring)&lt;/li&gt;
&lt;li&gt;Month view: fetch ~80-120 events per user&lt;/li&gt;
&lt;li&gt;Most queries are time-range queries on a single user&amp;rsquo;s calendar → excellent for sharding by user_id&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Create an event
POST /calendars/{calendar_id}/events
  Body: {
    &amp;#34;title&amp;#34;: &amp;#34;Sprint Planning&amp;#34;,
    &amp;#34;description&amp;#34;: &amp;#34;Q1 sprint planning meeting&amp;#34;,
    &amp;#34;start&amp;#34;: &amp;#34;2026-02-23T09:00:00&amp;#34;,
    &amp;#34;end&amp;#34;: &amp;#34;2026-02-23T10:00:00&amp;#34;,
    &amp;#34;timezone&amp;#34;: &amp;#34;America/New_York&amp;#34;,
    &amp;#34;location&amp;#34;: &amp;#34;Conference Room A / https://meet.google.com/xyz&amp;#34;,
    &amp;#34;attendees&amp;#34;: [
      {&amp;#34;email&amp;#34;: &amp;#34;alice@company.com&amp;#34;, &amp;#34;optional&amp;#34;: false},
      {&amp;#34;email&amp;#34;: &amp;#34;bob@company.com&amp;#34;, &amp;#34;optional&amp;#34;: true}
    ],
    &amp;#34;recurrence&amp;#34;: &amp;#34;RRULE:FREQ=WEEKLY;BYDAY=MO;COUNT=12&amp;#34;,
    &amp;#34;reminders&amp;#34;: [
      {&amp;#34;method&amp;#34;: &amp;#34;popup&amp;#34;, &amp;#34;minutes&amp;#34;: 10},
      {&amp;#34;method&amp;#34;: &amp;#34;email&amp;#34;, &amp;#34;minutes&amp;#34;: 30}
    ],
    &amp;#34;visibility&amp;#34;: &amp;#34;default&amp;#34;,
    &amp;#34;color_id&amp;#34;: 5
  }
  Response 201: {
    &amp;#34;event_id&amp;#34;: &amp;#34;evt_abc123&amp;#34;,
    &amp;#34;calendar_id&amp;#34;: &amp;#34;cal_user456&amp;#34;,
    &amp;#34;html_link&amp;#34;: &amp;#34;https://calendar.google.com/event?eid=abc123&amp;#34;,
    &amp;#34;created&amp;#34;: &amp;#34;2026-02-22T14:30:00Z&amp;#34;,
    &amp;#34;updated&amp;#34;: &amp;#34;2026-02-22T14:30:00Z&amp;#34;,
    ...
  }

// Get events in a time range (calendar view)
GET /calendars/{calendar_id}/events?timeMin=2026-02-23T00:00:00Z&amp;amp;timeMax=2026-03-02T00:00:00Z&amp;amp;timezone=America/New_York&amp;amp;singleEvents=true
  Response 200: {
    &amp;#34;events&amp;#34;: [
      {
        &amp;#34;event_id&amp;#34;: &amp;#34;evt_abc123&amp;#34;,
        &amp;#34;title&amp;#34;: &amp;#34;Sprint Planning&amp;#34;,
        &amp;#34;start&amp;#34;: {&amp;#34;dateTime&amp;#34;: &amp;#34;2026-02-23T09:00:00-05:00&amp;#34;, &amp;#34;timezone&amp;#34;: &amp;#34;America/New_York&amp;#34;},
        &amp;#34;end&amp;#34;: {&amp;#34;dateTime&amp;#34;: &amp;#34;2026-02-23T10:00:00-05:00&amp;#34;, &amp;#34;timezone&amp;#34;: &amp;#34;America/New_York&amp;#34;},
        &amp;#34;recurring_event_id&amp;#34;: &amp;#34;evt_abc123&amp;#34;,    // parent recurring event
        &amp;#34;original_start&amp;#34;: &amp;#34;2026-02-23T09:00:00-05:00&amp;#34;,
        &amp;#34;attendees&amp;#34;: [...],
        &amp;#34;status&amp;#34;: &amp;#34;confirmed&amp;#34;
      },
      ...
    ]
  }

// Update a single instance of a recurring event
PUT /calendars/{calendar_id}/events/{event_id}?instance=2026-03-02T09:00:00-05:00
  Body: { &amp;#34;start&amp;#34;: &amp;#34;2026-03-02T10:00:00&amp;#34;, &amp;#34;end&amp;#34;: &amp;#34;2026-03-02T11:00:00&amp;#34; }

// Check free/busy time for scheduling
POST /freeBusy
  Body: {
    &amp;#34;timeMin&amp;#34;: &amp;#34;2026-02-24T08:00:00Z&amp;#34;,
    &amp;#34;timeMax&amp;#34;: &amp;#34;2026-02-24T18:00:00Z&amp;#34;,
    &amp;#34;items&amp;#34;: [
      {&amp;#34;id&amp;#34;: &amp;#34;alice@company.com&amp;#34;},
      {&amp;#34;id&amp;#34;: &amp;#34;bob@company.com&amp;#34;}
    ]
  }
  Response 200: {
    &amp;#34;calendars&amp;#34;: {
      &amp;#34;alice@company.com&amp;#34;: {
        &amp;#34;busy&amp;#34;: [
          {&amp;#34;start&amp;#34;: &amp;#34;2026-02-24T09:00:00Z&amp;#34;, &amp;#34;end&amp;#34;: &amp;#34;2026-02-24T10:00:00Z&amp;#34;},
          {&amp;#34;start&amp;#34;: &amp;#34;2026-02-24T14:00:00Z&amp;#34;, &amp;#34;end&amp;#34;: &amp;#34;2026-02-24T15:00:00Z&amp;#34;}
        ]
      },
      ...
    }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;singleEvents=true&lt;/code&gt; tells the API to expand recurring events into individual instances — critical for calendar rendering&lt;/li&gt;
&lt;li&gt;Recurring event modifications are tracked as &amp;ldquo;exception instances&amp;rdquo; linked to the parent event&lt;/li&gt;
&lt;li&gt;Free/busy API is a separate endpoint optimized for multi-user scheduling (returns only busy slots, not event details — respects privacy)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;events-table-mysqlpostgresql-sharded-by-owner_user_id&#34;&gt;Events Table (MySQL/PostgreSQL, sharded by owner_user_id)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: events
  event_id            (PK) | bigint (Snowflake ID)
  calendar_id              | bigint
  owner_user_id            | bigint (shard key)
  title                    | varchar(500)
  description              | text
  start_time               | timestamp with timezone
  end_time                 | timestamp with timezone
  start_timezone           | varchar(50)    -- e.g., &amp;#34;America/New_York&amp;#34;
  end_timezone             | varchar(50)
  location                 | varchar(500)
  is_all_day               | boolean
  recurrence_rule          | varchar(500)   -- RRULE string, null if non-recurring
  recurrence_end           | timestamp      -- when the recurrence stops
  visibility               | enum(default, public, private)
  status                   | enum(confirmed, tentative, cancelled)
  color_id                 | tinyint
  created_at               | timestamp
  updated_at               | timestamp

Indexes:
  (calendar_id, start_time)  -- primary query: events in a time range for a calendar
  (owner_user_id)            -- shard key
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;recurring-event-exceptions-table&#34;&gt;Recurring Event Exceptions Table&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: event_exceptions
  exception_id        (PK) | bigint
  parent_event_id          | bigint (FK → events)
  original_start_time      | timestamp   -- which instance is being modified
  is_cancelled             | boolean     -- true if this instance is deleted
  modified_title           | varchar(500)
  modified_start_time      | timestamp
  modified_end_time        | timestamp
  modified_location        | varchar(500)
  -- other overridden fields (null = use parent&amp;#39;s value)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;attendees-table&#34;&gt;Attendees Table&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: event_attendees
  event_id                 | bigint (FK → events)
  user_id                  | bigint
  email                    | varchar(200)
  response_status          | enum(needsAction, accepted, declined, tentative)
  is_optional              | boolean
  is_organizer             | boolean

Index: (user_id, event_start_time)  -- for attendee&amp;#39;s calendar view
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;reminders-table&#34;&gt;Reminders Table&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: reminders
  reminder_id         (PK) | bigint
  event_id                 | bigint
  user_id                  | bigint
  method                   | enum(popup, email, sms)
  minutes_before           | int
  trigger_time             | timestamp   -- precomputed: event.start - minutes_before
  is_sent                  | boolean

Index: (trigger_time, is_sent)  -- for the reminder scheduler to efficiently find due reminders
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;why-these-choices&#34;&gt;Why These Choices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sharded MySQL by owner_user_id&lt;/strong&gt; — most queries are &amp;ldquo;my events this week&amp;rdquo; which hits a single shard. Cross-user queries (free/busy) require scatter-gather but are less frequent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RRULE stored as string, expanded at query time&lt;/strong&gt; — storing every instance of &amp;ldquo;every weekday forever&amp;rdquo; would be infinite. RRULE is compact and standards-based (RFC 5545).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exception table for recurring modifications&lt;/strong&gt; — cleanly separates the recurring pattern from per-instance changes. No need to duplicate the entire event for each modified instance.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;architecture&#34;&gt;Architecture&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client (Web / Mobile / Desktop)
  │
  │  REST API / WebSocket (for real-time sync)
  ▼
┌───────────────┐
│  API Gateway   │  (auth, rate limiting, routing)
└───────┬───────┘
        │
        ├──────────────────┬───────────────────┬──────────────────┐
        ▼                  ▼                   ▼                  ▼
┌───────────────┐ ┌─────────────────┐ ┌───────────────┐ ┌───────────────┐
│ Event Service  │ │ Recurring Event │ │ Notification   │ │ Free/Busy     │
│ (CRUD for      │ │ Expander        │ │ Service        │ │ Service       │
│  events)       │ │ (expands RRULE  │ │ (reminders,    │ │ (scheduling   │
│                │ │  into instances) │ │  invites)      │ │  queries)     │
└───────┬───────┘ └────────┬────────┘ └───────┬───────┘ └───────────────┘
        │                  │                   │
        ▼                  ▼                   ▼
┌──────────────────────────────────────────────────────┐
│              Shared Data Layer                          │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Events DB     │  │ Attendees DB │  │ Reminders DB │ │
│  │ (MySQL,       │  │ (MySQL,      │  │ (MySQL,      │ │
│  │  sharded by   │  │  sharded by  │  │  sharded by  │ │
│  │  owner_id)    │  │  user_id)    │  │  trigger_time│ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐                    │
│  │ Cache (Redis) │  │ Sync Queue   │                    │
│  │ (user&amp;#39;s week  │  │ (Kafka, for  │                    │
│  │  view cache)  │  │  cross-device│                    │
│  │              │  │  sync)       │                     │
│  └──────────────┘  └──────────────┘                    │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;event-creation-flow&#34;&gt;Event Creation Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;User creates &amp;#34;Sprint Planning, every Monday 9-10 AM ET, 12 weeks&amp;#34;
  │
  ▼
Event Service:
  1. Validate input (times, timezone, RRULE syntax)
  2. Store event with recurrence_rule = &amp;#34;RRULE:FREQ=WEEKLY;BYDAY=MO;COUNT=12&amp;#34;
     (single row in events table, NOT 12 rows)
  3. Store attendees in event_attendees table
  4. Compute reminders for next 30 days of instances:
     → Expand RRULE for next 30 days → 4-5 instances
     → For each: insert into reminders table with precomputed trigger_time
  5. Publish event to Kafka &amp;#34;event-changes&amp;#34; topic
  │
  ├─→ Notification Service:
  │     → Send email invitations to all attendees
  │     → Each attendee&amp;#39;s calendar view cache is invalidated
  │
  └─→ Sync Service:
        → Push real-time update to all owner&amp;#39;s connected devices (WebSocket)
        → Push update to attendees&amp;#39; devices
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;calendar-view-week-flow&#34;&gt;Calendar View (Week) Flow&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;User opens calendar for week of Feb 23 - Mar 1
  │
  ▼
API Gateway → Event Service:
  1. Check cache: Redis key &amp;#34;cal_view:{user_id}:2026-W09&amp;#34;
     → Cache hit (80% of the time): return cached events, done
  2. Cache miss:
     a. Query events table:
        SELECT * FROM events
        WHERE calendar_id = ?
          AND ((start_time BETWEEN ? AND ?)           -- non-recurring events in range
               OR recurrence_rule IS NOT NULL)         -- all recurring events (need expansion)
     b. For recurring events: expand RRULE to find instances in this week
     c. Join with event_exceptions: apply per-instance modifications, remove cancelled instances
     d. Query attendees table: get events where this user is an attendee
     e. Merge owner&amp;#39;s events + attendee events
     f. Sort by start_time
     g. Cache result in Redis (TTL: 5 minutes)
  3. Return to client
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Event Service:&lt;/strong&gt; Core CRUD. Handles event creation, updates, deletion. Sharded by owner_user_id for write locality.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recurring Event Expander:&lt;/strong&gt; Library/service that takes an RRULE + time range and produces concrete event instances. Uses RFC 5545-compliant parser. Handles complex rules like &amp;ldquo;last Thursday of every month&amp;rdquo; or &amp;ldquo;every 3rd day, excluding weekends.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Service:&lt;/strong&gt; Processes reminder triggers and attendee invitations. Polls the reminders table every 10 seconds for due reminders. Sends via email, push notification, or in-app popup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Free/Busy Service:&lt;/strong&gt; Optimized for multi-user scheduling queries. Maintains a denormalized &amp;ldquo;busy slots&amp;rdquo; table per user (pre-computed from events). Returns only time ranges, no event details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sync Service:&lt;/strong&gt; Real-time sync across devices. Uses WebSocket connections for push updates. When an event changes, publishes to Kafka, which fans out to all connected devices of affected users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache Layer (Redis):&lt;/strong&gt; Caches calendar views per user per week/month. Invalidated on event create/update/delete. Hit rate: ~80% (users repeatedly view the same week).&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-recurring-events-and-rrule&#34;&gt;Deep Dive 1: Recurring Events and RRULE&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Recurring events are the hardest part of a calendar system. &amp;ldquo;Every weekday&amp;rdquo; generates ~260 instances/year. &amp;ldquo;Every day forever&amp;rdquo; is infinite. We cannot store every instance. But users can modify individual instances (move Tuesday&amp;rsquo;s meeting to Wednesday, cancel one occurrence).&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Google Docs (Collaborative Document Editor)</title>
      <link>https://chiraghasija.cc/designs/google-docs/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/google-docs/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can create, edit, and delete documents with rich-text formatting (bold, italic, headings, lists, tables, images)&lt;/li&gt;
&lt;li&gt;Multiple users can simultaneously edit the same document in real time, seeing each other&amp;rsquo;s changes within ~200ms&lt;/li&gt;
&lt;li&gt;Each collaborator&amp;rsquo;s cursor position and selection is visible to all other editors (presence awareness)&lt;/li&gt;
&lt;li&gt;Full version history — users can view, name, and restore any previous version of the document&lt;/li&gt;
&lt;li&gt;Sharing and permissions model: owner, editor, commenter, viewer roles with link-sharing and per-user ACLs&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — documents are a user&amp;rsquo;s primary work artifact. Downtime during working hours is extremely costly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 200ms for local edit acknowledgment; &amp;lt; 500ms for remote edit propagation to other collaborators&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for document state, but operations must converge — all users see the identical document after quiescence. No lost edits, ever.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 1B documents total, 10M DAU, up to 100 concurrent editors per document, peak 500K concurrent editing sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Zero data loss. Every keystroke must be persisted. Point-in-time recovery for any document.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10M DAU, average 3 editing sessions/day, average session 20 minutes&lt;/li&gt;
&lt;li&gt;Average 2 operations/second per active user (character insert, delete, format change)&lt;/li&gt;
&lt;li&gt;Concurrent active editors at peak: ~2M users&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write QPS:&lt;/strong&gt; 2M users x 2 ops/sec = &lt;strong&gt;4M operations/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read QPS:&lt;/strong&gt; Document opens: 10M x 3 = 30M/day = ~350 reads/sec (bursty, 3x peak = ~1K/sec). Presence/cursor updates: same as write QPS.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B documents, average 50KB per document (plain text + formatting metadata)&lt;/li&gt;
&lt;li&gt;Document content: 1B x 50KB = &lt;strong&gt;50TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Operation log (for version history): assume 500 ops per document per day, 10 bytes per op average, retained for 1 year
&lt;ul&gt;
&lt;li&gt;Active documents per day: 30M. 30M x 500 x 10B x 365 = &lt;strong&gt;~55TB/year&lt;/strong&gt; of operation logs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;~105TB&lt;/strong&gt; primary storage, replicated 3x = ~315TB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;write-heavy, real-time collaboration&lt;/strong&gt; system. The core challenge is not storage or throughput — it&amp;rsquo;s ensuring that concurrent edits from multiple users converge to the same document state without conflicts or lost updates. The algorithm choice (OT vs CRDT) dominates the design.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Google Drive (Cloud Storage &amp; Collaboration)</title>
      <link>https://chiraghasija.cc/designs/google-drive/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/google-drive/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can upload, download, and delete files of any type and size (up to 5TB per file). Uploads support chunked and resumable upload protocols so large files survive network interruptions.&lt;/li&gt;
&lt;li&gt;Files sync automatically across all of a user&amp;rsquo;s devices. When a file is modified on one device, all other devices reflect the change within seconds.&lt;/li&gt;
&lt;li&gt;Sharing and permissions: users can share files/folders with specific people (viewer, commenter, editor) or generate shareable links with configurable access levels. Support organizational domains (anyone at company X can view).&lt;/li&gt;
&lt;li&gt;Version history: every edit creates a new version. Users can view and restore previous versions (up to 100 versions, or 30 days, whichever comes first).&lt;/li&gt;
&lt;li&gt;Full-text search across file names, contents (for supported formats: docs, PDFs, images via OCR), and metadata (owner, type, modified date).&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% for file access (download). Upload can tolerate slightly lower availability (resumable uploads mask brief outages).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Metadata operations (list, search, share): &amp;lt; 200ms. Small file download (&amp;lt; 1MB): &amp;lt; 500ms from edge CDN. Upload acknowledgment: &amp;lt; 100ms per chunk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for metadata (permissions, ownership, folder structure). Eventual consistency for file content propagation to other devices is acceptable (target: &amp;lt; 10 seconds).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 2B users total, 500M MAU, 100M DAU. 2 trillion files stored. 10B API requests/day. 5PB of new data uploaded per day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; 99.999999999% (11 nines). Files must never be lost. This is the most critical non-functional requirement.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;10B API requests/day = ~116K requests/sec average&lt;/li&gt;
&lt;li&gt;Breakdown: 60% metadata reads (list, search), 25% downloads, 15% uploads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read QPS:&lt;/strong&gt; ~70K metadata reads/sec + ~29K downloads/sec&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write QPS:&lt;/strong&gt; ~17K uploads/sec&lt;/li&gt;
&lt;li&gt;Peak: 3x average = &lt;strong&gt;~350K requests/sec total&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Total files: 2 trillion&lt;/li&gt;
&lt;li&gt;Average file size: 2.5MB (skewed: many small docs, fewer large videos)&lt;/li&gt;
&lt;li&gt;Total storage: 2T x 2.5MB = &lt;strong&gt;5 exabytes (5,000 PB)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;New data: 5PB/day → 1.8EB/year&lt;/li&gt;
&lt;li&gt;With 3x replication: &lt;strong&gt;15EB&lt;/strong&gt; raw storage&lt;/li&gt;
&lt;li&gt;With deduplication (estimated 30% duplicate data): effective ~10.5EB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;metadata&#34;&gt;Metadata&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;2 trillion files x 1KB metadata per file = &lt;strong&gt;2PB&lt;/strong&gt; of metadata&lt;/li&gt;
&lt;li&gt;This must be in a fast, queryable database — not blob storage&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Downloads: 29K/sec x 2.5MB average = &lt;strong&gt;72.5GB/sec = 580Gbps&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Uploads: 17K/sec x 2.5MB average = &lt;strong&gt;42.5GB/sec = 340Gbps&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;CDN absorbs most download traffic (cache-hit ratio ~85%), so origin bandwidth: ~87Gbps&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;storage-dominant system&lt;/strong&gt; at planetary scale. The core challenges are: (1) storing exabytes of data durably and cost-efficiently, (2) syncing file changes across devices with minimal bandwidth and latency, and (3) making 2 trillion files searchable. The metadata layer is essentially a distributed file system.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Instagram</title>
      <link>https://chiraghasija.cc/designs/instagram/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/instagram/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can upload photos with captions&lt;/li&gt;
&lt;li&gt;Users can follow/unfollow other users&lt;/li&gt;
&lt;li&gt;Users can view a personalized news feed (photos from people they follow)&lt;/li&gt;
&lt;li&gt;Users can like and comment on photos&lt;/li&gt;
&lt;li&gt;Users can view any user&amp;rsquo;s profile (grid of their photos)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — social feeds being down is immediately visible to millions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Feed load &amp;lt; 300ms at p99, photo upload acknowledgment &amp;lt; 2s&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for feed (2-5 seconds stale is fine). Strong consistency for uploads (post → refresh → must see it)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M DAU, 50M photo uploads/day, average user views feed 10 times/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage:&lt;/strong&gt; Photos are large; need cost-efficient media storage&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uploads:&lt;/strong&gt; 50M/day ÷ 100K = &lt;strong&gt;500 writes/sec&lt;/strong&gt;, peak &lt;strong&gt;2,500/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Feed reads:&lt;/strong&gt; 500M × 10/day = 5B reads/day ÷ 100K = &lt;strong&gt;50,000 reads/sec&lt;/strong&gt;, peak &lt;strong&gt;250,000/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read-to-write ratio:&lt;/strong&gt; 100:1 — extremely read-heavy&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average photo: 2MB original, store 4 sizes (thumbnail 50KB, small 200KB, medium 500KB, large 2MB) = ~2.75MB total per photo&lt;/li&gt;
&lt;li&gt;50M photos/day × 2.75MB = &lt;strong&gt;137TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Per year: &lt;strong&gt;~50PB&lt;/strong&gt; — this is a massive storage system&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Feed: 10 photos per load × 500KB avg = 5MB per feed load&lt;/li&gt;
&lt;li&gt;50,000 feeds/sec × 5MB = &lt;strong&gt;250GB/sec&lt;/strong&gt; peak read bandwidth&lt;/li&gt;
&lt;li&gt;CDN is absolutely critical&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/photos
  Content-Type: multipart/form-data
  Body: {
    &amp;#34;photo&amp;#34;: &amp;lt;binary&amp;gt;,
    &amp;#34;caption&amp;#34;: &amp;#34;Sunset vibes&amp;#34;,
    &amp;#34;location&amp;#34;: &amp;#34;Mumbai, India&amp;#34;     // optional
  }
  Response 201: {
    &amp;#34;photo_id&amp;#34;: &amp;#34;p_abc123&amp;#34;,
    &amp;#34;urls&amp;#34;: {
      &amp;#34;thumbnail&amp;#34;: &amp;#34;https://cdn.ig.com/thumb/p_abc123.jpg&amp;#34;,
      &amp;#34;full&amp;#34;: &amp;#34;https://cdn.ig.com/full/p_abc123.jpg&amp;#34;
    }
  }

GET /api/v1/feed?cursor={cursor}&amp;amp;limit=10
  Response 200: {
    &amp;#34;photos&amp;#34;: [
      {
        &amp;#34;photo_id&amp;#34;: &amp;#34;p_abc123&amp;#34;,
        &amp;#34;user&amp;#34;: { &amp;#34;id&amp;#34;: &amp;#34;u_1&amp;#34;, &amp;#34;username&amp;#34;: &amp;#34;chirag&amp;#34;, &amp;#34;avatar&amp;#34;: &amp;#34;...&amp;#34; },
        &amp;#34;caption&amp;#34;: &amp;#34;Sunset vibes&amp;#34;,
        &amp;#34;photo_url&amp;#34;: &amp;#34;https://cdn.ig.com/med/p_abc123.jpg&amp;#34;,
        &amp;#34;like_count&amp;#34;: 2847,
        &amp;#34;comment_count&amp;#34;: 43,
        &amp;#34;liked_by_me&amp;#34;: true,
        &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T18:00:00Z&amp;#34;
      }
    ],
    &amp;#34;next_cursor&amp;#34;: &amp;#34;ts_1708632000&amp;#34;
  }

POST /api/v1/photos/{photo_id}/like
DELETE /api/v1/photos/{photo_id}/like

POST /api/v1/users/{user_id}/follow
DELETE /api/v1/users/{user_id}/follow

GET /api/v1/users/{user_id}/photos?cursor={cursor}&amp;amp;limit=30
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Key decisions:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Pastebin</title>
      <link>https://chiraghasija.cc/designs/pastebin/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/pastebin/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can create a paste (text content, up to 10MB)&lt;/li&gt;
&lt;li&gt;Each paste gets a unique, shareable URL&lt;/li&gt;
&lt;li&gt;Pastes can be public or private (unlisted — only accessible via URL)&lt;/li&gt;
&lt;li&gt;Pastes can have an optional expiration (10 min, 1 hour, 1 day, 1 week, never)&lt;/li&gt;
&lt;li&gt;Syntax highlighting for code pastes (client-side, not a backend concern)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.9% — reads must be highly available; writes can tolerate brief degradation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Paste retrieval &amp;lt; 200ms at p99&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Strong consistency for writes (create → immediately readable). Eventual consistency acceptable for metadata like view counts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 5M new pastes/day, 50M reads/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage:&lt;/strong&gt; Most pastes are small (&amp;lt; 50KB), but we support up to 10MB&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Writes: 5M/day ÷ 100K = &lt;strong&gt;50 writes/sec&lt;/strong&gt;, peak &lt;strong&gt;250/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Reads: 50M/day ÷ 100K = &lt;strong&gt;500 reads/sec&lt;/strong&gt;, peak &lt;strong&gt;2,500/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average paste: 10KB (most are small code snippets)&lt;/li&gt;
&lt;li&gt;5M/day × 10KB = 50GB/day&lt;/li&gt;
&lt;li&gt;Per year: &lt;strong&gt;~18TB&lt;/strong&gt; of paste content&lt;/li&gt;
&lt;li&gt;Over 5 years: &lt;strong&gt;~90TB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Read: 500 reads/sec × 10KB = &lt;strong&gt;5MB/sec&lt;/strong&gt; average&lt;/li&gt;
&lt;li&gt;Peak: &lt;strong&gt;25MB/sec&lt;/strong&gt; — manageable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/pastes
  Body: {
    &amp;#34;content&amp;#34;: &amp;#34;def hello():\n    print(&amp;#39;world&amp;#39;)&amp;#34;,
    &amp;#34;title&amp;#34;: &amp;#34;My Snippet&amp;#34;,          // optional
    &amp;#34;language&amp;#34;: &amp;#34;python&amp;#34;,            // optional, for syntax highlighting
    &amp;#34;expiration&amp;#34;: &amp;#34;1d&amp;#34;,              // optional: 10m, 1h, 1d, 1w, never
    &amp;#34;visibility&amp;#34;: &amp;#34;unlisted&amp;#34;         // public or unlisted
  }
  Response 201: {
    &amp;#34;id&amp;#34;: &amp;#34;aB3kX9p&amp;#34;,
    &amp;#34;url&amp;#34;: &amp;#34;https://paste.example.com/aB3kX9p&amp;#34;,
    &amp;#34;raw_url&amp;#34;: &amp;#34;https://paste.example.com/raw/aB3kX9p&amp;#34;,
    &amp;#34;expires_at&amp;#34;: &amp;#34;2026-02-23T12:00:00Z&amp;#34;
  }

GET /api/v1/pastes/{id}
  Response 200: {
    &amp;#34;id&amp;#34;: &amp;#34;aB3kX9p&amp;#34;,
    &amp;#34;title&amp;#34;: &amp;#34;My Snippet&amp;#34;,
    &amp;#34;content&amp;#34;: &amp;#34;def hello():\n    print(&amp;#39;world&amp;#39;)&amp;#34;,
    &amp;#34;language&amp;#34;: &amp;#34;python&amp;#34;,
    &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T12:00:00Z&amp;#34;,
    &amp;#34;expires_at&amp;#34;: &amp;#34;2026-02-23T12:00:00Z&amp;#34;,
    &amp;#34;view_count&amp;#34;: 42
  }

GET /raw/{id}
  Response 200: (plain text content, no JSON wrapper)
  Content-Type: text/plain

GET /api/v1/pastes/recent?limit=20&amp;amp;cursor={cursor}
  Response 200: { &amp;#34;pastes&amp;#34;: [...], &amp;#34;next_cursor&amp;#34;: &amp;#34;...&amp;#34; }
  // Only returns public pastes
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;metadata-store-sql-postgresql&#34;&gt;Metadata Store: SQL (PostgreSQL)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: pastes
  id           (PK)   | char(7), Base62 encoded
  title                | varchar(200), nullable
  language             | varchar(50), nullable
  visibility           | enum(&amp;#39;public&amp;#39;, &amp;#39;unlisted&amp;#39;)
  content_key          | varchar(100)  -- S3 object key
  content_size         | int
  created_at           | timestamptz
  expires_at           | timestamptz, nullable
  view_count           | bigint, default 0
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;content-store-object-storage-s3&#34;&gt;Content Store: Object Storage (S3)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;pastes/{id}&lt;/code&gt; → raw text content&lt;/li&gt;
&lt;li&gt;Content stored separately because: metadata queries (list recent, check expiry) shouldn&amp;rsquo;t load paste content; S3 scales storage cheaply to petabytes; enables CDN integration for reads&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;why-sql-for-metadata&#34;&gt;Why SQL for metadata?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Need to query &lt;code&gt;WHERE visibility = &#39;public&#39; ORDER BY created_at DESC&lt;/code&gt; for the recent pastes feed&lt;/li&gt;
&lt;li&gt;Need to query &lt;code&gt;WHERE expires_at &amp;lt; NOW()&lt;/code&gt; for cleanup&lt;/li&gt;
&lt;li&gt;5M inserts/day (~50/sec) is well within PostgreSQL&amp;rsquo;s capacity with proper indexing&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;write-path&#34;&gt;Write Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client → Load Balancer → Paste Service
  → Generate unique ID (Snowflake → Base62)
  → Upload content to S3 (key: pastes/{id})
  → Write metadata to PostgreSQL
  → Return paste URL
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;read-path&#34;&gt;Read Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client → CDN (CloudFront/Cloudflare)
  Cache hit → Return content
  Cache miss → Load Balancer → Paste Service
    → Read metadata from PostgreSQL (or Redis cache)
    → Check expiry → if expired, return 404
    → Fetch content from S3 (or Redis if cached)
    → Increment view count async (Kafka → consumer)
    → Return response + cache at CDN
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Paste Service&lt;/strong&gt;: Stateless application servers handling create/read&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; (primary + replica): Metadata store&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S3&lt;/strong&gt;: Content store — cheap, durable, scales to petabytes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt;: Cache hot pastes (metadata + content for small pastes &amp;lt; 100KB)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CDN&lt;/strong&gt;: Cache popular pastes at edge; raw endpoint is especially CDN-friendly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kafka&lt;/strong&gt;: Async view count updates + expiry event stream&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cleanup Worker&lt;/strong&gt;: Periodic job to delete expired pastes from S3 and PostgreSQL&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-storage-architecture-s3-vs-database&#34;&gt;Deep Dive 1: Storage Architecture (S3 vs Database)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why not store content in PostgreSQL?&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Twitter for Millions of Users</title>
      <link>https://chiraghasija.cc/designs/twitter/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/twitter/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can post tweets (280 chars, text only for scope)&lt;/li&gt;
&lt;li&gt;Users can follow/unfollow other users&lt;/li&gt;
&lt;li&gt;Users can view their home timeline (tweets from followed users)&lt;/li&gt;
&lt;li&gt;Users can like and retweet&lt;/li&gt;
&lt;li&gt;Users can search for tweets&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — timeline is the core product&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Timeline load &amp;lt; 200ms at p99, tweet post &amp;lt; 500ms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for timeline (a few seconds stale is fine). Strong consistency for the write path (post → refresh → see it).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 500M DAU, 200M tweets/day. Average user follows 200 people and loads timeline 10 times/day.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;writes&#34;&gt;Writes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;200M tweets/day ÷ 100K = &lt;strong&gt;2,000 tweets/sec&lt;/strong&gt;, peak &lt;strong&gt;10,000/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;1B likes/day → &lt;strong&gt;10,000 likes/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;reads-timeline&#34;&gt;Reads (Timeline)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500M DAU × 10 loads/day = 5B loads/day ÷ 100K = &lt;strong&gt;50,000 timeline reads/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: &lt;strong&gt;250,000/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Read-to-write ratio: 25:1&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Tweet: 280 chars × 2 bytes + metadata ~300 bytes = ~860 bytes → round to 1KB&lt;/li&gt;
&lt;li&gt;200M/day × 1KB = 200GB/day → &lt;strong&gt;73TB/year&lt;/strong&gt; for tweets alone&lt;/li&gt;
&lt;li&gt;User metadata, follows, likes: additional ~20TB/year&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;fan-out-calculation&#34;&gt;Fan-out calculation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Average user has 200 followers&lt;/li&gt;
&lt;li&gt;200M tweets/day × 200 followers = &lt;strong&gt;40B fan-out writes/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Celebrity with 50M followers posting → 50M cache writes for one tweet&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;POST /api/v1/tweets
  Body: { &amp;#34;content&amp;#34;: &amp;#34;Hello world!&amp;#34;, &amp;#34;reply_to&amp;#34;: &amp;#34;t_abc&amp;#34; }  // reply_to optional
  Response 201: { &amp;#34;tweet_id&amp;#34;: &amp;#34;t_xyz&amp;#34;, &amp;#34;created_at&amp;#34;: &amp;#34;...&amp;#34; }

GET /api/v1/timeline?cursor={cursor}&amp;amp;limit=20
  Response 200: {
    &amp;#34;tweets&amp;#34;: [
      {
        &amp;#34;tweet_id&amp;#34;: &amp;#34;t_xyz&amp;#34;,
        &amp;#34;user&amp;#34;: { &amp;#34;id&amp;#34;: &amp;#34;u_1&amp;#34;, &amp;#34;username&amp;#34;: &amp;#34;chirag&amp;#34;, &amp;#34;display_name&amp;#34;: &amp;#34;Chirag&amp;#34;, &amp;#34;avatar&amp;#34;: &amp;#34;...&amp;#34; },
        &amp;#34;content&amp;#34;: &amp;#34;Hello world!&amp;#34;,
        &amp;#34;like_count&amp;#34;: 42,
        &amp;#34;retweet_count&amp;#34;: 7,
        &amp;#34;reply_count&amp;#34;: 3,
        &amp;#34;liked_by_me&amp;#34;: false,
        &amp;#34;retweeted_by_me&amp;#34;: false,
        &amp;#34;created_at&amp;#34;: &amp;#34;2026-02-22T18:00:00Z&amp;#34;
      }
    ],
    &amp;#34;next_cursor&amp;#34;: &amp;#34;1708632000_t_abc&amp;#34;
  }

POST /api/v1/tweets/{tweet_id}/like
DELETE /api/v1/tweets/{tweet_id}/like

POST /api/v1/tweets/{tweet_id}/retweet
DELETE /api/v1/tweets/{tweet_id}/retweet

POST /api/v1/users/{user_id}/follow
DELETE /api/v1/users/{user_id}/follow

GET /api/v1/search?q={query}&amp;amp;cursor={cursor}
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;user--social-graph-postgresql-sharded-by-user_id&#34;&gt;User &amp;amp; Social Graph (PostgreSQL, sharded by user_id)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: users
  user_id       (PK) | bigint
  username             | varchar(15), unique
  display_name         | varchar(50)
  bio                  | varchar(160)
  follower_count       | int (denormalized)
  following_count      | int (denormalized)

Table: follows
  follower_id          | bigint
  followee_id          | bigint
  created_at           | timestamptz
  PK: (follower_id, followee_id)
  Index: (followee_id)  -- for &amp;#34;who follows me&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;tweets-cassandra-or-dynamodb&#34;&gt;Tweets (Cassandra or DynamoDB)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: tweets
  tweet_id      (PK) | bigint (Snowflake ID)
  user_id              | bigint
  content              | text
  reply_to             | bigint, nullable
  like_count           | int (denormalized, eventually consistent)
  retweet_count        | int
  created_at           | timestamp
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;timeline-cache-redis&#34;&gt;Timeline Cache (Redis)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Key: timeline:{user_id}
Type: Sorted Set
Members: tweet_ids, scored by created_at timestamp
Max size: 800 entries (older tweets evicted)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;likes-cassandra&#34;&gt;Likes (Cassandra)&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Table: likes
  tweet_id  (partition key) | bigint
  user_id   (clustering key)| bigint
  created_at                | timestamp

Table: user_likes  (reverse index for &amp;#34;tweets I liked&amp;#34;)
  user_id   (partition key) | bigint
  tweet_id  (clustering key)| bigint
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id=&#34;5-high-level-design-12-min&#34;&gt;5. High-Level Design (12 min)&lt;/h2&gt;
&lt;h3 id=&#34;tweet-post-path&#34;&gt;Tweet Post Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client → Load Balancer → Tweet Service
  → Validate (length, content policy)
  → Write to Cassandra (tweets table)
  → Return success to client (fast ack)
  → Publish to Kafka: tweet_events topic
  → Fan-out Service consumes from Kafka:
    → Fetch follower list for the author
    → For each follower:
      → ZADD timeline:{follower_id} {timestamp} {tweet_id} in Redis
      → ZREMRANGEBYRANK to keep max 800 entries
    → For celebrities (&amp;gt; 10K followers): skip fan-out (pull model)
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;timeline-read-path&#34;&gt;Timeline Read Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client → Load Balancer → Timeline Service
  → Read timeline:{user_id} from Redis
    → ZREVRANGE with cursor-based pagination
  → For each tweet_id, batch fetch tweet data:
    → Redis cache first (hot tweets)
    → Cassandra for cache misses
  → Merge with celebrity tweets (pull):
    → Get list of celebrities the user follows
    → Fetch their recent tweets
    → Merge + sort by timestamp
  → Hydrate: add user info, liked_by_me, retweeted_by_me
  → Return assembled timeline
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;search-path&#34;&gt;Search Path&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Client → Search Service → Elasticsearch
  → Full-text search on tweet content
  → Filter by time range, user, engagement metrics
  → Return results ranked by relevance + recency
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;components&#34;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Tweet Service&lt;/strong&gt;: Write path — validates and stores tweets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timeline Service&lt;/strong&gt;: Read path — assembles personalized timelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fan-out Service&lt;/strong&gt;: Async — distributes tweets to followers&amp;rsquo; cached timelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Search Service&lt;/strong&gt;: Elasticsearch-backed tweet search&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cassandra&lt;/strong&gt;: Tweet storage, likes, retweets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt;: User profiles, social graph (follows)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis Cluster&lt;/strong&gt;: Timeline caches, hot tweet cache, user data cache&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kafka&lt;/strong&gt;: Event stream — tweet events, like events, follow events&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CDN&lt;/strong&gt;: Cache user avatars, media attachments&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-deep-dives-15-min&#34;&gt;6. Deep Dives (15 min)&lt;/h2&gt;
&lt;h3 id=&#34;deep-dive-1-fan-out-strategy-the-celebrity-problem&#34;&gt;Deep Dive 1: Fan-out Strategy (The Celebrity Problem)&lt;/h3&gt;
&lt;p&gt;This is THE defining problem of Twitter&amp;rsquo;s architecture.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Twitter Trending Topics</title>
      <link>https://chiraghasija.cc/designs/twitter-trending/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/twitter-trending/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Detect trending topics&lt;/strong&gt; — Identify hashtags, phrases, and named entities that are experiencing an abnormal surge in tweet volume (velocity over absolute volume)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geographic trending&lt;/strong&gt; — Compute separate trend lists for global, per-country, and per-city levels (~500 geographic regions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personalized trends&lt;/strong&gt; — Show a mix of global/local trends and topics relevant to the user&amp;rsquo;s interests (who they follow, what they engage with)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend context&lt;/strong&gt; — For each trending topic, display explanatory context: &amp;ldquo;Trending because of [event]&amp;rdquo;, tweet count, related tweets, and a category (Sports, Politics, Entertainment, etc.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend timeline&lt;/strong&gt; — Show how a topic&amp;rsquo;s volume changed over time (sparkline), when it started trending, and whether it&amp;rsquo;s rising, peaking, or declining&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — the trends sidebar is visible on every Twitter page. Downtime is highly visible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Trend list must load in &amp;lt; 200ms. Trend detection lag: a topic should appear in the trending list within 2-5 minutes of its surge beginning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; Trends update every 1-2 minutes. Stale trends (showing yesterday&amp;rsquo;s event) severely degrade trust.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; Twitter processes ~500M tweets/day (6,000 tweets/sec average, 15,000/sec peak). Each tweet may contain 0-5 hashtagged or extractable topics. That&amp;rsquo;s up to 75,000 topic events/sec at peak.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spam resilience:&lt;/strong&gt; Coordinated bot campaigns must not be able to artificially push a topic to trending. False positives are worse than false negatives (a fake trend damages trust).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Tweets/day: 500M → &lt;strong&gt;6,000 tweets/sec&lt;/strong&gt; avg, &lt;strong&gt;15,000/sec&lt;/strong&gt; peak&lt;/li&gt;
&lt;li&gt;Topics extracted per tweet: ~2 (hashtags + NLP-extracted phrases) → &lt;strong&gt;12,000 topic events/sec&lt;/strong&gt; avg, &lt;strong&gt;30,000/sec&lt;/strong&gt; peak&lt;/li&gt;
&lt;li&gt;Trend reads: Every Twitter page load fetches trends. 300M DAU × 10 page loads/day = 3B trend reads/day → &lt;strong&gt;35,000 reads/sec&lt;/strong&gt; avg, &lt;strong&gt;100,000 reads/sec&lt;/strong&gt; peak&lt;/li&gt;
&lt;li&gt;Geographic granularity: ~500 regions × trend list refresh every 60 seconds = &lt;strong&gt;500 trend computations/minute&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Real-time counters:&lt;/strong&gt; Sliding window counts for ~5M unique topics across 500 regions
&lt;ul&gt;
&lt;li&gt;Each topic-region pair: topic_hash (8B) + count (8B) + window metadata (32B) ≈ 48B&lt;/li&gt;
&lt;li&gt;5M topics × 500 regions × 48B = &lt;strong&gt;120 GB&lt;/strong&gt; — fits in memory (Redis cluster)&lt;/li&gt;
&lt;li&gt;In practice, most topics only trend in 1-3 regions, so effective storage is much less&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Historical trends:&lt;/strong&gt; Archive of what trended, when, and why
&lt;ul&gt;
&lt;li&gt;500 regions × 30 trends × 1440 minutes/day × 365 days = 7.9B records/year&lt;/li&gt;
&lt;li&gt;Each record: ~200B → &lt;strong&gt;1.6 TB/year&lt;/strong&gt; — manageable in a columnar store (ClickHouse)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Count-Min Sketch:&lt;/strong&gt; For approximate frequency counting of all topics
&lt;ul&gt;
&lt;li&gt;5 hash functions × 1M counters × 4 bytes = &lt;strong&gt;20 MB per sketch&lt;/strong&gt; per region&lt;/li&gt;
&lt;li&gt;500 regions × 20 MB = &lt;strong&gt;10 GB&lt;/strong&gt; total — trivially small&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;compute&#34;&gt;Compute&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Topic extraction (NLP): 15,000 tweets/sec × ~1ms per tweet = &lt;strong&gt;15 CPU-seconds/sec&lt;/strong&gt; → 15 cores dedicated to NLP extraction&lt;/li&gt;
&lt;li&gt;Trend scoring: 500 regions × every 60 seconds, score 10K candidate topics per region = 500 × 10K / 60 ≈ &lt;strong&gt;83K scoring operations/sec&lt;/strong&gt; — lightweight math, &amp;lt; 1 core&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;trending-topics-api&#34;&gt;Trending Topics API&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Get trending topics for a user (personalized + geographic)
GET /api/v1/trends
  Params:
    user_id (optional)          // for personalized trends
    woeid (optional)            // &amp;#34;Where On Earth ID&amp;#34; — geographic region. 1 = global
    count (default 30)          // number of trends to return
  → 200 {
    &amp;#34;as_of&amp;#34;: &amp;#34;2024-03-15T14:32:00Z&amp;#34;,
    &amp;#34;location&amp;#34;: { &amp;#34;name&amp;#34;: &amp;#34;San Francisco&amp;#34;, &amp;#34;woeid&amp;#34;: 2487956 },
    &amp;#34;trends&amp;#34;: [
      {
        &amp;#34;name&amp;#34;: &amp;#34;#SuperBowl&amp;#34;,
        &amp;#34;query&amp;#34;: &amp;#34;%23SuperBowl&amp;#34;,
        &amp;#34;url&amp;#34;: &amp;#34;https://twitter.com/search?q=%23SuperBowl&amp;#34;,
        &amp;#34;tweet_volume_24h&amp;#34;: 2340000,
        &amp;#34;tweet_volume_1h&amp;#34;: 185000,
        &amp;#34;category&amp;#34;: &amp;#34;Sports&amp;#34;,
        &amp;#34;context&amp;#34;: &amp;#34;The Super Bowl is happening tonight at Allegiant Stadium&amp;#34;,
        &amp;#34;started_trending_at&amp;#34;: &amp;#34;2024-03-15T10:00:00Z&amp;#34;,
        &amp;#34;trend_type&amp;#34;: &amp;#34;breaking&amp;#34;,         // breaking | sustained | recurring
        &amp;#34;sparkline&amp;#34;: [12, 15, 45, 120, 185],  // hourly volumes (last 5h)
        &amp;#34;promoted&amp;#34;: false
      },
      ...
    ]
  }

// Get available trending locations
GET /api/v1/trends/available
  → 200 { &amp;#34;locations&amp;#34;: [{ &amp;#34;name&amp;#34;: &amp;#34;Worldwide&amp;#34;, &amp;#34;woeid&amp;#34;: 1 }, ...] }

// Get trend details (for &amp;#34;Why is X trending&amp;#34; page)
GET /api/v1/trends/{topic}/details
  Params: woeid
  → 200 {
    &amp;#34;topic&amp;#34;: &amp;#34;#SuperBowl&amp;#34;,
    &amp;#34;summary&amp;#34;: &amp;#34;The Super Bowl LVIII between...&amp;#34;,
    &amp;#34;related_topics&amp;#34;: [&amp;#34;#Chiefs&amp;#34;, &amp;#34;#49ers&amp;#34;, &amp;#34;#HalftimeShow&amp;#34;],
    &amp;#34;top_tweets&amp;#34;: [...],
    &amp;#34;volume_timeseries&amp;#34;: [...],          // minute-by-minute for last 24h
    &amp;#34;demographics&amp;#34;: { &amp;#34;top_geos&amp;#34;: [...], &amp;#34;top_age_groups&amp;#34;: [...] }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;internal-apis&#34;&gt;Internal APIs&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Topic ingestion (called by tweet processing pipeline)
POST /internal/topics/ingest
  Body: {
    &amp;#34;tweet_id&amp;#34;: &amp;#34;...&amp;#34;,
    &amp;#34;topics&amp;#34;: [&amp;#34;#SuperBowl&amp;#34;, &amp;#34;Chiefs&amp;#34;, &amp;#34;halftime show&amp;#34;],
    &amp;#34;user_id&amp;#34;: &amp;#34;...&amp;#34;,
    &amp;#34;geo&amp;#34;: { &amp;#34;country&amp;#34;: &amp;#34;US&amp;#34;, &amp;#34;city&amp;#34;: &amp;#34;San Francisco&amp;#34; },
    &amp;#34;timestamp&amp;#34;: &amp;#34;...&amp;#34;
  }

// Admin: suppress a trend (safety/trust intervention)
POST /internal/trends/suppress
  Body: { &amp;#34;topic&amp;#34;: &amp;#34;#ScamCoin&amp;#34;, &amp;#34;reason&amp;#34;: &amp;#34;coordinated_inauthentic&amp;#34;, &amp;#34;duration_hours&amp;#34;: 24 }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;WOEID (Where On Earth ID):&lt;/strong&gt; Using Yahoo&amp;rsquo;s WOEID system for geographic identifiers — it&amp;rsquo;s a well-established standard for location hierarchy (city → state → country → world).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sparkline data inline:&lt;/strong&gt; Including a small array of recent hourly volumes allows the client to render a trend line without a separate API call.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend type classification:&lt;/strong&gt; Distinguishing &amp;ldquo;breaking&amp;rdquo; (sudden spike), &amp;ldquo;sustained&amp;rdquo; (high volume over hours), and &amp;ldquo;recurring&amp;rdquo; (regularly trends at this time, like #MondayMotivation) helps the UI prioritize display.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;topic-counts--sliding-window-redis&#34;&gt;Topic Counts — Sliding Window (Redis)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Key Pattern&lt;/th&gt;
          &lt;th&gt;Type&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;topic:{hash}:window:{region}:{minute_bucket}&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;INT&lt;/td&gt;
          &lt;td&gt;Tweet count in this 1-minute bucket&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;topic:{hash}:meta&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Hash&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;{name, category, first_seen, is_hashtag}&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;trend:list:{woeid}&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Sorted Set&lt;/td&gt;
          &lt;td&gt;Current trends, scored by trend_score. Top 30 = the trend list&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;trend:suppress:{topic_hash}&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;String&lt;/td&gt;
          &lt;td&gt;TTL-based suppression flag&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Sliding Window Design:&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Typeahead Suggestion / Autocomplete</title>
      <link>https://chiraghasija.cc/designs/typeahead/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/typeahead/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;As the user types in a search box, show top 5-10 suggestions matching the prefix&lt;/li&gt;
&lt;li&gt;Suggestions should be ranked by popularity (search frequency)&lt;/li&gt;
&lt;li&gt;Suggestions update as new search trends emerge (near-real-time)&lt;/li&gt;
&lt;li&gt;Support personalization (user&amp;rsquo;s own search history weighted higher)&lt;/li&gt;
&lt;li&gt;Handle typos/fuzzy matching (optional stretch goal)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — broken autocomplete degrades the entire search experience&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Suggestions must appear in &amp;lt; 50ms at p99 (users expect instant response as they type)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency is fine — if a new trending search takes a few minutes to appear in suggestions, that&amp;rsquo;s acceptable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 10B search queries/day, suggestions requested on every keystroke → 50B+ suggestion requests/day (average query = 5 keystrokes)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;50B suggestion requests/day ÷ 100K = &lt;strong&gt;500,000 requests/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Peak: 3x → &lt;strong&gt;1.5M requests/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;This is an extremely high-QPS system — latency and throughput are everything&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;data-size&#34;&gt;Data Size&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Vocabulary: ~5M unique search phrases (with frequency counts)&lt;/li&gt;
&lt;li&gt;Each phrase: avg 30 chars + frequency count + metadata = ~100 bytes&lt;/li&gt;
&lt;li&gt;Total vocabulary: 5M × 100 bytes = &lt;strong&gt;500MB&lt;/strong&gt; — fits entirely in memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;memory-bound, latency-critical&lt;/strong&gt; system. The entire dataset fits in RAM. The challenge is serving 1.5M QPS at &amp;lt; 50ms consistently.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Uber&#39;s Driver-Rider Matching System</title>
      <link>https://chiraghasija.cc/designs/uber-matching/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/uber-matching/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Rider requests a ride (pickup location, destination) and the system finds the best available driver within seconds&lt;/li&gt;
&lt;li&gt;Drivers continuously report their GPS location; the system maintains a real-time view of all available drivers&lt;/li&gt;
&lt;li&gt;Matching algorithm considers distance, ETA, driver rating, vehicle type, and supply-demand balance&lt;/li&gt;
&lt;li&gt;Support ride types: standard, pool/shared, premium, XL — each with different matching criteria&lt;/li&gt;
&lt;li&gt;Handle surge pricing zones and dynamically adjust pricing based on supply-demand ratio&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — riders cannot request rides if matching is down. Revenue loss is immediate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Match a rider to a driver within 3-5 seconds. Location updates processed within 1 second.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; A driver must never be matched to two riders simultaneously (strong consistency on driver state). Ride pricing can be eventually consistent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 5M concurrent drivers, 20M rides/day, 500K location updates/sec globally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness:&lt;/strong&gt; Driver location must be accurate within 5 seconds for meaningful ETA calculations.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Location updates:&lt;/strong&gt; 5M drivers × 1 update/4 seconds = &lt;strong&gt;1.25M updates/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ride requests:&lt;/strong&gt; 20M rides/day = ~230 rides/sec (peak 5x = 1,150/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Match queries:&lt;/strong&gt; Each ride request queries nearby drivers → geospatial index lookup (1,150 QPS at peak)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ETA calculations:&lt;/strong&gt; Per match attempt, compute ETA for ~10 candidate drivers = 11,500 ETA computations/sec at peak&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Driver locations:&lt;/strong&gt; 5M drivers × 64 bytes (id + lat/lng + timestamp + status) = 320MB (fits in memory)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ride history:&lt;/strong&gt; 20M rides/day × 1KB × 365 = 7.3TB/year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial index:&lt;/strong&gt; In-memory spatial index of 5M points = ~500MB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;This is a &lt;strong&gt;real-time geospatial matching problem&lt;/strong&gt;. The core challenge is maintaining an up-to-date spatial index of 5M moving points with 1.25M updates/sec, then efficiently querying it to find optimal matches. The matching itself is a constrained optimization problem (minimize total wait time across all concurrent requests, not just each individual one).&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design Yelp (Proximity Service)</title>
      <link>https://chiraghasija.cc/designs/yelp/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/yelp/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users search for businesses near a given location (latitude, longitude) within a specified radius&lt;/li&gt;
&lt;li&gt;Results are ranked by a combination of relevance, distance, and user ratings&lt;/li&gt;
&lt;li&gt;Support filtering by business category (restaurants, hotels, gas stations), price range, open now, and rating threshold&lt;/li&gt;
&lt;li&gt;Display business details: name, address, photos, hours, reviews, ratings&lt;/li&gt;
&lt;li&gt;Business owners can add/update their business listing information&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — search is the core product. Users expect instant results when they open the app.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; &amp;lt; 100ms for proximity search queries (p99). Business detail pages &amp;lt; 200ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Eventual consistency for business data updates. A new restaurant appearing in search within 1 hour is acceptable. But search results must be consistent within a single query (no partial results).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 200M businesses worldwide, 100K search QPS at peak, 500M DAU&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Location accuracy:&lt;/strong&gt; Results within a 500m radius should be nearly exhaustive (recall &amp;gt; 99%). Results at larger radii (5km+) can prioritize relevance over exhaustiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Search QPS: 100K peak, 40K average&lt;/li&gt;
&lt;li&gt;Business detail views: 200K QPS (users click into results)&lt;/li&gt;
&lt;li&gt;Business writes (new/updated listings): 1K/sec (low — read-heavy system)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;200M businesses x 2KB per business (name, address, coordinates, category, hours, ratings) = &lt;strong&gt;400 GB&lt;/strong&gt; for core business data&lt;/li&gt;
&lt;li&gt;Photos: 200M businesses x 5 photos avg x 200KB = &lt;strong&gt;200 TB&lt;/strong&gt; (CDN-served)&lt;/li&gt;
&lt;li&gt;Reviews: 2 billion reviews x 500 bytes = &lt;strong&gt;1 TB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Geospatial index: 200M businesses x 50 bytes (coordinates + geohash + pointers) = &lt;strong&gt;10 GB&lt;/strong&gt; — fits entirely in memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;key-insight&#34;&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The geospatial index (10GB) is small enough to fit in memory on a single machine, but we need to replicate it across multiple read replicas for 100K QPS. The real challenge is not storage — it is efficient spatial querying and ranking at low latency.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design YouTube / Netflix</title>
      <link>https://chiraghasija.cc/designs/youtube/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/youtube/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Users can upload videos&lt;/li&gt;
&lt;li&gt;Users can stream/watch videos (adaptive bitrate)&lt;/li&gt;
&lt;li&gt;Users can search for videos&lt;/li&gt;
&lt;li&gt;Personalized home feed (recommended videos)&lt;/li&gt;
&lt;li&gt;Video metadata: title, description, view count, likes, comments&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.99% — video playback must be rock-solid&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Video playback start &amp;lt; 2 seconds. Search results &amp;lt; 300ms. Home feed &amp;lt; 500ms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; View counts and likes can be eventually consistent (seconds of delay acceptable). Video availability after upload: within minutes (transcoding pipeline).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; 2B MAU, 1B videos watched/day, 500K video uploads/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bandwidth:&lt;/strong&gt; This is a bandwidth-dominated system — video streaming is 80%+ of internet traffic&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;500K uploads/day × avg 5 minutes × 10MB/min (original) = 25TB/day raw uploads&lt;/li&gt;
&lt;li&gt;After transcoding (5 resolutions × 3 codecs): ~5x storage multiplier = &lt;strong&gt;125TB/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Per year: &lt;strong&gt;~45PB&lt;/strong&gt; — massive storage system&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;1B video views/day, avg 5 min watch time, avg bitrate 3Mbps&lt;/li&gt;
&lt;li&gt;Concurrent viewers (assume 10% of daily active at peak): 200M concurrent&lt;/li&gt;
&lt;li&gt;200M × 3Mbps = &lt;strong&gt;600Tbps&lt;/strong&gt; peak bandwidth&lt;/li&gt;
&lt;li&gt;Even with CDN (95%+ cached), origin bandwidth: &lt;strong&gt;30Tbps&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Upload: 500K/day ÷ 100K = &lt;strong&gt;5 uploads/sec&lt;/strong&gt; (low, but each is large and long-running)&lt;/li&gt;
&lt;li&gt;Video plays: 1B/day ÷ 100K = &lt;strong&gt;10,000 play starts/sec&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Search: assume 500M/day = &lt;strong&gt;5,000 searches/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Upload flow (chunked, resumable)
POST /api/v1/videos/upload/init
  Body: { &amp;#34;title&amp;#34;: &amp;#34;My Video&amp;#34;, &amp;#34;description&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;filename&amp;#34;: &amp;#34;video.mp4&amp;#34; }
  Response 200: { &amp;#34;upload_id&amp;#34;: &amp;#34;up_123&amp;#34;, &amp;#34;upload_url&amp;#34;: &amp;#34;https://upload.yt.com/up_123&amp;#34; }

PUT /upload/{upload_id}
  Headers: Content-Range: bytes 0-5242879/*
  Body: &amp;lt;binary chunk&amp;gt;
  Response 308: { &amp;#34;next_offset&amp;#34;: 5242880 }

POST /api/v1/videos/upload/{upload_id}/complete
  Response 202: { &amp;#34;video_id&amp;#34;: &amp;#34;v_abc&amp;#34;, &amp;#34;status&amp;#34;: &amp;#34;processing&amp;#34; }

// Playback
GET /api/v1/videos/{video_id}
  Response 200: {
    &amp;#34;video_id&amp;#34;: &amp;#34;v_abc&amp;#34;,
    &amp;#34;title&amp;#34;: &amp;#34;My Video&amp;#34;,
    &amp;#34;description&amp;#34;: &amp;#34;...&amp;#34;,
    &amp;#34;channel&amp;#34;: { &amp;#34;id&amp;#34;: &amp;#34;c_1&amp;#34;, &amp;#34;name&amp;#34;: &amp;#34;TechChannel&amp;#34;, &amp;#34;subscriber_count&amp;#34;: 1000000 },
    &amp;#34;view_count&amp;#34;: 1500000,
    &amp;#34;like_count&amp;#34;: 50000,
    &amp;#34;duration&amp;#34;: 300,
    &amp;#34;stream_urls&amp;#34;: {
      &amp;#34;dash&amp;#34;: &amp;#34;https://cdn.yt.com/v_abc/manifest.mpd&amp;#34;,
      &amp;#34;hls&amp;#34;: &amp;#34;https://cdn.yt.com/v_abc/master.m3u8&amp;#34;
    },
    &amp;#34;thumbnails&amp;#34;: { &amp;#34;default&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;high&amp;#34;: &amp;#34;...&amp;#34; },
    &amp;#34;published_at&amp;#34;: &amp;#34;2026-02-22T12:00:00Z&amp;#34;
  }

GET /api/v1/feed?cursor={cursor}&amp;amp;limit=20
GET /api/v1/search?q={query}&amp;amp;cursor={cursor}

POST /api/v1/videos/{video_id}/like
POST /api/v1/videos/{video_id}/view   // fire-and-forget, for analytics
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Key decisions:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Design YouTube Surveys (In-Video Polls)</title>
      <link>https://chiraghasija.cc/designs/youtube-surveys/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://chiraghasija.cc/designs/youtube-surveys/</guid>
      <description>&lt;h2 id=&#34;1-requirements--scope-5-min&#34;&gt;1. Requirements &amp;amp; Scope (5 min)&lt;/h2&gt;
&lt;h3 id=&#34;functional-requirements&#34;&gt;Functional Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Create polls&lt;/strong&gt; — Creators/advertisers can define survey questions (multiple choice, single select, rating scale) and attach them to specific timestamps in a video&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Render polls mid-roll&lt;/strong&gt; — Display a non-intrusive overlay poll at the configured timestamp during video playback, without pausing or blocking the video&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Collect votes&lt;/strong&gt; — Record user responses in real time, enforce one vote per user per poll, and allow changing a vote before the poll closes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time results&lt;/strong&gt; — Show live vote counts / percentages to the user after they vote (instant feedback)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analytics dashboard&lt;/strong&gt; — Provide creators with detailed poll analytics: response rate, demographic breakdown, completion funnel, and A/B test results for different poll placements&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;non-functional-requirements&#34;&gt;Non-Functional Requirements&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability:&lt;/strong&gt; 99.95% — a poll failing to render is a missed data collection opportunity, but not as critical as video playback itself&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Poll UI must render within 200ms of the trigger timestamp. Vote submission must ACK within 100ms (perceived instant).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Votes must be counted exactly once. Read-after-write consistency for a user seeing their own vote. Aggregate counts can be eventually consistent (1-2 second delay is fine).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; YouTube has 800M daily active viewers, 500M hours of video watched/day. If 5% of videos have polls and 20% of viewers interact → ~16B poll impressions/day → &lt;strong&gt;185K poll renders/sec&lt;/strong&gt;, &lt;strong&gt;37K votes/sec&lt;/strong&gt; at peak.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability:&lt;/strong&gt; Every vote must be durably stored. Zero data loss on votes.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;2-estimation-3-min&#34;&gt;2. Estimation (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;traffic&#34;&gt;Traffic&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Daily active viewers: 800M&lt;/li&gt;
&lt;li&gt;Videos watched per viewer: ~8/day → 6.4B video views/day&lt;/li&gt;
&lt;li&gt;Videos with polls: 5% → 320M poll-eligible views/day&lt;/li&gt;
&lt;li&gt;Poll impression rate (viewer sees the poll): 60% → &lt;strong&gt;192M poll impressions/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Vote rate (viewer actually votes): 30% of impressions → &lt;strong&gt;57.6M votes/day&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Peak vote QPS:&lt;/strong&gt; 57.6M / 86400 × 3 (peak multiplier) ≈ &lt;strong&gt;2,000 votes/sec&lt;/strong&gt; (average), &lt;strong&gt;6,000 votes/sec&lt;/strong&gt; (peak)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Peak poll render QPS:&lt;/strong&gt; 192M / 86400 × 3 ≈ &lt;strong&gt;6,700 renders/sec&lt;/strong&gt; (peak)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;storage&#34;&gt;Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Poll definitions:&lt;/strong&gt; 10M active polls × 2 KB (question, options, targeting rules, schedule) = &lt;strong&gt;20 GB&lt;/strong&gt; — easily fits in a relational DB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Votes:&lt;/strong&gt; 57.6M votes/day × 365 days × 3 years retention = 63B votes
&lt;ul&gt;
&lt;li&gt;Each vote: poll_id (8B) + user_id (8B) + option_id (4B) + timestamp (8B) + metadata (32B) ≈ 60 bytes&lt;/li&gt;
&lt;li&gt;Total: 63B × 60B = &lt;strong&gt;3.78 TB&lt;/strong&gt; — manageable with partitioned storage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregated counts:&lt;/strong&gt; Per-poll, per-option counters. 10M polls × 5 options × 16B = &lt;strong&gt;800 MB&lt;/strong&gt; — trivially small, lives in Redis&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;bandwidth&#34;&gt;Bandwidth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Poll render payload: ~5 KB (question text, options, styling, targeting metadata)&lt;/li&gt;
&lt;li&gt;6,700 renders/sec × 5 KB = &lt;strong&gt;33.5 MB/s&lt;/strong&gt; — negligible compared to video streaming bandwidth&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-api-design-3-min&#34;&gt;3. API Design (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;creator-facing-apis&#34;&gt;Creator-Facing APIs&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Create a poll attached to a video
POST /api/v1/videos/{video_id}/polls
  Body: {
    &amp;#34;question&amp;#34;: &amp;#34;What feature should we build next?&amp;#34;,
    &amp;#34;type&amp;#34;: &amp;#34;single_select&amp;#34;,           // single_select | multi_select | rating
    &amp;#34;options&amp;#34;: [&amp;#34;Dark mode&amp;#34;, &amp;#34;Offline support&amp;#34;, &amp;#34;AI search&amp;#34;, &amp;#34;Better perf&amp;#34;],
    &amp;#34;trigger_time_sec&amp;#34;: 145,           // show at 2:25 in the video
    &amp;#34;display_duration_sec&amp;#34;: 15,        // auto-dismiss after 15s
    &amp;#34;targeting&amp;#34;: {
      &amp;#34;geo&amp;#34;: [&amp;#34;US&amp;#34;, &amp;#34;CA&amp;#34;, &amp;#34;GB&amp;#34;],
      &amp;#34;demographics&amp;#34;: { &amp;#34;age_min&amp;#34;: 18, &amp;#34;age_max&amp;#34;: 45 },
      &amp;#34;sample_pct&amp;#34;: 10                 // only show to 10% of viewers (A/B test)
    },
    &amp;#34;close_after_hours&amp;#34;: 168           // stop accepting votes after 7 days
  }
  → 201 { &amp;#34;poll_id&amp;#34;: &amp;#34;p_abc123&amp;#34;, ... }

// Get poll analytics
GET /api/v1/polls/{poll_id}/analytics
  → 200 {
    &amp;#34;impressions&amp;#34;: 145230,
    &amp;#34;votes&amp;#34;: 43120,
    &amp;#34;response_rate&amp;#34;: 0.297,
    &amp;#34;results&amp;#34;: [
      { &amp;#34;option&amp;#34;: &amp;#34;Dark mode&amp;#34;, &amp;#34;votes&amp;#34;: 18200, &amp;#34;pct&amp;#34;: 42.2 },
      { &amp;#34;option&amp;#34;: &amp;#34;AI search&amp;#34;, &amp;#34;votes&amp;#34;: 12500, &amp;#34;pct&amp;#34;: 29.0 },
      ...
    ],
    &amp;#34;demographics&amp;#34;: { ... },
    &amp;#34;ab_test&amp;#34;: { &amp;#34;variant_a_response_rate&amp;#34;: 0.31, &amp;#34;variant_b_response_rate&amp;#34;: 0.26 }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;viewer-facing-apis&#34;&gt;Viewer-Facing APIs&lt;/h3&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;// Fetch polls for a video (called when video starts playing)
GET /api/v1/videos/{video_id}/polls?viewer_id={uid}
  → 200 {
    &amp;#34;polls&amp;#34;: [
      {
        &amp;#34;poll_id&amp;#34;: &amp;#34;p_abc123&amp;#34;,
        &amp;#34;trigger_time_sec&amp;#34;: 145,
        &amp;#34;question&amp;#34;: &amp;#34;What feature should we build next?&amp;#34;,
        &amp;#34;options&amp;#34;: [...],
        &amp;#34;user_vote&amp;#34;: null              // or option_id if already voted
      }
    ]
  }

// Submit a vote
POST /api/v1/polls/{poll_id}/vote
  Body: { &amp;#34;option_id&amp;#34;: &amp;#34;opt_2&amp;#34;, &amp;#34;viewer_id&amp;#34;: &amp;#34;u_xyz&amp;#34; }
  → 200 { &amp;#34;results&amp;#34;: { &amp;#34;opt_1&amp;#34;: 42.2, &amp;#34;opt_2&amp;#34;: 29.0, ... }, &amp;#34;total_votes&amp;#34;: 43121 }

// Change vote (idempotent PUT)
PUT /api/v1/polls/{poll_id}/vote
  Body: { &amp;#34;option_id&amp;#34;: &amp;#34;opt_3&amp;#34;, &amp;#34;viewer_id&amp;#34;: &amp;#34;u_xyz&amp;#34; }
  → 200 { &amp;#34;results&amp;#34;: { ... } }
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;key-decisions&#34;&gt;Key Decisions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pre-fetch polls at video start:&lt;/strong&gt; The player fetches all polls for the video when playback begins. This avoids a network request at the exact trigger timestamp (which could cause a visible delay).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vote response includes live results:&lt;/strong&gt; After voting, the user immediately sees percentages. This is the &amp;ldquo;social proof&amp;rdquo; hook that drives engagement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeting evaluated client-side:&lt;/strong&gt; The server sends all eligible polls plus targeting rules. The client-side SDK evaluates targeting (geo, demographics, A/B bucket) to avoid an extra server round-trip at trigger time.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-data-model-3-min&#34;&gt;4. Data Model (3 min)&lt;/h2&gt;
&lt;h3 id=&#34;polls-table-postgresql--strong-consistency-for-definitions&#34;&gt;Polls Table (PostgreSQL — strong consistency for definitions)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Column&lt;/th&gt;
          &lt;th&gt;Type&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;poll_id&lt;/td&gt;
          &lt;td&gt;UUID (PK)&lt;/td&gt;
          &lt;td&gt;Globally unique&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;video_id&lt;/td&gt;
          &lt;td&gt;VARCHAR(11)&lt;/td&gt;
          &lt;td&gt;YouTube video ID, indexed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;creator_id&lt;/td&gt;
          &lt;td&gt;BIGINT&lt;/td&gt;
          &lt;td&gt;FK to creator accounts&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;question&lt;/td&gt;
          &lt;td&gt;TEXT&lt;/td&gt;
          &lt;td&gt;Poll question text&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;type&lt;/td&gt;
          &lt;td&gt;ENUM&lt;/td&gt;
          &lt;td&gt;single_select, multi_select, rating&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;options&lt;/td&gt;
          &lt;td&gt;JSONB&lt;/td&gt;
          &lt;td&gt;Array of {id, text, display_order}&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;trigger_time_sec&lt;/td&gt;
          &lt;td&gt;INT&lt;/td&gt;
          &lt;td&gt;Seconds into the video&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;display_duration_sec&lt;/td&gt;
          &lt;td&gt;INT&lt;/td&gt;
          &lt;td&gt;How long to show the overlay&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;targeting&lt;/td&gt;
          &lt;td&gt;JSONB&lt;/td&gt;
          &lt;td&gt;Geo, demographics, sample_pct, etc.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;status&lt;/td&gt;
          &lt;td&gt;ENUM&lt;/td&gt;
          &lt;td&gt;draft, active, paused, closed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;close_at&lt;/td&gt;
          &lt;td&gt;TIMESTAMP&lt;/td&gt;
          &lt;td&gt;When to stop accepting votes&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;created_at&lt;/td&gt;
          &lt;td&gt;TIMESTAMP&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;votes-table-cassandra--high-write-throughput-partitioned-by-poll_id&#34;&gt;Votes Table (Cassandra — high write throughput, partitioned by poll_id)&lt;/h3&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Column&lt;/th&gt;
          &lt;th&gt;Type&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;poll_id&lt;/td&gt;
          &lt;td&gt;UUID (partition key)&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;viewer_id&lt;/td&gt;
          &lt;td&gt;BIGINT (clustering key)&lt;/td&gt;
          &lt;td&gt;Ensures uniqueness: one vote per user per poll&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;option_id&lt;/td&gt;
          &lt;td&gt;UUID&lt;/td&gt;
          &lt;td&gt;Which option they chose&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;voted_at&lt;/td&gt;
          &lt;td&gt;TIMESTAMP&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;metadata&lt;/td&gt;
          &lt;td&gt;MAP&amp;lt;TEXT,TEXT&amp;gt;&lt;/td&gt;
          &lt;td&gt;Device, geo, referrer, A/B variant&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Why Cassandra for votes?&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
