Design a Real-Time Live Likes/Reactions System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Users can send reactions (like, love, wow, haha, etc.) on a live stream or post, and all viewers see animated reactions in real-time
Display an aggregated reaction count that updates live (e.g., “12.3K likes”)
Each user can react multiple times (unlike a static like button — this is a live engagement feature, like Facebook Live or Instagram Live hearts)
Reaction animations appear as floating icons on the viewer’s screen, reflecting the volume and type of reactions across all viewers
Provide reaction rate metrics (reactions per second) for streamer dashboards and analytics

Non-Functional Requirements

Availability: 99.9% — reactions are engagement features, not mission-critical. Brief delays are acceptable; total loss of reactions degrades the experience but does not break the product.
Latency: Reactions should appear on other viewers’ screens within 1-2 seconds of being sent.
Consistency: Eventual consistency is fine. Counts can be approximate. Showing “12.3K” when the true count is 12,347 is perfectly acceptable.
Scale: 100K concurrent live streams. Top streams: 500K concurrent viewers, up to 50,000 reactions/sec per stream. Global: 5M reactions/sec across all streams.
Durability: Aggregate counts must be durable (total likes on a stream). Individual reaction events do not need permanent storage.

2. Estimation (3 min)

Write Traffic

5M reactions/sec globally
Each reaction event: ~80 bytes (stream_id, user_id, reaction_type, timestamp)
Inbound data rate: 5M × 80 bytes = 400 MB/sec

Fan-Out

Each reaction is not individually broadcast — instead, reactions are aggregated into batches
Every 500ms, each stream produces a batch summary: { “like”: 142, “love”: 37, “wow”: 12 }
Batch per stream: ~200 bytes
100K streams × 200 bytes × 2 batches/sec = 40 MB/sec fan-out from aggregation layer
Each batch is pushed to all viewers of that stream
Top stream: 500K viewers × 200 bytes × 2/sec = 200 MB/sec — manageable with edge fan-out

Storage

Aggregate counts per stream: 100K streams × 6 reaction types × 8 bytes = ~5 MB in Redis
Reaction event log (for analytics, retained 7 days): 5M/sec × 80 bytes × 86400 sec × 7 days = ~240 TB/week
- Store sampled (1% sample = 2.4 TB/week) in a data lake for trend analysis, not the full firehose

Count Storage

Final aggregate counts per stream (permanent): negligible — one row per stream in the DB

3. API Design (3 min)

REST Endpoints

// Send a reaction
POST /v1/streams/{stream_id}/reactions
  Headers: Authorization: Bearer <token>
  Body: {
    "type": "like"                     // like, love, wow, haha, sad, angry
  }
  Response 202: { "status": "accepted" }
  // 202 Accepted — fire-and-forget, no guarantee of individual delivery

// Get current reaction counts
GET /v1/streams/{stream_id}/reactions/counts
  Response 200: {
    "stream_id": "stream_abc",
    "counts": {
      "like": 1234567,
      "love": 234567,
      "wow": 45678,
      "haha": 12345,
      "sad": 1234,
      "angry": 567
    },
    "rate": {
      "total_per_second": 4521,
      "by_type": { "like": 2100, "love": 890, ... }
    }
  }

WebSocket Protocol

// Client → Server: send a reaction (alternative to REST, lower overhead)
{ "type": "reaction", "reaction": "like" }

// Server → Client: batched reaction update (every 500ms)
{
  "type": "reaction_batch",
  "window_ms": 500,
  "counts": {
    "like": 142,
    "love": 37,
    "wow": 12,
    "haha": 8,
    "sad": 2,
    "angry": 1
  },
  "total": {
    "like": 1234567,
    "love": 234568,
    ...
  }
}

Key Decisions

Fire-and-forget writes: Reactions return 202, not 201. We do not guarantee every reaction is counted. Losing 0.1% of reactions under extreme load is acceptable.
Batched delivery: Individual reactions are never pushed to clients. Instead, 500ms window summaries are pushed. This reduces fan-out by 1000x for high-volume streams.
Approximate counts in totals: The total count uses eventual consistency. The per-window count (used for animations) is more important for UX.

4. Data Model (3 min)

In-Flight Reaction Aggregation (Redis)

// Per-window reaction counts (current 500ms window)
Key: stream:{stream_id}:reactions:current
Type: Hash
Fields: like → 142, love → 37, wow → 12, ...
TTL: 5 seconds (auto-cleanup)

// Running total counts
Key: stream:{stream_id}:reactions:total
Type: Hash
Fields: like → 1234567, love → 234567, ...

Permanent Counts (PostgreSQL — updated periodically)

Table: stream_reaction_counts
  stream_id      (PK) | varchar(20)
  like_count          | bigint
  love_count          | bigint
  wow_count           | bigint
  haha_count          | bigint
  sad_count           | bigint
  angry_count         | bigint
  updated_at          | timestamp

Rate Limit State (Redis)

// Per-user reaction rate limit (max 10 reactions per second per user)
Key: reaction_rl:{stream_id}:{user_id}
Type: String (counter)
TTL: 1 second

Why This Split?

Redis for real-time: All hot-path data lives in Redis. Hash HINCRBY is O(1) and handles millions of increments per second. The aggregation window resets every 500ms.
PostgreSQL for permanence: The final count after a stream ends is checkpointed to PostgreSQL. This happens once per stream, not millions of times.
No per-event storage: We do NOT store individual reaction events in a database. 5M events/sec is too expensive to persist individually. Instead, we aggregate in Redis and periodically flush summaries.

5. High-Level Design (12 min)

Reaction Write Path

Client taps "like" button
  → WebSocket Connection Server
    → 1. Rate limit check:
         INCR reaction_rl:{stream_id}:{user_id}
         If > 10 → silently drop (don't error, just ignore excess)
    → 2. Forward to Reaction Aggregator:
         HINCRBY stream:{stream_id}:reactions:current like 1
         HINCRBY stream:{stream_id}:reactions:total like 1
    → 3. Done. No ACK needed to client.

Reaction Broadcast Path

Reaction Broadcaster Service (per-stream timer):
  Every 500ms per active stream:
    → 1. HGETALL stream:{stream_id}:reactions:current
         Result: { "like": 142, "love": 37, "wow": 12 }
    → 2. DEL stream:{stream_id}:reactions:current (reset window)
         (Use GETDEL pattern or Lua script for atomicity)
    → 3. If total reactions in window > 0:
         Publish batch to fan-out layer:
         PUBLISH stream_reactions:{stream_id} {batch_json}
    → 4. If total reactions = 0 → skip (no broadcast, save bandwidth)

Fan-Out to Viewers:
  WebSocket Servers subscribe to Redis Pub/Sub: stream_reactions:{stream_id}
    → On batch received:
      → Push batch to all local WebSocket connections for that stream

Components

WebSocket Connection Servers (100+ instances): Handle client connections. Receive reactions via WebSocket. Forward to Redis for aggregation. Push batched updates to clients.
Redis Cluster: Reaction aggregation (HINCRBY). Rate limiting. Pub/Sub for batch distribution. Stores running totals.
Reaction Broadcaster Service: Timer-based service, one logical timer per active stream. Reads current window, resets counter, publishes batch. Horizontally scaled — each instance owns a shard of streams.
Count Checkpoint Service: Periodically (every 60 seconds) writes Redis totals to PostgreSQL. On stream end, performs a final checkpoint.
Analytics Sampler: Taps into the reaction stream at 1% sample rate. Writes to Kafka for downstream analytics (trend detection, engagement scoring).

Architecture Diagram

Clients (viewers)
  ↕ WebSocket
WebSocket Servers (100+)
  → HINCRBY → Redis Cluster
               ├── Current window counters
               ├── Running totals
               └── Pub/Sub channels

Reaction Broadcaster (sharded by stream)
  → Every 500ms: read + reset current window
  → PUBLISH batch to Redis Pub/Sub
  → WebSocket Servers receive and push to clients

Count Checkpoint Service
  → Every 60s: Redis totals → PostgreSQL

Analytics Sampler → Kafka → Data Warehouse

Animation Rendering (Client-Side)

On receiving batch { "like": 142, "love": 37, "wow": 12 }:
  1. Total reactions in window: 191
  2. Scale animation intensity:
     < 10 reactions → sparse floating icons
     10-100 → moderate stream
     100-1000 → dense stream with size variation
     > 1000 → "burst" mode with explosion effect
  3. For each reaction type, spawn proportional number of animated icons:
     like: 142/191 = 74% → 74% of animation slots are hearts
     love: 37/191 = 19% → 19% are love icons
  4. Randomize: position (x-axis), float speed, size, opacity
  5. Animate using requestAnimationFrame, recycle DOM nodes from a pool
  6. Cap at 60 visible animations simultaneously (performance)

6. Deep Dives (15 min)

Deep Dive 1: High Write Throughput — Handling 50K Reactions/Sec Per Stream

A top stream receives 50,000 reactions per second. Naively, each reaction is a separate Redis HINCRBY command. At 50K/sec for one stream, this is manageable for Redis (it handles 100K+ ops/sec). But across 100K streams with 5M total reactions/sec, we need to be smarter.

Level 1: Connection Server Local Batching

Instead of sending each reaction individually to Redis:
  - Each WS Server maintains an in-memory counter per stream per reaction type
  - Every 100ms, flush accumulated counts to Redis in a single pipeline:
    MULTI
      HINCRBY stream:abc:reactions:current like 47
      HINCRBY stream:abc:reactions:current love 12
      HINCRBY stream:abc:reactions:total like 47
      HINCRBY stream:abc:reactions:total love 12
    EXEC

Reduction:
  - Without batching: 50K Redis ops/sec per stream
  - With 100ms batching across 50 WS Servers:
    50 servers × 1 pipeline × 2 commands per pipeline = 100 ops/sec per stream
  - 5000x reduction in Redis operations

Level 2: Redis Pipeline and Cluster Sharding

- Stream IDs are distributed across Redis shards by consistent hashing
- Hot streams may land on the same shard → hot key problem
- Solution: sub-shard hot streams
  Key: stream:{stream_id}:reactions:current:{shard_N}  (N = 0..7)
  Each WS Server writes to a random sub-shard
  Broadcaster reads all sub-shards and sums them

  Read cost at broadcast time: 8 HGETALL commands (negligible at 2/sec)
  Write distribution: spread across 8 Redis keys → 8x reduction in per-key contention

Level 3: Lossy counting under extreme load

If a stream exceeds 100K reactions/sec:
  - WS Servers probabilistically sample: accept reaction with probability P
  - P = min(1.0, 100000 / current_rate)
  - Multiply displayed count by 1/P to compensate
  - Users cannot perceive the difference in animations at this volume
  - Total count accuracy: ±5% (acceptable for live engagement)

Deep Dive 2: Batched Broadcasting and Client Animation Throttling

Why 500ms windows?

Humans perceive animation changes at ~200ms granularity
500ms windows give smooth animation while reducing server push rate to 2/sec/stream
Shorter windows (100ms) = 10x more pushes with minimal perceptual benefit
Longer windows (2sec) = noticeable lag between tapping “like” and seeing others’ reactions

Adaptive window sizing:

Low activity (< 10 reactions/sec): 2-second windows
  → Reactions trickle in, each one is noticeable
  → Longer window to accumulate enough for a meaningful batch

Medium activity (10-1000 reactions/sec): 500ms windows
  → Standard cadence, good animation density

High activity (> 1000 reactions/sec): 500ms windows, aggregated counts
  → Same push frequency, but counts are larger
  → Client adjusts animation intensity based on count magnitude

Mega activity (> 10K reactions/sec): 1-second windows, sampled
  → At this rate, client cannot render individual reactions anyway
  → Show "burst" animation + counter increment
  → Push frequency drops to 1/sec to save bandwidth

Client-side animation throttling:

Performance budget: 60 FPS, max 16.7ms per frame
  - Object pool: pre-allocate 100 reaction DOM elements
  - On batch: determine animation count = min(60, batch_total / 10)
  - Distribute animation spawns across the 500ms window:
    If 30 animations for 500ms → spawn 1 every 16.7ms (one per frame)
  - Each animation: CSS transform (translate + opacity) for GPU acceleration
  - No layout thrashing: all reactions are position: absolute
  - Recycle: when animation completes (float to top + fade), return to pool

Low-powered devices:
  - Detect via navigator.hardwareConcurrency or frame-rate monitoring
  - Reduce max animations to 20 or 10
  - Use simpler animations (no rotation, no size variation)

Deep Dive 3: Count Consistency and Durability

Problem: Reactions are counted in Redis (volatile memory). If Redis crashes, we lose the running total. The stream shows “0 likes” suddenly.

Solution: Multi-layer count durability

Layer 1: Redis (real-time, volatile)
  - Updated on every reaction (via HINCRBY)
  - Source of truth for live count display

Layer 2: PostgreSQL checkpoint (durable, periodic)
  - Every 60 seconds, Checkpoint Service reads Redis totals
    and writes to PostgreSQL
  - On Redis failure, fall back to last PostgreSQL checkpoint
  - Max data loss: 60 seconds of reactions (acceptable)

Layer 3: Kafka event log (durable, append-only)
  - A tap on the WS Server → Kafka pipeline logs batched counts
  - On catastrophic failure (Redis + PostgreSQL), replay Kafka to reconstruct
  - Retention: 7 days

Recovery sequence on Redis failure:
  1. Load last checkpoint from PostgreSQL (e.g., counts as of 60 sec ago)
  2. Replay Kafka events from checkpoint timestamp to now
  3. Reconstruct current totals in new Redis instance
  4. Resume normal operation
  Time: < 30 seconds for automated recovery

Preventing double-counting on recovery:

Each Checkpoint Service write includes a checkpoint_id (monotonic):
  UPDATE stream_reaction_counts
  SET like_count = {value}, ..., checkpoint_id = {id}
  WHERE stream_id = {stream_id} AND checkpoint_id < {id}

If checkpoint runs twice (at-least-once delivery), the second write is a no-op
because checkpoint_id is already >= {id}.

End-of-stream finalization:

When a stream ends:
  1. Stop accepting new reactions
  2. Final checkpoint: Redis → PostgreSQL (exact counts)
  3. Mark stream as finalized
  4. Clear Redis keys for this stream
  5. Final counts are now permanent in PostgreSQL

7. Extensions (2 min)

Reaction heatmap timeline: Record reaction rate over the stream’s duration (1-second buckets). Display a “reaction timeline” graph showing peaks of engagement. Useful for clipping highlights — peaks in reaction rate correlate with exciting moments.
Custom reactions and emotes: Allow streamers to define custom reaction types (custom emoji). Store in a per-stream reaction config. Client downloads the emote sprite sheet on stream join. Aggregation logic is identical — just more hash fields.
Reaction leaderboard: Track which viewers sent the most reactions. Store per-user counts in a Redis sorted set per stream. Display “Top fans” leaderboard on the stream page. Incentivizes engagement.
Sentiment analysis: Feed reaction proportions into a sentiment model. If sad/angry reactions spike, flag the stream for review. Useful for detecting controversial content in real-time.
Cross-stream reaction aggregation: For multi-stream events (e.g., esports tournament), aggregate reactions across all streams in the event. Show a global reaction counter and animation. Sum counts from multiple stream keys in Redis.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Write Traffic#

Fan-Out#

Storage#

Count Storage#

3. API Design (3 min)#

REST Endpoints#

WebSocket Protocol#

Key Decisions#

4. Data Model (3 min)#

In-Flight Reaction Aggregation (Redis)#

Permanent Counts (PostgreSQL — updated periodically)#

Rate Limit State (Redis)#

Why This Split?#

5. High-Level Design (12 min)#

Reaction Write Path#

Reaction Broadcast Path#

Components#

Architecture Diagram#

Animation Rendering (Client-Side)#

6. Deep Dives (15 min)#

Deep Dive 1: High Write Throughput — Handling 50K Reactions/Sec Per Stream#

Deep Dive 2: Batched Broadcasting and Client Animation Throttling#

Deep Dive 3: Count Consistency and Durability#

7. Extensions (2 min)#