1. Requirements & Scope (5 min)

Functional Requirements

  1. Users can like/react to a post (like, love, haha, wow, sad, angry) — one reaction per user per post, toggleable (tap again to remove)
  2. Display the total reaction count on every post and a breakdown by reaction type
  3. Show “who reacted” — a paginated list of users who reacted to a post (with their reaction type)
  4. Notify the post author when their post receives reactions (batched, not per-reaction)
  5. Support reactions on posts, comments, messages, and other content types (polymorphic)

Non-Functional Requirements

  • Availability: 99.99% — the like button is on every post in the News Feed. Downtime affects all users immediately.
  • Latency: Like action < 100ms (write). Displaying count < 50ms (read). Counts can be slightly stale (eventual consistency acceptable for counts).
  • Throughput: 500K likes/sec sustained (2B DAU, average 20 likes/day = 40B likes/day). Peak: 2M likes/sec during viral events.
  • Consistency: Toggle semantics must be strongly consistent per user — if I tap like twice, the second tap must undo the first. Counts can be eventually consistent.
  • Scale: 500B total reactions stored. 10B+ posts with at least one reaction.

2. Estimation (3 min)

Write Traffic

  • 2B DAU, average 20 reactions/day = 40B reactions/day
  • Average: 460K writes/sec, peak: 2M writes/sec
  • Each write: check if user already reacted → upsert/delete reaction → update counter

Read Traffic

  • Every post view shows reaction count. 2B DAU × 200 posts viewed/day = 400B post views/day
  • 4.6M reads/sec for reaction counts (but counts are cached on the post object — not a separate query)
  • “Who reacted” list: viewed much less frequently. ~1B queries/day = 11K reads/sec

Storage

  • Reaction records: 500B reactions × 30 bytes (user_id, post_id, reaction_type, timestamp) = 15TB
  • Reaction counts: 10B posts × 50 bytes (6 counters + total) = 500GB (easily fits in cache)
  • Active reaction count cache: top 1B posts × 50 bytes = 50GB Redis

Key Insight

Reads outnumber writes 10:1 for counts. The “like count” is shown on every post impression but only changes when someone reacts. This is a perfect case for denormalized counters — pre-compute and cache the count rather than counting rows at query time.


3. API Design (3 min)

React to a Post

POST /v1/reactions
  Body: {
    "content_id": "post_abc123",
    "content_type": "post",
    "reaction_type": "like"           // like, love, haha, wow, sad, angry
  }
  Response 200: {
    "action": "added",                // "added" or "changed" or "removed"
    "previous_reaction": null,        // or "like" if changed from like to love
    "counts": {
      "total": 1542,
      "like": 1203, "love": 187, "haha": 89,
      "wow": 31, "sad": 22, "angry": 10
    }
  }

// Toggle semantics:
// POST with reaction_type = "like" when user already liked → REMOVES the like
// POST with reaction_type = "love" when user already liked → CHANGES to love

Delete Reaction (explicit)

DELETE /v1/reactions?content_id=post_abc123&content_type=post
  Response 200: { "action": "removed", "counts": { ... } }

Get Reactions on a Post

GET /v1/reactions?content_id=post_abc123&content_type=post
  &filter=love&cursor=xxx&limit=20
  Response 200: {
    "counts": { "total": 1542, ... },
    "user_reaction": "like",          // current user's reaction (null if none)
    "reactors": [
      {"user_id": "u_1", "name": "Alice", "reaction": "love", "reacted_at": "..."},
      {"user_id": "u_2", "name": "Bob", "reaction": "love", "reacted_at": "..."},
      ...
    ],
    "next_cursor": "yyy"
  }

Key Decisions

  • Toggle via POST, not DELETE: Same endpoint for add/change/remove simplifies the client. The server determines the action based on current state.
  • Return updated counts in the write response: Client can update the UI immediately without a separate read. Saves a round-trip.
  • Cursor-based pagination for “who reacted”: Offset pagination breaks with high-velocity data. Cursor (last_reacted_at, user_id) is stable.

4. Data Model (3 min)

Reactions (Sharded MySQL/PostgreSQL)

Table: reactions
  content_id       | bigint         -- post, comment, or message ID
  content_type     | tinyint        -- 1=post, 2=comment, 3=message
  user_id          | bigint
  reaction_type    | tinyint        -- 1=like, 2=love, 3=haha, etc.
  created_at       | timestamp

  PRIMARY KEY: (content_id, content_type, user_id)
  -- Composite PK ensures one reaction per user per content item
  -- Shard key: content_id (all reactions for a post on same shard)

  INDEX: (user_id, created_at)
  -- For "posts I've reacted to" queries

Reaction Counts (Denormalized, Redis + DB)

Table: reaction_counts
  content_id       | bigint
  content_type     | tinyint
  total            | int
  count_like       | int
  count_love       | int
  count_haha       | int
  count_wow        | int
  count_sad        | int
  count_angry      | int
  updated_at       | timestamp

  PRIMARY KEY: (content_id, content_type)

Redis Cache Structure

// Reaction counts (hot path for every post render)
Key: rc:{content_id}
Type: Hash
Fields: total, like, love, haha, wow, sad, angry
TTL: 24 hours (refreshed on access)

// User's reaction on a post (for showing "you liked this")
Key: ur:{content_id}:{user_id}
Type: String
Value: reaction_type (1-6) or empty
TTL: 1 hour

Why These Choices

  • Sharded MySQL by content_id: Reactions are almost always queried by content_id (“show reactions on this post”). Sharding by content_id keeps all reactions for a post on one shard, avoiding scatter-gather.
  • Denormalized counts: COUNT(*) on 500B rows is not viable. Pre-computed counts served from cache handle 4.6M reads/sec with < 1ms latency.
  • Tinyint for reaction_type: Saves storage over varchar at 500B row scale. 1 byte × 500B = 500GB saved vs using a string.

5. High-Level Design (12 min)

Write Path (User Reacts)

Client taps "Like"
  → API Server:
      1. Validate request
      2. Read current reaction for (content_id, user_id):
         → Redis cache ur:{content_id}:{user_id}
         → If cache miss: query reactions table
      3. Determine action:
         → No existing reaction → INSERT new reaction
         → Same reaction exists → DELETE (toggle off)
         → Different reaction → UPDATE reaction_type
      4. Write to reactions table (synchronous, consistent)
      5. Update denormalized count (async via Kafka):
         → Publish ReactionEvent { content_id, action, old_type, new_type }
      6. Invalidate/update Redis caches
      7. Return response with new counts (optimistically computed)

Kafka Consumer (Counter Updater):
  → Consume ReactionEvent
  → UPDATE reaction_counts SET count_{type} = count_{type} +/- 1, total = total +/- 1
  → Update Redis hash rc:{content_id}

Read Path (View Post with Counts)

News Feed Service renders post:
  → Fetch post data (includes content_id)
  → Fetch reaction counts: Redis GET rc:{content_id}
    → Cache hit (99.9%): return immediately (< 1ms)
    → Cache miss: query reaction_counts table, populate cache
  → Fetch user's reaction: Redis GET ur:{content_id}:{user_id}
    → To show "You and 1,541 others liked this"
  → Return to client

Components

  1. API Servers (stateless): Handle reaction writes. Validate, check current state, write to DB, publish events. Horizontally scaled.
  2. Reactions DB (sharded MySQL): Source of truth for individual reactions. Sharded by content_id. 500B rows across ~1000 shards.
  3. Counter Service (Kafka consumers): Processes reaction events and updates denormalized counts. Handles counter drift correction.
  4. Redis Cluster: Caches reaction counts (hot path) and per-user reaction state. 50GB+ for counts, distributed across cluster.
  5. Notification Service: Batches reaction notifications. “Alice, Bob, and 15 others liked your post” — aggregated every 5-15 minutes.
  6. Counter Correction Job (hourly): Recalculates counts from source-of-truth (reactions table) and fixes any drift in the denormalized counters.

6. Deep Dives (15 min)

Deep Dive 1: Handling High Write Throughput (500K-2M writes/sec)

The challenge: 500K writes/sec to a single table. Even sharded across 1000 instances, that’s 500 writes/sec per shard. The real problem is hot posts — a viral post might get 50K likes/sec, all hitting the same shard.

Solution 1: Write buffering with coalescing

Instead of writing each reaction directly to DB:

1. Write to a per-shard write buffer (Redis list):
   RPUSH reactions:buffer:{shard_id} {serialized_reaction_event}

2. Buffer consumer (per shard) processes in micro-batches:
   Every 100ms or 1000 events (whichever first):
     → Group by content_id
     → Batch INSERT/UPDATE/DELETE to MySQL (single transaction per batch)
     → Batch counter updates

3. For reads during the buffer window:
   → Check buffer first, then DB
   → Or: update Redis cache immediately (write-behind pattern)

This reduces 50K individual writes/sec to a viral post into 500 batched transactions/sec (100ms batches).

Solution 2: Sharding hot posts further

For posts with > 10K reactions in 1 hour (detected dynamically):
  → Sub-shard the counter: rc:{content_id}:0 through rc:{content_id}:15
  → Each sub-counter handles 1/16th of the writes
  → Read: SUM all 16 sub-counters (still fast — 16 Redis HGET calls < 2ms)
  → This is the "counter sharding" or "distributed counter" pattern

Idempotent toggle handling:

// Ensure exactly-once semantics for rapid tapping
// User taps like 5 times quickly → should result in "liked" (odd taps = liked)

Per-user lock: Redis SET ur:lock:{content_id}:{user_id} NX EX 2
  → If lock acquired: process normally
  → If lock exists: enqueue, process after lock released
  → This serializes operations per (user, content) pair
  → No global contention — lock is per user per post

Deep Dive 2: Approximate vs Exact Counts

The trade-off:

Exact counts require strong consistency between the reactions table and the counts table. At 500K writes/sec, keeping them in sync adds latency and complexity.

Approach: Exact source-of-truth, approximate cache, periodic reconciliation

Layer 1 — Redis cache (approximate, fast):
  → Updated async via Kafka (eventual consistency)
  → May be off by a few reactions during peak load
  → Serves 99.9% of read traffic
  → Acceptable because users can't tell if a post has 1,542 vs 1,545 likes

Layer 2 — reaction_counts table (eventually consistent):
  → Updated by counter service with micro-batching
  → Consistent within seconds of a reaction
  → Serves cache misses

Layer 3 — reactions table (exact, source of truth):
  → COUNT(*) GROUP BY reaction_type on demand
  → Used only by reconciliation job, never for live reads

Reconciliation job (runs every hour):
  FOR each content_id with reactions in the last hour:
    exact = SELECT reaction_type, COUNT(*) FROM reactions
            WHERE content_id = ? GROUP BY reaction_type;
    cached = SELECT * FROM reaction_counts WHERE content_id = ?;
    IF drift > 5%:
      UPDATE reaction_counts to exact values
      UPDATE Redis cache
      Log the drift for monitoring

Why not just use exact counts everywhere?

  • COUNT(*) on 500B rows, even for a single post with 10M reactions, takes seconds
  • Locking the counter row for every write creates contention (50K writes/sec to one row)
  • Approximate counts with periodic correction give us speed AND eventual accuracy

Display optimization:

< 1,000:       show exact count ("847 reactions")
1,000-999K:    show rounded ("1.5K reactions")
> 1M:          show rounded ("2.3M reactions")

At these display granularities, being off by 10 reactions is invisible to the user.

Deep Dive 3: Fan-Out for Notifications

The problem: When a post gets liked, the author should be notified. But a viral post with 50K likes/sec would spam the author with 50K notifications.

Solution: Batched notification aggregation

Reaction event → Kafka → Notification Aggregator

Aggregator logic:
  1. Maintain a per-post notification buffer:
     Key: notif:pending:{author_id}:{content_id}
     Value: list of (user_id, reaction_type, timestamp)

  2. Flush rules (whichever triggers first):
     → First reaction on a post: flush immediately (instant gratification)
     → Subsequent: flush every 5 minutes
     → Or when buffer reaches 50 reactions
     → At flush: create a single aggregated notification:
       "Alice, Bob, and 48 others reacted to your post"

  3. Notification deduplication:
     → If user already received a notification for this post in the last hour,
       update the existing notification ("Alice, Bob, and 148 others...")
       instead of creating a new one

  4. Priority handling:
     → Reaction from a close friend → include by name: "Alice reacted..."
     → Reaction from a stranger → count only: "...and 48 others"
     → Determined by social graph proximity score

Real-time count updates (live on the page):

While a user is viewing a post:
  → Subscribe to WebSocket channel: reactions:{content_id}
  → Server pushes count deltas every 2 seconds: { "delta_like": +3, "delta_love": +1 }
  → Client updates count locally (optimistic)
  → On significant changes (> 100 in 1 minute), push a full refresh

For posts not currently being viewed:
  → No real-time updates needed
  → Next time the post is loaded, fresh counts are fetched from cache

7. Extensions (2 min)

  • Custom reactions: Allow page admins (brands, groups) to define custom reaction emojis. Extends reaction_type from 6 fixed values to a configurable set. Requires a reaction_type_id → emoji mapping per page/context.
  • Reaction analytics for page owners: Dashboard showing reaction sentiment over time. “Your post about Product X received 60% love, 5% angry — positive sentiment.” Aggregated from reaction_counts with time-series bucketing.
  • Reaction-weighted ranking: Feed ranking algorithm uses reactions as quality signals. A post with 100 “love” reactions ranks higher than one with 100 “angry” reactions. Weight each reaction type differently in the engagement score.
  • Undo grace period: After toggling a reaction off, keep the record for 5 seconds before deletion. If the user re-taps within 5 seconds, restore instead of creating a new record. Reduces write amplification from accidental taps.
  • Cross-platform sync: Reactions on a post shared to Instagram, WhatsApp, or Threads should be aggregated. Central reaction counter with per-platform source tracking.