1. Requirements & Scope (5 min)

Functional Requirements

  1. Display a real-time count of users currently viewing a specific page (e.g., “47 people viewing this page”)
  2. Count updates within 5 seconds of a viewer joining or leaving
  3. Track unique viewers — refreshing the page or opening multiple tabs from the same user should count as 1 viewer
  4. Support any page on the platform (product pages, articles, dashboards) — millions of distinct page IDs
  5. Provide an API to query current viewer count for any page (for analytics dashboards)

Non-Functional Requirements

  • Availability: 99.9% — viewer counts are informational, not business-critical. Showing a stale count briefly is acceptable.
  • Latency: Count updates pushed to viewers within 5 seconds. API queries return in < 50ms.
  • Consistency: Approximate counts are fine. Off by 2-3 viewers is acceptable. Off by 50% is not.
  • Scale: 10M concurrent users across 5M distinct pages. Average page: 2 viewers. Hot pages (trending product, breaking news): 500K+ viewers.
  • Durability: Viewer counts are ephemeral — no need to persist historical real-time counts. However, log peak counts for analytics.

2. Estimation (3 min)

Connections

  • 10M concurrent WebSocket connections
  • Each connection: ~10 KB memory → 100 GB total connection memory
  • At 200K connections per server → 50 WebSocket servers

Heartbeat Traffic

  • Each client sends a heartbeat every 30 seconds
  • 10M clients / 30 sec = 333K heartbeats/sec
  • Each heartbeat: ~100 bytes → 33 MB/sec inbound

Count Update Traffic

  • When a viewer joins/leaves, update the count for that page
  • Churn rate: ~5% of viewers change pages per minute → 500K join/leave events per minute → 8,300 events/sec
  • Each event triggers a count update pushed to all viewers of that page
  • Average page: 2 viewers, hot pages: thousands
  • Estimated push volume: ~50K count update messages/sec

Storage

  • Active page viewer sets: 5M pages × ~100 bytes per entry × avg 2 viewers = 1 GB in Redis
  • Hot pages with 500K viewers: a single sorted set with 500K members = ~50 MB. Manageable.

3. API Design (3 min)

REST Endpoints

// Get current viewer count for a page
GET /v1/pages/{page_id}/viewers/count
  Response 200: {
    "page_id": "product_12345",
    "viewer_count": 47,
    "updated_at": "2026-02-22T10:00:05Z"
  }

// Get viewer counts for multiple pages (batch)
POST /v1/pages/viewers/count
  Body: { "page_ids": ["product_12345", "article_678", ...] }
  Response 200: {
    "counts": {
      "product_12345": 47,
      "article_678": 1203
    }
  }

WebSocket Protocol

// Client connects and subscribes to a page
WS /v1/pages/{page_id}/viewers

// Client → Server: heartbeat (every 30 seconds)
{ "type": "heartbeat" }

// Server → Client: viewer count update
{ "type": "viewer_count", "count": 48 }

// Client navigates to a different page
// → Close old WebSocket, open new one (or send subscribe/unsubscribe messages)
{ "type": "subscribe", "page_id": "new_page_456" }
{ "type": "unsubscribe", "page_id": "product_12345" }

Key Decisions

  • Use WebSocket for real-time push rather than polling (polling 10M clients every 5 seconds = 2M req/sec just for counts)
  • Multiplexed WebSocket: single connection per client, subscribe/unsubscribe to pages as they navigate. Avoids reconnection overhead.
  • Heartbeat is mandatory. If no heartbeat received in 90 seconds (3 missed), the server considers the viewer gone.

4. Data Model (3 min)

Active Viewer Set (Redis)

// Set of active viewers per page
Key: page:{page_id}:viewers
Type: Sorted Set
Member: viewer_id (user_id or session_id for anonymous users)
Score: last_heartbeat_timestamp

// Current count (cached, updated on join/leave)
Key: page:{page_id}:count
Type: String (integer)

Connection Registry (Redis)

// Which pages a connection is viewing
Key: conn:{connection_id}
Type: Hash
Fields:
  viewer_id      | varchar
  page_id        | varchar
  server_id      | varchar
  connected_at   | timestamp
  last_heartbeat | timestamp
TTL: 120 seconds (auto-cleanup if server crashes)

Unique Viewer Tracking (Redis HyperLogLog)

// Approximate unique viewers (for analytics, not real-time count)
Key: page:{page_id}:unique_viewers:{date}
Type: HyperLogLog
Operation: PFADD on each new viewer

Why Redis?

  • All data is ephemeral (viewer presence is transient)
  • Sub-millisecond reads/writes for count lookups and heartbeat updates
  • Sorted sets enable efficient expiry scanning (remove viewers with old heartbeats)
  • Pub/Sub for cross-server count change notifications
  • HyperLogLog for memory-efficient unique counting (12 KB per counter regardless of cardinality)

5. High-Level Design (12 min)

Viewer Join Flow

Client opens page
  → WebSocket Connection Server
    → 1. Authenticate (extract user_id or generate session_id)
    → 2. Deduplicate: Check if this viewer_id already exists in page's viewer set
         ZSCORE page:{page_id}:viewers {viewer_id}
         If exists → update heartbeat timestamp, do NOT increment count
         If new → ZADD page:{page_id}:viewers {now} {viewer_id}
                   INCR page:{page_id}:count
    → 3. Register connection: HSET conn:{connection_id} ...
    → 4. Publish count change: PUBLISH page_count:{page_id} {new_count}
    → 5. Return current count to client

Heartbeat Flow

Client sends heartbeat every 30 seconds
  → WS Server receives heartbeat
    → ZADD page:{page_id}:viewers {now} {viewer_id}  (update score = timestamp)
    → EXPIRE conn:{connection_id} 120                  (refresh TTL)

Viewer Leave Flow

Case 1: Client navigates away (graceful close)
  → Client sends unsubscribe or closes WebSocket
  → WS Server:
    → ZREM page:{page_id}:viewers {viewer_id}
    → DECR page:{page_id}:count
    → PUBLISH page_count:{page_id} {new_count}
    → DEL conn:{connection_id}

Case 2: Client crashes / loses network (ungraceful)
  → No more heartbeats received
  → Heartbeat Reaper (background job):
    → Runs every 30 seconds
    → ZRANGEBYSCORE page:{page_id}:viewers -inf (now - 90)
    → Remove stale entries, decrement count
    → Publish updated count

Count Broadcast Flow

Redis Pub/Sub channel: page_count:{page_id}
  → When count changes, PUBLISH to this channel
  → Each WS Server subscribes to channels for pages with local viewers
  → On receiving count update:
    → Push { "type": "viewer_count", "count": N } to all local WebSocket
      connections viewing that page

Components

  1. WebSocket Connection Servers (50 servers): Maintain persistent connections. Handle heartbeats, join/leave events. Subscribe to Redis Pub/Sub for count updates.
  2. Redis Cluster (3 shards, replicated): Stores viewer sets, counts, connection registry. Provides Pub/Sub for cross-server notifications.
  3. Heartbeat Reaper Service: Background job scanning for stale viewers. Runs on a schedule (every 30 seconds). Scans sorted sets for expired entries.
  4. Count Query API: Stateless HTTP service for non-real-time queries. Reads directly from Redis. Used by analytics dashboards.
  5. Analytics Pipeline: Kafka consumer logs peak counts, unique viewer counts (HyperLogLog reads) to a data warehouse for historical analysis.

Architecture Diagram

Clients (10M)
  → Load Balancer (sticky by viewer_id hash)
    → WebSocket Servers (50 instances)
      → Redis Cluster
        ├── Viewer Sets (Sorted Sets)
        ├── Counts (Strings)
        ├── Connection Registry (Hashes)
        └── Pub/Sub (count change notifications)

Heartbeat Reaper (3 instances, leader-elected)
  → Scans Redis sorted sets for stale entries
  → Decrements counts and publishes updates

Analytics Pipeline
  → Periodically snapshots counts to Kafka → Data Warehouse

6. Deep Dives (15 min)

Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State

Option A: Rely purely on WebSocket connection state

  • If WebSocket is connected → viewer is present
  • If WebSocket disconnects → viewer is gone
  • Problem: WebSocket disconnections are unreliable. TCP half-open connections can linger for minutes. Mobile devices may keep connections open while the app is backgrounded. Load balancer timeouts may silently drop connections without notifying the server.
  • Result: Phantom viewers (count inflated)

Option B: Heartbeat-based presence (chosen approach)

Client sends heartbeat every 30 seconds:
  { "type": "heartbeat" }

Server updates the viewer's score in the sorted set:
  ZADD page:{page_id}:viewers {current_timestamp} {viewer_id}

Reaper removes stale entries:
  ZRANGEBYSCORE page:{page_id}:viewers -inf {now - 90}
  → Any viewer who hasn't sent a heartbeat in 90 seconds is removed

Why 90 seconds (3 missed heartbeats)?
  - 1 missed heartbeat: likely a temporary network hiccup
  - 2 missed heartbeats: probably still a transient issue
  - 3 missed heartbeats (90 sec): viewer is almost certainly gone
  - This gives a 60-90 second lag for ungraceful departures

Optimizing the reaper:

  • Don’t scan all 5M pages every 30 seconds
  • Maintain an active pages set: only pages with at least 1 viewer
  • Further optimization: partition pages across reaper instances by hash
  • Each reaper instance owns a slice of pages (consistent hashing)
  • Scan cost: ZRANGEBYSCORE is O(log N + M) where M is the number of expired entries

Tab/window deduplication:

Scenario: User has 3 tabs open to the same product page
  - Each tab has its own WebSocket connection
  - All 3 connections share the same viewer_id (user_id)
  - ZADD is idempotent on the same member: only 1 entry in the sorted set
  - Heartbeat from any tab refreshes the timestamp
  - Count = ZCARD (number of unique members) = 1, not 3

For anonymous users:
  - Generate a session_id stored in a cookie
  - Multiple tabs share the same cookie → same session_id → counted as 1

Deep Dive 2: Scaling Hot Pages (500K Concurrent Viewers)

A trending product page has 500K viewers. The count changes frequently (viewers joining/leaving constantly). Broadcasting every count change to 500K WebSocket connections is expensive.

Problem breakdown:

  • At 5% churn/min: 25K join/leave events per minute = ~400 events/sec
  • Each event triggers a Pub/Sub message to all WS Servers with viewers on this page
  • Each WS Server pushes to thousands of local connections

Solution 1: Throttle count updates

Instead of broadcasting every individual change:
  - Buffer count changes for 3-5 seconds
  - Send a single update with the latest count
  - Implementation: WS Server maintains a per-page timer
    On count change:
      if timer not active:
        start 3-second timer
      on timer fire:
        push latest count to all local connections for this page
        reset timer

Result: Max 1 push per page per 3 seconds, regardless of churn rate.
At 500K viewers across 50 servers → 50 Pub/Sub messages every 3 sec = trivial

Solution 2: Approximate counting for mega-pages

For pages with > 10K viewers, switch to approximate counting:
  - Instead of maintaining a full sorted set (expensive at 500K members),
    use probabilistic counting
  - Option A: Count = ZCARD (still O(1) in Redis, sorted set is fine up to ~1M members)
  - Option B: For > 1M, use a counter with batched increments
    Each WS Server maintains a local count of its viewers for that page
    Periodically (every 5 sec), publish local delta to Redis
    Global count = sum of all server-local counts

Display: "~500K viewing" (round to nearest 100 or 1K for mega-pages)

Solution 3: Edge-level aggregation

For truly massive pages (1M+ viewers):
  - Use CDN edge servers as WebSocket proxies
  - Each edge maintains local viewer count
  - Edges report to origin every 5 seconds
  - Origin sums all edge counts → publishes back to edges
  - Edge pushes updated global count to local viewers
  - Reduces origin fan-out from 1M to ~200 edge PoPs

Deep Dive 3: Server Crashes and Count Accuracy

Problem: A WS Server holding 200K connections crashes. Those viewers are gone, but their entries remain in Redis sorted sets. The viewer count is now inflated by up to 200K.

Recovery mechanism:

1. Heartbeat reaper (standard path):
   - Crashed server's connections stop sending heartbeats
   - After 90 seconds, reaper removes all stale entries
   - Count self-corrects within ~90 seconds

2. Fast recovery (server crash detection):
   - Each WS Server registers itself: SADD active_servers {server_id} with TTL 60s
   - Server refreshes TTL every 30 seconds
   - Crash detector monitors active_servers set
   - If a server disappears (TTL expires):
     → Query connection registry: all conn:{*} with server_id = crashed_server
     → Bulk remove those viewers from page sorted sets
     → Recount and publish corrected counts
   - Recovery time: ~60 seconds instead of 90

3. Consistency check (periodic, background):
   - Every 5 minutes, for each active page:
     Recalculate count from ZCARD (ignoring cached counter)
     If |ZCARD - cached_count| > threshold → reset cached count to ZCARD
   - Catches any drift caused by race conditions or missed decrements

Preventing count drift from race conditions:

Increment/decrement race:
  Thread A: ZREM viewer_1 → success → DECR count → count = 46
  Thread B: ZADD viewer_2 → success → INCR count → count = 47
  Actual ZCARD = 47 ✓ (no issue, both operations are atomic in Redis)

But what if INCR/DECR gets lost (network error)?
  - The counter drifts from the actual set cardinality
  - Fix: use ZCARD as the source of truth, counter is just a fast cache
  - Periodically: SET page:{page_id}:count (ZCARD page:{page_id}:viewers)
  - For hot pages (>1000 viewers), run this every 30 seconds
  - For cold pages (<10 viewers), run this every 5 minutes

7. Extensions (2 min)

  • Historical peak tracking: Record peak concurrent viewers per page per hour in a time-series DB (InfluxDB/TimescaleDB). Display “Peak: 12,847 viewers today” on the page. Useful for product teams to correlate traffic with marketing campaigns.
  • Viewer breakdown by source: Track referrer alongside viewer_id. Show “23 from Google, 15 from Twitter, 9 direct.” Store referrer in the connection hash and aggregate on demand.
  • Anonymous vs. logged-in viewer counts: Display “47 viewers (12 signed in)” by partitioning the sorted set or using two sets. Logged-in viewers may see additional social features (“3 of your friends are viewing this”).
  • Cross-page presence: Extend beyond single-page counts to “342 people shopping right now” across a category. Aggregate counts from all pages in a category using a Redis set of active pages per category.
  • Heatmap integration: Beyond count, track which section of the page users are scrolling to. Use periodic scroll-position reports (every 10 sec) to build a real-time engagement heatmap. Useful for A/B testing and UX optimization.