Design a Real-Time Page Viewer Count System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Display a real-time count of users currently viewing a specific page (e.g., “47 people viewing this page”)
Count updates within 5 seconds of a viewer joining or leaving
Track unique viewers — refreshing the page or opening multiple tabs from the same user should count as 1 viewer
Support any page on the platform (product pages, articles, dashboards) — millions of distinct page IDs
Provide an API to query current viewer count for any page (for analytics dashboards)

Non-Functional Requirements

Availability: 99.9% — viewer counts are informational, not business-critical. Showing a stale count briefly is acceptable.
Latency: Count updates pushed to viewers within 5 seconds. API queries return in < 50ms.
Consistency: Approximate counts are fine. Off by 2-3 viewers is acceptable. Off by 50% is not.
Scale: 10M concurrent users across 5M distinct pages. Average page: 2 viewers. Hot pages (trending product, breaking news): 500K+ viewers.
Durability: Viewer counts are ephemeral — no need to persist historical real-time counts. However, log peak counts for analytics.

2. Estimation (3 min)

Connections

10M concurrent WebSocket connections
Each connection: ~10 KB memory → 100 GB total connection memory
At 200K connections per server → 50 WebSocket servers

Heartbeat Traffic

Each client sends a heartbeat every 30 seconds
10M clients / 30 sec = 333K heartbeats/sec
Each heartbeat: ~100 bytes → 33 MB/sec inbound

Count Update Traffic

When a viewer joins/leaves, update the count for that page
Churn rate: ~5% of viewers change pages per minute → 500K join/leave events per minute → 8,300 events/sec
Each event triggers a count update pushed to all viewers of that page
Average page: 2 viewers, hot pages: thousands
Estimated push volume: ~50K count update messages/sec

Storage

Active page viewer sets: 5M pages × ~100 bytes per entry × avg 2 viewers = 1 GB in Redis
Hot pages with 500K viewers: a single sorted set with 500K members = ~50 MB. Manageable.

3. API Design (3 min)

REST Endpoints

// Get current viewer count for a page
GET /v1/pages/{page_id}/viewers/count
  Response 200: {
    "page_id": "product_12345",
    "viewer_count": 47,
    "updated_at": "2026-02-22T10:00:05Z"
  }

// Get viewer counts for multiple pages (batch)
POST /v1/pages/viewers/count
  Body: { "page_ids": ["product_12345", "article_678", ...] }
  Response 200: {
    "counts": {
      "product_12345": 47,
      "article_678": 1203
    }
  }

WebSocket Protocol

// Client connects and subscribes to a page
WS /v1/pages/{page_id}/viewers

// Client → Server: heartbeat (every 30 seconds)
{ "type": "heartbeat" }

// Server → Client: viewer count update
{ "type": "viewer_count", "count": 48 }

// Client navigates to a different page
// → Close old WebSocket, open new one (or send subscribe/unsubscribe messages)
{ "type": "subscribe", "page_id": "new_page_456" }
{ "type": "unsubscribe", "page_id": "product_12345" }

Key Decisions

Use WebSocket for real-time push rather than polling (polling 10M clients every 5 seconds = 2M req/sec just for counts)
Multiplexed WebSocket: single connection per client, subscribe/unsubscribe to pages as they navigate. Avoids reconnection overhead.
Heartbeat is mandatory. If no heartbeat received in 90 seconds (3 missed), the server considers the viewer gone.

4. Data Model (3 min)

Active Viewer Set (Redis)

// Set of active viewers per page
Key: page:{page_id}:viewers
Type: Sorted Set
Member: viewer_id (user_id or session_id for anonymous users)
Score: last_heartbeat_timestamp

// Current count (cached, updated on join/leave)
Key: page:{page_id}:count
Type: String (integer)

Connection Registry (Redis)

// Which pages a connection is viewing
Key: conn:{connection_id}
Type: Hash
Fields:
  viewer_id      | varchar
  page_id        | varchar
  server_id      | varchar
  connected_at   | timestamp
  last_heartbeat | timestamp
TTL: 120 seconds (auto-cleanup if server crashes)

Unique Viewer Tracking (Redis HyperLogLog)

// Approximate unique viewers (for analytics, not real-time count)
Key: page:{page_id}:unique_viewers:{date}
Type: HyperLogLog
Operation: PFADD on each new viewer

Why Redis?

All data is ephemeral (viewer presence is transient)
Sub-millisecond reads/writes for count lookups and heartbeat updates
Sorted sets enable efficient expiry scanning (remove viewers with old heartbeats)
Pub/Sub for cross-server count change notifications
HyperLogLog for memory-efficient unique counting (12 KB per counter regardless of cardinality)

5. High-Level Design (12 min)

Viewer Join Flow

Client opens page
  → WebSocket Connection Server
    → 1. Authenticate (extract user_id or generate session_id)
    → 2. Deduplicate: Check if this viewer_id already exists in page's viewer set
         ZSCORE page:{page_id}:viewers {viewer_id}
         If exists → update heartbeat timestamp, do NOT increment count
         If new → ZADD page:{page_id}:viewers {now} {viewer_id}
                   INCR page:{page_id}:count
    → 3. Register connection: HSET conn:{connection_id} ...
    → 4. Publish count change: PUBLISH page_count:{page_id} {new_count}
    → 5. Return current count to client

Heartbeat Flow

Client sends heartbeat every 30 seconds
  → WS Server receives heartbeat
    → ZADD page:{page_id}:viewers {now} {viewer_id}  (update score = timestamp)
    → EXPIRE conn:{connection_id} 120                  (refresh TTL)

Viewer Leave Flow

Case 1: Client navigates away (graceful close)
  → Client sends unsubscribe or closes WebSocket
  → WS Server:
    → ZREM page:{page_id}:viewers {viewer_id}
    → DECR page:{page_id}:count
    → PUBLISH page_count:{page_id} {new_count}
    → DEL conn:{connection_id}

Case 2: Client crashes / loses network (ungraceful)
  → No more heartbeats received
  → Heartbeat Reaper (background job):
    → Runs every 30 seconds
    → ZRANGEBYSCORE page:{page_id}:viewers -inf (now - 90)
    → Remove stale entries, decrement count
    → Publish updated count

Count Broadcast Flow

Redis Pub/Sub channel: page_count:{page_id}
  → When count changes, PUBLISH to this channel
  → Each WS Server subscribes to channels for pages with local viewers
  → On receiving count update:
    → Push { "type": "viewer_count", "count": N } to all local WebSocket
      connections viewing that page

Components

WebSocket Connection Servers (50 servers): Maintain persistent connections. Handle heartbeats, join/leave events. Subscribe to Redis Pub/Sub for count updates.
Redis Cluster (3 shards, replicated): Stores viewer sets, counts, connection registry. Provides Pub/Sub for cross-server notifications.
Heartbeat Reaper Service: Background job scanning for stale viewers. Runs on a schedule (every 30 seconds). Scans sorted sets for expired entries.
Count Query API: Stateless HTTP service for non-real-time queries. Reads directly from Redis. Used by analytics dashboards.
Analytics Pipeline: Kafka consumer logs peak counts, unique viewer counts (HyperLogLog reads) to a data warehouse for historical analysis.

Architecture Diagram

Clients (10M)
  → Load Balancer (sticky by viewer_id hash)
    → WebSocket Servers (50 instances)
      → Redis Cluster
        ├── Viewer Sets (Sorted Sets)
        ├── Counts (Strings)
        ├── Connection Registry (Hashes)
        └── Pub/Sub (count change notifications)

Heartbeat Reaper (3 instances, leader-elected)
  → Scans Redis sorted sets for stale entries
  → Decrements counts and publishes updates

Analytics Pipeline
  → Periodically snapshots counts to Kafka → Data Warehouse

6. Deep Dives (15 min)

Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State

Option A: Rely purely on WebSocket connection state

If WebSocket is connected → viewer is present
If WebSocket disconnects → viewer is gone
Problem: WebSocket disconnections are unreliable. TCP half-open connections can linger for minutes. Mobile devices may keep connections open while the app is backgrounded. Load balancer timeouts may silently drop connections without notifying the server.
Result: Phantom viewers (count inflated)

Option B: Heartbeat-based presence (chosen approach)

Client sends heartbeat every 30 seconds:
  { "type": "heartbeat" }

Server updates the viewer's score in the sorted set:
  ZADD page:{page_id}:viewers {current_timestamp} {viewer_id}

Reaper removes stale entries:
  ZRANGEBYSCORE page:{page_id}:viewers -inf {now - 90}
  → Any viewer who hasn't sent a heartbeat in 90 seconds is removed

Why 90 seconds (3 missed heartbeats)?
  - 1 missed heartbeat: likely a temporary network hiccup
  - 2 missed heartbeats: probably still a transient issue
  - 3 missed heartbeats (90 sec): viewer is almost certainly gone
  - This gives a 60-90 second lag for ungraceful departures

Optimizing the reaper:

Don’t scan all 5M pages every 30 seconds
Maintain an active pages set: only pages with at least 1 viewer
Further optimization: partition pages across reaper instances by hash
Each reaper instance owns a slice of pages (consistent hashing)
Scan cost: ZRANGEBYSCORE is O(log N + M) where M is the number of expired entries

Tab/window deduplication:

Scenario: User has 3 tabs open to the same product page
  - Each tab has its own WebSocket connection
  - All 3 connections share the same viewer_id (user_id)
  - ZADD is idempotent on the same member: only 1 entry in the sorted set
  - Heartbeat from any tab refreshes the timestamp
  - Count = ZCARD (number of unique members) = 1, not 3

For anonymous users:
  - Generate a session_id stored in a cookie
  - Multiple tabs share the same cookie → same session_id → counted as 1

Deep Dive 2: Scaling Hot Pages (500K Concurrent Viewers)

A trending product page has 500K viewers. The count changes frequently (viewers joining/leaving constantly). Broadcasting every count change to 500K WebSocket connections is expensive.

Problem breakdown:

At 5% churn/min: 25K join/leave events per minute = ~400 events/sec
Each event triggers a Pub/Sub message to all WS Servers with viewers on this page
Each WS Server pushes to thousands of local connections

Solution 1: Throttle count updates

Instead of broadcasting every individual change:
  - Buffer count changes for 3-5 seconds
  - Send a single update with the latest count
  - Implementation: WS Server maintains a per-page timer
    On count change:
      if timer not active:
        start 3-second timer
      on timer fire:
        push latest count to all local connections for this page
        reset timer

Result: Max 1 push per page per 3 seconds, regardless of churn rate.
At 500K viewers across 50 servers → 50 Pub/Sub messages every 3 sec = trivial

Solution 2: Approximate counting for mega-pages

For pages with > 10K viewers, switch to approximate counting:
  - Instead of maintaining a full sorted set (expensive at 500K members),
    use probabilistic counting
  - Option A: Count = ZCARD (still O(1) in Redis, sorted set is fine up to ~1M members)
  - Option B: For > 1M, use a counter with batched increments
    Each WS Server maintains a local count of its viewers for that page
    Periodically (every 5 sec), publish local delta to Redis
    Global count = sum of all server-local counts

Display: "~500K viewing" (round to nearest 100 or 1K for mega-pages)

Solution 3: Edge-level aggregation

For truly massive pages (1M+ viewers):
  - Use CDN edge servers as WebSocket proxies
  - Each edge maintains local viewer count
  - Edges report to origin every 5 seconds
  - Origin sums all edge counts → publishes back to edges
  - Edge pushes updated global count to local viewers
  - Reduces origin fan-out from 1M to ~200 edge PoPs

Deep Dive 3: Server Crashes and Count Accuracy

Problem: A WS Server holding 200K connections crashes. Those viewers are gone, but their entries remain in Redis sorted sets. The viewer count is now inflated by up to 200K.

Recovery mechanism:

1. Heartbeat reaper (standard path):
   - Crashed server's connections stop sending heartbeats
   - After 90 seconds, reaper removes all stale entries
   - Count self-corrects within ~90 seconds

2. Fast recovery (server crash detection):
   - Each WS Server registers itself: SADD active_servers {server_id} with TTL 60s
   - Server refreshes TTL every 30 seconds
   - Crash detector monitors active_servers set
   - If a server disappears (TTL expires):
     → Query connection registry: all conn:{*} with server_id = crashed_server
     → Bulk remove those viewers from page sorted sets
     → Recount and publish corrected counts
   - Recovery time: ~60 seconds instead of 90

3. Consistency check (periodic, background):
   - Every 5 minutes, for each active page:
     Recalculate count from ZCARD (ignoring cached counter)
     If |ZCARD - cached_count| > threshold → reset cached count to ZCARD
   - Catches any drift caused by race conditions or missed decrements

Preventing count drift from race conditions:

Increment/decrement race:
  Thread A: ZREM viewer_1 → success → DECR count → count = 46
  Thread B: ZADD viewer_2 → success → INCR count → count = 47
  Actual ZCARD = 47 ✓ (no issue, both operations are atomic in Redis)

But what if INCR/DECR gets lost (network error)?
  - The counter drifts from the actual set cardinality
  - Fix: use ZCARD as the source of truth, counter is just a fast cache
  - Periodically: SET page:{page_id}:count (ZCARD page:{page_id}:viewers)
  - For hot pages (>1000 viewers), run this every 30 seconds
  - For cold pages (<10 viewers), run this every 5 minutes

7. Extensions (2 min)

Historical peak tracking: Record peak concurrent viewers per page per hour in a time-series DB (InfluxDB/TimescaleDB). Display “Peak: 12,847 viewers today” on the page. Useful for product teams to correlate traffic with marketing campaigns.
Viewer breakdown by source: Track referrer alongside viewer_id. Show “23 from Google, 15 from Twitter, 9 direct.” Store referrer in the connection hash and aggregate on demand.
Anonymous vs. logged-in viewer counts: Display “47 viewers (12 signed in)” by partitioning the sorted set or using two sets. Logged-in viewers may see additional social features (“3 of your friends are viewing this”).
Cross-page presence: Extend beyond single-page counts to “342 people shopping right now” across a category. Aggregate counts from all pages in a category using a Redis set of active pages per category.
Heatmap integration: Beyond count, track which section of the page users are scrolling to. Use periodic scroll-position reports (every 10 sec) to build a real-time engagement heatmap. Useful for A/B testing and UX optimization.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Connections#

Heartbeat Traffic#

Count Update Traffic#

Storage#

3. API Design (3 min)#

REST Endpoints#

WebSocket Protocol#

Key Decisions#

4. Data Model (3 min)#

Active Viewer Set (Redis)#

Connection Registry (Redis)#

Unique Viewer Tracking (Redis HyperLogLog)#

Why Redis?#

5. High-Level Design (12 min)#

Viewer Join Flow#

Heartbeat Flow#

Viewer Leave Flow#

Count Broadcast Flow#

Components#

Architecture Diagram#

6. Deep Dives (15 min)#

Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State#

Deep Dive 2: Scaling Hot Pages (500K Concurrent Viewers)#

Deep Dive 3: Server Crashes and Count Accuracy#

7. Extensions (2 min)#