1. Requirements & Scope (5 min)
Functional Requirements
- Display a real-time count of users currently viewing a specific page (e.g., “47 people viewing this page”)
- Count updates within 5 seconds of a viewer joining or leaving
- Track unique viewers — refreshing the page or opening multiple tabs from the same user should count as 1 viewer
- Support any page on the platform (product pages, articles, dashboards) — millions of distinct page IDs
- Provide an API to query current viewer count for any page (for analytics dashboards)
Non-Functional Requirements
- Availability: 99.9% — viewer counts are informational, not business-critical. Showing a stale count briefly is acceptable.
- Latency: Count updates pushed to viewers within 5 seconds. API queries return in < 50ms.
- Consistency: Approximate counts are fine. Off by 2-3 viewers is acceptable. Off by 50% is not.
- Scale: 10M concurrent users across 5M distinct pages. Average page: 2 viewers. Hot pages (trending product, breaking news): 500K+ viewers.
- Durability: Viewer counts are ephemeral — no need to persist historical real-time counts. However, log peak counts for analytics.
2. Estimation (3 min)
Connections
- 10M concurrent WebSocket connections
- Each connection: ~10 KB memory → 100 GB total connection memory
- At 200K connections per server → 50 WebSocket servers
Heartbeat Traffic
- Each client sends a heartbeat every 30 seconds
- 10M clients / 30 sec = 333K heartbeats/sec
- Each heartbeat: ~100 bytes → 33 MB/sec inbound
Count Update Traffic
- When a viewer joins/leaves, update the count for that page
- Churn rate: ~5% of viewers change pages per minute → 500K join/leave events per minute → 8,300 events/sec
- Each event triggers a count update pushed to all viewers of that page
- Average page: 2 viewers, hot pages: thousands
- Estimated push volume: ~50K count update messages/sec
Storage
- Active page viewer sets: 5M pages × ~100 bytes per entry × avg 2 viewers = 1 GB in Redis
- Hot pages with 500K viewers: a single sorted set with 500K members = ~50 MB. Manageable.
3. API Design (3 min)
REST Endpoints
// Get current viewer count for a page
GET /v1/pages/{page_id}/viewers/count
Response 200: {
"page_id": "product_12345",
"viewer_count": 47,
"updated_at": "2026-02-22T10:00:05Z"
}
// Get viewer counts for multiple pages (batch)
POST /v1/pages/viewers/count
Body: { "page_ids": ["product_12345", "article_678", ...] }
Response 200: {
"counts": {
"product_12345": 47,
"article_678": 1203
}
}
WebSocket Protocol
// Client connects and subscribes to a page
WS /v1/pages/{page_id}/viewers
// Client → Server: heartbeat (every 30 seconds)
{ "type": "heartbeat" }
// Server → Client: viewer count update
{ "type": "viewer_count", "count": 48 }
// Client navigates to a different page
// → Close old WebSocket, open new one (or send subscribe/unsubscribe messages)
{ "type": "subscribe", "page_id": "new_page_456" }
{ "type": "unsubscribe", "page_id": "product_12345" }
Key Decisions
- Use WebSocket for real-time push rather than polling (polling 10M clients every 5 seconds = 2M req/sec just for counts)
- Multiplexed WebSocket: single connection per client, subscribe/unsubscribe to pages as they navigate. Avoids reconnection overhead.
- Heartbeat is mandatory. If no heartbeat received in 90 seconds (3 missed), the server considers the viewer gone.
4. Data Model (3 min)
Active Viewer Set (Redis)
// Set of active viewers per page
Key: page:{page_id}:viewers
Type: Sorted Set
Member: viewer_id (user_id or session_id for anonymous users)
Score: last_heartbeat_timestamp
// Current count (cached, updated on join/leave)
Key: page:{page_id}:count
Type: String (integer)
Connection Registry (Redis)
// Which pages a connection is viewing
Key: conn:{connection_id}
Type: Hash
Fields:
viewer_id | varchar
page_id | varchar
server_id | varchar
connected_at | timestamp
last_heartbeat | timestamp
TTL: 120 seconds (auto-cleanup if server crashes)
Unique Viewer Tracking (Redis HyperLogLog)
// Approximate unique viewers (for analytics, not real-time count)
Key: page:{page_id}:unique_viewers:{date}
Type: HyperLogLog
Operation: PFADD on each new viewer
Why Redis?
- All data is ephemeral (viewer presence is transient)
- Sub-millisecond reads/writes for count lookups and heartbeat updates
- Sorted sets enable efficient expiry scanning (remove viewers with old heartbeats)
- Pub/Sub for cross-server count change notifications
- HyperLogLog for memory-efficient unique counting (12 KB per counter regardless of cardinality)
5. High-Level Design (12 min)
Viewer Join Flow
Client opens page
→ WebSocket Connection Server
→ 1. Authenticate (extract user_id or generate session_id)
→ 2. Deduplicate: Check if this viewer_id already exists in page's viewer set
ZSCORE page:{page_id}:viewers {viewer_id}
If exists → update heartbeat timestamp, do NOT increment count
If new → ZADD page:{page_id}:viewers {now} {viewer_id}
INCR page:{page_id}:count
→ 3. Register connection: HSET conn:{connection_id} ...
→ 4. Publish count change: PUBLISH page_count:{page_id} {new_count}
→ 5. Return current count to client
Heartbeat Flow
Client sends heartbeat every 30 seconds
→ WS Server receives heartbeat
→ ZADD page:{page_id}:viewers {now} {viewer_id} (update score = timestamp)
→ EXPIRE conn:{connection_id} 120 (refresh TTL)
Viewer Leave Flow
Case 1: Client navigates away (graceful close)
→ Client sends unsubscribe or closes WebSocket
→ WS Server:
→ ZREM page:{page_id}:viewers {viewer_id}
→ DECR page:{page_id}:count
→ PUBLISH page_count:{page_id} {new_count}
→ DEL conn:{connection_id}
Case 2: Client crashes / loses network (ungraceful)
→ No more heartbeats received
→ Heartbeat Reaper (background job):
→ Runs every 30 seconds
→ ZRANGEBYSCORE page:{page_id}:viewers -inf (now - 90)
→ Remove stale entries, decrement count
→ Publish updated count
Count Broadcast Flow
Redis Pub/Sub channel: page_count:{page_id}
→ When count changes, PUBLISH to this channel
→ Each WS Server subscribes to channels for pages with local viewers
→ On receiving count update:
→ Push { "type": "viewer_count", "count": N } to all local WebSocket
connections viewing that page
Components
- WebSocket Connection Servers (50 servers): Maintain persistent connections. Handle heartbeats, join/leave events. Subscribe to Redis Pub/Sub for count updates.
- Redis Cluster (3 shards, replicated): Stores viewer sets, counts, connection registry. Provides Pub/Sub for cross-server notifications.
- Heartbeat Reaper Service: Background job scanning for stale viewers. Runs on a schedule (every 30 seconds). Scans sorted sets for expired entries.
- Count Query API: Stateless HTTP service for non-real-time queries. Reads directly from Redis. Used by analytics dashboards.
- Analytics Pipeline: Kafka consumer logs peak counts, unique viewer counts (HyperLogLog reads) to a data warehouse for historical analysis.
Architecture Diagram
Clients (10M)
→ Load Balancer (sticky by viewer_id hash)
→ WebSocket Servers (50 instances)
→ Redis Cluster
├── Viewer Sets (Sorted Sets)
├── Counts (Strings)
├── Connection Registry (Hashes)
└── Pub/Sub (count change notifications)
Heartbeat Reaper (3 instances, leader-elected)
→ Scans Redis sorted sets for stale entries
→ Decrements counts and publishes updates
Analytics Pipeline
→ Periodically snapshots counts to Kafka → Data Warehouse
6. Deep Dives (15 min)
Deep Dive 1: Presence Detection — Heartbeat vs. WebSocket State
Option A: Rely purely on WebSocket connection state
- If WebSocket is connected → viewer is present
- If WebSocket disconnects → viewer is gone
- Problem: WebSocket disconnections are unreliable. TCP half-open connections can linger for minutes. Mobile devices may keep connections open while the app is backgrounded. Load balancer timeouts may silently drop connections without notifying the server.
- Result: Phantom viewers (count inflated)
Option B: Heartbeat-based presence (chosen approach)
Client sends heartbeat every 30 seconds:
{ "type": "heartbeat" }
Server updates the viewer's score in the sorted set:
ZADD page:{page_id}:viewers {current_timestamp} {viewer_id}
Reaper removes stale entries:
ZRANGEBYSCORE page:{page_id}:viewers -inf {now - 90}
→ Any viewer who hasn't sent a heartbeat in 90 seconds is removed
Why 90 seconds (3 missed heartbeats)?
- 1 missed heartbeat: likely a temporary network hiccup
- 2 missed heartbeats: probably still a transient issue
- 3 missed heartbeats (90 sec): viewer is almost certainly gone
- This gives a 60-90 second lag for ungraceful departures
Optimizing the reaper:
- Don’t scan all 5M pages every 30 seconds
- Maintain an active pages set: only pages with at least 1 viewer
- Further optimization: partition pages across reaper instances by hash
- Each reaper instance owns a slice of pages (consistent hashing)
- Scan cost: ZRANGEBYSCORE is O(log N + M) where M is the number of expired entries
Tab/window deduplication:
Scenario: User has 3 tabs open to the same product page
- Each tab has its own WebSocket connection
- All 3 connections share the same viewer_id (user_id)
- ZADD is idempotent on the same member: only 1 entry in the sorted set
- Heartbeat from any tab refreshes the timestamp
- Count = ZCARD (number of unique members) = 1, not 3
For anonymous users:
- Generate a session_id stored in a cookie
- Multiple tabs share the same cookie → same session_id → counted as 1
Deep Dive 2: Scaling Hot Pages (500K Concurrent Viewers)
A trending product page has 500K viewers. The count changes frequently (viewers joining/leaving constantly). Broadcasting every count change to 500K WebSocket connections is expensive.
Problem breakdown:
- At 5% churn/min: 25K join/leave events per minute = ~400 events/sec
- Each event triggers a Pub/Sub message to all WS Servers with viewers on this page
- Each WS Server pushes to thousands of local connections
Solution 1: Throttle count updates
Instead of broadcasting every individual change:
- Buffer count changes for 3-5 seconds
- Send a single update with the latest count
- Implementation: WS Server maintains a per-page timer
On count change:
if timer not active:
start 3-second timer
on timer fire:
push latest count to all local connections for this page
reset timer
Result: Max 1 push per page per 3 seconds, regardless of churn rate.
At 500K viewers across 50 servers → 50 Pub/Sub messages every 3 sec = trivial
Solution 2: Approximate counting for mega-pages
For pages with > 10K viewers, switch to approximate counting:
- Instead of maintaining a full sorted set (expensive at 500K members),
use probabilistic counting
- Option A: Count = ZCARD (still O(1) in Redis, sorted set is fine up to ~1M members)
- Option B: For > 1M, use a counter with batched increments
Each WS Server maintains a local count of its viewers for that page
Periodically (every 5 sec), publish local delta to Redis
Global count = sum of all server-local counts
Display: "~500K viewing" (round to nearest 100 or 1K for mega-pages)
Solution 3: Edge-level aggregation
For truly massive pages (1M+ viewers):
- Use CDN edge servers as WebSocket proxies
- Each edge maintains local viewer count
- Edges report to origin every 5 seconds
- Origin sums all edge counts → publishes back to edges
- Edge pushes updated global count to local viewers
- Reduces origin fan-out from 1M to ~200 edge PoPs
Deep Dive 3: Server Crashes and Count Accuracy
Problem: A WS Server holding 200K connections crashes. Those viewers are gone, but their entries remain in Redis sorted sets. The viewer count is now inflated by up to 200K.
Recovery mechanism:
1. Heartbeat reaper (standard path):
- Crashed server's connections stop sending heartbeats
- After 90 seconds, reaper removes all stale entries
- Count self-corrects within ~90 seconds
2. Fast recovery (server crash detection):
- Each WS Server registers itself: SADD active_servers {server_id} with TTL 60s
- Server refreshes TTL every 30 seconds
- Crash detector monitors active_servers set
- If a server disappears (TTL expires):
→ Query connection registry: all conn:{*} with server_id = crashed_server
→ Bulk remove those viewers from page sorted sets
→ Recount and publish corrected counts
- Recovery time: ~60 seconds instead of 90
3. Consistency check (periodic, background):
- Every 5 minutes, for each active page:
Recalculate count from ZCARD (ignoring cached counter)
If |ZCARD - cached_count| > threshold → reset cached count to ZCARD
- Catches any drift caused by race conditions or missed decrements
Preventing count drift from race conditions:
Increment/decrement race:
Thread A: ZREM viewer_1 → success → DECR count → count = 46
Thread B: ZADD viewer_2 → success → INCR count → count = 47
Actual ZCARD = 47 ✓ (no issue, both operations are atomic in Redis)
But what if INCR/DECR gets lost (network error)?
- The counter drifts from the actual set cardinality
- Fix: use ZCARD as the source of truth, counter is just a fast cache
- Periodically: SET page:{page_id}:count (ZCARD page:{page_id}:viewers)
- For hot pages (>1000 viewers), run this every 30 seconds
- For cold pages (<10 viewers), run this every 5 minutes
7. Extensions (2 min)
- Historical peak tracking: Record peak concurrent viewers per page per hour in a time-series DB (InfluxDB/TimescaleDB). Display “Peak: 12,847 viewers today” on the page. Useful for product teams to correlate traffic with marketing campaigns.
- Viewer breakdown by source: Track referrer alongside viewer_id. Show “23 from Google, 15 from Twitter, 9 direct.” Store referrer in the connection hash and aggregate on demand.
- Anonymous vs. logged-in viewer counts: Display “47 viewers (12 signed in)” by partitioning the sorted set or using two sets. Logged-in viewers may see additional social features (“3 of your friends are viewing this”).
- Cross-page presence: Extend beyond single-page counts to “342 people shopping right now” across a category. Aggregate counts from all pages in a category using a Redis set of active pages per category.
- Heatmap integration: Beyond count, track which section of the page users are scrolling to. Use periodic scroll-position reports (every 10 sec) to build a real-time engagement heatmap. Useful for A/B testing and UX optimization.