Design a Concurrent Screen Limit System (Netflix)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Enforce a maximum number of concurrent streams per account (e.g., Basic = 1, Standard = 2, Premium = 4)
Track active streams in real-time via heartbeat-based session monitoring — a stream is “active” if a heartbeat was received within the last 60 seconds
When the concurrent limit is reached, deny new stream requests with a clear error message (“Too many screens. Stop watching on another device to continue.”)
Handle edge cases: device crashes (no explicit stop), network interruptions, and simultaneous stream starts from multiple devices (race conditions)
Allow users to see active sessions and forcefully terminate a session from another device

Non-Functional Requirements

Availability: 99.99% — blocking a paying customer from watching is worse than allowing one extra stream temporarily. Fail-open bias.
Latency: Stream start authorization < 100ms. Heartbeat processing < 50ms.
Consistency: Strong consistency for stream count enforcement. Two devices simultaneously starting stream #3 on a 2-stream plan must not both succeed.
Scale: 200M subscribers, 50M concurrent streams at peak, 500M heartbeats/minute (each stream sends a heartbeat every 30 seconds)
Fault tolerance: Single Redis node failure must not cause mass stream denials. Graceful degradation.

2. Estimation (3 min)

Traffic

50M concurrent streams at peak
Each stream sends heartbeat every 30 seconds → 50M / 30 = 1.67M heartbeats/sec
Stream starts: average stream duration = 45 minutes → 50M / 45 = ~18.5K stream starts/sec
Stream stops (explicit): ~70% of sessions end cleanly → ~13K stops/sec
Stream authorization checks (on start + periodic re-check every 5 min): ~185K checks/sec

Storage

Active session record: account_id (8B) + session_id (16B) + device_id (32B) + started_at (8B) + last_heartbeat (8B) + metadata (28B) = ~100 bytes
50M concurrent sessions × 100 bytes = 5 GB — fits entirely in Redis
Account → plan mapping: 200M accounts × 50 bytes = 10 GB — cached in Redis, source of truth in PostgreSQL

Heartbeat Processing

1.67M heartbeats/sec × 100 bytes each = 167 MB/sec of inbound traffic
Each heartbeat updates a single key in Redis: O(1) operation
Redis can handle 500K+ operations/sec per instance → need ~4 Redis shards minimum
With 10 Redis shards: 167K operations/sec each — comfortable headroom

Key Insight

This is a distributed counting problem with strong consistency requirements. The core challenge is ensuring that the stream count for an account never exceeds the limit, even when multiple devices attempt to start streams simultaneously. This is complicated by the stateless nature of video playback — we must use heartbeats to infer liveness rather than relying on explicit stop signals.

3. API Design (3 min)

Start Stream

POST /streams/start
Body: {
  "account_id": "acc_123",
  "device_id": "device_abc",
  "device_info": {
    "type": "smart_tv",
    "name": "Living Room Samsung",
    "ip": "73.162.44.100"
  },
  "content_id": "movie_456",
  "session_token": "jwt_..."           // authenticated user session
}
Response 200: {
  "stream_id": "stream_xyz",
  "heartbeat_interval_seconds": 30,
  "heartbeat_url": "/streams/stream_xyz/heartbeat",
  "license_url": "/drm/license?stream_id=stream_xyz"
}

// When limit exceeded:
Response 403: {
  "error": "concurrent_limit_reached",
  "message": "Too many screens. Your plan allows 2 concurrent streams.",
  "active_streams": [
    { "stream_id": "stream_001", "device_name": "iPhone", "content": "Breaking Bad S3E5", "started_at": "..." },
    { "stream_id": "stream_002", "device_name": "iPad", "content": "Stranger Things S4E1", "started_at": "..." }
  ],
  "plan_limit": 2
}

Heartbeat

POST /streams/{stream_id}/heartbeat
Body: {
  "position_seconds": 1847,           // playback position for resume
  "bitrate": "1080p",
  "buffer_health": "good"
}
Response 200: { "continue": true }

// If session was force-stopped:
Response 410: { "error": "session_terminated", "reason": "User stopped from another device" }

Stop Stream

DELETE /streams/{stream_id}
Response 204 No Content

List Active Sessions

GET /accounts/{account_id}/active-streams
Response 200: {
  "plan_limit": 2,
  "active_streams": [
    { "stream_id": "stream_001", "device_name": "iPhone", "content_title": "Breaking Bad S3E5", "started_at": "..." },
    { "stream_id": "stream_002", "device_name": "iPad", "content_title": "Stranger Things S4E1", "started_at": "..." }
  ]
}

Force Stop

DELETE /accounts/{account_id}/streams/{stream_id}
Response 204 No Content
// Next heartbeat from that device returns 410 Gone

Key Decisions

Heartbeat interval = 30 seconds: Balances freshness (detect dead streams quickly) with server load. 60-second timeout = 2 missed heartbeats before declaring stream dead.
Stream ID returned on start: Client uses this for all subsequent heartbeats and stop. Prevents replay attacks.
403 with active stream list: Lets the user decide which stream to stop. Better UX than a generic error.

4. Data Model (3 min)

Active Sessions (Redis)

Data structure: Hash per account + Set for enumeration

// Per-session data
Key: session:{stream_id}
Type: Hash
Fields: {
  "account_id": "acc_123",
  "device_id": "device_abc",
  "device_name": "Living Room Samsung",
  "content_id": "movie_456",
  "content_title": "Breaking Bad S3E5",
  "started_at": "1708632000",
  "last_heartbeat": "1708632900",
  "playback_position": "1847"
}
TTL: 90 seconds (auto-expire if no heartbeat refreshes it)

// Per-account session index
Key: account_sessions:{account_id}
Type: Sorted Set
Members: stream_id
Score: last_heartbeat_timestamp
// Sorted set allows: count active sessions, list sessions, expire old ones

Account Plan (Redis cache + PostgreSQL source)

// Redis cache
Key: plan:{account_id}
Value: { "max_streams": 2, "plan_name": "standard" }
TTL: 300 seconds (refresh from DB periodically)

// PostgreSQL (source of truth)
Table: accounts
  account_id          | varchar(50) (PK)
  plan_id             | int (FK)
  max_concurrent      | int             -- denormalized for fast reads
  email               | varchar(255)
  created_at          | timestamp

Session History (PostgreSQL, for analytics)

Table: stream_sessions (partitioned by started_at monthly)
  stream_id           | UUID (PK)
  account_id          | varchar(50)
  device_id           | varchar(200)
  content_id          | varchar(50)
  started_at          | timestamp
  ended_at            | timestamp
  end_reason          | enum('user_stop', 'heartbeat_timeout', 'force_stop', 'error')
  duration_seconds    | int
  device_type         | varchar(50)

Why Redis

Sub-millisecond reads and writes — critical for 100ms stream start SLA
TTL-based auto-expiry — dead sessions clean themselves up without explicit garbage collection
Sorted sets — efficiently count and enumerate active sessions per account
Lua scripts — atomic check-and-increment for race condition prevention (see Deep Dives)
Cluster mode — shard by account_id for horizontal scaling

5. High-Level Design (12 min)

Architecture

Client (TV, Phone, Tablet, Browser)
  → CDN / Load Balancer
  → Stream Authorization Service (stateless):
       1. Authenticate user (JWT validation)
       2. Look up account plan (Redis cache or PostgreSQL)
       3. Atomic check-and-add in Redis:
          IF count(account_sessions) < max_streams:
            ADD new session
            RETURN success
          ELSE:
            RETURN error with active session list
       4. If success: return stream_id + DRM license URL

  → Heartbeat Service (stateless):
       1. Validate stream_id
       2. Update last_heartbeat in Redis (refresh TTL)
       3. Check if session was force-terminated → return 410

  → Session Reaper (background job):
       1. Every 30 seconds: scan for sessions with last_heartbeat > 60s ago
       2. Remove stale sessions from account_sessions set
       3. Write session end record to PostgreSQL
       (Note: Redis TTL handles most cleanup, reaper is defense-in-depth)

Stream Start Flow (detailed)

1. Client → POST /streams/start { account_id, device_id, content_id }

2. Stream Auth Service:
   a. Validate JWT → extract account_id
   b. Redis: GET plan:{account_id} → { max_streams: 2 }
      (Cache miss → query PostgreSQL → cache in Redis for 5 min)

   c. Redis Lua script (ATOMIC):
      -- Remove expired members from sorted set
      ZREMRANGEBYSCORE account_sessions:{account_id} 0 (now - 60)

      -- Count active sessions
      local count = ZCARD account_sessions:{account_id}

      -- Check limit
      if count >= max_streams then
        return "DENIED"
      end

      -- Add new session
      ZADD account_sessions:{account_id} now stream_id
      HSET session:{stream_id} account_id ... device_id ... (etc)
      EXPIRE session:{stream_id} 90
      return "ALLOWED"

   d. If ALLOWED:
      → Generate stream_id
      → Return 200 with heartbeat instructions
      → Log session start to Kafka (for analytics)

   e. If DENIED:
      → Fetch active sessions: ZRANGE account_sessions:{account_id} 0 -1
      → For each: HGETALL session:{stream_id}
      → Return 403 with active session details

3. Client begins playback:
   → Requests DRM license (requires valid stream_id)
   → Starts heartbeat timer (every 30 seconds)

Heartbeat Flow

Every 30 seconds from each active client:

1. Client → POST /streams/{stream_id}/heartbeat { position }

2. Heartbeat Service:
   a. Check: EXISTS session:{stream_id}
      If not exists → return 410 (session expired or force-stopped)

   b. Update session:
      HSET session:{stream_id} last_heartbeat now playback_position position
      EXPIRE session:{stream_id} 90     // refresh TTL
      ZADD account_sessions:{account_id} now stream_id  // update score

   c. Return 200 { continue: true }

Components

Stream Authorization Service: Enforces concurrent limit on stream start. Uses Redis Lua for atomic check-and-add. Stateless, horizontally scalable.
Heartbeat Service: High-throughput heartbeat processor. Updates Redis TTLs. Detects force-terminated sessions.
Session Reaper: Background job that cleans up stale sessions missed by TTL. Writes session-end records to PostgreSQL.
Redis Cluster: Sharded by account_id. Stores all active sessions and plan caches. The core of the system.
PostgreSQL: Source of truth for account plans. Historical session records for analytics and billing.
Kafka: Event bus for session start/stop events. Powers analytics, fraud detection, and billing.

6. Deep Dives (15 min)

Deep Dive 1: Race Condition on Simultaneous Stream Starts

The problem: User has 2-stream plan with 1 active stream. Two devices (TV and phone) simultaneously try to start a stream.

Without atomicity:
  TV:    READ count = 1; count < 2? YES → ADD session → count = 2 ✓
  Phone: READ count = 1; count < 2? YES → ADD session → count = 3 ✗

Both succeed! Account now has 3 streams on a 2-stream plan.

Solution: Redis Lua script for atomic check-and-add

-- KEYS[1] = account_sessions:{account_id}
-- ARGV[1] = current_timestamp
-- ARGV[2] = max_streams
-- ARGV[3] = stream_id
-- ARGV[4] = expiry_seconds (60)

-- Step 1: Remove expired sessions
redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, ARGV[1] - ARGV[4])

-- Step 2: Count current active sessions
local count = redis.call('ZCARD', KEYS[1])

-- Step 3: Check limit
if count >= tonumber(ARGV[2]) then
  return redis.error_reply('LIMIT_REACHED:' .. count)
end

-- Step 4: Add new session (atomic with the check)
redis.call('ZADD', KEYS[1], ARGV[1], ARGV[3])
return count + 1

Why this works:

Redis is single-threaded — Lua scripts execute atomically
Between the ZCARD check and the ZADD, no other command can modify the set
Both TV and Phone send the Lua script simultaneously, but Redis executes them serially
First one succeeds (count goes to 2), second one fails (count = 2 ≥ max_streams)

What about Redis Cluster (sharded)?

All data for one account lives on the same shard (sharded by account_id)
Lua script runs locally on that shard → atomicity preserved
Cross-shard transactions not needed (each account is self-contained)

Deep Dive 2: Handling Device Crashes and Network Issues

Scenario 1: Device crashes (no explicit stop signal)

Timeline:
  t=0:   Stream started on TV, heartbeat sent
  t=30:  Heartbeat sent
  t=32:  TV crashes (power outage)
  t=60:  Expected heartbeat — not received
  t=90:  Expected heartbeat — not received
  t=92:  Redis key TTL expires (set at t=30 + 90s = t=120... wait, let's be precise)

Actually: Redis key session:{stream_id} has TTL refreshed to 90s on each heartbeat.
  Last heartbeat at t=30 → key expires at t=120
  But we also clean up via ZREMRANGEBYSCORE when a new stream starts.

  t=92: User tries to start stream on phone.
  → Lua script runs ZREMRANGEBYSCORE removing sessions with score < (now - 60)
  → TV session score = 30, now = 92, 30 < (92-60) = 32 → REMOVED
  → Phone stream allowed.

Effective dead session detection time: 60-90 seconds after last heartbeat.

Scenario 2: Temporary network interruption

  t=0:   Heartbeat sent
  t=30:  Network down — heartbeat fails
  t=35:  Network restored — video still playing locally from buffer
  t=60:  Heartbeat sent — succeeds!
  → Session stays alive (heartbeat arrived within 60-second timeout)

  If network is down for > 60 seconds:
  → Session expires
  → On next heartbeat, server returns 410 (session not found)
  → Client must re-start stream (POST /streams/start)
  → If under limit: succeeds silently (seamless to user)
  → If at limit (another device started in the gap): shows error

Grace period for reconnection:

To improve UX during network flakiness:

Option 1: Longer TTL (90s instead of 60s)
  Pros: More tolerant of network issues
  Cons: Dead sessions linger longer, user waits 90s to reclaim slot

Option 2: Soft/Hard expiry
  Soft expiry at 60s: session slot is reclaimable by same device
  Hard expiry at 120s: session fully removed

  Implementation:
  - At 60s without heartbeat: remove from account_sessions sorted set (slot freed)
  - But keep session:{stream_id} hash for another 60s
  - If original device heartbeats within 60-120s window:
    → Re-add to sorted set (if limit not exceeded by other devices)
    → Seamless reconnection
  - After 120s: hash expires, device must fully re-start

Deep Dive 3: Distributed Redis Failure and Graceful Degradation

Problem: Redis cluster has a node failure. What happens to stream authorization?

Redis Cluster with replicas:

Setup: 10 shards, each with 1 primary + 2 replicas
  Shard 1: Primary A1, Replica A2, A3
  ...

Normal failover:
  Primary A1 goes down → Sentinel promotes A2 to primary in ~5 seconds
  During those 5 seconds: writes to shard 1 fail

Impact: accounts hashed to shard 1 (~10% of accounts) can't start or heartbeat

Graceful degradation strategies:

Strategy 1: Fail-open with soft enforcement

If Redis is unreachable for an account:
  → ALLOW the stream start (fail-open)
  → Log the event to Kafka for async reconciliation
  → Set a client-side timer: re-check authorization in 60 seconds
  → When Redis recovers: count actual streams, terminate excess

Rationale: It's better to allow 3 streams on a 2-stream plan temporarily
than to block all 200K users hashed to the failed shard.

Strategy 2: Local in-memory fallback

Each Stream Auth Service instance maintains a local cache:
  account_id → last_known_stream_count (refreshed from Redis every 10s)

If Redis is unreachable:
  1. Check local cache for stream count
  2. If count < limit: allow (tentatively)
  3. If count >= limit: deny
  4. When Redis recovers: reconcile

Downside: different auth instances may have stale local state
→ May allow slight over-limit during Redis outage
→ Acceptable trade-off vs total denial

Strategy 3: Multi-region Redis with cross-region fallback

Region A: Redis Cluster (primary writes)
Region B: Redis Cluster (read replica, promoted to primary on failover)

Normal operation:
  All writes go to Region A Redis
  Reads can go to either (for lower latency)

Region A failure:
  → Region B Redis promoted to primary
  → Some state may be lost (async replication lag ~1-2 seconds)
  → Worst case: a few accounts get 1 extra stream for a brief period
  → Far better than total outage

Monitoring and alerting:

Key metrics:
  - Redis latency P99 (alert > 5ms)
  - Stream start failure rate (alert > 1%)
  - Heartbeat failure rate (alert > 0.5%)
  - Sessions in "zombie" state (no heartbeat but not expired)
  - Accounts exceeding plan limit (should be ~0 after reconciliation)

7. Extensions (2 min)

Device fingerprinting: Track device type, OS, and hardware ID to prevent screen-sharing workarounds (one device pretending to be multiple). Hash hardware identifiers for privacy. Flag accounts where device fingerprints change suspiciously fast.
Household detection: If multiple streams originate from the same IP/network, count them as “home” streams with a higher limit. Streams from different IPs count against a stricter “away” limit. Similar to Netflix’s household sharing enforcement.
Quality-based stream limits: Instead of raw screen count, enforce a bandwidth limit. 4K stream = 2 “credits,” HD = 1 credit, SD = 0.5 credits. Premium plan gets 8 credits → either 2×4K or 4×HD or 8×SD.
Offline download management: Downloaded content for offline viewing counts against a device limit (not concurrent screen limit). Track authorized devices separately. Revoke offline licenses when device limit is exceeded.
Account sharing detection: ML model analyzes stream patterns (geographic spread, time zones, device diversity) to detect likely account sharing. Flag accounts for review rather than auto-block, allowing for legitimate travel use cases.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Heartbeat Processing#

Key Insight#

3. API Design (3 min)#

Start Stream#

Heartbeat#

Stop Stream#

List Active Sessions#

Force Stop#

Key Decisions#

4. Data Model (3 min)#

Active Sessions (Redis)#

Account Plan (Redis cache + PostgreSQL source)#

Session History (PostgreSQL, for analytics)#

Why Redis#

5. High-Level Design (12 min)#

Architecture#

Stream Start Flow (detailed)#

Heartbeat Flow#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Race Condition on Simultaneous Stream Starts#

Deep Dive 2: Handling Device Crashes and Network Issues#

Deep Dive 3: Distributed Redis Failure and Graceful Degradation#

7. Extensions (2 min)#