1. Requirements & Scope (5 min)

Functional Requirements

  1. 1-on-1 messaging between users (text messages)
  2. Group messaging (up to 256 members)
  3. Message delivery status: sent, delivered, read
  4. Online/offline presence indicators
  5. Message history — persistent, accessible from any device

Non-Functional Requirements

  • Availability: 99.99% — messaging is real-time communication; downtime is immediately felt
  • Latency: Message delivery < 200ms end-to-end for online recipients
  • Consistency: Messages must be ordered correctly within a conversation. No message loss. Exactly-once delivery semantics (no duplicate messages).
  • Scale: 2B registered users, 500M DAU, 100B messages/day
  • Durability: Messages are persistent — stored forever (or until user deletes)

2. Estimation (3 min)

Traffic

  • 100B messages/day ÷ 100K sec/day = 1M messages/sec
  • Peak: 3x → 3M messages/sec
  • This is extremely high write throughput

Storage

  • Average message: 100 bytes (text + metadata)
  • 100B messages/day × 100 bytes = 10TB/day
  • Per year: ~3.6PB
  • With media (images, voice notes): 10x → ~36PB/year

Connections

  • 500M DAU with persistent WebSocket connections
  • Average user online 4 hours/day → at any time ~83M concurrent connections
  • Peak: ~150M concurrent WebSocket connections

3. API Design (3 min)

REST (for non-realtime operations)

POST /api/v1/messages/send
  Body: {
    "conversation_id": "conv_abc",
    "content": "Hello!",
    "type": "text",
    "client_message_id": "cm_uuid_123"  // idempotency key
  }
  Response 201: {
    "message_id": "m_server_456",
    "timestamp": "2026-02-22T18:00:00.123Z"
  }

GET /api/v1/conversations
  Response 200: [
    {
      "conversation_id": "conv_abc",
      "type": "1on1",
      "participants": ["u_1", "u_2"],
      "last_message": { "content": "Hello!", "timestamp": "..." },
      "unread_count": 3
    }
  ]

GET /api/v1/conversations/{conv_id}/messages?cursor={cursor}&limit=50

WebSocket (for real-time)

// Client → Server
{ "type": "message", "conversation_id": "conv_abc", "content": "Hello!", "client_id": "cm_uuid_123" }
{ "type": "typing", "conversation_id": "conv_abc" }
{ "type": "ack", "message_id": "m_456" }        // delivery receipt
{ "type": "read", "conversation_id": "conv_abc", "up_to": "m_456" }  // read receipt

// Server → Client
{ "type": "message", "message_id": "m_456", "from": "u_2", "conversation_id": "conv_abc", "content": "Hi!", "timestamp": "..." }
{ "type": "delivered", "message_id": "m_123" }   // your message was delivered
{ "type": "read", "conversation_id": "conv_abc", "by": "u_2", "up_to": "m_456" }
{ "type": "typing", "conversation_id": "conv_abc", "user": "u_2" }
{ "type": "presence", "user": "u_2", "status": "online" }

Key decisions:

  • client_message_id for idempotency — if client retries due to network issues, server deduplicates
  • Read receipts as “up_to” marker — not per-message, reduces traffic: “I’ve read everything up to message X”
  • WebSocket for all real-time events: messages, typing, presence, delivery/read receipts

4. Data Model (3 min)

Conversations & Participants (PostgreSQL, sharded by user_id)

Table: conversations
  conversation_id  (PK) | bigint
  type                   | enum('1on1', 'group')
  name                   | varchar(100)  -- for groups
  created_at             | timestamptz

Table: participants
  conversation_id        | bigint
  user_id                | bigint
  last_read_message_id   | bigint
  muted_until            | timestamptz, nullable
  PK: (conversation_id, user_id)
  Index: (user_id, conversation_id)  -- "my conversations" query

Messages (Cassandra, partitioned by conversation_id)

Table: messages
  conversation_id  (partition key) | bigint
  message_id       (clustering key, DESC) | bigint (Snowflake ID = time-ordered)
  sender_id                        | bigint
  content                          | text
  type                             | text ('text', 'image', 'voice')
  created_at                       | timestamp

Why Cassandra for messages?

  • 1M writes/sec requires horizontally scalable write throughput
  • Access pattern is always: “get messages for conversation X, ordered by time” — perfect for Cassandra’s partition + clustering key model
  • No cross-conversation queries needed (no joins)

Connection Registry (Redis)

Key: conn:{user_id}
Value: { "server_id": "ws-server-42", "connected_at": "..." }
TTL: auto-expiry on disconnect

5. High-Level Design (12 min)

Message Send Path

Sender's device → WebSocket → Chat Server (sender's connection)
  → Chat Service:
    1. Validate message, dedup by client_message_id
    2. Write to Cassandra (messages table)
    3. Update conversation metadata (last_message) in PostgreSQL
    4. Publish to Kafka topic: messages.{conversation_id}
  → Kafka → Routing Service:
    → For each recipient in the conversation:
      → Check Redis: is recipient online?
      → Online: find their Chat Server → push via WebSocket
      → Offline: push to Push Notification Service (APNs/FCM)
      → Store in undelivered queue (Redis list per user)
  → Sender receives "sent" ack
  → Recipient's Chat Server receives message → pushes to client
  → Recipient sends "delivered" ack → routed back to sender

Message Read Path (History)

Client opens conversation
  → REST API: GET /conversations/{id}/messages?cursor=...
  → Chat Service → Cassandra: query messages partition
  → Return paginated results
  → Client also receives real-time messages via WebSocket

Components

  1. Chat Servers: Maintain WebSocket connections, handle real-time message routing
  2. Chat Service: Business logic — validation, storage, orchestration
  3. Routing Service: Determines how to reach each recipient (online → push, offline → notification)
  4. PostgreSQL: Conversation metadata, user profiles, participant lists
  5. Cassandra: Message storage — optimized for time-series write-heavy workloads
  6. Redis: Connection registry (which user is on which chat server), undelivered message queues, presence cache
  7. Kafka: Message event stream — decouples write from delivery
  8. Push Notification Service: APNs/FCM integration for offline users
  9. Load Balancer (L4): TCP-level balancing for WebSocket connections (sticky sessions)

6. Deep Dives (15 min)

Deep Dive 1: Message Ordering and Delivery Guarantees

Problem: At 1M messages/sec with distributed servers, how do we guarantee messages are ordered correctly and delivered exactly once?

Ordering within a conversation:

  • Each message gets a Snowflake ID (time-based, globally unique, monotonically increasing per server)
  • Within a single conversation, messages from different senders may arrive at different servers at slightly different times
  • We use the Snowflake ID as the canonical order — it’s the timestamp when the message was received by the Chat Service
  • Client displays messages sorted by this ID
  • If two messages have IDs 1ms apart, the order might be “wrong” from the users’ perspective, but this is acceptable and matches how every major messaging app works

Exactly-once delivery:

  • Client assigns a client_message_id (UUID) to each message before sending
  • Server stores this ID in a dedup cache (Redis, TTL 24h)
  • If the same client_message_id arrives again (network retry), server returns the existing message_id without re-processing
  • On the receiving side: client tracks the last received message_id per conversation. Server only sends messages with higher IDs.

Delivery guarantees (at-least-once → exactly-once):

  1. Sender → Server: acknowledged when written to Cassandra + Kafka (persistent)
  2. Server → Recipient: push via WebSocket, wait for client ACK
  3. No ACK within 30 seconds → retry push
  4. Recipient still offline → message stays in Redis undelivered queue → delivered on next connect
  5. Client-side dedup by message_id prevents displaying duplicates

Deep Dive 2: Handling 150M Concurrent WebSocket Connections

Problem: Each WebSocket connection is a persistent TCP connection consuming memory (40-50KB per connection for buffers). 150M connections × 50KB = 7.5TB of memory just for connection state.

Architecture:

  • Fleet of Chat Servers: each handles ~100K connections (100KB overhead including application state) = ~10GB RAM per server
  • Need: 150M / 100K = 1,500 Chat Servers
  • Deployed across multiple availability zones and regions

Connection routing:

  • When a message needs to be delivered to user X, the Routing Service needs to know: which Chat Server is user X connected to?
  • Redis stores the mapping: conn:{user_id} → chat-server-42
  • When user connects: SETEX in Redis with TTL
  • When user disconnects: DEL from Redis
  • Heartbeat every 30 seconds to detect dead connections

Sticky sessions:

  • L4 load balancer (not L7) for WebSocket connections
  • Once connected, all traffic for that TCP connection goes to the same Chat Server
  • If a Chat Server dies, clients reconnect (they’re designed for this — mobile networks drop connections constantly)

Multi-region:

  • Chat Servers deployed in each region (US, EU, Asia, etc.)
  • Messages between users in different regions: Kafka cross-region replication (MirrorMaker)
  • Latency: intra-region < 50ms, cross-region < 200ms

Deep Dive 3: Group Messaging

Problem: A group with 256 members means each message must be delivered to 255 other users. A busy group with 100 messages/min → 25,500 deliveries/min for one group.

Fan-out approach:

  1. Sender sends message to Chat Service
  2. Chat Service writes to Cassandra (one write — message is stored once per conversation, not per recipient)
  3. Publishes to Kafka: group_messages.{conversation_id}
  4. Routing Service reads participant list (cached in Redis)
  5. For each participant: check online status → push or notify

Optimization — Group message batching:

  • Don’t fan out to all 255 members individually in the Routing Service
  • Instead, Chat Servers subscribe to group topics
  • Each Chat Server checks: “do I have any connected users who are in this group?”
  • If yes, push to those users. If no, skip entirely.
  • This reduces cross-server traffic dramatically because each Chat Server only needs to handle its own connected users.

Unread counts:

  • Maintain last_read_message_id per user per conversation
  • Unread count = SELECT count(*) FROM messages WHERE conversation_id = X AND message_id > last_read_message_id
  • This query is efficient in Cassandra (range scan on clustering key)
  • Cache frequently accessed unread counts in Redis

7. Extensions (2 min)

  • End-to-end encryption (E2EE): Signal Protocol for 1-on-1 (Diffie-Hellman key exchange). For groups: Sender Keys protocol. Server stores only ciphertext — cannot read messages. Key challenge: multi-device sync (each device has its own key pair).
  • Media messages: Images, videos, voice notes → upload to S3 first, send a message with the S3 URL. Thumbnail generation for inline preview.
  • Message search: Client-side search for E2EE (server can’t index encrypted content). For non-E2EE: Elasticsearch index on message content.
  • Voice/Video calls: WebRTC for peer-to-peer. TURN/STUN servers for NAT traversal. SFU (Selective Forwarding Unit) for group calls.
  • Message reactions: Lightweight — just an update to the message record. Fan-out same as regular messages.