1. Requirements & Scope (5 min)
Functional Requirements
- 1-on-1 messaging between users (text messages)
- Group messaging (up to 256 members)
- Message delivery status: sent, delivered, read
- Online/offline presence indicators
- Message history — persistent, accessible from any device
Non-Functional Requirements
- Availability: 99.99% — messaging is real-time communication; downtime is immediately felt
- Latency: Message delivery < 200ms end-to-end for online recipients
- Consistency: Messages must be ordered correctly within a conversation. No message loss. Exactly-once delivery semantics (no duplicate messages).
- Scale: 2B registered users, 500M DAU, 100B messages/day
- Durability: Messages are persistent — stored forever (or until user deletes)
2. Estimation (3 min)
Traffic
- 100B messages/day ÷ 100K sec/day = 1M messages/sec
- Peak: 3x → 3M messages/sec
- This is extremely high write throughput
Storage
- Average message: 100 bytes (text + metadata)
- 100B messages/day × 100 bytes = 10TB/day
- Per year: ~3.6PB
- With media (images, voice notes): 10x → ~36PB/year
Connections
- 500M DAU with persistent WebSocket connections
- Average user online 4 hours/day → at any time ~83M concurrent connections
- Peak: ~150M concurrent WebSocket connections
3. API Design (3 min)
REST (for non-realtime operations)
POST /api/v1/messages/send
Body: {
"conversation_id": "conv_abc",
"content": "Hello!",
"type": "text",
"client_message_id": "cm_uuid_123" // idempotency key
}
Response 201: {
"message_id": "m_server_456",
"timestamp": "2026-02-22T18:00:00.123Z"
}
GET /api/v1/conversations
Response 200: [
{
"conversation_id": "conv_abc",
"type": "1on1",
"participants": ["u_1", "u_2"],
"last_message": { "content": "Hello!", "timestamp": "..." },
"unread_count": 3
}
]
GET /api/v1/conversations/{conv_id}/messages?cursor={cursor}&limit=50
WebSocket (for real-time)
// Client → Server
{ "type": "message", "conversation_id": "conv_abc", "content": "Hello!", "client_id": "cm_uuid_123" }
{ "type": "typing", "conversation_id": "conv_abc" }
{ "type": "ack", "message_id": "m_456" } // delivery receipt
{ "type": "read", "conversation_id": "conv_abc", "up_to": "m_456" } // read receipt
// Server → Client
{ "type": "message", "message_id": "m_456", "from": "u_2", "conversation_id": "conv_abc", "content": "Hi!", "timestamp": "..." }
{ "type": "delivered", "message_id": "m_123" } // your message was delivered
{ "type": "read", "conversation_id": "conv_abc", "by": "u_2", "up_to": "m_456" }
{ "type": "typing", "conversation_id": "conv_abc", "user": "u_2" }
{ "type": "presence", "user": "u_2", "status": "online" }
Key decisions:
- client_message_id for idempotency — if client retries due to network issues, server deduplicates
- Read receipts as “up_to” marker — not per-message, reduces traffic: “I’ve read everything up to message X”
- WebSocket for all real-time events: messages, typing, presence, delivery/read receipts
4. Data Model (3 min)
Conversations & Participants (PostgreSQL, sharded by user_id)
Table: conversations
conversation_id (PK) | bigint
type | enum('1on1', 'group')
name | varchar(100) -- for groups
created_at | timestamptz
Table: participants
conversation_id | bigint
user_id | bigint
last_read_message_id | bigint
muted_until | timestamptz, nullable
PK: (conversation_id, user_id)
Index: (user_id, conversation_id) -- "my conversations" query
Messages (Cassandra, partitioned by conversation_id)
Table: messages
conversation_id (partition key) | bigint
message_id (clustering key, DESC) | bigint (Snowflake ID = time-ordered)
sender_id | bigint
content | text
type | text ('text', 'image', 'voice')
created_at | timestamp
Why Cassandra for messages?
- 1M writes/sec requires horizontally scalable write throughput
- Access pattern is always: “get messages for conversation X, ordered by time” — perfect for Cassandra’s partition + clustering key model
- No cross-conversation queries needed (no joins)
Connection Registry (Redis)
Key: conn:{user_id}
Value: { "server_id": "ws-server-42", "connected_at": "..." }
TTL: auto-expiry on disconnect
5. High-Level Design (12 min)
Message Send Path
Sender's device → WebSocket → Chat Server (sender's connection)
→ Chat Service:
1. Validate message, dedup by client_message_id
2. Write to Cassandra (messages table)
3. Update conversation metadata (last_message) in PostgreSQL
4. Publish to Kafka topic: messages.{conversation_id}
→ Kafka → Routing Service:
→ For each recipient in the conversation:
→ Check Redis: is recipient online?
→ Online: find their Chat Server → push via WebSocket
→ Offline: push to Push Notification Service (APNs/FCM)
→ Store in undelivered queue (Redis list per user)
→ Sender receives "sent" ack
→ Recipient's Chat Server receives message → pushes to client
→ Recipient sends "delivered" ack → routed back to sender
Message Read Path (History)
Client opens conversation
→ REST API: GET /conversations/{id}/messages?cursor=...
→ Chat Service → Cassandra: query messages partition
→ Return paginated results
→ Client also receives real-time messages via WebSocket
Components
- Chat Servers: Maintain WebSocket connections, handle real-time message routing
- Chat Service: Business logic — validation, storage, orchestration
- Routing Service: Determines how to reach each recipient (online → push, offline → notification)
- PostgreSQL: Conversation metadata, user profiles, participant lists
- Cassandra: Message storage — optimized for time-series write-heavy workloads
- Redis: Connection registry (which user is on which chat server), undelivered message queues, presence cache
- Kafka: Message event stream — decouples write from delivery
- Push Notification Service: APNs/FCM integration for offline users
- Load Balancer (L4): TCP-level balancing for WebSocket connections (sticky sessions)
6. Deep Dives (15 min)
Deep Dive 1: Message Ordering and Delivery Guarantees
Problem: At 1M messages/sec with distributed servers, how do we guarantee messages are ordered correctly and delivered exactly once?
Ordering within a conversation:
- Each message gets a Snowflake ID (time-based, globally unique, monotonically increasing per server)
- Within a single conversation, messages from different senders may arrive at different servers at slightly different times
- We use the Snowflake ID as the canonical order — it’s the timestamp when the message was received by the Chat Service
- Client displays messages sorted by this ID
- If two messages have IDs 1ms apart, the order might be “wrong” from the users’ perspective, but this is acceptable and matches how every major messaging app works
Exactly-once delivery:
- Client assigns a
client_message_id(UUID) to each message before sending - Server stores this ID in a dedup cache (Redis, TTL 24h)
- If the same
client_message_idarrives again (network retry), server returns the existingmessage_idwithout re-processing - On the receiving side: client tracks the last received
message_idper conversation. Server only sends messages with higher IDs.
Delivery guarantees (at-least-once → exactly-once):
- Sender → Server: acknowledged when written to Cassandra + Kafka (persistent)
- Server → Recipient: push via WebSocket, wait for client ACK
- No ACK within 30 seconds → retry push
- Recipient still offline → message stays in Redis undelivered queue → delivered on next connect
- Client-side dedup by message_id prevents displaying duplicates
Deep Dive 2: Handling 150M Concurrent WebSocket Connections
Problem: Each WebSocket connection is a persistent TCP connection consuming memory (40-50KB per connection for buffers). 150M connections × 50KB = 7.5TB of memory just for connection state.
Architecture:
- Fleet of Chat Servers: each handles ~100K connections (100KB overhead including application state) = ~10GB RAM per server
- Need: 150M / 100K = 1,500 Chat Servers
- Deployed across multiple availability zones and regions
Connection routing:
- When a message needs to be delivered to user X, the Routing Service needs to know: which Chat Server is user X connected to?
- Redis stores the mapping:
conn:{user_id} → chat-server-42 - When user connects: SETEX in Redis with TTL
- When user disconnects: DEL from Redis
- Heartbeat every 30 seconds to detect dead connections
Sticky sessions:
- L4 load balancer (not L7) for WebSocket connections
- Once connected, all traffic for that TCP connection goes to the same Chat Server
- If a Chat Server dies, clients reconnect (they’re designed for this — mobile networks drop connections constantly)
Multi-region:
- Chat Servers deployed in each region (US, EU, Asia, etc.)
- Messages between users in different regions: Kafka cross-region replication (MirrorMaker)
- Latency: intra-region < 50ms, cross-region < 200ms
Deep Dive 3: Group Messaging
Problem: A group with 256 members means each message must be delivered to 255 other users. A busy group with 100 messages/min → 25,500 deliveries/min for one group.
Fan-out approach:
- Sender sends message to Chat Service
- Chat Service writes to Cassandra (one write — message is stored once per conversation, not per recipient)
- Publishes to Kafka:
group_messages.{conversation_id} - Routing Service reads participant list (cached in Redis)
- For each participant: check online status → push or notify
Optimization — Group message batching:
- Don’t fan out to all 255 members individually in the Routing Service
- Instead, Chat Servers subscribe to group topics
- Each Chat Server checks: “do I have any connected users who are in this group?”
- If yes, push to those users. If no, skip entirely.
- This reduces cross-server traffic dramatically because each Chat Server only needs to handle its own connected users.
Unread counts:
- Maintain
last_read_message_idper user per conversation - Unread count =
SELECT count(*) FROM messages WHERE conversation_id = X AND message_id > last_read_message_id - This query is efficient in Cassandra (range scan on clustering key)
- Cache frequently accessed unread counts in Redis
7. Extensions (2 min)
- End-to-end encryption (E2EE): Signal Protocol for 1-on-1 (Diffie-Hellman key exchange). For groups: Sender Keys protocol. Server stores only ciphertext — cannot read messages. Key challenge: multi-device sync (each device has its own key pair).
- Media messages: Images, videos, voice notes → upload to S3 first, send a message with the S3 URL. Thumbnail generation for inline preview.
- Message search: Client-side search for E2EE (server can’t index encrypted content). For non-E2EE: Elasticsearch index on message content.
- Voice/Video calls: WebRTC for peer-to-peer. TURN/STUN servers for NAT traversal. SFU (Selective Forwarding Unit) for group calls.
- Message reactions: Lightweight — just an update to the message record. Fan-out same as regular messages.