Design WhatsApp / a Chat Messaging System - System Design

A chat app sounds like the easiest system you will ever build. “User A sends a message, user B receives it.” One INSERT, one SELECT. The interviewer lets you say that, then asks the questions that turn it into one of the hardest real-time systems in the building: how does B receive it instantly when B might be offline, on a train, or logged in on three devices at once? How do you show the second grey tick the moment it lands on B’s phone, and the two blue ticks the moment B actually opens the chat? How do hundreds of millions of phones hold an open connection to your servers at the same time without melting?

The whole problem comes down to one fact: HTTP request/response is the wrong shape for chat. Chat is server-initiated. The server needs to push to a client that did not ask for anything right now. Everything hard in this design - the persistent connections, the delivery guarantees, the presence system, the multi-device fan-out - falls out of solving “how does the server reach a specific user’s specific phone the instant a message exists for it.” Let me build it properly.

Functional Requirements (FR)

In scope:

1:1 messaging. User A sends a text message to user B. It must arrive in near real time when B is online, and be waiting for B when B comes back online.
Delivery state / receipts. The sender sees the message progress through states: sent (server has it, one tick), delivered (it reached the recipient’s device, two ticks), read (the recipient opened the chat, two blue ticks).
Online presence. Show whether a contact is online, or their “last seen” timestamp.
Typing indicators. “Alice is typing…” - a transient, best-effort signal.
Group messaging. A message to a group of up to ~1,000 members fans out to all of them. I will treat groups as an extension of 1:1, not a separate beast.
Multi-device. One account logged in on a phone plus several linked devices (web, desktop, tablet), all kept in sync. A message and its read state must converge across all of a user’s devices.
Message history / sync. A device that was offline catches up on everything it missed when it reconnects.

Explicitly out of scope (say this out loud so you own the scope):

End-to-end encryption internals. Real WhatsApp is E2E encrypted with the Signal protocol; I will note where it changes the design (the server stores ciphertext and cannot read or search content) but I am not designing the double-ratchet key exchange.
Voice/video calls. That is a WebRTC/media problem, its own system. See a separate design.
Media storage and transcoding. Photos and videos upload to a blob store/CDN and the message carries a pointer; I am not designing that pipeline.
Spam, abuse, payments, status/stories. Each is its own system.

The one decision that drives everything: this is a write-heavy, push-oriented, stateful-connection system. Unlike a social feed (read-heavy, pull), chat is roughly symmetric in reads and writes, and the defining constraint is that the server must hold a live connection to every online user so it can push the instant a message exists. That statefulness is the whole problem.

Non-Functional Requirements (NFR)

Scale: ~2B registered users, ~500M concurrent online connections at peak, ~100B messages/day. Groups up to ~1,000 members.
Latency: message delivery end to end (send tap to recipient’s screen) p99 under 500ms when both are online. The “sent” acknowledgment back to the sender must be near-instant (under 100ms) so the UI feels responsive.
Availability: 99.99%+ on the messaging path. People notice immediately when messages do not send. The connection layer must fail over gracefully.
Consistency: per-conversation ordering must be preserved (messages in a chat appear in the order they were sent), and no message is ever lost or silently dropped. Across conversations, global ordering does not matter. Receipts and presence are eventually consistent and can be slightly stale.
Durability: an undelivered message must survive server restarts and be delivered when the recipient returns. Once delivered to all of a user’s devices, the server may delete it (WhatsApp’s model - the server is a relay, not a permanent archive). Delivery guarantee is at-least-once, deduplicated by message ID to make it effectively exactly-once at the client.
Presence freshness: online/offline within a few seconds. “Last seen” can lag.

Back-of-the-Envelope Estimation (BoE)

Real numbers with the arithmetic, because they dictate the architecture.

Messages (writes):

100B messages / day
= 100,000,000,000 / 86,400 seconds
≈ 1,160,000 messages/sec  (call it ~1.2M msg/sec average)
Peak (3x average) ≈ 3.5M messages/sec

Reads / delivery:

Each message is written once but delivered to at least one recipient, often more (groups, multi-device). Assume an average fan-out of ~2 (1:1 dominates, groups pull the average up, multi-device adds copies):

1.2M msg/sec * ~2 deliveries ≈ 2.4M delivery-pushes/sec average
Peak ≈ 7M pushes/sec

So unlike a feed, reads and writes are the same order of magnitude. There is no 100:1 asymmetry to exploit; you cannot cache your way out of it. You must actually move every message.

Concurrent connections (the defining number):

500M concurrent WebSocket connections at peak.
If one connection server holds ~500K connections (aggressive but achievable
with epoll/edge-triggered IO and tuned kernels):
500M / 500K ≈ 1,000 connection servers just to hold the sockets.

That 1,000-server figure is the heart of the problem. These boxes do almost no CPU work; they exist to hold file descriptors open and shuttle bytes. Memory per connection (socket buffers + a little app state, say ~10KB) is the binding constraint:

500K connections * 10KB ≈ 5GB RAM per connection server, just for connection state.

Storage (undelivered queue + recent history):

The server is a relay, so it mainly stores undelivered messages plus a short window for multi-device sync. Per stored message:

message_id    16 bytes
conv_id        8 bytes
sender_id      8 bytes
recipient_id   8 bytes
payload      ~256 bytes (ciphertext, text-sized; media is a pointer)
timestamp      8 bytes
state          1 byte
metadata     ~50 bytes
------------------------------------------------
≈ 360 bytes/message  (round to ~400)

Most messages are delivered within seconds and then eligible for deletion. But assume we retain a sync window (say messages get delivered to all devices within ~30 days max, undelivered older ones expire):

Suppose ~5% of a day's messages are undelivered at any moment (offline recipients):
100B * 5% * 400 bytes ≈ 2 TB of pending messages outstanding.
Plus a 30-day multi-device sync buffer is the larger driver, but it is bounded
and still small relative to the message rate. Storage is NOT the hard part here;
connection state and push throughput are.

Bandwidth:

Peak ~7M pushes/sec * ~400 bytes ≈ 2.8 GB/sec of message payload pushed.
Plus presence/typing/receipt control traffic, which is high in count
but tiny per event. Real but spread across 1,000 connection servers
(~3MB/sec each) - trivially handled.

The takeaways: ~500M live connections needing ~1,000 connection servers, ~3.5M messages/sec peak that all have to actually move (no read-cache shortcut), and storage that is modest because the server relays rather than archives. Connections and push throughput, not storage, are the design drivers.

High-Level Design (HLD)

The architecture splits into a stateful connection layer (holds the WebSockets, knows which user is on which server) and a stateless message-routing layer behind it. The trick that ties them together is a session registry: a fast lookup from user_id/device_id -> which connection server holds that socket. To deliver a message you ask the registry where the recipient is, then hand the message to that specific connection server to push.

        ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
        │  Phone A     │   │  Web (A)    │   │  Phone B     │
        │ (WebSocket)  │   │ (WebSocket) │   │ (WebSocket)  │
        └──────┬───────┘   └──────┬──────┘   └──────┬───────┘
               │   persistent WSS │                  │
        ┌──────▼──────────────────▼──────────────────▼───────┐
        │              Load Balancer (L4, sticky)             │
        └──────┬──────────────────┬──────────────────┬───────┘
               │                  │                  │
        ┌──────▼──────┐    ┌──────▼──────┐    ┌──────▼──────┐
        │ Conn Server │    │ Conn Server │    │ Conn Server │   ... ~1,000 of these
        │   (gateway) │    │   (gateway) │    │   (gateway) │   each holds ~500K sockets
        └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
               │  registers (user,device)->this server │
        ┌──────▼───────────────────────────────────────▼──────┐
        │        Session Registry (Redis: user/device          │
        │        -> conn_server_id, + presence TTL keys)        │
        └──────────────────────────┬───────────────────────────┘
                                   │ "where is recipient?"
                         ┌─────────▼──────────┐
                         │  Message Service    │  (stateless)
                         │  routes + persists  │
                         └───┬────────────┬────┘
                  enqueue if │            │ persist always
                  offline    │            │
              ┌──────────────▼──┐    ┌────▼────────────────┐
              │  Message Queue   │    │  Message Store       │
              │  (per-recipient  │    │  (sharded by         │
              │   outbox, Kafka/ │    │  recipient/conv;     │
              │   durable)       │    │  pending + sync win) │
              └──────────────────┘    └─────────────────────┘
                         ┌──────────────────────┐
                         │  Presence Service     │  (heartbeats, last-seen)
                         └──────────────────────┘

Send flow (A messages B, both online):

Phone A already holds a persistent WebSocket to some connection server. A sends a SEND frame: {client_msg_id, conv_id, recipient_id, payload}.
The connection server forwards it to the stateless Message Service.
Message Service assigns a server-side message_id (time-sortable), persists the message to the Message Store (durability first - never ack before it is safe), and immediately acks A with sent (one tick). This ack carries the canonical message_id.
Message Service asks the Session Registry: “where is B, and on which devices?” The registry returns B is online on connection server #734 (phone) and #91 (web).
Message Service pushes the message to servers #734 and #91, which write it down their respective WebSockets to B’s devices.
B’s device, on receiving it, sends back a delivered receipt. This flows back to A and flips A’s UI to two ticks. When B opens the chat, B’s device sends a read receipt -> two blue ticks on A.

Send flow (B is offline):

Steps 1-3 identical: persist, ack A with sent.
Registry lookup says B is offline (no live session). Message Service enqueues the message into B’s durable outbox (per-recipient queue) and leaves it persisted.
When B reconnects, B’s connection server registers B in the Session Registry and triggers a sync: drain B’s outbox, push all pending messages in order, B acks each as delivered, and the receipts propagate back to the senders.

The key insight: the connection layer is stateful and the routing layer is stateless, glued by the Session Registry. Online delivery is a registry lookup plus a direct push; offline delivery is a durable queue drained on reconnect. Both paths persist first, so nothing is ever lost.

Component Deep Dive

The hard parts, naive-first then evolved: (1) persistent connections and the session registry, (2) guaranteed delivery and the tick state machine, (3) presence at scale, (4) multi-device sync.

1. Persistent connections and the session registry

Naive approach: HTTP polling. The client asks “any new messages?” every few seconds over normal HTTP.

loop every 3s:
    GET /messages?since=<last_seen_ts>

Where it breaks:

Latency vs load tradeoff is unwinnable. Poll every 3s and messages are up to 3s late (chat feels broken). Poll every 200ms for snappiness and you have 500M clients each firing 5 requests/sec = 2.5B requests/sec of mostly-empty polls. You are paying full HTTP request cost (connection setup, headers, auth) to learn “nothing happened” 99% of the time.
No server push. The fundamental mismatch: the server knows the instant a message exists but cannot tell anyone until they next ask.

First evolution: long polling. The client opens a request and the server holds it open until a message arrives or a timeout hits, then responds and the client immediately re-opens. This removes empty polls and gives near-instant delivery. It is a real, workable answer and worth stating as the stepping stone. But: every held request still ties up a server thread/connection, reconnection churn is high, and it is one-directional (server->client); the client still needs separate requests to send. For a system where messages flow constantly in both directions, you want a true bidirectional channel.

The answer: WebSockets. One TCP connection, upgraded from HTTP once (Upgrade: websocket), then kept open for the life of the session. Both sides can send frames at any time with almost no per-message overhead (a few bytes of framing, no headers, no re-auth). This is the natural shape for chat: full-duplex, low-latency, server-initiated push built in.

Client                          Connection Server
  │  HTTP GET Upgrade: websocket  │
  │ ─────────────────────────────▶│
  │  101 Switching Protocols      │
  │ ◀─────────────────────────────│
  │  ===== persistent WSS ======= │
  │  SEND frame ─────────────────▶│
  │ ◀──────────── ACK(sent) frame │
  │ ◀──────────── PUSH(msg) frame │   (server-initiated, no request needed)
  │  RECEIPT(delivered) ─────────▶│

Now the hard part: where does a message-for-B actually go? B’s socket lives on exactly one of 1,000 connection servers, and the Message Service has no idea which. This is the session registry problem.

Naive: broadcast. When a message for B arrives, ask all 1,000 connection servers “do you have B?” That is a 1,000-way fan-out per message at 3.5M messages/sec. It melts instantly.
The fix: a session registry. When B’s WebSocket connects, the connection server writes B/device -> server_734 into a fast shared store (Redis). When a message for B arrives, the Message Service does a single GET B to find server #734, then pushes directly there. O(1) lookup, one hop.

On connect:   registry.set("sess:{user}:{device}", conn_server_id, TTL=heartbeat*3)
On message:   server = registry.get("sess:{recipient}:{device}")
              if server: push_to(server, message)     # online
              else:      enqueue_outbox(recipient, message)  # offline
On disconnect: registry.del(...)  (or let TTL expire if the server crashed)

The connection server-to-connection server push (Message Service -> server #734) goes over an internal RPC mesh or a lightweight internal message bus, not the public internet. The registry entry carries a TTL refreshed by heartbeats so that if a connection server crashes, its stale entries expire and the client’s reconnect (to a different server) re-registers cleanly. The load balancer at L4 is connection-sticky but does not need application stickiness, because the registry, not the LB, is the source of truth for “where is B.”

2. Guaranteed delivery and the tick state machine

Naive approach: fire and forget. Message Service receives the message and pushes it down B’s socket. Done.

Where it breaks:

TCP delivered != application received. The OS may have buffered the bytes, then B’s phone dies, or the app is killed, or the socket silently half-opens. The message is gone and nobody knows. “Delivered” at the TCP layer is not “delivered” to the user.
No durability before ack. If you ack the sender before persisting and the Message Service crashes, the message vanishes but A thinks it sent. Unacceptable - “no message is ever lost” is a hard NFR.
Ordering. Two messages from A to B over slightly different paths can arrive out of order; the chat looks scrambled.

The fix is a persist-then-ack pipeline with application-level acknowledgments at every hop, plus per-conversation sequencing.

Durability first. The Message Service persists the message to the Message Store before acking the sender. Only after the write is durable does A get the sent tick. If the service crashes mid-flight, A never got the ack, A’s client retries with the same client_msg_id, and the server dedups. At-least-once delivery + dedup by ID = effectively exactly-once at the client.

The tick state machine is the receipt protocol made explicit. Every transition is driven by an application-level ack from the next hop, not by TCP:

   A taps send
        │
        ▼
   [PENDING]  (clock icon on A's phone; not yet on server)
        │  server persists + acks
        ▼
   [SENT]     ✓   one grey tick   (server has durable copy)
        │  pushed to B's device, B's app acks "delivered"
        ▼
   [DELIVERED] ✓✓ two grey ticks  (on B's device, not yet seen)
        │  B opens the chat, B's app sends "read"
        ▼
   [READ]     ✓✓ two blue ticks   (B actually viewed it)

Each receipt is itself a small message that flows back through the same routing machinery (registry lookup of the original sender, push or enqueue). Receipts are persisted too, so if A is offline when B reads, A learns about the blue ticks on reconnect.

Per-conversation ordering. Assign each message a monotonic per-conversation sequence number (or rely on a time-sortable message_id scoped to the conversation). The recipient’s client orders by that sequence, not by arrival time, so out-of-order network delivery still renders correctly. We do not need global ordering across conversations - only within a conversation - which lets us shard by conversation without a global clock.

Retries and dedup. Undelivered messages sit in the recipient’s durable outbox and are re-pushed on every reconnect until the recipient acks delivered. Because re-pushes can duplicate, the client dedups on message_id. The outbox entry is removed only after the delivered ack. This is a classic at-least-once queue with idempotent consumers.

Guarantee	Naive fire-and-forget	Persist-then-ack + outbox
Lost on crash	Yes, silently	No - persisted before ack, retried from outbox
Offline recipient	Message dropped	Queued, delivered on reconnect
Duplicates	Possible, unhandled	Possible but deduped by message_id
Ordering	Arrival order (wrong)	Per-conversation sequence (correct)
Sender feedback	None	Full tick state machine

3. Presence at scale (online / last-seen / typing)

Naive approach: a presence table you write on every change. UPDATE users SET online=true WHERE id=B on connect, false on disconnect, and clients poll it.

Where it breaks:

Crashes leave ghosts. If a connection server dies, it never runs the “set offline” code, so B shows online forever. Presence built on explicit disconnect events is unreliable because the most common “disconnect” (network drop, crash) sends no event.
Write storms. 500M users connecting/disconnecting as phones flap between wifi and cellular means millions of presence writes/sec to a database. And presence is fan-out heavy on read: when B comes online, everyone with B’s chat open wants to know. Naively that is a broadcast to all of B’s contacts.

The fix: heartbeat-driven presence with TTLs, and subscribe-based fan-out.

Heartbeats + TTL, not explicit events. The client sends a lightweight heartbeat every ~30s over the existing WebSocket. The Presence Service stores presence:{user} -> last_heartbeat as a Redis key with a TTL of ~3 heartbeats (~90s). Online is derived: the key exists -> online; the key expired -> offline, and last_seen = last heartbeat time.

on heartbeat from B:   redis.set("presence:B", now, EX=90)
is B online?           redis.exists("presence:B")
B's last seen:         the value of the key just before it expired (persisted)

This self-heals: a crashed connection server stops B’s heartbeats, the key expires in 90s, B goes offline automatically. No reliance on a clean disconnect ever firing. The cost is up to ~90s of staleness on “offline,” which is well within the “few seconds to a bit” freshness budget for last-seen (tune the TTL down if you want faster, at the cost of more heartbeat traffic).

Fan-out only to interested watchers. Do not broadcast B’s status to all of B’s contacts. Instead, presence is subscribe-on-demand: when A opens a chat with B, A’s client subscribes to B’s presence; the Presence Service pushes B’s status changes only to the small set of users currently viewing a chat with B. When A closes the chat, A unsubscribes. This turns an O(contacts) broadcast into O(active viewers), which is tiny.

Typing indicators are pure ephemeral signaling: a typing event sent over the WebSocket, relayed to the other party (or group members currently viewing the chat), never persisted, auto-expiring after a few seconds if no “stopped typing” follows. Best-effort - if one is lost, nobody cares. Keeping it off the durable path keeps the firehose of typing events away from the storage layer entirely.

4. Multi-device sync

This is the part people underestimate. One account on a phone + web + desktop, all needing the same messages and the same read state. WhatsApp’s real model: each linked device is a first-class endpoint with its own session and its own encryption identity.

Naive approach: treat the account as one mailbox. Deliver each message once to “the user,” whichever device grabs it first. Mark it read for the account.

Where it breaks:

A message read on the phone must show as read on the web too, and vice versa, or the UX is maddening. A single account-level mailbox where one device consumes the message means the other devices never see it.
A device offline for a week must catch up on everything it missed without re-downloading the entire history.
Receipts must converge. If you read on your phone, your laptop should stop showing the unread badge.

The fix: model each device as its own delivery target with per-device cursors, and treat the account as a set of devices.

Per-device fan-out. The Session Registry keys are user/device, not just user. When a message arrives for B, the Message Service looks up all of B’s registered devices and delivers a copy to each (online ones pushed now, offline ones queued in that device’s outbox). So “send to B” is internally a small fan-out to B’s device set.

deliver_to(user B, message m):
    devices = registry.devices_of(B)          # [phone, web, desktop]
    for d in devices:
        if online(B, d): push(server_of(B,d), m)
        else:            enqueue_outbox(B, d, m)   # per-device outbox

Per-device sync cursor. Each device tracks the message_id (or per-conversation sequence) up to which it has acked delivery. On reconnect, the device sends its cursor and the server streams everything after it, in order. A week-offline laptop just resumes from its cursor - bounded, incremental, no full re-download.
Read-state convergence. A read receipt is itself a synced event. When B reads a chat on the phone, the phone emits a read event that the server fans out to B’s other devices too (not just to the original sender A). Each device applies it and clears its unread badge. Read state is thus eventually consistent across the account, converging within a sync cycle.
A note on encryption. Because each device has its own keys, the sender actually encrypts the message once per recipient device (the Signal “sender keys”/per-device session model). The server still just routes ciphertext blobs to device endpoints - it never decrypts - so from the routing design’s point of view, multi-device is “fan out to N device mailboxes.” The encryption detail explains why it is per-device and not a shared account mailbox, but does not change the routing topology.

The mental shift: the unit of delivery is a device, not a user. Once you model it that way, multi-device, offline catch-up, and cross-device read state all reduce to “the same message/receipt machinery, fanned out across a device set with per-device cursors.”

API Design & Data Schema

Most traffic is WebSocket frames, not REST. REST is used for the connection bootstrap, auth, and history backfill; the real-time path is framed messages over the open socket.

REST (bootstrap and history)

POST /api/v1/auth/connect
  -> issues a short-lived connection token + the WSS endpoint to dial
Response 200: { "ws_url": "wss://chat-edge.wa.net/ws", "token": "..." }

GET /api/v1/conversations/{conv_id}/messages?before=<message_id>&limit=50
  -> paginated history backfill (cursor-based on message_id)
Response 200: { "messages": [...], "next_cursor": "<message_id>" }

POST /api/v1/devices/link      { "device_pubkey": "..." }  -> link a new device
GET  /api/v1/devices                                        -> list linked devices

WebSocket frames (the real-time path)

// client -> server: send a message
{ "type": "SEND", "client_msg_id": "c-abc", "conv_id": "g42",
  "recipient_id": "B", "payload": "<ciphertext>", "ts": 1719... }

// server -> client: acknowledge persistence (one tick)
{ "type": "ACK", "client_msg_id": "c-abc", "message_id": "0x1f3a...",
  "state": "SENT" }

// server -> client: push a new message (server-initiated)
{ "type": "PUSH", "message_id": "0x1f3a...", "conv_id": "g42",
  "sender_id": "A", "seq": 8821, "payload": "<ciphertext>" }

// client -> server: receipts
{ "type": "RECEIPT", "message_id": "0x1f3a...", "state": "DELIVERED" }
{ "type": "RECEIPT", "message_id": "0x1f3a...", "state": "READ" }

// presence + typing (ephemeral)
{ "type": "HEARTBEAT" }                                  // client, every ~30s
{ "type": "PRESENCE_SUB",   "user_id": "B" }             // start watching B
{ "type": "PRESENCE_EVENT", "user_id": "B", "state": "online", "last_seen": null }
{ "type": "TYPING", "conv_id": "g42", "state": "start" } // best-effort

Pagination for history is cursor-based on message_id, not offset - new messages arrive constantly, so offsets would skip or duplicate. Since message_id is time-sortable, the cursor is a stable position in the conversation stream.

Data store choices

1. Message Store - NoSQL wide-column (Cassandra / ScyllaDB / DynamoDB). The access pattern is “write a message” and “range-scan a conversation’s recent messages by sequence,” at millions of writes/sec, with no cross-entity transactions and no joins. That is exactly a distributed wide-column store - high write throughput, horizontal scale, tunable consistency. A single relational DB cannot sustain 3.5M writes/sec or hold the data.

Table: messages
  conv_id      STRING   PARTITION KEY        -- all messages of one conversation together
  seq          BIGINT   CLUSTERING KEY ASC   -- per-conversation monotonic sequence
  message_id   BIGINT                        -- global time-sortable id (Snowflake)
  sender_id    BIGINT
  payload      BLOB                           -- ciphertext; server cannot read it
  created_at   TIMESTAMP
  Shard by: conv_id   -> a chat's history is one partition, range-scanned by seq
  Read pattern: "messages in conv g42 after seq 8800" = single-partition range scan

Sharding by conv_id keeps every conversation’s messages co-located and ordered, so history reads and the per-conversation sequence are both single-partition operations. A very hot group (1,000 members blasting messages) is the one hot partition; mitigate by capping group size and by the fact that group fan-out is read from this one ordered partition rather than duplicated per member.

2. Per-device Outbox / pending queue - durable queue (Kafka, or a per-recipient table). Holds messages for offline devices until acked delivered.

Table: outbox
  user_id      BIGINT
  device_id    BIGINT
  PARTITION KEY = (user_id, device_id)        -- one queue per device
  message_id   BIGINT   CLUSTERING KEY ASC     -- ordered, drained on reconnect
  enqueued_at  TIMESTAMP
  Read pattern: "drain everything for (B, web) after cursor" on reconnect
  Delete: on DELIVERED receipt from that device

Sharding by (user_id, device_id) gives each device an independent, ordered queue - exactly what per-device sync cursors need.

3. Session Registry + Presence - Redis (in-memory, replicated). sess:{user}:{device} -> conn_server_id with heartbeat TTL, and presence:{user} -> last_heartbeat with TTL. Both are small, hot, and ephemeral - the perfect fit for an in-memory store, and rebuildable (a lost registry just forces clients to re-register on their next heartbeat).

Why NoSQL/Redis over SQL across the board: the workload is partition-scoped, transaction-free, write-heavy, and demands horizontal scale to billions of rows and millions of ops/sec, plus an in-memory tier for connection routing. SQL’s joins and ACID transactions are features we do not need; its single-writer scaling ceiling is exactly the wall we would hit. There is no billing ledger or balance here that would justify strong relational consistency.

Bottlenecks & Scaling

Where it breaks first, in order, and the fix for each:

1. Holding 500M concurrent connections (breaks first). A normal app server holds a few thousand connections; you need ~500K per box. Fix: dedicated, thin connection servers tuned for IO, not CPU - epoll/edge-triggered event loops (e.g. an Erlang/Elixir BEAM, Go, or C++ stack), raised file-descriptor limits, tuned kernel socket buffers, and minimal per-connection memory. Scale horizontally to ~1,000 boxes. These do no business logic; they hold sockets and relay frames.

2. Finding which server holds the recipient. Cannot broadcast to 1,000 servers per message. Fix: the Session Registry (Redis), user/device -> conn_server_id, single GET per delivery. TTL-refreshed by heartbeats so crashed servers’ entries self-expire.

3. Message write throughput (3.5M/sec peak). Fix: shard the Message Store by conv_id; writes spread across the partition space. Per-conversation sequence avoids any global counter. Replication factor 3 for durability. Persist-then-ack so durability never depends on a single node staying up.

4. Presence write/read storms. Millions of connect/disconnect flaps and contact-fan-out. Fix: heartbeat + TTL keys (online is derived, self-healing) instead of explicit event writes; and subscribe-on-demand fan-out so status changes go only to users currently viewing that chat, turning O(contacts) into O(active viewers). Typing indicators stay entirely off the durable path.

5. Hot group / hot partition. A 1,000-member active group concentrates writes and fan-out. Fix: cap group size; store the group’s messages in one ordered partition (by conv_id) and fan out references to each member’s session rather than duplicating the payload per member; rate-limit per-sender. For very large broadcast lists, treat fan-out asynchronously through the queue so one chatty group cannot starve 1:1 traffic.

6. Connection server crash = mass disconnect. One box dropping kills ~500K sessions at once. Fix: clients auto-reconnect (with jittered backoff to avoid a thundering herd) to a different connection server via the LB; the new server re-registers them in the registry and triggers outbox sync. Because messages were persisted and queued, nothing is lost - the reconnect just replays from each device’s cursor. Stale registry entries from the dead box expire by TTL.

7. Shard keys, stated plainly. Message Store -> conv_id (a chat’s history is one ordered partition). Outbox -> (user_id, device_id) (independent per-device queues with cursors). Session Registry / Presence -> user_id (point lookups). Sharding messages by created_at would be the classic mistake: it piles all current writes onto the “now” shard, a permanent hot spot. conv_id spreads writes evenly while keeping each conversation ordered and co-located.

8. At-least-once duplicates. Retries from the outbox can re-deliver. Fix: idempotent clients dedup on message_id; outbox entries are deleted only on the delivered ack. The combination yields effectively exactly-once at the screen.

9. Single points of failure. Connection servers are interchangeable (state lives in the registry, not the box). Message Service is stateless behind the LB. Registry, queue, and store are all replicated. No single box whose loss stops messages from flowing - at worst a slice of users reconnect and resync.

10. Multi-region latency for a global user base. Fix: deploy connection servers and registries per region, route via GeoDNS to the nearest edge. Two users in the same region never leave it. Cross-region chats hop between regional message buses; because per-conversation ordering (not global) is the only requirement, asynchronous cross-region replication is fine. A user’s home region owns their outbox and serves their reconnect sync.

Wrap-Up

The trade-offs that define this design:

WebSockets over polling/long-polling. Chat is server-initiated and bidirectional; we pay the cost of holding 500M stateful connections to get true low-latency push, because no request/response shape can deliver instantly without melting under empty polls.
Stateful connection layer + stateless routing, glued by a session registry. We keep the expensive statefulness (open sockets) in thin, interchangeable gateway boxes and put a fast user/device -> server registry in front, so routing a message is one lookup and one hop, not a broadcast.
Persist-then-ack with at-least-once + dedup. We never acknowledge before the message is durable, queue everything for offline devices, and dedup by ID at the client - trading some duplicate pushes for a hard no-loss guarantee and the full sent/delivered/read tick state machine.
Heartbeat-derived presence with subscribe-on-demand fan-out. We make online a derived, self-healing fact (TTL keys) instead of trusting clean disconnects, and we push status only to users actually watching - trading up to ~90s of last-seen staleness for a system that never shows ghosts and never broadcasts to millions.
The device, not the user, is the unit of delivery. Per-device sessions, outboxes, and sync cursors make multi-device, offline catch-up, and cross-device read state all fall out of the same message machinery - and they line up with the per-device encryption model.

One-line summary: a push-oriented, write-heavy messaging system where every online device holds a persistent WebSocket to a thin gateway tracked in a session registry, messages are persisted before they are acked and queued per-device for offline delivery with at-least-once-plus-dedup guarantees and a sent/delivered/read tick state machine, presence is derived from self-healing heartbeat TTLs fanned out only to active watchers, and the whole thing shards by conversation and device to move 3.5M messages/sec across 500M live connections under 500ms.

Functional Requirements (FR)#

Non-Functional Requirements (NFR)#

Back-of-the-Envelope Estimation (BoE)#

High-Level Design (HLD)#

Component Deep Dive#

1. Persistent connections and the session registry#

2. Guaranteed delivery and the tick state machine#

3. Presence at scale (online / last-seen / typing)#

4. Multi-device sync#

API Design & Data Schema#

REST (bootstrap and history)#

WebSocket frames (the real-time path)#

Data store choices#

Bottlenecks & Scaling#

Wrap-Up#

Comments

Functional Requirements (FR)

Non-Functional Requirements (NFR)

Back-of-the-Envelope Estimation (BoE)

High-Level Design (HLD)

Component Deep Dive

1. Persistent connections and the session registry

2. Guaranteed delivery and the tick state machine

3. Presence at scale (online / last-seen / typing)

4. Multi-device sync

API Design & Data Schema

REST (bootstrap and history)

WebSocket frames (the real-time path)

Data store choices

Bottlenecks & Scaling

Wrap-Up