1. Requirements & Scope (5 min)

Functional Requirements

  1. Users can create, edit, and delete documents with rich-text formatting (bold, italic, headings, lists, tables, images)
  2. Multiple users can simultaneously edit the same document in real time, seeing each other’s changes within ~200ms
  3. Each collaborator’s cursor position and selection is visible to all other editors (presence awareness)
  4. Full version history — users can view, name, and restore any previous version of the document
  5. Sharing and permissions model: owner, editor, commenter, viewer roles with link-sharing and per-user ACLs

Non-Functional Requirements

  • Availability: 99.99% — documents are a user’s primary work artifact. Downtime during working hours is extremely costly.
  • Latency: < 200ms for local edit acknowledgment; < 500ms for remote edit propagation to other collaborators
  • Consistency: Eventual consistency for document state, but operations must converge — all users see the identical document after quiescence. No lost edits, ever.
  • Scale: 1B documents total, 10M DAU, up to 100 concurrent editors per document, peak 500K concurrent editing sessions
  • Durability: Zero data loss. Every keystroke must be persisted. Point-in-time recovery for any document.

2. Estimation (3 min)

Traffic

  • 10M DAU, average 3 editing sessions/day, average session 20 minutes
  • Average 2 operations/second per active user (character insert, delete, format change)
  • Concurrent active editors at peak: ~2M users
  • Write QPS: 2M users x 2 ops/sec = 4M operations/sec
  • Read QPS: Document opens: 10M x 3 = 30M/day = ~350 reads/sec (bursty, 3x peak = ~1K/sec). Presence/cursor updates: same as write QPS.

Storage

  • 1B documents, average 50KB per document (plain text + formatting metadata)
  • Document content: 1B x 50KB = 50TB
  • Operation log (for version history): assume 500 ops per document per day, 10 bytes per op average, retained for 1 year
    • Active documents per day: 30M. 30M x 500 x 10B x 365 = ~55TB/year of operation logs
  • Total: ~105TB primary storage, replicated 3x = ~315TB

Key Insight

This is a write-heavy, real-time collaboration system. The core challenge is not storage or throughput — it’s ensuring that concurrent edits from multiple users converge to the same document state without conflicts or lost updates. The algorithm choice (OT vs CRDT) dominates the design.


3. API Design (3 min)

REST Endpoints (Document CRUD)

POST   /api/documents
  Body: { "title": "Untitled" }
  Response 201: { "doc_id": "d_abc123", "created_at": "..." }

GET    /api/documents/{doc_id}
  Response 200: { "doc_id": "...", "title": "...", "content": <snapshot>, "version": 4872, "permissions": {...} }

DELETE /api/documents/{doc_id}
  Response 204

GET    /api/documents/{doc_id}/history?from_version=100&to_version=200
  Response 200: { "versions": [{ "version": 101, "timestamp": "...", "author": "...", "ops": [...] }, ...] }

POST   /api/documents/{doc_id}/restore
  Body: { "target_version": 150 }
  Response 200: { "version": 4873, "content": <snapshot> }

PUT    /api/documents/{doc_id}/share
  Body: { "user_id": "u_xyz", "role": "editor" }
  Response 200

WebSocket Protocol (Real-Time Editing)

// Client connects
WS /ws/documents/{doc_id}?token=<jwt>

// Client → Server: send operation
{ "type": "op", "client_id": "c_1", "parent_version": 4871, "ops": [{ "type": "insert", "pos": 42, "text": "hello" }] }

// Server → Client: broadcast transformed operation
{ "type": "op", "server_version": 4872, "author": "u_abc", "ops": [{ "type": "insert", "pos": 45, "text": "hello" }] }

// Server → Client: presence update
{ "type": "presence", "user_id": "u_abc", "cursor": { "pos": 47, "selection_end": 47 }, "color": "#e06c75" }

// Client → Server: cursor update
{ "type": "cursor", "pos": 50, "selection_end": 55 }

Key Decisions

  • Operations are sent as deltas (not full document snapshots) to minimize bandwidth
  • Each operation references a parent_version so the server can determine which concurrent ops need transformation
  • Presence (cursors) is sent on a separate channel/message type — it’s ephemeral and does not need persistence

4. Data Model (3 min)

Documents Table (PostgreSQL)

Field Type Notes
doc_id UUID (PK) Globally unique
owner_id UUID (FK) Creator
title VARCHAR(500) Document title
current_version BIGINT Monotonically increasing
content_snapshot JSONB Latest full document state (rich text tree)
snapshot_version BIGINT Version at which snapshot was taken
created_at TIMESTAMP
updated_at TIMESTAMP

Operations Table (Cassandra / ScyllaDB)

Field Type Notes
doc_id UUID (partition key)
version BIGINT (clustering key) Server-assigned monotonic version
author_id UUID Who made this edit
ops BLOB Serialized operation (insert/delete/format)
timestamp TIMESTAMP

Permissions Table (PostgreSQL)

Field Type Notes
doc_id UUID (PK)
user_id UUID (PK)
role ENUM owner / editor / commenter / viewer
granted_at TIMESTAMP
link_share_role ENUM Role for anyone-with-link (nullable)

Why These DB Choices?

  • PostgreSQL for documents and permissions: strong consistency needed for access control; documents metadata is relational and moderate-scale
  • Cassandra/ScyllaDB for operations log: append-heavy, time-series-like workload. Partitioned by doc_id so all ops for one document are co-located. Excellent write throughput. We only need range scans within a partition (get ops from version X to Y).
  • Redis for ephemeral state: cursor positions, active sessions, presence data. TTL-based expiry.

5. High-Level Design (12 min)

System Architecture

                          ┌──────────────────────────┐
                          │      Load Balancer        │
                          │   (L7, sticky sessions    │
                          │    by doc_id hash)        │
                          └────────────┬─────────────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                   │
              ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐
              │  API GW   │     │  API GW   │     │  API GW   │
              │ (REST +   │     │ (REST +   │     │ (REST +   │
              │  WS mgmt) │     │  WS mgmt) │     │  WS mgmt) │
              └─────┬─────┘     └─────┬─────┘     └─────┬─────┘
                    │                  │                   │
                    └──────────────────┼──────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              │                        │                        │
       ┌──────▼──────┐         ┌──────▼──────┐         ┌──────▼──────┐
       │ Collab       │         │ Collab       │         │ Collab       │
       │ Server       │         │ Server       │         │ Server       │
       │ (OT Engine)  │         │ (OT Engine)  │         │ (OT Engine)  │
       │              │         │              │         │              │
       │ Owns doc     │         │ Owns doc     │         │ Owns doc     │
       │ partitions   │         │ partitions   │         │ partitions   │
       │ A-F          │         │ G-M          │         │ N-Z          │
       └──────┬───────┘         └──────┬───────┘         └──────┬───────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                     ┌─────────────────┼──────────────────┐
                     │                 │                   │
              ┌──────▼──────┐   ┌─────▼──────┐    ┌──────▼──────┐
              │  PostgreSQL │   │  Cassandra  │    │   Redis     │
              │  (docs,     │   │  (ops log)  │    │  (presence, │
              │   perms,    │   │             │    │   sessions, │
              │   metadata) │   │             │    │   cursors)  │
              └─────────────┘   └────────────┘    └─────────────┘

Request Flow — User Types a Character

User A types 'x' at position 42 (their local version = 100)
       │
       ▼
Client applies op locally (optimistic) → user sees 'x' immediately
       │
       ▼
Client sends: { op: insert('x', 42), parent_version: 100 }
       │
       ▼
WebSocket → API Gateway → routes to Collab Server owning this doc
       │
       ▼
Collab Server:
  1. Acquire per-document lock (in-memory, single-threaded per doc)
  2. Current server version = 103 (3 ops happened since client's version 100)
  3. Transform client's op against ops 101, 102, 103
     - e.g., op 101 inserted 5 chars before pos 42 → transformed pos = 47
  4. Apply transformed op → new server version = 104
  5. Persist op 104 to Cassandra (async, buffered)
  6. Broadcast transformed op to all connected clients (except sender)
  7. Send ACK to sender with server_version = 104
       │
       ▼
Other clients receive transformed op → apply it → their docs converge

Component Responsibilities

  1. API Gateway: Handles REST endpoints, upgrades HTTP to WebSocket, authenticates JWT, routes WebSocket connections to the correct Collab Server based on document partition
  2. Collaboration Server: The brain. Owns a partition of documents. Runs the OT engine. Maintains in-memory document state and connected client list. Single writer per document (critical for OT correctness).
  3. PostgreSQL: Source of truth for document metadata, permissions, and periodic content snapshots
  4. Cassandra: Append-only operation log. Enables version history, undo, and recovery.
  5. Redis: Ephemeral presence data (cursors, online users). Pub/Sub for cross-server presence fanout.
  6. Snapshot Worker: Background job that periodically compacts operations into a full document snapshot (every ~1000 ops) to bound recovery time.

6. Deep Dives (15 min)

Deep Dive 1: Operational Transformation (OT) vs CRDT

This is the defining technical decision of the system.

Operational Transformation (OT):

  • Each edit is an “operation” (insert char at position X, delete range Y-Z, format range as bold)
  • When two users edit concurrently, their operations must be “transformed” against each other so they can be applied in any order and converge to the same state
  • Requires a central server that assigns a total order to operations
  • Google Docs uses OT (specifically, a variant called Jupiter/Wave OT)

CRDTs (Conflict-free Replicated Data Types):

  • Each character has a unique, globally-ordered ID (e.g., Lamport timestamp + node ID)
  • Operations are commutative — no transformation needed
  • Can work fully peer-to-peer (no central server required)
  • Used by Figma, Yjs, Automerge

Why we choose OT for this design:

Criterion OT CRDT
Server requirement Central server (we already need one for auth, persistence) Optional
Memory overhead Low — only current doc state in memory High — tombstones for every deleted character, unique IDs per character
Rich text support Well-understood (Google has shipped it for 15 years) Harder — formatting operations across ranges are tricky with CRDTs
Undo/redo Well-defined (inverse operations) Complex (need to reason about causal history)
Offline editing Harder (ops diverge significantly) Better (designed for this)
Implementation complexity Transform functions are error-prone but well-documented Conceptually simpler but metadata overhead is real

OT Transform Example:

User A: insert('x', pos=5) at version 10
User B: insert('y', pos=3) at version 10  (concurrent)

Server receives A first → version 11: insert('x', 5)
Server receives B second → must transform against A's op:
  - B's pos (3) < A's pos (5), so A doesn't affect B
  - Transformed B: insert('y', pos=3)
  - But A must be adjusted for clients who already applied B:
    Transformed A for B's clients: insert('x', pos=6)  // shifted right by 1

Both converge to: ...y..x... (same document)

The critical invariant: For any two operations O1 and O2, applying transform(O1, O2) after O2 must produce the same state as applying transform(O2, O1) after O1. This is called the TP1 property.

Server-side implementation:

Per-document state (in memory):
  - current_version: int
  - ops_buffer: [last 1000 ops]  // for transforming late-arriving ops
  - connected_clients: map<client_id, {websocket, last_ack_version}>

On receiving op from client:
  1. Lock(doc_id)  // in-memory mutex, single Collab Server owns each doc
  2. gap = current_version - op.parent_version
  3. for each server_op in ops_buffer[parent_version+1 .. current_version]:
       client_op = transform(client_op, server_op)
  4. current_version++
  5. append(ops_buffer, {version: current_version, op: client_op})
  6. persist_async(cassandra, doc_id, current_version, client_op)
  7. broadcast(client_op, current_version, exclude=sender)
  8. ack(sender, current_version)
  9. Unlock(doc_id)

Deep Dive 2: Document Storage, Versioning & Snapshotting

Storing every individual operation forever is necessary for version history but expensive for document loading. We use a snapshot + operation log strategy:

Snapshot Strategy:

  • A background worker periodically takes a full document snapshot (every 1000 operations or every 5 minutes, whichever comes first)
  • Snapshot is stored in PostgreSQL as a JSONB blob (the full rich-text document tree)
  • To load a document: fetch latest snapshot, then replay all ops after that snapshot’s version

Version History:

  • Users see named versions (auto-saved every 30 minutes + manually named versions)
  • To render version N: find the closest snapshot before N, replay ops up to N
  • For frequently accessed historical versions, we can cache reconstructed snapshots

Compaction:

  • Ops older than 1 year are compacted: we keep only hourly snapshots beyond 1 year
  • This bounds storage growth while preserving meaningful history

Recovery:

  • If a Collab Server crashes, the new owner loads the latest snapshot + replays ops from Cassandra
  • Recovery time: snapshot load (50KB, ~5ms) + replay 1000 ops (~20ms) = <50ms to fully reconstruct document state
Timeline of a document:

Snapshot v0 ──op1──op2──...──op999──op1000── Snapshot v1000 ──op1001──...──op1500 (current)
  (PostgreSQL)        (Cassandra)              (PostgreSQL)        (Cassandra)

To load current state:
  1. Fetch snapshot at v1000
  2. Fetch ops 1001-1500 from Cassandra
  3. Replay 500 ops → current document state

Deep Dive 3: WebSocket Connection Management & Presence

Connection Management:

  • Each document maps to exactly one Collab Server (consistent hashing on doc_id)
  • When a user opens a document, the API Gateway establishes a WebSocket to the appropriate Collab Server
  • If the Collab Server dies, clients reconnect through the API Gateway, which routes them to the new owner (via consistent hashing with virtual nodes)
  • Client maintains a heartbeat (every 30s). Server evicts clients after 90s of silence.

Presence System:

  • Cursor position, selection range, and user info (name, avatar color) are sent as ephemeral WebSocket messages
  • Not persisted to Cassandra — presence is transient
  • Stored in Redis with 60s TTL (refreshed on every cursor update)
  • For cross-server presence (when viewers are on different API Gateway instances), Redis Pub/Sub fans out cursor updates

Scaling Considerations:

  • A document with 100 concurrent editors generates ~200 cursor updates/sec (each user moves cursor ~2x/sec)
  • Each update is broadcast to 99 other users = 19,800 messages/sec per document
  • We throttle cursor broadcasts to 5Hz (every 200ms) and batch updates to reduce WebSocket frame overhead
  • For “spectator” mode (100+ viewers), we switch to polling presence every 2 seconds instead of real-time push

Offline Editing & Sync:

  • Client stores pending operations in IndexedDB when offline
  • On reconnect, client sends all buffered ops with their parent_version
  • Server transforms and applies them in sequence
  • If the gap is too large (>10,000 ops behind), server sends a full snapshot + diff instead of transforming each op

7. Extensions (2 min)

  • Suggesting mode & comments: Operations include a “suggestion” flag that renders as tracked changes. Comments are anchored to ranges using stable position markers that survive OT transforms.
  • Real-time spell check & grammar: Offload to a separate microservice. Client sends sentences after a debounce; server returns underline ranges. Must handle that text may shift before the response arrives (use version-tagged positions).
  • Document templates & mail merge: Template engine that substitutes placeholders. Rendered server-side for consistency. Templates stored as special documents with variable bindings.
  • Export to PDF/DOCX: Background rendering service (e.g., headless Chromium for PDF, Apache POI for DOCX). Queued via message broker to avoid blocking.
  • Mobile editing with bandwidth constraints: Compress operations with protocol buffers. Batch ops over unreliable connections. Use CRDTs for mobile-specific offline scenarios where OT’s server dependency is problematic.