1. Requirements & Scope (5 min)
Functional Requirements
- Users can create, edit, and delete documents with rich-text formatting (bold, italic, headings, lists, tables, images)
- Multiple users can simultaneously edit the same document in real time, seeing each other’s changes within ~200ms
- Each collaborator’s cursor position and selection is visible to all other editors (presence awareness)
- Full version history — users can view, name, and restore any previous version of the document
- Sharing and permissions model: owner, editor, commenter, viewer roles with link-sharing and per-user ACLs
Non-Functional Requirements
- Availability: 99.99% — documents are a user’s primary work artifact. Downtime during working hours is extremely costly.
- Latency: < 200ms for local edit acknowledgment; < 500ms for remote edit propagation to other collaborators
- Consistency: Eventual consistency for document state, but operations must converge — all users see the identical document after quiescence. No lost edits, ever.
- Scale: 1B documents total, 10M DAU, up to 100 concurrent editors per document, peak 500K concurrent editing sessions
- Durability: Zero data loss. Every keystroke must be persisted. Point-in-time recovery for any document.
2. Estimation (3 min)
Traffic
- 10M DAU, average 3 editing sessions/day, average session 20 minutes
- Average 2 operations/second per active user (character insert, delete, format change)
- Concurrent active editors at peak: ~2M users
- Write QPS: 2M users x 2 ops/sec = 4M operations/sec
- Read QPS: Document opens: 10M x 3 = 30M/day = ~350 reads/sec (bursty, 3x peak = ~1K/sec). Presence/cursor updates: same as write QPS.
Storage
- 1B documents, average 50KB per document (plain text + formatting metadata)
- Document content: 1B x 50KB = 50TB
- Operation log (for version history): assume 500 ops per document per day, 10 bytes per op average, retained for 1 year
- Active documents per day: 30M. 30M x 500 x 10B x 365 = ~55TB/year of operation logs
- Total: ~105TB primary storage, replicated 3x = ~315TB
Key Insight
This is a write-heavy, real-time collaboration system. The core challenge is not storage or throughput — it’s ensuring that concurrent edits from multiple users converge to the same document state without conflicts or lost updates. The algorithm choice (OT vs CRDT) dominates the design.
3. API Design (3 min)
REST Endpoints (Document CRUD)
POST /api/documents
Body: { "title": "Untitled" }
Response 201: { "doc_id": "d_abc123", "created_at": "..." }
GET /api/documents/{doc_id}
Response 200: { "doc_id": "...", "title": "...", "content": <snapshot>, "version": 4872, "permissions": {...} }
DELETE /api/documents/{doc_id}
Response 204
GET /api/documents/{doc_id}/history?from_version=100&to_version=200
Response 200: { "versions": [{ "version": 101, "timestamp": "...", "author": "...", "ops": [...] }, ...] }
POST /api/documents/{doc_id}/restore
Body: { "target_version": 150 }
Response 200: { "version": 4873, "content": <snapshot> }
PUT /api/documents/{doc_id}/share
Body: { "user_id": "u_xyz", "role": "editor" }
Response 200
WebSocket Protocol (Real-Time Editing)
// Client connects
WS /ws/documents/{doc_id}?token=<jwt>
// Client → Server: send operation
{ "type": "op", "client_id": "c_1", "parent_version": 4871, "ops": [{ "type": "insert", "pos": 42, "text": "hello" }] }
// Server → Client: broadcast transformed operation
{ "type": "op", "server_version": 4872, "author": "u_abc", "ops": [{ "type": "insert", "pos": 45, "text": "hello" }] }
// Server → Client: presence update
{ "type": "presence", "user_id": "u_abc", "cursor": { "pos": 47, "selection_end": 47 }, "color": "#e06c75" }
// Client → Server: cursor update
{ "type": "cursor", "pos": 50, "selection_end": 55 }
Key Decisions
- Operations are sent as deltas (not full document snapshots) to minimize bandwidth
- Each operation references a
parent_versionso the server can determine which concurrent ops need transformation - Presence (cursors) is sent on a separate channel/message type — it’s ephemeral and does not need persistence
4. Data Model (3 min)
Documents Table (PostgreSQL)
| Field | Type | Notes |
|---|---|---|
| doc_id | UUID (PK) | Globally unique |
| owner_id | UUID (FK) | Creator |
| title | VARCHAR(500) | Document title |
| current_version | BIGINT | Monotonically increasing |
| content_snapshot | JSONB | Latest full document state (rich text tree) |
| snapshot_version | BIGINT | Version at which snapshot was taken |
| created_at | TIMESTAMP | |
| updated_at | TIMESTAMP |
Operations Table (Cassandra / ScyllaDB)
| Field | Type | Notes |
|---|---|---|
| doc_id | UUID (partition key) | |
| version | BIGINT (clustering key) | Server-assigned monotonic version |
| author_id | UUID | Who made this edit |
| ops | BLOB | Serialized operation (insert/delete/format) |
| timestamp | TIMESTAMP |
Permissions Table (PostgreSQL)
| Field | Type | Notes |
|---|---|---|
| doc_id | UUID (PK) | |
| user_id | UUID (PK) | |
| role | ENUM | owner / editor / commenter / viewer |
| granted_at | TIMESTAMP | |
| link_share_role | ENUM | Role for anyone-with-link (nullable) |
Why These DB Choices?
- PostgreSQL for documents and permissions: strong consistency needed for access control; documents metadata is relational and moderate-scale
- Cassandra/ScyllaDB for operations log: append-heavy, time-series-like workload. Partitioned by doc_id so all ops for one document are co-located. Excellent write throughput. We only need range scans within a partition (get ops from version X to Y).
- Redis for ephemeral state: cursor positions, active sessions, presence data. TTL-based expiry.
5. High-Level Design (12 min)
System Architecture
┌──────────────────────────┐
│ Load Balancer │
│ (L7, sticky sessions │
│ by doc_id hash) │
└────────────┬─────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ API GW │ │ API GW │ │ API GW │
│ (REST + │ │ (REST + │ │ (REST + │
│ WS mgmt) │ │ WS mgmt) │ │ WS mgmt) │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Collab │ │ Collab │ │ Collab │
│ Server │ │ Server │ │ Server │
│ (OT Engine) │ │ (OT Engine) │ │ (OT Engine) │
│ │ │ │ │ │
│ Owns doc │ │ Owns doc │ │ Owns doc │
│ partitions │ │ partitions │ │ partitions │
│ A-F │ │ G-M │ │ N-Z │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────────────────┼────────────────────────┘
│
┌─────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌─────▼──────┐ ┌──────▼──────┐
│ PostgreSQL │ │ Cassandra │ │ Redis │
│ (docs, │ │ (ops log) │ │ (presence, │
│ perms, │ │ │ │ sessions, │
│ metadata) │ │ │ │ cursors) │
└─────────────┘ └────────────┘ └─────────────┘
Request Flow — User Types a Character
User A types 'x' at position 42 (their local version = 100)
│
▼
Client applies op locally (optimistic) → user sees 'x' immediately
│
▼
Client sends: { op: insert('x', 42), parent_version: 100 }
│
▼
WebSocket → API Gateway → routes to Collab Server owning this doc
│
▼
Collab Server:
1. Acquire per-document lock (in-memory, single-threaded per doc)
2. Current server version = 103 (3 ops happened since client's version 100)
3. Transform client's op against ops 101, 102, 103
- e.g., op 101 inserted 5 chars before pos 42 → transformed pos = 47
4. Apply transformed op → new server version = 104
5. Persist op 104 to Cassandra (async, buffered)
6. Broadcast transformed op to all connected clients (except sender)
7. Send ACK to sender with server_version = 104
│
▼
Other clients receive transformed op → apply it → their docs converge
Component Responsibilities
- API Gateway: Handles REST endpoints, upgrades HTTP to WebSocket, authenticates JWT, routes WebSocket connections to the correct Collab Server based on document partition
- Collaboration Server: The brain. Owns a partition of documents. Runs the OT engine. Maintains in-memory document state and connected client list. Single writer per document (critical for OT correctness).
- PostgreSQL: Source of truth for document metadata, permissions, and periodic content snapshots
- Cassandra: Append-only operation log. Enables version history, undo, and recovery.
- Redis: Ephemeral presence data (cursors, online users). Pub/Sub for cross-server presence fanout.
- Snapshot Worker: Background job that periodically compacts operations into a full document snapshot (every ~1000 ops) to bound recovery time.
6. Deep Dives (15 min)
Deep Dive 1: Operational Transformation (OT) vs CRDT
This is the defining technical decision of the system.
Operational Transformation (OT):
- Each edit is an “operation” (insert char at position X, delete range Y-Z, format range as bold)
- When two users edit concurrently, their operations must be “transformed” against each other so they can be applied in any order and converge to the same state
- Requires a central server that assigns a total order to operations
- Google Docs uses OT (specifically, a variant called Jupiter/Wave OT)
CRDTs (Conflict-free Replicated Data Types):
- Each character has a unique, globally-ordered ID (e.g., Lamport timestamp + node ID)
- Operations are commutative — no transformation needed
- Can work fully peer-to-peer (no central server required)
- Used by Figma, Yjs, Automerge
Why we choose OT for this design:
| Criterion | OT | CRDT |
|---|---|---|
| Server requirement | Central server (we already need one for auth, persistence) | Optional |
| Memory overhead | Low — only current doc state in memory | High — tombstones for every deleted character, unique IDs per character |
| Rich text support | Well-understood (Google has shipped it for 15 years) | Harder — formatting operations across ranges are tricky with CRDTs |
| Undo/redo | Well-defined (inverse operations) | Complex (need to reason about causal history) |
| Offline editing | Harder (ops diverge significantly) | Better (designed for this) |
| Implementation complexity | Transform functions are error-prone but well-documented | Conceptually simpler but metadata overhead is real |
OT Transform Example:
User A: insert('x', pos=5) at version 10
User B: insert('y', pos=3) at version 10 (concurrent)
Server receives A first → version 11: insert('x', 5)
Server receives B second → must transform against A's op:
- B's pos (3) < A's pos (5), so A doesn't affect B
- Transformed B: insert('y', pos=3)
- But A must be adjusted for clients who already applied B:
Transformed A for B's clients: insert('x', pos=6) // shifted right by 1
Both converge to: ...y..x... (same document)
The critical invariant: For any two operations O1 and O2, applying transform(O1, O2) after O2 must produce the same state as applying transform(O2, O1) after O1. This is called the TP1 property.
Server-side implementation:
Per-document state (in memory):
- current_version: int
- ops_buffer: [last 1000 ops] // for transforming late-arriving ops
- connected_clients: map<client_id, {websocket, last_ack_version}>
On receiving op from client:
1. Lock(doc_id) // in-memory mutex, single Collab Server owns each doc
2. gap = current_version - op.parent_version
3. for each server_op in ops_buffer[parent_version+1 .. current_version]:
client_op = transform(client_op, server_op)
4. current_version++
5. append(ops_buffer, {version: current_version, op: client_op})
6. persist_async(cassandra, doc_id, current_version, client_op)
7. broadcast(client_op, current_version, exclude=sender)
8. ack(sender, current_version)
9. Unlock(doc_id)
Deep Dive 2: Document Storage, Versioning & Snapshotting
Storing every individual operation forever is necessary for version history but expensive for document loading. We use a snapshot + operation log strategy:
Snapshot Strategy:
- A background worker periodically takes a full document snapshot (every 1000 operations or every 5 minutes, whichever comes first)
- Snapshot is stored in PostgreSQL as a JSONB blob (the full rich-text document tree)
- To load a document: fetch latest snapshot, then replay all ops after that snapshot’s version
Version History:
- Users see named versions (auto-saved every 30 minutes + manually named versions)
- To render version N: find the closest snapshot before N, replay ops up to N
- For frequently accessed historical versions, we can cache reconstructed snapshots
Compaction:
- Ops older than 1 year are compacted: we keep only hourly snapshots beyond 1 year
- This bounds storage growth while preserving meaningful history
Recovery:
- If a Collab Server crashes, the new owner loads the latest snapshot + replays ops from Cassandra
- Recovery time: snapshot load (50KB, ~5ms) + replay 1000 ops (~20ms) = <50ms to fully reconstruct document state
Timeline of a document:
Snapshot v0 ──op1──op2──...──op999──op1000── Snapshot v1000 ──op1001──...──op1500 (current)
(PostgreSQL) (Cassandra) (PostgreSQL) (Cassandra)
To load current state:
1. Fetch snapshot at v1000
2. Fetch ops 1001-1500 from Cassandra
3. Replay 500 ops → current document state
Deep Dive 3: WebSocket Connection Management & Presence
Connection Management:
- Each document maps to exactly one Collab Server (consistent hashing on doc_id)
- When a user opens a document, the API Gateway establishes a WebSocket to the appropriate Collab Server
- If the Collab Server dies, clients reconnect through the API Gateway, which routes them to the new owner (via consistent hashing with virtual nodes)
- Client maintains a heartbeat (every 30s). Server evicts clients after 90s of silence.
Presence System:
- Cursor position, selection range, and user info (name, avatar color) are sent as ephemeral WebSocket messages
- Not persisted to Cassandra — presence is transient
- Stored in Redis with 60s TTL (refreshed on every cursor update)
- For cross-server presence (when viewers are on different API Gateway instances), Redis Pub/Sub fans out cursor updates
Scaling Considerations:
- A document with 100 concurrent editors generates ~200 cursor updates/sec (each user moves cursor ~2x/sec)
- Each update is broadcast to 99 other users = 19,800 messages/sec per document
- We throttle cursor broadcasts to 5Hz (every 200ms) and batch updates to reduce WebSocket frame overhead
- For “spectator” mode (100+ viewers), we switch to polling presence every 2 seconds instead of real-time push
Offline Editing & Sync:
- Client stores pending operations in IndexedDB when offline
- On reconnect, client sends all buffered ops with their parent_version
- Server transforms and applies them in sequence
- If the gap is too large (>10,000 ops behind), server sends a full snapshot + diff instead of transforming each op
7. Extensions (2 min)
- Suggesting mode & comments: Operations include a “suggestion” flag that renders as tracked changes. Comments are anchored to ranges using stable position markers that survive OT transforms.
- Real-time spell check & grammar: Offload to a separate microservice. Client sends sentences after a debounce; server returns underline ranges. Must handle that text may shift before the response arrives (use version-tagged positions).
- Document templates & mail merge: Template engine that substitutes placeholders. Rendered server-side for consistency. Templates stored as special documents with variable bindings.
- Export to PDF/DOCX: Background rendering service (e.g., headless Chromium for PDF, Apache POI for DOCX). Queued via message broker to avoid blocking.
- Mobile editing with bandwidth constraints: Compress operations with protocol buffers. Batch ops over unreliable connections. Use CRDTs for mobile-specific offline scenarios where OT’s server dependency is problematic.