Design Google Docs (Collaborative Document Editor)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Users can create, edit, and delete documents with rich-text formatting (bold, italic, headings, lists, tables, images)
Multiple users can simultaneously edit the same document in real time, seeing each other’s changes within ~200ms
Each collaborator’s cursor position and selection is visible to all other editors (presence awareness)
Full version history — users can view, name, and restore any previous version of the document
Sharing and permissions model: owner, editor, commenter, viewer roles with link-sharing and per-user ACLs

Non-Functional Requirements

Availability: 99.99% — documents are a user’s primary work artifact. Downtime during working hours is extremely costly.
Latency: < 200ms for local edit acknowledgment; < 500ms for remote edit propagation to other collaborators
Consistency: Eventual consistency for document state, but operations must converge — all users see the identical document after quiescence. No lost edits, ever.
Scale: 1B documents total, 10M DAU, up to 100 concurrent editors per document, peak 500K concurrent editing sessions
Durability: Zero data loss. Every keystroke must be persisted. Point-in-time recovery for any document.

2. Estimation (3 min)

Traffic

10M DAU, average 3 editing sessions/day, average session 20 minutes
Average 2 operations/second per active user (character insert, delete, format change)
Concurrent active editors at peak: ~2M users
Write QPS: 2M users x 2 ops/sec = 4M operations/sec
Read QPS: Document opens: 10M x 3 = 30M/day = ~350 reads/sec (bursty, 3x peak = ~1K/sec). Presence/cursor updates: same as write QPS.

Storage

1B documents, average 50KB per document (plain text + formatting metadata)
Document content: 1B x 50KB = 50TB
Operation log (for version history): assume 500 ops per document per day, 10 bytes per op average, retained for 1 year
- Active documents per day: 30M. 30M x 500 x 10B x 365 = ~55TB/year of operation logs
Total: ~105TB primary storage, replicated 3x = ~315TB

Key Insight

This is a write-heavy, real-time collaboration system. The core challenge is not storage or throughput — it’s ensuring that concurrent edits from multiple users converge to the same document state without conflicts or lost updates. The algorithm choice (OT vs CRDT) dominates the design.

3. API Design (3 min)

REST Endpoints (Document CRUD)

POST   /api/documents
  Body: { "title": "Untitled" }
  Response 201: { "doc_id": "d_abc123", "created_at": "..." }

GET    /api/documents/{doc_id}
  Response 200: { "doc_id": "...", "title": "...", "content": <snapshot>, "version": 4872, "permissions": {...} }

DELETE /api/documents/{doc_id}
  Response 204

GET    /api/documents/{doc_id}/history?from_version=100&to_version=200
  Response 200: { "versions": [{ "version": 101, "timestamp": "...", "author": "...", "ops": [...] }, ...] }

POST   /api/documents/{doc_id}/restore
  Body: { "target_version": 150 }
  Response 200: { "version": 4873, "content": <snapshot> }

PUT    /api/documents/{doc_id}/share
  Body: { "user_id": "u_xyz", "role": "editor" }
  Response 200

WebSocket Protocol (Real-Time Editing)

// Client connects
WS /ws/documents/{doc_id}?token=<jwt>

// Client → Server: send operation
{ "type": "op", "client_id": "c_1", "parent_version": 4871, "ops": [{ "type": "insert", "pos": 42, "text": "hello" }] }

// Server → Client: broadcast transformed operation
{ "type": "op", "server_version": 4872, "author": "u_abc", "ops": [{ "type": "insert", "pos": 45, "text": "hello" }] }

// Server → Client: presence update
{ "type": "presence", "user_id": "u_abc", "cursor": { "pos": 47, "selection_end": 47 }, "color": "#e06c75" }

// Client → Server: cursor update
{ "type": "cursor", "pos": 50, "selection_end": 55 }

Key Decisions

Operations are sent as deltas (not full document snapshots) to minimize bandwidth
Each operation references a parent_version so the server can determine which concurrent ops need transformation
Presence (cursors) is sent on a separate channel/message type — it’s ephemeral and does not need persistence

4. Data Model (3 min)

Documents Table (PostgreSQL)

Field	Type	Notes
doc_id	UUID (PK)	Globally unique
owner_id	UUID (FK)	Creator
title	VARCHAR(500)	Document title
current_version	BIGINT	Monotonically increasing
content_snapshot	JSONB	Latest full document state (rich text tree)
snapshot_version	BIGINT	Version at which snapshot was taken
created_at	TIMESTAMP
updated_at	TIMESTAMP

Operations Table (Cassandra / ScyllaDB)

Field	Type	Notes
doc_id	UUID (partition key)
version	BIGINT (clustering key)	Server-assigned monotonic version
author_id	UUID	Who made this edit
ops	BLOB	Serialized operation (insert/delete/format)
timestamp	TIMESTAMP

Permissions Table (PostgreSQL)

Field	Type	Notes
doc_id	UUID (PK)
user_id	UUID (PK)
role	ENUM	owner / editor / commenter / viewer
granted_at	TIMESTAMP
link_share_role	ENUM	Role for anyone-with-link (nullable)

Why These DB Choices?

PostgreSQL for documents and permissions: strong consistency needed for access control; documents metadata is relational and moderate-scale
Cassandra/ScyllaDB for operations log: append-heavy, time-series-like workload. Partitioned by doc_id so all ops for one document are co-located. Excellent write throughput. We only need range scans within a partition (get ops from version X to Y).
Redis for ephemeral state: cursor positions, active sessions, presence data. TTL-based expiry.

5. High-Level Design (12 min)

System Architecture

                          ┌──────────────────────────┐
                          │      Load Balancer        │
                          │   (L7, sticky sessions    │
                          │    by doc_id hash)        │
                          └────────────┬─────────────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                   │
              ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐
              │  API GW   │     │  API GW   │     │  API GW   │
              │ (REST +   │     │ (REST +   │     │ (REST +   │
              │  WS mgmt) │     │  WS mgmt) │     │  WS mgmt) │
              └─────┬─────┘     └─────┬─────┘     └─────┬─────┘
                    │                  │                   │
                    └──────────────────┼──────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              │                        │                        │
       ┌──────▼──────┐         ┌──────▼──────┐         ┌──────▼──────┐
       │ Collab       │         │ Collab       │         │ Collab       │
       │ Server       │         │ Server       │         │ Server       │
       │ (OT Engine)  │         │ (OT Engine)  │         │ (OT Engine)  │
       │              │         │              │         │              │
       │ Owns doc     │         │ Owns doc     │         │ Owns doc     │
       │ partitions   │         │ partitions   │         │ partitions   │
       │ A-F          │         │ G-M          │         │ N-Z          │
       └──────┬───────┘         └──────┬───────┘         └──────┬───────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                     ┌─────────────────┼──────────────────┐
                     │                 │                   │
              ┌──────▼──────┐   ┌─────▼──────┐    ┌──────▼──────┐
              │  PostgreSQL │   │  Cassandra  │    │   Redis     │
              │  (docs,     │   │  (ops log)  │    │  (presence, │
              │   perms,    │   │             │    │   sessions, │
              │   metadata) │   │             │    │   cursors)  │
              └─────────────┘   └────────────┘    └─────────────┘

Request Flow — User Types a Character

User A types 'x' at position 42 (their local version = 100)
       │
       ▼
Client applies op locally (optimistic) → user sees 'x' immediately
       │
       ▼
Client sends: { op: insert('x', 42), parent_version: 100 }
       │
       ▼
WebSocket → API Gateway → routes to Collab Server owning this doc
       │
       ▼
Collab Server:
  1. Acquire per-document lock (in-memory, single-threaded per doc)
  2. Current server version = 103 (3 ops happened since client's version 100)
  3. Transform client's op against ops 101, 102, 103
     - e.g., op 101 inserted 5 chars before pos 42 → transformed pos = 47
  4. Apply transformed op → new server version = 104
  5. Persist op 104 to Cassandra (async, buffered)
  6. Broadcast transformed op to all connected clients (except sender)
  7. Send ACK to sender with server_version = 104
       │
       ▼
Other clients receive transformed op → apply it → their docs converge

Component Responsibilities

API Gateway: Handles REST endpoints, upgrades HTTP to WebSocket, authenticates JWT, routes WebSocket connections to the correct Collab Server based on document partition
Collaboration Server: The brain. Owns a partition of documents. Runs the OT engine. Maintains in-memory document state and connected client list. Single writer per document (critical for OT correctness).
PostgreSQL: Source of truth for document metadata, permissions, and periodic content snapshots
Cassandra: Append-only operation log. Enables version history, undo, and recovery.
Redis: Ephemeral presence data (cursors, online users). Pub/Sub for cross-server presence fanout.
Snapshot Worker: Background job that periodically compacts operations into a full document snapshot (every ~1000 ops) to bound recovery time.

6. Deep Dives (15 min)

Deep Dive 1: Operational Transformation (OT) vs CRDT

This is the defining technical decision of the system.

Operational Transformation (OT):

Each edit is an “operation” (insert char at position X, delete range Y-Z, format range as bold)
When two users edit concurrently, their operations must be “transformed” against each other so they can be applied in any order and converge to the same state
Requires a central server that assigns a total order to operations
Google Docs uses OT (specifically, a variant called Jupiter/Wave OT)

CRDTs (Conflict-free Replicated Data Types):

Each character has a unique, globally-ordered ID (e.g., Lamport timestamp + node ID)
Operations are commutative — no transformation needed
Can work fully peer-to-peer (no central server required)
Used by Figma, Yjs, Automerge

Why we choose OT for this design:

Criterion	OT	CRDT
Server requirement	Central server (we already need one for auth, persistence)	Optional
Memory overhead	Low — only current doc state in memory	High — tombstones for every deleted character, unique IDs per character
Rich text support	Well-understood (Google has shipped it for 15 years)	Harder — formatting operations across ranges are tricky with CRDTs
Undo/redo	Well-defined (inverse operations)	Complex (need to reason about causal history)
Offline editing	Harder (ops diverge significantly)	Better (designed for this)
Implementation complexity	Transform functions are error-prone but well-documented	Conceptually simpler but metadata overhead is real

OT Transform Example:

User A: insert('x', pos=5) at version 10
User B: insert('y', pos=3) at version 10  (concurrent)

Server receives A first → version 11: insert('x', 5)
Server receives B second → must transform against A's op:
  - B's pos (3) < A's pos (5), so A doesn't affect B
  - Transformed B: insert('y', pos=3)
  - But A must be adjusted for clients who already applied B:
    Transformed A for B's clients: insert('x', pos=6)  // shifted right by 1

Both converge to: ...y..x... (same document)

The critical invariant: For any two operations O1 and O2, applying transform(O1, O2) after O2 must produce the same state as applying transform(O2, O1) after O1. This is called the TP1 property.

Server-side implementation:

Per-document state (in memory):
  - current_version: int
  - ops_buffer: [last 1000 ops]  // for transforming late-arriving ops
  - connected_clients: map<client_id, {websocket, last_ack_version}>

On receiving op from client:
  1. Lock(doc_id)  // in-memory mutex, single Collab Server owns each doc
  2. gap = current_version - op.parent_version
  3. for each server_op in ops_buffer[parent_version+1 .. current_version]:
       client_op = transform(client_op, server_op)
  4. current_version++
  5. append(ops_buffer, {version: current_version, op: client_op})
  6. persist_async(cassandra, doc_id, current_version, client_op)
  7. broadcast(client_op, current_version, exclude=sender)
  8. ack(sender, current_version)
  9. Unlock(doc_id)

Deep Dive 2: Document Storage, Versioning & Snapshotting

Storing every individual operation forever is necessary for version history but expensive for document loading. We use a snapshot + operation log strategy:

Snapshot Strategy:

A background worker periodically takes a full document snapshot (every 1000 operations or every 5 minutes, whichever comes first)
Snapshot is stored in PostgreSQL as a JSONB blob (the full rich-text document tree)
To load a document: fetch latest snapshot, then replay all ops after that snapshot’s version

Version History:

Users see named versions (auto-saved every 30 minutes + manually named versions)
To render version N: find the closest snapshot before N, replay ops up to N
For frequently accessed historical versions, we can cache reconstructed snapshots

Compaction:

Ops older than 1 year are compacted: we keep only hourly snapshots beyond 1 year
This bounds storage growth while preserving meaningful history

Recovery:

If a Collab Server crashes, the new owner loads the latest snapshot + replays ops from Cassandra
Recovery time: snapshot load (50KB, ~5ms) + replay 1000 ops (~20ms) = <50ms to fully reconstruct document state

Timeline of a document:

Snapshot v0 ──op1──op2──...──op999──op1000── Snapshot v1000 ──op1001──...──op1500 (current)
  (PostgreSQL)        (Cassandra)              (PostgreSQL)        (Cassandra)

To load current state:
  1. Fetch snapshot at v1000
  2. Fetch ops 1001-1500 from Cassandra
  3. Replay 500 ops → current document state

Deep Dive 3: WebSocket Connection Management & Presence

Connection Management:

Each document maps to exactly one Collab Server (consistent hashing on doc_id)
When a user opens a document, the API Gateway establishes a WebSocket to the appropriate Collab Server
If the Collab Server dies, clients reconnect through the API Gateway, which routes them to the new owner (via consistent hashing with virtual nodes)
Client maintains a heartbeat (every 30s). Server evicts clients after 90s of silence.

Presence System:

Cursor position, selection range, and user info (name, avatar color) are sent as ephemeral WebSocket messages
Not persisted to Cassandra — presence is transient
Stored in Redis with 60s TTL (refreshed on every cursor update)
For cross-server presence (when viewers are on different API Gateway instances), Redis Pub/Sub fans out cursor updates

Scaling Considerations:

A document with 100 concurrent editors generates ~200 cursor updates/sec (each user moves cursor ~2x/sec)
Each update is broadcast to 99 other users = 19,800 messages/sec per document
We throttle cursor broadcasts to 5Hz (every 200ms) and batch updates to reduce WebSocket frame overhead
For “spectator” mode (100+ viewers), we switch to polling presence every 2 seconds instead of real-time push

Offline Editing & Sync:

Client stores pending operations in IndexedDB when offline
On reconnect, client sends all buffered ops with their parent_version
Server transforms and applies them in sequence
If the gap is too large (>10,000 ops behind), server sends a full snapshot + diff instead of transforming each op

7. Extensions (2 min)

Suggesting mode & comments: Operations include a “suggestion” flag that renders as tracked changes. Comments are anchored to ranges using stable position markers that survive OT transforms.
Real-time spell check & grammar: Offload to a separate microservice. Client sends sentences after a debounce; server returns underline ranges. Must handle that text may shift before the response arrives (use version-tagged positions).
Document templates & mail merge: Template engine that substitutes placeholders. Rendered server-side for consistency. Templates stored as special documents with variable bindings.
Export to PDF/DOCX: Background rendering service (e.g., headless Chromium for PDF, Apache POI for DOCX). Queued via message broker to avoid blocking.
Mobile editing with bandwidth constraints: Compress operations with protocol buffers. Batch ops over unreliable connections. Use CRDTs for mobile-specific offline scenarios where OT’s server dependency is problematic.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Key Insight#

3. API Design (3 min)#

REST Endpoints (Document CRUD)#

WebSocket Protocol (Real-Time Editing)#

Key Decisions#

4. Data Model (3 min)#

Documents Table (PostgreSQL)#

Operations Table (Cassandra / ScyllaDB)#

Permissions Table (PostgreSQL)#

Why These DB Choices?#

5. High-Level Design (12 min)#

System Architecture#

Request Flow — User Types a Character#

Component Responsibilities#

6. Deep Dives (15 min)#

Deep Dive 1: Operational Transformation (OT) vs CRDT#

Deep Dive 2: Document Storage, Versioning & Snapshotting#

Deep Dive 3: WebSocket Connection Management & Presence#

7. Extensions (2 min)#