Design a Document Management System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Storage
- Traffic
- Bandwidth
- Key Insight
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Upload, download, and organize files in a hierarchical folder structure with support for large files (up to 10GB) via chunked/resumable uploads
Version control: maintain full version history of documents, allow reverting to any previous version, and show diffs for text-based files
Fine-grained access control: per-document and per-folder permissions (owner, editor, viewer), shareable links with expiry, and organization-wide policies
Full-text search across document contents, metadata, and tags. Support filters by file type, date, owner, and folder.
Real-time collaborative editing for text documents (Google Docs-style) with conflict resolution and presence indicators

Non-Functional Requirements

Availability: 99.99% for reads (viewing/downloading). 99.9% for writes (uploading/editing). Acceptable to queue uploads during degraded states.
Latency: File listing < 100ms. File download start (first byte) < 200ms. Search results < 500ms. Collaborative edit sync < 100ms (real-time feel).
Durability: 99.999999999% (11 nines) — zero data loss. Documents are business-critical. Use replication + backups.
Scale: 500M documents totaling 50PB of storage. 10M users. 1M uploads/day. 10M downloads/day. 100K concurrent collaborative editing sessions.
Compliance: Audit trail for all document access. Support for retention policies and legal holds. GDPR right to deletion.

2. Estimation (3 min)

Storage

500M documents, average 100MB = 50PB total storage
With 5 versions average per document: 50PB × 3 (dedup reduces version overhead to ~3x) = 150PB raw storage → with S3’s internal redundancy: managed
Daily uploads: 1M files × 100MB average = 100TB/day
Search index: 500M documents × 5KB extracted text average = 2.5TB Elasticsearch index

Traffic

Uploads: 1M/day = 12 uploads/sec average, peak 100/sec
Downloads: 10M/day = 115 downloads/sec average, peak 1K/sec
Search: 5M queries/day = 58 queries/sec
Collaborative editing: 100K concurrent sessions, each generating ~10 operations/sec = 1M ops/sec for the collaboration engine

Bandwidth

Uploads: 100TB/day = 9.3 Gbps sustained
Downloads: 10M × 100MB = 1PB/day = 93 Gbps sustained → CDN handles this
Total: ~100 Gbps — significant but manageable with CDN offload

Key Insight

Storage cost dominates. At S3 Standard pricing ($0.023/GB/month), 50PB = $1.15M/month. Tiering cold documents to S3 Glacier ($0.004/GB/month) reduces this to ~$250K/month. Storage tiering is a critical cost optimization, not just a nice-to-have.

3. API Design (3 min)

File Operations

// Initiate resumable upload
POST /v1/files/upload
  Body: {
    "name": "Q4-Report.pdf",
    "folder_id": "fld_abc",
    "size_bytes": 52428800,        // 50MB
    "content_type": "application/pdf",
    "checksum_sha256": "abc123..."
  }
  Response 200: {
    "upload_id": "upl_xyz",
    "chunk_size": 5242880,          // 5MB chunks
    "upload_url": "https://upload.service.com/upl_xyz"
  }

// Upload chunk
PUT /v1/files/upload/{upload_id}/chunks/{chunk_number}
  Body: <binary chunk data>
  Headers: Content-Range: bytes 0-5242879/52428800
  Response 200: { "chunk_number": 0, "status": "received" }

// Complete upload
POST /v1/files/upload/{upload_id}/complete
  Response 201: {
    "file_id": "file_123",
    "version": 1,
    "size_bytes": 52428800,
    "url": "https://cdn.service.com/file_123"
  }

// Download (pre-signed URL for direct S3/CDN access)
GET /v1/files/{file_id}/download?version=2
  Response 302: Location: https://cdn.service.com/file_123?token=...&expires=...

// List folder contents
GET /v1/folders/{folder_id}/contents?sort=modified&cursor=xxx&limit=50
  Response 200: {
    "items": [
      {"id": "fld_child", "type": "folder", "name": "Reports", "modified_at": "..."},
      {"id": "file_123", "type": "file", "name": "Q4-Report.pdf", "size": 52428800, ...}
    ],
    "next_cursor": "yyy"
  }

Search

GET /v1/search?q=quarterly+revenue&type=pdf&folder_id=fld_abc&modified_after=2026-01-01
  Response 200: {
    "results": [
      {
        "file_id": "file_123",
        "name": "Q4-Report.pdf",
        "snippet": "...quarterly revenue increased by 23% to $4.2B...",
        "path": "/Finance/Reports/Q4-Report.pdf",
        "score": 0.92,
        "modified_at": "2026-02-15T10:00:00Z"
      }
    ],
    "total": 47,
    "next_cursor": "zzz"
  }

POST /v1/files/{file_id}/share
  Body: {
    "type": "user",                     // user, group, link
    "target": "u_456",
    "permission": "editor",             // owner, editor, commenter, viewer
    "expires_at": "2026-03-01T00:00:00Z"
  }

POST /v1/files/{file_id}/share
  Body: {
    "type": "link",
    "permission": "viewer",
    "password": "optional",
    "expires_at": "2026-03-01T00:00:00Z"
  }
  Response 201: { "share_url": "https://docs.service.com/s/abc123" }

4. Data Model (3 min)

Files and Folders (PostgreSQL)

Table: files
  file_id          (PK) | uuid
  name                   | varchar(255)
  folder_id        (FK) | uuid (nullable, null = root)
  owner_id         (FK) | uuid
  org_id           (FK) | uuid
  current_version        | int
  size_bytes             | bigint
  content_type           | varchar(100)
  storage_tier           | enum('hot', 'warm', 'cold', 'archive')
  status                 | enum('active', 'trashed', 'deleted')
  created_at             | timestamp
  updated_at             | timestamp
  trashed_at             | timestamp
  INDEX: (folder_id, name)  -- folder listings
  INDEX: (owner_id)          -- "my files"
  INDEX: (org_id, updated_at) -- org-wide recent files

Table: folders
  folder_id        (PK) | uuid
  name                   | varchar(255)
  parent_id        (FK) | uuid (nullable, null = root)
  owner_id         (FK) | uuid
  org_id           (FK) | uuid
  path                   | ltree   -- materialized path: "root.finance.reports"
  created_at             | timestamp
  INDEX: GIST on path    -- efficient ancestor/descendant queries

Table: file_versions
  version_id       (PK) | uuid
  file_id          (FK) | uuid
  version_number         | int
  storage_key            | varchar(500) -- S3 key
  size_bytes             | bigint
  checksum_sha256        | varchar(64)
  created_by             | uuid
  created_at             | timestamp
  change_description     | text
  UNIQUE: (file_id, version_number)

Permissions (PostgreSQL)

Table: permissions
  permission_id    (PK) | uuid
  resource_id            | uuid   -- file_id or folder_id
  resource_type          | enum('file', 'folder')
  grantee_id             | uuid   -- user_id or group_id
  grantee_type           | enum('user', 'group')
  role                   | enum('owner', 'editor', 'commenter', 'viewer')
  inherited              | boolean -- true if inherited from parent folder
  granted_by             | uuid
  expires_at             | timestamp
  INDEX: (resource_id, resource_type)
  INDEX: (grantee_id, grantee_type)

Table: share_links
  link_id          (PK) | uuid
  token            (UQ) | varchar(64) -- URL token
  file_id          (FK) | uuid
  permission             | enum('viewer', 'commenter', 'editor')
  password_hash          | varchar(255)
  expires_at             | timestamp
  max_downloads          | int
  download_count         | int
  created_by             | uuid

Why These Choices

PostgreSQL for metadata: ACID transactions for move/rename/permission changes. ltree extension for efficient folder hierarchy queries (“all descendants of folder X”). Recursive CTEs for permission inheritance.
S3 for file content: 11 nines durability, automatic replication, lifecycle policies for tiering, pre-signed URLs for direct download (bypass our servers).
Elasticsearch for search: Full-text search with relevance scoring, faceted search (filter by type, date), highlighting snippets, scales to TB-level indexes.

5. High-Level Design (12 min)

Upload Flow

Client (browser/app)
  → POST /v1/files/upload (initiate)
  → API Server:
      1. Validate: user has write permission on folder
      2. Create upload record in DB (status: uploading)
      3. Create multipart upload in S3
      4. Return upload_id + pre-signed chunk upload URLs

  → Client uploads chunks directly to S3 (parallel, 5MB each):
      PUT {pre-signed-url-chunk-0} <data>
      PUT {pre-signed-url-chunk-1} <data>
      ... (can retry individual failed chunks)

  → POST /v1/files/upload/{upload_id}/complete
  → API Server:
      1. Complete S3 multipart upload
      2. Verify checksum (SHA-256 of all chunks)
      3. Create file record + version 1 in DB
      4. Publish FileCreated event to Kafka:
         → Trigger: text extraction, thumbnail generation, search indexing
      5. Return file metadata

Download Flow

Client → GET /v1/files/{file_id}/download
  → API Server:
      1. Check permission: user has view access?
      2. Log access in audit trail
      3. Generate pre-signed S3 URL (expires in 15 minutes)
      4. Redirect (302) to pre-signed URL
  → Client downloads directly from S3/CDN
  (our servers never touch the file bytes on download — S3/CDN handles it)

Processing Pipeline (async, after upload)

Kafka: FileCreated / FileVersionCreated event
  → Text Extraction Worker:
      PDF → Apache Tika → plain text
      DOCX → python-docx → plain text
      Images → OCR (Tesseract) → plain text
      → Write extracted text to file_text table

  → Search Indexer:
      → Index in Elasticsearch: file metadata + extracted text

  → Thumbnail Generator:
      PDF → first page rendered as PNG
      Images → resized to 200×200
      Videos → frame at 2 seconds
      → Store thumbnail in S3, link from file metadata

  → Metadata Extractor:
      EXIF data from images (camera, GPS, date)
      Author/title from Office documents
      → Store as metadata tags on file record

Components

API Service: Handles CRUD, permission checks, upload/download orchestration. Stateless, horizontally scaled.
Object Store (S3): File content, all versions. Lifecycle policies for storage tiering. Cross-region replication for durability.
Metadata DB (PostgreSQL): Files, folders, versions, permissions. Strongly consistent for permission changes.
Search Engine (Elasticsearch): Full-text index of document contents and metadata. Supports faceted search and highlighting.
Processing Workers: Text extraction, thumbnail generation, OCR, metadata extraction. Kafka-driven, auto-scaled.
Collaboration Engine: Real-time document editing. OT/CRDT-based conflict resolution. WebSocket connections.
CDN (CloudFront): Serves file downloads globally. Pre-signed URLs with expiry. Reduces S3 bandwidth costs.
Audit Service: Immutable log of all file access, permission changes, and sharing events. Required for compliance.

6. Deep Dives (15 min)

Deep Dive 1: Collaborative Editing with Operational Transformation (OT)

The problem: Two users editing the same document simultaneously. User A types “Hello” at position 0. User B deletes character at position 5. If applied naively, the operations conflict.

Operational Transformation (OT) — Google Docs approach:

Model: the document is a sequence of characters. Each edit is an "operation":
  Insert(position, character)
  Delete(position)

Transform function: given two concurrent operations, transform each so they
produce the same result regardless of application order.

Example:
  Document: "ABCDE"
  User A: Insert(0, "X")  → "XABCDE"
  User B: Delete(4)       → "ABCE"  (deleted "D")

  If A's op arrives first:
    Apply A: "XABCDE"
    Transform B against A: Delete(4) → Delete(5)  (shifted by 1 due to A's insert)
    Apply transformed B: "XABCE"

  If B's op arrives first:
    Apply B: "ABCE"
    Transform A against B: Insert(0, "X") → Insert(0, "X")  (unaffected, before deletion)
    Apply transformed A: "XABCE"

  Both orders produce "XABCE" ✓

Architecture:

Client A ←→ WebSocket ←→ Collaboration Server ←→ WebSocket ←→ Client B
                              ↕
                        Document Store (Redis)
                              ↕
                        Persistence (S3 + DB)

Flow:
  1. Client A opens document → WebSocket to Collaboration Server
  2. Server loads document state from Redis (or S3 on cache miss)
  3. Client A makes edit → sends operation to server
  4. Server:
     a. Receives op with client's last-known server version
     b. Transforms op against all ops between client's version and current version
     c. Applies transformed op to server document state
     d. Broadcasts transformed op to all other clients
     e. Acknowledges to Client A
  5. Client B receives op → applies to local state

Server is the single source of truth (central server model).
Each document has exactly one server (sharded by document_id).

Why OT over CRDT?

OT:
  + Simpler operations (insert/delete at position)
  + Smaller operation size (position + char vs full CRDT metadata)
  + Well-understood (Google Docs uses it since 2010)
  - Requires central server for transformation
  - Complex transform functions for rich text (bold, lists, tables)

CRDT (Conflict-free Replicated Data Type):
  + Truly peer-to-peer (no central server needed)
  + Convergence guaranteed by data structure properties
  - Larger metadata overhead (each character carries unique ID)
  - Tombstones accumulate (deleted characters leave metadata)
  - More complex implementation for rich text

For our centralized service: OT is the better choice.
For offline-first or P2P: CRDT (e.g., Yjs, Automerge).

Handling disconnects and reconnects:

Client disconnects:
  → Local edits queued in client buffer
  → On reconnect: send buffer to server with last-known version
  → Server transforms buffered ops against missed server ops
  → Merge and broadcast

Document auto-save:
  → Every 30 seconds OR after 100 operations:
    → Snapshot document state → create new version in S3
    → Update file_versions table
  → Prevents data loss on server crash

Deep Dive 2: File Versioning and Storage Optimization

Naive versioning: Store every version as a complete copy.

File: 10MB document, 50 versions = 500MB
At scale: 500M files × 5 versions × 100MB = 250PB → extremely expensive

Storage optimization strategies:

Strategy 1: Delta/diff storage (for text-based files)

Version 1: Full content (10MB)
Version 2: Delta from V1 (50KB — only the changed paragraphs)
Version 3: Delta from V2 (30KB)
...
Every 10th version: full snapshot (to limit delta chain length)

Storage for 50 versions: 10MB + 49 × 50KB = 12.5MB (vs 500MB naive)
95% reduction!

Implementation:
  → Use xdelta3 or bsdiff for binary delta computation
  → Store deltas in S3 alongside full snapshots
  → To reconstruct V7: load V1 (snapshot) + apply deltas V2-V7
  → Max delta chain: 10 → max 10 × 50KB = 500KB overhead to reconstruct

Strategy 2: Content-Defined Chunking (CDC) with deduplication

Instead of storing whole files, break into content-defined chunks:
  1. Use Rabin fingerprint rolling hash to identify chunk boundaries
     → Average chunk size: 4MB (min 1MB, max 16MB)
  2. Hash each chunk (SHA-256)
  3. Store chunks in S3, deduplicated by hash:
     chunk:sha256:abc123 → <data>
  4. File version = ordered list of chunk hashes

Example:
  V1: [chunk_A, chunk_B, chunk_C, chunk_D]  (stored: 4 chunks, 40MB)
  V2: [chunk_A, chunk_B, chunk_E, chunk_D]  (stored: 1 new chunk, 10MB)
  → chunk_A, chunk_B, chunk_D are reused (deduplication)
  → Total: 50MB instead of 80MB

Cross-file dedup: if two users upload the same PowerPoint template,
  → Only one copy stored. Both file records reference the same chunks.

Storage tiering (lifecycle management):

Automatic tiering based on access patterns:

Hot (S3 Standard, $0.023/GB/month):
  → Files accessed in last 30 days
  → Current version of all active files

Warm (S3 Infrequent Access, $0.0125/GB/month):
  → Files accessed 30-90 days ago
  → Non-current versions (V1-V4 when V5 is current)

Cold (S3 Glacier Instant Retrieval, $0.004/GB/month):
  → Files accessed > 90 days ago
  → Retrieval time: milliseconds (instant)

Archive (S3 Glacier Deep Archive, $0.00099/GB/month):
  → Files not accessed in > 1 year
  → Retrieval time: 12 hours (acceptable for compliance/legal)
  → Primarily: old versions, legal holds

Cost reduction:
  Hot (50PB):     $1.15M/month
  Tiered mix:     $300K/month (hot: 5PB, warm: 10PB, cold: 20PB, archive: 15PB)
  → 74% cost savings

Deep Dive 3: Search and Content Indexing

Challenge: Users search for “Q4 revenue” and expect to find a PDF buried in a nested folder, even if the file is named “report_final_v3.pdf.” We need full-text search of document contents.

Text extraction pipeline:

File uploaded → Kafka event → Text Extraction Worker

By file type:
  PDF        → Apache Tika (handles most formats)
  DOCX/XLSX  → Apache Tika or python-docx
  Images     → Tesseract OCR (+ GPU for throughput)
  Code files → Index as-is (already text)
  Videos     → Extract audio → speech-to-text (Whisper)

Output: plain text + structural metadata (headings, tables, page numbers)
  → Store in file_text table (for re-indexing)
  → Send to Elasticsearch

Elasticsearch index design:

Index: documents
  Mapping: {
    "file_id":       { "type": "keyword" },
    "name":          { "type": "text", "analyzer": "standard",
                       "fields": { "keyword": { "type": "keyword" } } },
    "content":       { "type": "text", "analyzer": "standard" },
    "path":          { "type": "text", "analyzer": "path_hierarchy" },
    "content_type":  { "type": "keyword" },
    "owner_id":      { "type": "keyword" },
    "org_id":        { "type": "keyword" },
    "tags":          { "type": "keyword" },
    "created_at":    { "type": "date" },
    "updated_at":    { "type": "date" },
    "size_bytes":    { "type": "long" }
  }

Query: "Q4 revenue" in org_123, PDFs only, modified this year:
  {
    "bool": {
      "must": [
        { "multi_match": { "query": "Q4 revenue", "fields": ["name^3", "content"] } }
      ],
      "filter": [
        { "term": { "org_id": "org_123" } },
        { "term": { "content_type": "application/pdf" } },
        { "range": { "updated_at": { "gte": "2026-01-01" } } }
      ]
    }
  }

Search with access control (the hard part):

Problem: User A searches "revenue" but should only see files they have access to.
Naive: filter search results by checking permissions on each result → slow (100ms per check × 50 results)

Solution: Index permissions in Elasticsearch:
  "authorized_users":  ["u_123", "u_456"],
  "authorized_groups": ["g_admin", "g_finance"]

Add permission filter to every search query:
  "filter": [
    { "bool": { "should": [
      { "term": { "authorized_users": "u_123" } },
      { "terms": { "authorized_groups": ["g_admin", "g_finance"] } }
    ] } }
  ]

Challenge: when permissions change, re-index affected documents.
  → Permission change on a folder with 10K files → 10K re-index operations
  → Use bulk API, async. Search may be stale for ~30 seconds after permission change.
  → Acceptable trade-off: search is eventually consistent with permissions.

7. Extensions (2 min)

Offline sync: Desktop/mobile clients maintain a local copy of synced folders. Detect conflicts when the same file is edited offline by two users. Offer merge, keep-both, or overwrite options. Use vector clocks for conflict detection.
Preview generation: Generate in-browser previews for non-native formats: PDF viewer (PDF.js), Office documents (LibreOffice conversion to HTML), CAD files (web-based renderer). Queue-based generation on upload, cached for subsequent views.
Retention policies and legal hold: Enforce data retention rules: “delete all files in /Projects older than 7 years.” Legal hold: prevent deletion of specific files/folders during litigation, even if retention policy says delete. Override hierarchy: legal hold > retention policy > user delete.
Activity feed and audit: Show per-document activity: “Alice edited on Feb 20, Bob viewed on Feb 21, shared with Finance team.” Per-user activity: “Files you recently opened.” Audit log for compliance: every view, download, share, and permission change.
Smart content classification: ML-based auto-tagging (invoice, contract, receipt, photo). PII detection (SSN, credit card numbers) with automatic redaction or access restriction. Content-based duplicate detection (“this file is 95% similar to existing file X”).

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Storage#

Traffic#

Bandwidth#

Key Insight#

3. API Design (3 min)#

File Operations#

Search#

Sharing#

4. Data Model (3 min)#

Files and Folders (PostgreSQL)#

Permissions (PostgreSQL)#

Why These Choices#

5. High-Level Design (12 min)#

Upload Flow#

Download Flow#

Processing Pipeline (async, after upload)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Collaborative Editing with Operational Transformation (OT)#

Deep Dive 2: File Versioning and Storage Optimization#

Deep Dive 3: Search and Content Indexing#

7. Extensions (2 min)#