1. Requirements & Scope (5 min)

Functional Requirements

  1. Users can upload, download, and delete files of any type and size (up to 5TB per file). Uploads support chunked and resumable upload protocols so large files survive network interruptions.
  2. Files sync automatically across all of a user’s devices. When a file is modified on one device, all other devices reflect the change within seconds.
  3. Sharing and permissions: users can share files/folders with specific people (viewer, commenter, editor) or generate shareable links with configurable access levels. Support organizational domains (anyone at company X can view).
  4. Version history: every edit creates a new version. Users can view and restore previous versions (up to 100 versions, or 30 days, whichever comes first).
  5. Full-text search across file names, contents (for supported formats: docs, PDFs, images via OCR), and metadata (owner, type, modified date).

Non-Functional Requirements

  • Availability: 99.99% for file access (download). Upload can tolerate slightly lower availability (resumable uploads mask brief outages).
  • Latency: Metadata operations (list, search, share): < 200ms. Small file download (< 1MB): < 500ms from edge CDN. Upload acknowledgment: < 100ms per chunk.
  • Consistency: Strong consistency for metadata (permissions, ownership, folder structure). Eventual consistency for file content propagation to other devices is acceptable (target: < 10 seconds).
  • Scale: 2B users total, 500M MAU, 100M DAU. 2 trillion files stored. 10B API requests/day. 5PB of new data uploaded per day.
  • Durability: 99.999999999% (11 nines). Files must never be lost. This is the most critical non-functional requirement.

2. Estimation (3 min)

Traffic

  • 10B API requests/day = ~116K requests/sec average
  • Breakdown: 60% metadata reads (list, search), 25% downloads, 15% uploads
  • Read QPS: ~70K metadata reads/sec + ~29K downloads/sec
  • Write QPS: ~17K uploads/sec
  • Peak: 3x average = ~350K requests/sec total

Storage

  • Total files: 2 trillion
  • Average file size: 2.5MB (skewed: many small docs, fewer large videos)
  • Total storage: 2T x 2.5MB = 5 exabytes (5,000 PB)
  • New data: 5PB/day → 1.8EB/year
  • With 3x replication: 15EB raw storage
  • With deduplication (estimated 30% duplicate data): effective ~10.5EB

Metadata

  • 2 trillion files x 1KB metadata per file = 2PB of metadata
  • This must be in a fast, queryable database — not blob storage

Bandwidth

  • Downloads: 29K/sec x 2.5MB average = 72.5GB/sec = 580Gbps
  • Uploads: 17K/sec x 2.5MB average = 42.5GB/sec = 340Gbps
  • CDN absorbs most download traffic (cache-hit ratio ~85%), so origin bandwidth: ~87Gbps

Key Insight

This is a storage-dominant system at planetary scale. The core challenges are: (1) storing exabytes of data durably and cost-efficiently, (2) syncing file changes across devices with minimal bandwidth and latency, and (3) making 2 trillion files searchable. The metadata layer is essentially a distributed file system.


3. API Design (3 min)

File Operations

// Initiate resumable upload
POST /api/v1/files/upload/init
  Body: {
    "name": "report.pdf",
    "parent_folder_id": "fld_abc",
    "size": 104857600,                    // 100MB
    "mime_type": "application/pdf",
    "checksum_sha256": "a1b2c3..."
  }
  Response 200: {
    "upload_id": "upl_xyz",
    "chunk_size": 8388608,                // 8MB recommended chunk size
    "upload_url": "https://upload.drive.example.com/upl_xyz"
  }

// Upload chunk
PUT /api/v1/files/upload/{upload_id}/chunks/{chunk_number}
  Headers: Content-Range: bytes 0-8388607/104857600
  Body: <binary chunk data>
  Response 200: { "chunk_number": 0, "status": "RECEIVED" }

// Complete upload
POST /api/v1/files/upload/{upload_id}/complete
  Body: { "chunk_count": 13 }
  Response 201: {
    "file_id": "fil_def456",
    "version": 1,
    "size": 104857600,
    "created_at": "2026-02-22T..."
  }

// Download file
GET /api/v1/files/{file_id}/content?version=3
  Response 302: redirect to CDN signed URL (valid for 15 minutes)

// Get file metadata
GET /api/v1/files/{file_id}
  Response 200: { "file_id": "...", "name": "...", "size": ..., "versions": [...], ... }

// Delete file (move to trash)
DELETE /api/v1/files/{file_id}
  Response 200: { "status": "TRASHED", "permanent_delete_at": "2026-03-24T..." }

// Search
GET /api/v1/files/search?q=quarterly+report&type=pdf&owner=me&modified_after=2026-01-01
  Response 200: { "results": [...], "next_page_token": "..." }

Sharing

POST /api/v1/files/{file_id}/permissions
  Body: {
    "email": "[email protected]",
    "role": "editor",                     // viewer | commenter | editor
    "notify": true,
    "message": "Please review this doc"
  }
  Response 200: { "permission_id": "perm_abc" }

POST /api/v1/files/{file_id}/link
  Body: { "access": "ANYONE_WITH_LINK", "role": "viewer", "expires_at": "2026-03-01" }
  Response 200: { "link": "https://drive.example.com/s/abc123xyz" }

Key Decisions

  • Resumable uploads are mandatory — a 5TB upload must survive network failures without restarting from scratch
  • Download via CDN redirect — signed URLs with short TTL for security; CDN caches popular files at edge
  • Trash with 30-day retention — soft delete with automatic permanent deletion, giving users a recovery window
  • Chunk size is server-determined — allows the server to optimize based on network conditions and file size

4. Data Model (3 min)

File Metadata (Sharded PostgreSQL / Vitess / CockroachDB)

Field Type Notes
file_id UUID PK Globally unique
owner_id UUID File creator
name VARCHAR(500) File name
parent_folder_id UUID FK Folder hierarchy (nullable for root)
mime_type VARCHAR(100)
size BIGINT Bytes
current_version INT
checksum_sha256 VARCHAR(64) Content hash (used for dedup)
status ENUM ACTIVE, TRASHED, PERMANENTLY_DELETED
trashed_at TIMESTAMP When moved to trash
created_at TIMESTAMP
updated_at TIMESTAMP

File Versions (Sharded PostgreSQL)

Field Type Notes
file_id UUID (composite PK)
version INT (composite PK)
blob_references JSONB List of {chunk_hash, blob_store_key, offset, size}
size BIGINT Version size
checksum_sha256 VARCHAR(64)
created_by UUID Who uploaded this version
created_at TIMESTAMP

Permissions (Sharded PostgreSQL)

Field Type Notes
file_id UUID (composite PK)
principal VARCHAR(200) (composite PK) user_id, group_id, “anyone”, or domain
role ENUM owner, editor, commenter, viewer
inherited BOOLEAN True if inherited from parent folder
granted_by UUID
created_at TIMESTAMP
expires_at TIMESTAMP Optional link expiry

Sync State (Per-Device) — Redis + PostgreSQL

Field Type Notes
user_id UUID
device_id UUID
last_sync_cursor BIGINT Server-side change sequence number
last_sync_at TIMESTAMP

Why These DB Choices?

  • Sharded PostgreSQL (or CockroachDB/Vitess): 2 trillion files require sharding. Shard by owner_id so all of a user’s files are co-located (efficient for listing, searching within “My Drive”). PostgreSQL for ACID transactions on metadata.
  • Blob Storage (GCS / S3 / custom): Actual file content stored in a distributed object store. Files are chunked (8MB chunks) and each chunk is content-addressed (SHA-256 hash as key). This enables deduplication at the chunk level.
  • Elasticsearch: Full-text search index. Indexes file names, extracted text content (from PDF, DOCX, OCR), and metadata. Sharded by user for query isolation.
  • Redis: Sync cursors, device presence, and caching hot metadata (frequently accessed folders, recently used files).

5. High-Level Design (12 min)

System Architecture

              ┌────────────────────────────────────────────────┐
              │              Client (Desktop / Mobile / Web)    │
              │  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
              │  │ File     │  │ Sync     │  │ Web UI       │ │
              │  │ Watcher  │  │ Engine   │  │ (Browser)    │ │
              │  └────┬─────┘  └────┬─────┘  └──────┬───────┘ │
              └───────┼─────────────┼───────────────┼──────────┘
                      │             │               │
                      ▼             ▼               ▼
              ┌─────────────────────────────────────────────────┐
              │                  CDN / Edge                      │
              │            (Download acceleration)               │
              └───────────────────────┬─────────────────────────┘
                                      │
                            ┌─────────▼──────────┐
                            │   API Gateway /    │
                            │   Load Balancer    │
                            └─────────┬──────────┘
                                      │
         ┌────────────────────────────┼────────────────────────────┐
         │                            │                            │
  ┌──────▼──────┐           ┌────────▼────────┐          ┌───────▼───────┐
  │  Metadata   │           │  Upload         │          │  Download     │
  │  Service    │           │  Service        │          │  Service      │
  │ (CRUD, list,│           │ (Chunked,       │          │ (Signed URL   │
  │  search,    │           │  resumable)     │          │  generation,  │
  │  share)     │           │                 │          │  range reqs)  │
  └──────┬──────┘           └────────┬────────┘          └───────┬───────┘
         │                           │                           │
         │                    ┌──────▼──────┐                    │
         │                    │  Chunk      │                    │
         │                    │  Processing │                    │
         │                    │  Pipeline   │                    │
         │                    │ (hash,dedup,│                    │
         │                    │  encrypt)   │                    │
         │                    └──────┬──────┘                    │
         │                           │                           │
    ┌────▼────────────┐      ┌──────▼──────┐           ┌───────▼──────┐
    │ Metadata DB     │      │ Blob Store  │           │    CDN       │
    │ (Sharded PG)    │      │ (Distributed│           │   Origin     │
    │                 │      │  Object     │           │              │
    │ + Elasticsearch │      │  Store)     │           │              │
    └─────────────────┘      └─────────────┘           └──────────────┘
         │
    ┌────▼────────────┐
    │  Sync / Notify  │
    │  Service        │
    │ (Change feed,   │
    │  push notif)    │
    └─────────────────┘

Upload Flow (Resumable, Chunked)

Client wants to upload report.pdf (100MB)
        │
        ▼
POST /api/v1/files/upload/init → Upload Service
  - Creates upload session: upload_id, expected size, chunk_size=8MB
  - Stores session in Redis (TTL 24 hours)
  - Returns upload_id + chunk_size
        │
        ▼
Client splits file into 13 chunks (12 x 8MB + 1 x 4MB)
        │
        ▼
For each chunk (can be parallelized, 4 concurrent):
  PUT /api/v1/files/upload/{upload_id}/chunks/{i}
    │
    ▼
  Upload Service:
    1. Validate chunk number and size
    2. Compute SHA-256 hash of chunk content
    3. Check blob store: does this hash already exist?
       ├── YES → skip storage (deduplication), just record reference
       └── NO  → write chunk to blob store with hash as key
    4. Encrypt chunk at rest (AES-256-GCM, per-user key from KMS)
    5. Mark chunk as received in Redis session
    6. Return 200
        │
        ▼
POST /api/v1/files/upload/{upload_id}/complete
    │
    ▼
  Upload Service:
    1. Verify all chunks received
    2. Compute whole-file checksum from chunk hashes (Merkle tree)
    3. Create file metadata in Metadata DB:
       - file_id, name, size, parent_folder_id, version=1
       - blob_references: [{chunk_hash_0, offset_0, size_0}, ...]
    4. Update parent folder's updated_at timestamp
    5. Publish FileCreated event to Sync Service
    6. Enqueue for indexing (Elasticsearch: extract text if PDF/DOCX)
    7. Return 201 with file_id
        │
        ▼
Sync Service picks up FileCreated event:
    - Notify all of this user's other devices via long-poll / WebSocket / push notification
    - Other devices fetch the new file metadata + content

Download Flow

Client requests: GET /api/v1/files/{file_id}/content
        │
        ▼
Download Service:
  1. Authenticate + authorize (check permissions)
  2. Look up file metadata: get blob_references for current version
  3. Generate signed CDN URL (expires in 15 min, bound to user IP)
  4. Return 302 redirect to signed URL
        │
        ▼
Client follows redirect → CDN edge
  ├── CDN cache HIT → serve directly from edge (~50ms)
  └── CDN cache MISS → CDN fetches from blob store origin
        │
        ▼
  Blob store assembles file from chunks → streams to CDN → streams to client
  (For large files: HTTP range requests supported for parallel/partial download)

Sync Flow (Delta Sync)

Desktop client detects local file change (File Watcher)
        │
        ▼
Sync Engine computes delta:
  1. Compare new file content with cached previous version
  2. Use rsync-like algorithm: rolling checksum (Rabin fingerprint) to find changed blocks
  3. Only upload changed blocks (delta), not the entire file
        │
        ▼
  Upload delta chunks → Upload Service (same chunked upload flow)
        │
        ▼
  Server creates new version with updated blob_references:
    - Unchanged chunks → reference existing blobs (no new storage)
    - Changed chunks → new blobs stored
        │
        ▼
  Sync Service notifies other devices:
    - Device B receives notification
    - Device B requests: GET /api/v1/files/{file_id}/delta?from_version=3&to_version=4
    - Server returns: list of changed chunk ranges + new chunk hashes
    - Device B downloads only changed chunks and patches its local copy

Component Responsibilities

  1. Metadata Service: All CRUD operations on files, folders, permissions. Handles the folder tree (adjacency list model with materialized path for fast ancestry queries). Powers search via Elasticsearch integration.
  2. Upload Service: Manages resumable upload sessions. Handles chunking, deduplication, encryption, and blob storage writes. Stateless except for Redis-backed upload sessions.
  3. Download Service: Authorization check + signed URL generation. Does not actually serve file bytes — that is the CDN / blob store’s job.
  4. Chunk Processing Pipeline: Content-addressable storage. SHA-256 hash is the chunk key. Enables cross-user and cross-file deduplication. Handles encryption (per-user keys managed by KMS).
  5. Blob Store: Distributed object storage (think S3/GCS). Handles replication (3 copies across availability zones), erasure coding for cold data, and lifecycle management (hot → warm → cold → archive tiers).
  6. Sync / Notification Service: Maintains a change log (monotonically increasing sequence number per user). Devices poll with their last cursor to get changes since last sync. Supports long-polling and WebSocket push for low-latency sync.
  7. CDN: Caches popular files at edge locations. Handles SSL termination, range requests, and bandwidth optimization.

6. Deep Dives (15 min)

Deep Dive 1: Chunked Upload, Deduplication & Delta Sync

Chunking Strategy:

  • Fixed-size chunks (8MB) are simple but suboptimal: inserting 1 byte at the start of a file shifts all chunk boundaries, causing every chunk to be “new.”
  • Content-defined chunking (CDC) uses a rolling hash (Rabin fingerprint) to find chunk boundaries at content-determined points. This means inserting or deleting bytes only affects 1-2 chunks, not the entire file.
  • Trade-off: CDC has variable chunk sizes (4MB - 16MB, targeting 8MB average). More complex but critical for delta sync efficiency.
Fixed chunking vs Content-defined chunking:

Original file: [========A========][========B========][========C========]

After inserting 10 bytes at position 100:

Fixed chunks:   [====A'===][===B'====][===C'====][D']  ← ALL chunks changed
CDC chunks:     [===A'====][========B========][========C========]  ← only chunk A changed

Deduplication:

  • Each chunk is stored with its SHA-256 hash as the key in blob store
  • Before writing, check if the hash already exists → skip write
  • Works across files and across users (global dedup)
  • Estimated space savings: 25-35% (many people store the same popular files, OS installers, shared documents)

Dedup at scale:

Dedup lookup flow:
  1. Compute SHA-256 of chunk (CPU-bound, ~50ms for 8MB chunk)
  2. Check bloom filter (in-memory, 10GB for 1 trillion chunks, ~0.01% false positive rate)
     ├── Not in bloom filter → chunk is definitely new, store it
     └── Possibly in bloom filter → check blob store index (single point lookup)
         ├── Exists → skip storage, just add reference
         └── Doesn't exist (false positive) → store it

Delta Sync Algorithm (rsync-like):

When a user modifies a file locally:

1. Client-side:
   a. Compute rolling checksums (Rabin fingerprint, 512-byte window) over the new file
   b. Compare with stored checksums of the previous version
   c. Identify changed regions: ranges where checksums differ
   d. Upload only the changed regions as new chunks

2. Server-side:
   a. Receive delta chunks
   b. Create new version: reference unchanged old chunks + store new delta chunks
   c. New version's blob_references = mix of old and new chunk references

3. Other devices syncing:
   a. Request delta: "what changed between version 3 and version 4?"
   b. Server returns: [{offset: 16MB, size: 8MB, new_chunk_hash: "abc..."}]
   c. Device downloads only the 8MB changed chunk
   d. Device patches its local copy: replace bytes at offset 16MB with new chunk

Bandwidth savings: For a 100MB file with a 1-page edit, sync transfers ~8MB instead of 100MB (92% savings).

Deep Dive 2: File Sync Across Devices — Consistency & Conflict Resolution

Change Notification System:

Each user has a monotonically increasing change sequence number. Every metadata change (create, update, delete, rename, move, share) increments this counter and is logged.

User's change log:
  seq=1001: FileCreated   { file_id: "fil_a", name: "report.pdf", folder: "fld_root" }
  seq=1002: FileModified  { file_id: "fil_b", version: 3→4 }
  seq=1003: FileRenamed   { file_id: "fil_a", old_name: "report.pdf", new_name: "Q4 Report.pdf" }
  seq=1004: FileShared    { file_id: "fil_a", shared_with: "alice@..." }

Device sync protocol:
  1. Device stores last_sync_cursor (e.g., 1000)
  2. Device calls: GET /api/v1/sync/changes?cursor=1000&limit=500
  3. Server returns changes 1001-1004
  4. Device applies changes locally, updates cursor to 1004
  5. Device calls again with cursor=1004 → server holds connection (long-poll, up to 60s)
  6. When a new change arrives → server responds immediately

Conflict Resolution:

What happens when two devices edit the same file while offline, then both come online?

Scenario:
  Device A (laptop): edits report.pdf, creates version 4a (offline)
  Device B (phone):  edits report.pdf, creates version 4b (offline)
  Both come online. Server has version 3.

Resolution strategy:
  1. First device to sync wins the version number.
     Device A syncs first → server creates version 4 from A's changes.
  2. Device B syncs → server detects conflict (B's parent version is 3, but current is 4).
  3. Server creates version 5 from B's changes AND saves a "conflict copy":
     "report (conflict from Device B - Feb 23).pdf"
  4. Both devices are notified. User manually merges if needed.

This is the same strategy used by Dropbox and Google Drive.
Trade-off: simpler than automatic merging (which only works for text files with OT/CRDT).
For binary files (images, PDFs), there's no sensible automatic merge — conflict copies are the only option.

Handling Massive Sync Backlogs:

  • A device offline for weeks may have thousands of changes to sync
  • Server paginates changes (500 per request)
  • Client prioritizes: sync currently-open folders first, then breadth-first from root
  • For initial setup of a new device: server sends a “snapshot” (full file tree) instead of replaying the entire change log

Deep Dive 3: Storage Architecture & Durability

At exabyte scale, the storage layer is the most critical (and expensive) component.

Storage Tiers:

┌─────────────────┐
│   Hot Tier      │  Files accessed in last 30 days
│   (SSD, 3x     │  ~20% of data, 80% of reads
│    replication) │  Cost: $$$$
├─────────────────┤
│   Warm Tier     │  Files accessed in last 90 days
│   (HDD, 3x     │  ~30% of data, 15% of reads
│    replication) │  Cost: $$$
├─────────────────┤
│   Cold Tier     │  Files not accessed in 90+ days
│   (HDD, erasure │  ~40% of data, 4% of reads
│    coded 8+4)   │  Cost: $$
├─────────────────┤
│   Archive Tier  │  Trashed files, old versions
│   (Tape/deep    │  ~10% of data, 1% of reads
│    archive,     │  Cost: $
│    erasure 12+4)│
└─────────────────┘

Durability: 11 nines (99.999999999%)

How to achieve this:

  • Hot/Warm: 3x replication across 3 availability zones. Can survive loss of 2 replicas simultaneously.
  • Cold: Reed-Solomon erasure coding (8 data + 4 parity shards = 12 shards total). Can lose any 4 shards and still reconstruct. Stored across 12 different machines in 3 AZs.
  • Bitrot detection: Every chunk has a checksum verified on read. Background scrubbing job reads and verifies every chunk every 90 days.
  • Cross-region replication: For critical data (enterprise customers), replicate to a geographically distant region.

Cost Optimization:

Without optimization:
  5EB x 3 replicas x $0.023/GB/month (S3 standard) = $345M/month

With tiering + erasure coding:
  Hot (1EB):    3x replication, SSD:     1EB x 3 x $0.023 =  $69M/month
  Warm (1.5EB): 3x replication, HDD:     1.5EB x 3 x $0.010 = $45M/month
  Cold (2EB):   Erasure coded (1.5x):    2EB x 1.5 x $0.004 = $12M/month
  Archive (0.5EB): Erasure coded (1.33x): 0.5EB x 1.33 x $0.001 = $0.7M/month

  Total: ~$127M/month (63% savings)

Quota Management:

  • Each user has a storage quota (15GB free, 100GB/2TB/5TB paid)
  • Quota is tracked in the metadata DB: sum of all file versions’ sizes
  • Uploads are rejected if they would exceed quota
  • Shared files count against the owner’s quota, not the viewer’s
  • Trashed files continue to count against quota until permanently deleted
  • Version history counts against quota (motivates the 100-version / 30-day limit)

7. Extensions (2 min)

  • Real-time collaboration integration: When a user opens a Google Docs/Sheets/Slides file in Drive, hand off to the collaborative editing service (OT/CRDT engine). Drive provides the storage and versioning layer; the editing service provides the real-time collaboration layer.
  • AI-powered search and organization: Use ML models to classify files (invoices, receipts, photos, resumes), extract entities (dates, people, amounts), and enable natural-language search (“find the budget spreadsheet Alice shared with me last month”). Auto-suggest folder organization.
  • Offline mode for mobile/desktop: Cache frequently accessed files and entire “starred” folders locally. Use a write-ahead log for offline edits. Sync when connectivity returns. Priority-based sync (user-starred files first, then recent, then rest).
  • Enterprise features: Shared drives (team-owned, not individual-owned), data loss prevention (DLP) scanning to prevent sharing of sensitive files (SSN, credit card numbers), admin console with usage analytics, and legal hold (prevent deletion of files under litigation).
  • Content-aware storage optimization: Automatically transcode uploaded videos to lower resolutions for preview. Generate image thumbnails at multiple sizes on upload. Store original + derived formats. This significantly improves download performance for preview/thumbnail use cases without touching the original file.