1. Requirements & Scope (5 min)
Functional Requirements
- Users can upload, download, and delete files of any type and size (up to 5TB per file). Uploads support chunked and resumable upload protocols so large files survive network interruptions.
- Files sync automatically across all of a user’s devices. When a file is modified on one device, all other devices reflect the change within seconds.
- Sharing and permissions: users can share files/folders with specific people (viewer, commenter, editor) or generate shareable links with configurable access levels. Support organizational domains (anyone at company X can view).
- Version history: every edit creates a new version. Users can view and restore previous versions (up to 100 versions, or 30 days, whichever comes first).
- Full-text search across file names, contents (for supported formats: docs, PDFs, images via OCR), and metadata (owner, type, modified date).
Non-Functional Requirements
- Availability: 99.99% for file access (download). Upload can tolerate slightly lower availability (resumable uploads mask brief outages).
- Latency: Metadata operations (list, search, share): < 200ms. Small file download (< 1MB): < 500ms from edge CDN. Upload acknowledgment: < 100ms per chunk.
- Consistency: Strong consistency for metadata (permissions, ownership, folder structure). Eventual consistency for file content propagation to other devices is acceptable (target: < 10 seconds).
- Scale: 2B users total, 500M MAU, 100M DAU. 2 trillion files stored. 10B API requests/day. 5PB of new data uploaded per day.
- Durability: 99.999999999% (11 nines). Files must never be lost. This is the most critical non-functional requirement.
2. Estimation (3 min)
Traffic
- 10B API requests/day = ~116K requests/sec average
- Breakdown: 60% metadata reads (list, search), 25% downloads, 15% uploads
- Read QPS: ~70K metadata reads/sec + ~29K downloads/sec
- Write QPS: ~17K uploads/sec
- Peak: 3x average = ~350K requests/sec total
Storage
- Total files: 2 trillion
- Average file size: 2.5MB (skewed: many small docs, fewer large videos)
- Total storage: 2T x 2.5MB = 5 exabytes (5,000 PB)
- New data: 5PB/day → 1.8EB/year
- With 3x replication: 15EB raw storage
- With deduplication (estimated 30% duplicate data): effective ~10.5EB
Metadata
- 2 trillion files x 1KB metadata per file = 2PB of metadata
- This must be in a fast, queryable database — not blob storage
Bandwidth
- Downloads: 29K/sec x 2.5MB average = 72.5GB/sec = 580Gbps
- Uploads: 17K/sec x 2.5MB average = 42.5GB/sec = 340Gbps
- CDN absorbs most download traffic (cache-hit ratio ~85%), so origin bandwidth: ~87Gbps
Key Insight
This is a storage-dominant system at planetary scale. The core challenges are: (1) storing exabytes of data durably and cost-efficiently, (2) syncing file changes across devices with minimal bandwidth and latency, and (3) making 2 trillion files searchable. The metadata layer is essentially a distributed file system.
3. API Design (3 min)
File Operations
// Initiate resumable upload
POST /api/v1/files/upload/init
Body: {
"name": "report.pdf",
"parent_folder_id": "fld_abc",
"size": 104857600, // 100MB
"mime_type": "application/pdf",
"checksum_sha256": "a1b2c3..."
}
Response 200: {
"upload_id": "upl_xyz",
"chunk_size": 8388608, // 8MB recommended chunk size
"upload_url": "https://upload.drive.example.com/upl_xyz"
}
// Upload chunk
PUT /api/v1/files/upload/{upload_id}/chunks/{chunk_number}
Headers: Content-Range: bytes 0-8388607/104857600
Body: <binary chunk data>
Response 200: { "chunk_number": 0, "status": "RECEIVED" }
// Complete upload
POST /api/v1/files/upload/{upload_id}/complete
Body: { "chunk_count": 13 }
Response 201: {
"file_id": "fil_def456",
"version": 1,
"size": 104857600,
"created_at": "2026-02-22T..."
}
// Download file
GET /api/v1/files/{file_id}/content?version=3
Response 302: redirect to CDN signed URL (valid for 15 minutes)
// Get file metadata
GET /api/v1/files/{file_id}
Response 200: { "file_id": "...", "name": "...", "size": ..., "versions": [...], ... }
// Delete file (move to trash)
DELETE /api/v1/files/{file_id}
Response 200: { "status": "TRASHED", "permanent_delete_at": "2026-03-24T..." }
// Search
GET /api/v1/files/search?q=quarterly+report&type=pdf&owner=me&modified_after=2026-01-01
Response 200: { "results": [...], "next_page_token": "..." }
Sharing
POST /api/v1/files/{file_id}/permissions
Body: {
"email": "[email protected]",
"role": "editor", // viewer | commenter | editor
"notify": true,
"message": "Please review this doc"
}
Response 200: { "permission_id": "perm_abc" }
POST /api/v1/files/{file_id}/link
Body: { "access": "ANYONE_WITH_LINK", "role": "viewer", "expires_at": "2026-03-01" }
Response 200: { "link": "https://drive.example.com/s/abc123xyz" }
Key Decisions
- Resumable uploads are mandatory — a 5TB upload must survive network failures without restarting from scratch
- Download via CDN redirect — signed URLs with short TTL for security; CDN caches popular files at edge
- Trash with 30-day retention — soft delete with automatic permanent deletion, giving users a recovery window
- Chunk size is server-determined — allows the server to optimize based on network conditions and file size
4. Data Model (3 min)
File Metadata (Sharded PostgreSQL / Vitess / CockroachDB)
| Field | Type | Notes |
|---|---|---|
| file_id | UUID PK | Globally unique |
| owner_id | UUID | File creator |
| name | VARCHAR(500) | File name |
| parent_folder_id | UUID FK | Folder hierarchy (nullable for root) |
| mime_type | VARCHAR(100) | |
| size | BIGINT | Bytes |
| current_version | INT | |
| checksum_sha256 | VARCHAR(64) | Content hash (used for dedup) |
| status | ENUM | ACTIVE, TRASHED, PERMANENTLY_DELETED |
| trashed_at | TIMESTAMP | When moved to trash |
| created_at | TIMESTAMP | |
| updated_at | TIMESTAMP |
File Versions (Sharded PostgreSQL)
| Field | Type | Notes |
|---|---|---|
| file_id | UUID (composite PK) | |
| version | INT (composite PK) | |
| blob_references | JSONB | List of {chunk_hash, blob_store_key, offset, size} |
| size | BIGINT | Version size |
| checksum_sha256 | VARCHAR(64) | |
| created_by | UUID | Who uploaded this version |
| created_at | TIMESTAMP |
Permissions (Sharded PostgreSQL)
| Field | Type | Notes |
|---|---|---|
| file_id | UUID (composite PK) | |
| principal | VARCHAR(200) (composite PK) | user_id, group_id, “anyone”, or domain |
| role | ENUM | owner, editor, commenter, viewer |
| inherited | BOOLEAN | True if inherited from parent folder |
| granted_by | UUID | |
| created_at | TIMESTAMP | |
| expires_at | TIMESTAMP | Optional link expiry |
Sync State (Per-Device) — Redis + PostgreSQL
| Field | Type | Notes |
|---|---|---|
| user_id | UUID | |
| device_id | UUID | |
| last_sync_cursor | BIGINT | Server-side change sequence number |
| last_sync_at | TIMESTAMP |
Why These DB Choices?
- Sharded PostgreSQL (or CockroachDB/Vitess): 2 trillion files require sharding. Shard by owner_id so all of a user’s files are co-located (efficient for listing, searching within “My Drive”). PostgreSQL for ACID transactions on metadata.
- Blob Storage (GCS / S3 / custom): Actual file content stored in a distributed object store. Files are chunked (8MB chunks) and each chunk is content-addressed (SHA-256 hash as key). This enables deduplication at the chunk level.
- Elasticsearch: Full-text search index. Indexes file names, extracted text content (from PDF, DOCX, OCR), and metadata. Sharded by user for query isolation.
- Redis: Sync cursors, device presence, and caching hot metadata (frequently accessed folders, recently used files).
5. High-Level Design (12 min)
System Architecture
┌────────────────────────────────────────────────┐
│ Client (Desktop / Mobile / Web) │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ File │ │ Sync │ │ Web UI │ │
│ │ Watcher │ │ Engine │ │ (Browser) │ │
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
└───────┼─────────────┼───────────────┼──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ CDN / Edge │
│ (Download acceleration) │
└───────────────────────┬─────────────────────────┘
│
┌─────────▼──────────┐
│ API Gateway / │
│ Load Balancer │
└─────────┬──────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌────────▼────────┐ ┌───────▼───────┐
│ Metadata │ │ Upload │ │ Download │
│ Service │ │ Service │ │ Service │
│ (CRUD, list,│ │ (Chunked, │ │ (Signed URL │
│ search, │ │ resumable) │ │ generation, │
│ share) │ │ │ │ range reqs) │
└──────┬──────┘ └────────┬────────┘ └───────┬───────┘
│ │ │
│ ┌──────▼──────┐ │
│ │ Chunk │ │
│ │ Processing │ │
│ │ Pipeline │ │
│ │ (hash,dedup,│ │
│ │ encrypt) │ │
│ └──────┬──────┘ │
│ │ │
┌────▼────────────┐ ┌──────▼──────┐ ┌───────▼──────┐
│ Metadata DB │ │ Blob Store │ │ CDN │
│ (Sharded PG) │ │ (Distributed│ │ Origin │
│ │ │ Object │ │ │
│ + Elasticsearch │ │ Store) │ │ │
└─────────────────┘ └─────────────┘ └──────────────┘
│
┌────▼────────────┐
│ Sync / Notify │
│ Service │
│ (Change feed, │
│ push notif) │
└─────────────────┘
Upload Flow (Resumable, Chunked)
Client wants to upload report.pdf (100MB)
│
▼
POST /api/v1/files/upload/init → Upload Service
- Creates upload session: upload_id, expected size, chunk_size=8MB
- Stores session in Redis (TTL 24 hours)
- Returns upload_id + chunk_size
│
▼
Client splits file into 13 chunks (12 x 8MB + 1 x 4MB)
│
▼
For each chunk (can be parallelized, 4 concurrent):
PUT /api/v1/files/upload/{upload_id}/chunks/{i}
│
▼
Upload Service:
1. Validate chunk number and size
2. Compute SHA-256 hash of chunk content
3. Check blob store: does this hash already exist?
├── YES → skip storage (deduplication), just record reference
└── NO → write chunk to blob store with hash as key
4. Encrypt chunk at rest (AES-256-GCM, per-user key from KMS)
5. Mark chunk as received in Redis session
6. Return 200
│
▼
POST /api/v1/files/upload/{upload_id}/complete
│
▼
Upload Service:
1. Verify all chunks received
2. Compute whole-file checksum from chunk hashes (Merkle tree)
3. Create file metadata in Metadata DB:
- file_id, name, size, parent_folder_id, version=1
- blob_references: [{chunk_hash_0, offset_0, size_0}, ...]
4. Update parent folder's updated_at timestamp
5. Publish FileCreated event to Sync Service
6. Enqueue for indexing (Elasticsearch: extract text if PDF/DOCX)
7. Return 201 with file_id
│
▼
Sync Service picks up FileCreated event:
- Notify all of this user's other devices via long-poll / WebSocket / push notification
- Other devices fetch the new file metadata + content
Download Flow
Client requests: GET /api/v1/files/{file_id}/content
│
▼
Download Service:
1. Authenticate + authorize (check permissions)
2. Look up file metadata: get blob_references for current version
3. Generate signed CDN URL (expires in 15 min, bound to user IP)
4. Return 302 redirect to signed URL
│
▼
Client follows redirect → CDN edge
├── CDN cache HIT → serve directly from edge (~50ms)
└── CDN cache MISS → CDN fetches from blob store origin
│
▼
Blob store assembles file from chunks → streams to CDN → streams to client
(For large files: HTTP range requests supported for parallel/partial download)
Sync Flow (Delta Sync)
Desktop client detects local file change (File Watcher)
│
▼
Sync Engine computes delta:
1. Compare new file content with cached previous version
2. Use rsync-like algorithm: rolling checksum (Rabin fingerprint) to find changed blocks
3. Only upload changed blocks (delta), not the entire file
│
▼
Upload delta chunks → Upload Service (same chunked upload flow)
│
▼
Server creates new version with updated blob_references:
- Unchanged chunks → reference existing blobs (no new storage)
- Changed chunks → new blobs stored
│
▼
Sync Service notifies other devices:
- Device B receives notification
- Device B requests: GET /api/v1/files/{file_id}/delta?from_version=3&to_version=4
- Server returns: list of changed chunk ranges + new chunk hashes
- Device B downloads only changed chunks and patches its local copy
Component Responsibilities
- Metadata Service: All CRUD operations on files, folders, permissions. Handles the folder tree (adjacency list model with materialized path for fast ancestry queries). Powers search via Elasticsearch integration.
- Upload Service: Manages resumable upload sessions. Handles chunking, deduplication, encryption, and blob storage writes. Stateless except for Redis-backed upload sessions.
- Download Service: Authorization check + signed URL generation. Does not actually serve file bytes — that is the CDN / blob store’s job.
- Chunk Processing Pipeline: Content-addressable storage. SHA-256 hash is the chunk key. Enables cross-user and cross-file deduplication. Handles encryption (per-user keys managed by KMS).
- Blob Store: Distributed object storage (think S3/GCS). Handles replication (3 copies across availability zones), erasure coding for cold data, and lifecycle management (hot → warm → cold → archive tiers).
- Sync / Notification Service: Maintains a change log (monotonically increasing sequence number per user). Devices poll with their last cursor to get changes since last sync. Supports long-polling and WebSocket push for low-latency sync.
- CDN: Caches popular files at edge locations. Handles SSL termination, range requests, and bandwidth optimization.
6. Deep Dives (15 min)
Deep Dive 1: Chunked Upload, Deduplication & Delta Sync
Chunking Strategy:
- Fixed-size chunks (8MB) are simple but suboptimal: inserting 1 byte at the start of a file shifts all chunk boundaries, causing every chunk to be “new.”
- Content-defined chunking (CDC) uses a rolling hash (Rabin fingerprint) to find chunk boundaries at content-determined points. This means inserting or deleting bytes only affects 1-2 chunks, not the entire file.
- Trade-off: CDC has variable chunk sizes (4MB - 16MB, targeting 8MB average). More complex but critical for delta sync efficiency.
Fixed chunking vs Content-defined chunking:
Original file: [========A========][========B========][========C========]
After inserting 10 bytes at position 100:
Fixed chunks: [====A'===][===B'====][===C'====][D'] ← ALL chunks changed
CDC chunks: [===A'====][========B========][========C========] ← only chunk A changed
Deduplication:
- Each chunk is stored with its SHA-256 hash as the key in blob store
- Before writing, check if the hash already exists → skip write
- Works across files and across users (global dedup)
- Estimated space savings: 25-35% (many people store the same popular files, OS installers, shared documents)
Dedup at scale:
Dedup lookup flow:
1. Compute SHA-256 of chunk (CPU-bound, ~50ms for 8MB chunk)
2. Check bloom filter (in-memory, 10GB for 1 trillion chunks, ~0.01% false positive rate)
├── Not in bloom filter → chunk is definitely new, store it
└── Possibly in bloom filter → check blob store index (single point lookup)
├── Exists → skip storage, just add reference
└── Doesn't exist (false positive) → store it
Delta Sync Algorithm (rsync-like):
When a user modifies a file locally:
1. Client-side:
a. Compute rolling checksums (Rabin fingerprint, 512-byte window) over the new file
b. Compare with stored checksums of the previous version
c. Identify changed regions: ranges where checksums differ
d. Upload only the changed regions as new chunks
2. Server-side:
a. Receive delta chunks
b. Create new version: reference unchanged old chunks + store new delta chunks
c. New version's blob_references = mix of old and new chunk references
3. Other devices syncing:
a. Request delta: "what changed between version 3 and version 4?"
b. Server returns: [{offset: 16MB, size: 8MB, new_chunk_hash: "abc..."}]
c. Device downloads only the 8MB changed chunk
d. Device patches its local copy: replace bytes at offset 16MB with new chunk
Bandwidth savings: For a 100MB file with a 1-page edit, sync transfers ~8MB instead of 100MB (92% savings).
Deep Dive 2: File Sync Across Devices — Consistency & Conflict Resolution
Change Notification System:
Each user has a monotonically increasing change sequence number. Every metadata change (create, update, delete, rename, move, share) increments this counter and is logged.
User's change log:
seq=1001: FileCreated { file_id: "fil_a", name: "report.pdf", folder: "fld_root" }
seq=1002: FileModified { file_id: "fil_b", version: 3→4 }
seq=1003: FileRenamed { file_id: "fil_a", old_name: "report.pdf", new_name: "Q4 Report.pdf" }
seq=1004: FileShared { file_id: "fil_a", shared_with: "alice@..." }
Device sync protocol:
1. Device stores last_sync_cursor (e.g., 1000)
2. Device calls: GET /api/v1/sync/changes?cursor=1000&limit=500
3. Server returns changes 1001-1004
4. Device applies changes locally, updates cursor to 1004
5. Device calls again with cursor=1004 → server holds connection (long-poll, up to 60s)
6. When a new change arrives → server responds immediately
Conflict Resolution:
What happens when two devices edit the same file while offline, then both come online?
Scenario:
Device A (laptop): edits report.pdf, creates version 4a (offline)
Device B (phone): edits report.pdf, creates version 4b (offline)
Both come online. Server has version 3.
Resolution strategy:
1. First device to sync wins the version number.
Device A syncs first → server creates version 4 from A's changes.
2. Device B syncs → server detects conflict (B's parent version is 3, but current is 4).
3. Server creates version 5 from B's changes AND saves a "conflict copy":
"report (conflict from Device B - Feb 23).pdf"
4. Both devices are notified. User manually merges if needed.
This is the same strategy used by Dropbox and Google Drive.
Trade-off: simpler than automatic merging (which only works for text files with OT/CRDT).
For binary files (images, PDFs), there's no sensible automatic merge — conflict copies are the only option.
Handling Massive Sync Backlogs:
- A device offline for weeks may have thousands of changes to sync
- Server paginates changes (500 per request)
- Client prioritizes: sync currently-open folders first, then breadth-first from root
- For initial setup of a new device: server sends a “snapshot” (full file tree) instead of replaying the entire change log
Deep Dive 3: Storage Architecture & Durability
At exabyte scale, the storage layer is the most critical (and expensive) component.
Storage Tiers:
┌─────────────────┐
│ Hot Tier │ Files accessed in last 30 days
│ (SSD, 3x │ ~20% of data, 80% of reads
│ replication) │ Cost: $$$$
├─────────────────┤
│ Warm Tier │ Files accessed in last 90 days
│ (HDD, 3x │ ~30% of data, 15% of reads
│ replication) │ Cost: $$$
├─────────────────┤
│ Cold Tier │ Files not accessed in 90+ days
│ (HDD, erasure │ ~40% of data, 4% of reads
│ coded 8+4) │ Cost: $$
├─────────────────┤
│ Archive Tier │ Trashed files, old versions
│ (Tape/deep │ ~10% of data, 1% of reads
│ archive, │ Cost: $
│ erasure 12+4)│
└─────────────────┘
Durability: 11 nines (99.999999999%)
How to achieve this:
- Hot/Warm: 3x replication across 3 availability zones. Can survive loss of 2 replicas simultaneously.
- Cold: Reed-Solomon erasure coding (8 data + 4 parity shards = 12 shards total). Can lose any 4 shards and still reconstruct. Stored across 12 different machines in 3 AZs.
- Bitrot detection: Every chunk has a checksum verified on read. Background scrubbing job reads and verifies every chunk every 90 days.
- Cross-region replication: For critical data (enterprise customers), replicate to a geographically distant region.
Cost Optimization:
Without optimization:
5EB x 3 replicas x $0.023/GB/month (S3 standard) = $345M/month
With tiering + erasure coding:
Hot (1EB): 3x replication, SSD: 1EB x 3 x $0.023 = $69M/month
Warm (1.5EB): 3x replication, HDD: 1.5EB x 3 x $0.010 = $45M/month
Cold (2EB): Erasure coded (1.5x): 2EB x 1.5 x $0.004 = $12M/month
Archive (0.5EB): Erasure coded (1.33x): 0.5EB x 1.33 x $0.001 = $0.7M/month
Total: ~$127M/month (63% savings)
Quota Management:
- Each user has a storage quota (15GB free, 100GB/2TB/5TB paid)
- Quota is tracked in the metadata DB: sum of all file versions’ sizes
- Uploads are rejected if they would exceed quota
- Shared files count against the owner’s quota, not the viewer’s
- Trashed files continue to count against quota until permanently deleted
- Version history counts against quota (motivates the 100-version / 30-day limit)
7. Extensions (2 min)
- Real-time collaboration integration: When a user opens a Google Docs/Sheets/Slides file in Drive, hand off to the collaborative editing service (OT/CRDT engine). Drive provides the storage and versioning layer; the editing service provides the real-time collaboration layer.
- AI-powered search and organization: Use ML models to classify files (invoices, receipts, photos, resumes), extract entities (dates, people, amounts), and enable natural-language search (“find the budget spreadsheet Alice shared with me last month”). Auto-suggest folder organization.
- Offline mode for mobile/desktop: Cache frequently accessed files and entire “starred” folders locally. Use a write-ahead log for offline edits. Sync when connectivity returns. Priority-based sync (user-starred files first, then recent, then rest).
- Enterprise features: Shared drives (team-owned, not individual-owned), data loss prevention (DLP) scanning to prevent sharing of sensitive files (SSN, credit card numbers), admin console with usage analytics, and legal hold (prevent deletion of files under litigation).
- Content-aware storage optimization: Automatically transcode uploaded videos to lower resolutions for preview. Generate image thumbnails at multiple sizes on upload. Store original + derived formats. This significantly improves download performance for preview/thumbnail use cases without touching the original file.