Design a Large File Download System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Users can download large files (100MB to 50GB) reliably over unstable network connections with resume support
Files are split into chunks; clients can download chunks in parallel and resume from the last completed chunk after interruption
Integrity verification at both chunk level (per-chunk checksum) and file level (whole-file hash) to detect corruption
Distribute files globally via CDN edge nodes to minimize download latency and maximize throughput
Support bandwidth throttling per user/tier and fair-share scheduling when multiple users download simultaneously

Non-Functional Requirements

Availability: 99.95% — users should almost always be able to start or resume a download. Brief outages are tolerable if resume works.
Latency: First byte within 200ms. Download speed should saturate the user’s available bandwidth (no artificial bottleneck from our side, except throttling).
Consistency: File metadata (checksums, chunk manifest) must be strongly consistent. A user must never download a partially-updated file.
Scale: 10M registered users, 500K concurrent downloads, 100PB total stored files, 5PB egress/month.
Durability: 99.999999999% (11 nines) for stored files. No data loss ever.

2. Estimation (3 min)

Traffic

Concurrent downloads: 500K
Average file size: 2GB
Average download speed: 50 Mbps per user
Total bandwidth: 500K × 50 Mbps = 25 Tbps peak egress (this is CDN-scale, not single-origin)
Chunk requests: 2GB file / 8MB chunk = 250 chunks per download. 500K downloads × 250 = 125M chunk requests (spread over download duration)
Chunk request rate: ~200K chunk requests/sec at peak

Storage

Total files: 100PB (stored in object storage like S3)
Chunk metadata: 50M files × 250 chunks × 64 bytes = 800GB (fits in a database)
File manifests: 50M files × 2KB = 100GB

Cost Insight

Egress cost is dominant: 5PB/month × $0.05/GB (CDN) = $250K/month
CDN cache hit ratio is critical: 80% hit rate → origin serves only 1PB/month
Popular files (top 1%) account for 80% of downloads (cache-friendly)
Long-tail files need origin serving — optimize with regional origin replicas

Key Insight

This is an egress-heavy, reliability-focused system. The core challenges are: (1) reliable chunk-based downloads with resume over unreliable networks, (2) efficient CDN distribution to minimize origin egress, and (3) integrity guarantees so users never get a corrupted file.

3. API Design (3 min)

File Manifest (download initialization)

GET /v1/files/{file_id}/manifest
  Response 200: {
    "file_id": "file_abc",
    "file_name": "dataset-v2.tar.gz",
    "file_size_bytes": 2147483648,           // 2GB
    "file_sha256": "a1b2c3d4...",            // whole-file checksum
    "chunk_size_bytes": 8388608,             // 8MB per chunk
    "chunks": [
      { "index": 0, "offset": 0, "size": 8388608, "sha256": "chunk0hash..." },
      { "index": 1, "offset": 8388608, "size": 8388608, "sha256": "chunk1hash..." },
      // ... 255 chunks
    ],
    "cdn_base_url": "https://cdn.example.com/files/file_abc",
    "download_token": "jwt_token_valid_24h",  // signed, includes user tier/throttle info
    "expires_at": "2024-02-23T18:00:00Z"
  }

Chunk Download (supports HTTP Range)

GET /v1/files/{file_id}/chunks/{chunk_index}
  Headers:
    Authorization: Bearer {download_token}
    Range: bytes=0-8388607                   // optional, for sub-chunk resume

  Response 200 (or 206 Partial Content):
    Content-Length: 8388608
    Content-Range: bytes 0-8388607/8388608
    ETag: "chunk0hash..."
    X-Chunk-SHA256: chunk0hash...
    Body: <binary chunk data>

Alternative: Direct CDN Download with Range Headers

GET https://cdn.example.com/files/file_abc
  Headers:
    Range: bytes=0-8388607                   // first 8MB chunk
    Authorization: Bearer {download_token}

  Response 206 Partial Content:
    Content-Range: bytes 0-8388607/2147483648
    Accept-Ranges: bytes
    Body: <binary data>

Key Decisions

Manifest-first approach: Client fetches manifest, then downloads chunks independently. Enables parallel chunk downloads and smart resume.
JWT download tokens: Include user ID, file ID, expiry, and throttle tier. CDN edge validates token without hitting origin.
Two download modes: Chunk API (for our custom client with parallel downloads) and standard Range header API (for browsers/wget/curl compatibility).

4. Data Model (3 min)

Files (PostgreSQL — metadata and manifests)

Table: files
  file_id          (PK) | uuid
  file_name             | varchar(500)
  file_size_bytes       | bigint
  sha256_hash           | char(64)
  chunk_size_bytes      | int              -- typically 8MB
  chunk_count           | int
  s3_bucket             | varchar(100)
  s3_key                | varchar(500)
  content_type          | varchar(100)
  uploaded_at           | timestamp
  status                | enum('processing', 'available', 'deleted')
  access_tier           | enum('hot', 'warm', 'cold')  -- determines CDN caching

Table: file_chunks
  file_id          (FK) | uuid
  chunk_index           | int
  offset_bytes          | bigint
  size_bytes            | int
  sha256_hash           | char(64)
  (composite PK: file_id + chunk_index)

Download Tracking (PostgreSQL + Redis)

Table: downloads (PostgreSQL — for analytics and billing)
  download_id      (PK) | uuid
  user_id          (FK) | uuid
  file_id          (FK) | uuid
  started_at            | timestamp
  completed_at          | timestamp         -- NULL if in progress
  bytes_downloaded      | bigint
  status                | enum('in_progress', 'completed', 'failed', 'paused')

// Redis — real-time download state for resume
HSET download:{download_id}
  file_id: "file_abc"
  chunks_completed: "0,1,2,3,4"            // bitset or comma-separated
  bytes_downloaded: 41943040
  last_chunk_offset: 2097152               // partial chunk resume point
  throttle_bps: 52428800                   // 50 Mbps for this user tier

Object Storage (S3)

Bucket: files-{region}
  Key: {file_id}/{chunk_index}             // each chunk stored as separate object
  OR
  Key: {file_id}/data                      // single object, served with Range headers

// Trade-off:
// Separate chunk objects: simpler CDN caching per chunk, easier invalidation
// Single object with Range: fewer S3 objects, standard HTTP Range support
// Recommendation: Single object with Range (CDN handles chunking via Range)

Why PostgreSQL + S3?

PostgreSQL for file metadata and chunk manifests (small, relational, ACID)
S3 for file storage (11 nines durability, unlimited scale, cost-effective)
Redis for real-time download state (fast reads for resume, TTL-based cleanup)

5. High-Level Design (12 min)

Architecture

Client (custom download client or browser)
  → DNS → CDN Edge Node (closest POP)
    → Cache hit? Serve chunk directly (80% of requests)
    → Cache miss? Pull from Origin Shield
      → Origin Shield (regional cache layer)
        → Cache hit? Serve to edge
        → Cache miss? Pull from S3 Origin
          → S3 (object storage, multi-region replicated)

Manifest Flow:
  Client → API Gateway → File Service → PostgreSQL (file metadata)
    → Generate signed download token (JWT)
    → Return manifest with CDN URLs

Download Flow (parallel chunks):
  Client downloads manifest
    → Spawns N parallel workers (N=4 by default)
    → Worker 1: GET cdn/file_abc Range: bytes=0-8388607
    → Worker 2: GET cdn/file_abc Range: bytes=8388608-16777215
    → Worker 3: GET cdn/file_abc Range: bytes=16777216-25165823
    → Worker 4: GET cdn/file_abc Range: bytes=25165824-33554431
    → Each worker: verify chunk SHA256, write to disk
    → Completed chunks tracked in local state file
    → On network failure: resume from last incomplete chunk

Throttling:
  CDN Edge → reads throttle tier from JWT → enforces bandwidth limit
  Token Bucket: refill_rate = user_tier_bps, bucket_size = 2 * chunk_size

Upload Pipeline (file preparation)

Publisher uploads file:
  → Upload Service: receive file via multipart upload
  → S3: store original file
  → Processing Pipeline:
    1. Compute whole-file SHA256
    2. Split into chunks (logical, via byte offsets — file stays as one S3 object)
    3. Compute per-chunk SHA256
    4. Store manifest in PostgreSQL
    5. Pre-warm CDN: push file to origin shield
    6. Mark file as "available"

Components

File Service: Manages file metadata, generates manifests and download tokens. Stateless API behind load balancer.
CDN (CloudFront/Akamai/Fastly): Global edge network, 200+ POPs. Caches file chunks at the edge. Validates JWT tokens. Enforces bandwidth throttling.
Origin Shield: Regional cache layer between CDN edges and S3. Reduces S3 egress by 5-10x. One per major region (US, EU, APAC).
S3 (Object Storage): Primary file storage. Cross-region replication for durability. Lifecycle policies move cold files to Glacier.
Download Client (SDK): Custom client library (or CLI tool) that handles parallel chunk downloads, resume, and integrity verification. Falls back to standard HTTP Range for browser downloads.
Throttle Service: Manages per-user bandwidth limits. Tier info embedded in JWT; enforced at CDN edge.
Analytics Service: Tracks download metrics (speed, completion rate, retry rate) for monitoring and capacity planning.

6. Deep Dives (15 min)

Deep Dive 1: Chunk-Based Download with Resume

The problem: Users downloading a 10GB file over a home connection may experience disconnections, ISP resets, laptop sleep, or network switches (WiFi → cellular). The download must resume seamlessly without re-downloading completed data.

Client-side state tracking:

// Local state file: ~/.downloads/file_abc.state
{
  "file_id": "file_abc",
  "file_size": 10737418240,
  "chunk_size": 8388608,
  "total_chunks": 1280,
  "completed_chunks": [0, 1, 2, ..., 450],     // bitset in practice
  "partial_chunk": {
    "index": 451,
    "bytes_downloaded": 3145728,                // 3MB of 8MB chunk
    "temp_file": "/tmp/file_abc_chunk_451.part"
  },
  "file_sha256": "a1b2c3d4...",
  "manifest_etag": "etag123"
}

Resume protocol:

1. Client starts download, fetches manifest, saves to state file
2. Downloads chunks in parallel (4 workers)
3. Network drops at chunk 451 (3MB into 8MB chunk)
4. Client detects connection failure, saves state

--- User reconnects later ---

5. Client reads state file
6. Verify manifest hasn't changed: HEAD /v1/files/file_abc/manifest → check ETag
   - If ETag matches: resume with existing state
   - If ETag changed: file was updated → restart download with new manifest
7. Skip chunks 0-450 (already completed and verified)
8. Resume chunk 451 from byte 3,145,728:
   GET cdn/file_abc
     Range: bytes=3774873344-3783262207    // chunk 451 starting from 3MB offset
9. Continue with remaining chunks 452-1279

Sub-chunk resume with HTTP Range:

Chunk 451 starts at file offset: 451 * 8388608 = 3,783,262,208
Already downloaded: 3,145,728 bytes of this chunk
Resume offset: 3,783,262,208 + 3,145,728 = 3,786,407,936

GET https://cdn.example.com/files/file_abc
  Range: bytes=3786407936-3791650815       // remaining 5,242,880 bytes of chunk 451
  Response 206 Partial Content

Integrity verification:

For each completed chunk:
  computed_hash = SHA256(chunk_data)
  expected_hash = manifest.chunks[index].sha256
  if computed_hash != expected_hash:
    → Discard chunk, re-download
    → If 3 consecutive failures for same chunk: report corruption, try different CDN edge

After all chunks assembled:
  computed_file_hash = SHA256(assembled_file)
  if computed_file_hash != manifest.file_sha256:
    → Identify corrupted chunk(s) by re-verifying each
    → Re-download only corrupted chunks
    → Worst case: full re-download (extremely rare)

Deep Dive 2: CDN Distribution and Cache Optimization

The problem: 5PB/month egress at $0.05-0.09/GB from S3 is $250K-450K/month. CDN reduces this but only if cache hit rates are high. Large files (10GB+) are hard to cache entirely at edge nodes.

CDN architecture:

User → CDN Edge (200+ POPs globally)
  → Origin Shield (3 regional caches: US, EU, APAC)
    → S3 Origin (us-east-1 primary, eu-west-1 replica)

Cache layers:
  Edge: 50TB SSD per POP, caches popular chunks
  Origin Shield: 500TB per region, caches all files accessed in last 7 days
  S3: unlimited, cold storage

Range-request-aware caching:

Challenge: A 10GB file cached as a single CDN object wastes edge cache space.
Solution: CDN caches at the Range request granularity.

Client requests: Range: bytes=0-8388607 (first 8MB)
CDN caches: file_abc:bytes=0-8388607 as a separate cache entry

Benefits:
  - Only popular portions of files cached at edge (first chunks downloaded most — many users start but don't finish)
  - Each 8MB cache entry fits efficiently in SSD
  - Subsequent Range requests for same chunk → cache hit

CDN configuration (CloudFront):
  Cache key: {file_id} + {Range header}
  TTL: 7 days for "available" files
  Invalidation: on file update, invalidate all cache entries for file_id

Cache warming for new releases:

When a popular file is published (e.g., new game update):
  1. Push to origin shield in all regions (pre-warm)
  2. For top 20 POPs by expected demand: pre-warm first 10% of file (most-requested chunks)
  3. Remaining chunks fill cache organically from user requests

  Without pre-warming: first 1000 users all hit S3 origin → 10Gbps burst on origin
  With pre-warming: first 1000 users served from edge → smooth rollout

Cost optimization:

Strategy                  | Savings
Origin Shield             | 5-10x reduction in S3 egress
Range-aware caching       | 30% better cache utilization (don't cache full large files)
Regional S3 replicas      | Avoid cross-region transfer fees ($0.02/GB)
Tiered CDN pricing        | Commit to volume for lower per-GB rate
Cold file eviction        | Files not downloaded in 30 days → removed from CDN, served from S3 on-demand

Deep Dive 3: Parallel Downloads and Bandwidth Optimization

The problem: A single TCP connection downloading a 10GB file at 50 Mbps takes 27 minutes. With 4 parallel connections, each saturating at 50 Mbps, it takes ~7 minutes. But we need to balance parallelism benefits against server load and fairness.

Adaptive parallel download:

Client starts with 4 parallel workers (default).

Adaptive algorithm:
  every 5 seconds:
    measure aggregate_throughput across all workers
    measure per_worker_throughput

    if aggregate_throughput increased with last worker added AND workers < 8:
      add_worker()  // bandwidth not saturated, more parallelism helps
    if per_worker_throughput < 1 Mbps AND workers > 1:
      remove_worker()  // connection is slow, reduce load
    if any worker consistently slower than others:
      reassign its chunks to faster workers (different CDN edge)

Typical behavior:
  Home broadband (100 Mbps): 4 workers optimal, each ~25 Mbps
  Data center (1 Gbps): ramps to 8 workers, each ~125 Mbps
  Mobile (10 Mbps): drops to 2 workers, each ~5 Mbps

Bandwidth throttling (server-side):

Per-user throttle enforced at CDN edge:
  Free tier: 10 Mbps per download
  Pro tier: 100 Mbps per download
  Enterprise: unlimited (fair-share)

Implementation: Token bucket in CDN edge config
  Refill rate: tier_bandwidth_bps
  Bucket size: 2 * chunk_size_bytes (allows burst for chunk start)

  CDN reads throttle_tier from JWT claim, applies limit
  No origin involvement — pure edge enforcement

Fair-share scheduling:

When CDN edge is congested (serving 10,000 concurrent downloads):
  Total edge bandwidth: 100 Gbps
  Per-user fair share: 10 Mbps (if all active)

  But many users have lower-tier limits (10 Mbps)
  Spare bandwidth redistributed to higher-tier users

  Algorithm: Weighted fair queuing (WFQ)
    weight(user) = min(user_tier_bandwidth, fair_share * 2)
    Higher-tier users get proportionally more bandwidth during congestion
    No user gets zero bandwidth (minimum guarantee: 1 Mbps)

Peer-assisted delivery (for extremely popular files):

When a file has > 10,000 concurrent downloaders:
  → Enable WebRTC-based peer exchange
  → Users who have downloaded chunks serve them to nearby peers
  → Similar to BitTorrent but coordinated by our tracker

  Peer selection: same ISP, same city (minimize cross-network traffic)
  Fallback: if peer is slow or unavailable, seamlessly fall back to CDN

  Saves: 30-50% CDN egress for viral files (game launches, OS updates)
  Privacy: peers only exchange encrypted chunks, no user identification

7. Extensions (2 min)

Delta downloads: When a file is updated (e.g., v2.1 → v2.2), compute binary diff (bsdiff/xdelta). Client downloads only the delta (often 5-10% of full file size). Requires storing diff manifests alongside full files.
Deduplication: Content-addressable storage — chunks with the same SHA256 stored once, regardless of how many files reference them. Saves 20-40% storage for datasets with overlapping content.
Download scheduling: Allow users to schedule downloads during off-peak hours (2 AM - 6 AM) for lower priority / cost. Background downloads on mobile with WiFi-only constraint.
Geographic access control: Restrict file downloads by country/region (licensing compliance). CDN edge checks user’s geographic location against file’s allowed regions. Block or redirect to appropriate licensed version.
End-to-end encryption: Files encrypted at rest in S3 with per-file keys. Decryption key delivered via manifest (encrypted with user’s public key). CDN serves encrypted bytes — even CDN edge cannot read file contents.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Cost Insight#

Key Insight#

3. API Design (3 min)#

File Manifest (download initialization)#

Chunk Download (supports HTTP Range)#

Alternative: Direct CDN Download with Range Headers#

Key Decisions#

4. Data Model (3 min)#

Files (PostgreSQL — metadata and manifests)#

Download Tracking (PostgreSQL + Redis)#

Object Storage (S3)#

Why PostgreSQL + S3?#

5. High-Level Design (12 min)#

Architecture#

Upload Pipeline (file preparation)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Chunk-Based Download with Resume#

Deep Dive 2: CDN Distribution and Cache Optimization#

Deep Dive 3: Parallel Downloads and Bandwidth Optimization#

7. Extensions (2 min)#