1. Requirements & Scope (5 min)
Functional Requirements
- Users can download large files (100MB to 50GB) reliably over unstable network connections with resume support
- Files are split into chunks; clients can download chunks in parallel and resume from the last completed chunk after interruption
- Integrity verification at both chunk level (per-chunk checksum) and file level (whole-file hash) to detect corruption
- Distribute files globally via CDN edge nodes to minimize download latency and maximize throughput
- Support bandwidth throttling per user/tier and fair-share scheduling when multiple users download simultaneously
Non-Functional Requirements
- Availability: 99.95% — users should almost always be able to start or resume a download. Brief outages are tolerable if resume works.
- Latency: First byte within 200ms. Download speed should saturate the user’s available bandwidth (no artificial bottleneck from our side, except throttling).
- Consistency: File metadata (checksums, chunk manifest) must be strongly consistent. A user must never download a partially-updated file.
- Scale: 10M registered users, 500K concurrent downloads, 100PB total stored files, 5PB egress/month.
- Durability: 99.999999999% (11 nines) for stored files. No data loss ever.
2. Estimation (3 min)
Traffic
- Concurrent downloads: 500K
- Average file size: 2GB
- Average download speed: 50 Mbps per user
- Total bandwidth: 500K × 50 Mbps = 25 Tbps peak egress (this is CDN-scale, not single-origin)
- Chunk requests: 2GB file / 8MB chunk = 250 chunks per download. 500K downloads × 250 = 125M chunk requests (spread over download duration)
- Chunk request rate: ~200K chunk requests/sec at peak
Storage
- Total files: 100PB (stored in object storage like S3)
- Chunk metadata: 50M files × 250 chunks × 64 bytes = 800GB (fits in a database)
- File manifests: 50M files × 2KB = 100GB
Cost Insight
- Egress cost is dominant: 5PB/month × $0.05/GB (CDN) = $250K/month
- CDN cache hit ratio is critical: 80% hit rate → origin serves only 1PB/month
- Popular files (top 1%) account for 80% of downloads (cache-friendly)
- Long-tail files need origin serving — optimize with regional origin replicas
Key Insight
This is an egress-heavy, reliability-focused system. The core challenges are: (1) reliable chunk-based downloads with resume over unreliable networks, (2) efficient CDN distribution to minimize origin egress, and (3) integrity guarantees so users never get a corrupted file.
3. API Design (3 min)
File Manifest (download initialization)
GET /v1/files/{file_id}/manifest
Response 200: {
"file_id": "file_abc",
"file_name": "dataset-v2.tar.gz",
"file_size_bytes": 2147483648, // 2GB
"file_sha256": "a1b2c3d4...", // whole-file checksum
"chunk_size_bytes": 8388608, // 8MB per chunk
"chunks": [
{ "index": 0, "offset": 0, "size": 8388608, "sha256": "chunk0hash..." },
{ "index": 1, "offset": 8388608, "size": 8388608, "sha256": "chunk1hash..." },
// ... 255 chunks
],
"cdn_base_url": "https://cdn.example.com/files/file_abc",
"download_token": "jwt_token_valid_24h", // signed, includes user tier/throttle info
"expires_at": "2024-02-23T18:00:00Z"
}
Chunk Download (supports HTTP Range)
GET /v1/files/{file_id}/chunks/{chunk_index}
Headers:
Authorization: Bearer {download_token}
Range: bytes=0-8388607 // optional, for sub-chunk resume
Response 200 (or 206 Partial Content):
Content-Length: 8388608
Content-Range: bytes 0-8388607/8388608
ETag: "chunk0hash..."
X-Chunk-SHA256: chunk0hash...
Body: <binary chunk data>
Alternative: Direct CDN Download with Range Headers
GET https://cdn.example.com/files/file_abc
Headers:
Range: bytes=0-8388607 // first 8MB chunk
Authorization: Bearer {download_token}
Response 206 Partial Content:
Content-Range: bytes 0-8388607/2147483648
Accept-Ranges: bytes
Body: <binary data>
Key Decisions
- Manifest-first approach: Client fetches manifest, then downloads chunks independently. Enables parallel chunk downloads and smart resume.
- JWT download tokens: Include user ID, file ID, expiry, and throttle tier. CDN edge validates token without hitting origin.
- Two download modes: Chunk API (for our custom client with parallel downloads) and standard Range header API (for browsers/wget/curl compatibility).
4. Data Model (3 min)
Files (PostgreSQL — metadata and manifests)
Table: files
file_id (PK) | uuid
file_name | varchar(500)
file_size_bytes | bigint
sha256_hash | char(64)
chunk_size_bytes | int -- typically 8MB
chunk_count | int
s3_bucket | varchar(100)
s3_key | varchar(500)
content_type | varchar(100)
uploaded_at | timestamp
status | enum('processing', 'available', 'deleted')
access_tier | enum('hot', 'warm', 'cold') -- determines CDN caching
Table: file_chunks
file_id (FK) | uuid
chunk_index | int
offset_bytes | bigint
size_bytes | int
sha256_hash | char(64)
(composite PK: file_id + chunk_index)
Download Tracking (PostgreSQL + Redis)
Table: downloads (PostgreSQL — for analytics and billing)
download_id (PK) | uuid
user_id (FK) | uuid
file_id (FK) | uuid
started_at | timestamp
completed_at | timestamp -- NULL if in progress
bytes_downloaded | bigint
status | enum('in_progress', 'completed', 'failed', 'paused')
// Redis — real-time download state for resume
HSET download:{download_id}
file_id: "file_abc"
chunks_completed: "0,1,2,3,4" // bitset or comma-separated
bytes_downloaded: 41943040
last_chunk_offset: 2097152 // partial chunk resume point
throttle_bps: 52428800 // 50 Mbps for this user tier
Object Storage (S3)
Bucket: files-{region}
Key: {file_id}/{chunk_index} // each chunk stored as separate object
OR
Key: {file_id}/data // single object, served with Range headers
// Trade-off:
// Separate chunk objects: simpler CDN caching per chunk, easier invalidation
// Single object with Range: fewer S3 objects, standard HTTP Range support
// Recommendation: Single object with Range (CDN handles chunking via Range)
Why PostgreSQL + S3?
- PostgreSQL for file metadata and chunk manifests (small, relational, ACID)
- S3 for file storage (11 nines durability, unlimited scale, cost-effective)
- Redis for real-time download state (fast reads for resume, TTL-based cleanup)
5. High-Level Design (12 min)
Architecture
Client (custom download client or browser)
→ DNS → CDN Edge Node (closest POP)
→ Cache hit? Serve chunk directly (80% of requests)
→ Cache miss? Pull from Origin Shield
→ Origin Shield (regional cache layer)
→ Cache hit? Serve to edge
→ Cache miss? Pull from S3 Origin
→ S3 (object storage, multi-region replicated)
Manifest Flow:
Client → API Gateway → File Service → PostgreSQL (file metadata)
→ Generate signed download token (JWT)
→ Return manifest with CDN URLs
Download Flow (parallel chunks):
Client downloads manifest
→ Spawns N parallel workers (N=4 by default)
→ Worker 1: GET cdn/file_abc Range: bytes=0-8388607
→ Worker 2: GET cdn/file_abc Range: bytes=8388608-16777215
→ Worker 3: GET cdn/file_abc Range: bytes=16777216-25165823
→ Worker 4: GET cdn/file_abc Range: bytes=25165824-33554431
→ Each worker: verify chunk SHA256, write to disk
→ Completed chunks tracked in local state file
→ On network failure: resume from last incomplete chunk
Throttling:
CDN Edge → reads throttle tier from JWT → enforces bandwidth limit
Token Bucket: refill_rate = user_tier_bps, bucket_size = 2 * chunk_size
Upload Pipeline (file preparation)
Publisher uploads file:
→ Upload Service: receive file via multipart upload
→ S3: store original file
→ Processing Pipeline:
1. Compute whole-file SHA256
2. Split into chunks (logical, via byte offsets — file stays as one S3 object)
3. Compute per-chunk SHA256
4. Store manifest in PostgreSQL
5. Pre-warm CDN: push file to origin shield
6. Mark file as "available"
Components
- File Service: Manages file metadata, generates manifests and download tokens. Stateless API behind load balancer.
- CDN (CloudFront/Akamai/Fastly): Global edge network, 200+ POPs. Caches file chunks at the edge. Validates JWT tokens. Enforces bandwidth throttling.
- Origin Shield: Regional cache layer between CDN edges and S3. Reduces S3 egress by 5-10x. One per major region (US, EU, APAC).
- S3 (Object Storage): Primary file storage. Cross-region replication for durability. Lifecycle policies move cold files to Glacier.
- Download Client (SDK): Custom client library (or CLI tool) that handles parallel chunk downloads, resume, and integrity verification. Falls back to standard HTTP Range for browser downloads.
- Throttle Service: Manages per-user bandwidth limits. Tier info embedded in JWT; enforced at CDN edge.
- Analytics Service: Tracks download metrics (speed, completion rate, retry rate) for monitoring and capacity planning.
6. Deep Dives (15 min)
Deep Dive 1: Chunk-Based Download with Resume
The problem: Users downloading a 10GB file over a home connection may experience disconnections, ISP resets, laptop sleep, or network switches (WiFi → cellular). The download must resume seamlessly without re-downloading completed data.
Client-side state tracking:
// Local state file: ~/.downloads/file_abc.state
{
"file_id": "file_abc",
"file_size": 10737418240,
"chunk_size": 8388608,
"total_chunks": 1280,
"completed_chunks": [0, 1, 2, ..., 450], // bitset in practice
"partial_chunk": {
"index": 451,
"bytes_downloaded": 3145728, // 3MB of 8MB chunk
"temp_file": "/tmp/file_abc_chunk_451.part"
},
"file_sha256": "a1b2c3d4...",
"manifest_etag": "etag123"
}
Resume protocol:
1. Client starts download, fetches manifest, saves to state file
2. Downloads chunks in parallel (4 workers)
3. Network drops at chunk 451 (3MB into 8MB chunk)
4. Client detects connection failure, saves state
--- User reconnects later ---
5. Client reads state file
6. Verify manifest hasn't changed: HEAD /v1/files/file_abc/manifest → check ETag
- If ETag matches: resume with existing state
- If ETag changed: file was updated → restart download with new manifest
7. Skip chunks 0-450 (already completed and verified)
8. Resume chunk 451 from byte 3,145,728:
GET cdn/file_abc
Range: bytes=3774873344-3783262207 // chunk 451 starting from 3MB offset
9. Continue with remaining chunks 452-1279
Sub-chunk resume with HTTP Range:
Chunk 451 starts at file offset: 451 * 8388608 = 3,783,262,208
Already downloaded: 3,145,728 bytes of this chunk
Resume offset: 3,783,262,208 + 3,145,728 = 3,786,407,936
GET https://cdn.example.com/files/file_abc
Range: bytes=3786407936-3791650815 // remaining 5,242,880 bytes of chunk 451
Response 206 Partial Content
Integrity verification:
For each completed chunk:
computed_hash = SHA256(chunk_data)
expected_hash = manifest.chunks[index].sha256
if computed_hash != expected_hash:
→ Discard chunk, re-download
→ If 3 consecutive failures for same chunk: report corruption, try different CDN edge
After all chunks assembled:
computed_file_hash = SHA256(assembled_file)
if computed_file_hash != manifest.file_sha256:
→ Identify corrupted chunk(s) by re-verifying each
→ Re-download only corrupted chunks
→ Worst case: full re-download (extremely rare)
Deep Dive 2: CDN Distribution and Cache Optimization
The problem: 5PB/month egress at $0.05-0.09/GB from S3 is $250K-450K/month. CDN reduces this but only if cache hit rates are high. Large files (10GB+) are hard to cache entirely at edge nodes.
CDN architecture:
User → CDN Edge (200+ POPs globally)
→ Origin Shield (3 regional caches: US, EU, APAC)
→ S3 Origin (us-east-1 primary, eu-west-1 replica)
Cache layers:
Edge: 50TB SSD per POP, caches popular chunks
Origin Shield: 500TB per region, caches all files accessed in last 7 days
S3: unlimited, cold storage
Range-request-aware caching:
Challenge: A 10GB file cached as a single CDN object wastes edge cache space.
Solution: CDN caches at the Range request granularity.
Client requests: Range: bytes=0-8388607 (first 8MB)
CDN caches: file_abc:bytes=0-8388607 as a separate cache entry
Benefits:
- Only popular portions of files cached at edge (first chunks downloaded most — many users start but don't finish)
- Each 8MB cache entry fits efficiently in SSD
- Subsequent Range requests for same chunk → cache hit
CDN configuration (CloudFront):
Cache key: {file_id} + {Range header}
TTL: 7 days for "available" files
Invalidation: on file update, invalidate all cache entries for file_id
Cache warming for new releases:
When a popular file is published (e.g., new game update):
1. Push to origin shield in all regions (pre-warm)
2. For top 20 POPs by expected demand: pre-warm first 10% of file (most-requested chunks)
3. Remaining chunks fill cache organically from user requests
Without pre-warming: first 1000 users all hit S3 origin → 10Gbps burst on origin
With pre-warming: first 1000 users served from edge → smooth rollout
Cost optimization:
Strategy | Savings
Origin Shield | 5-10x reduction in S3 egress
Range-aware caching | 30% better cache utilization (don't cache full large files)
Regional S3 replicas | Avoid cross-region transfer fees ($0.02/GB)
Tiered CDN pricing | Commit to volume for lower per-GB rate
Cold file eviction | Files not downloaded in 30 days → removed from CDN, served from S3 on-demand
Deep Dive 3: Parallel Downloads and Bandwidth Optimization
The problem: A single TCP connection downloading a 10GB file at 50 Mbps takes 27 minutes. With 4 parallel connections, each saturating at 50 Mbps, it takes ~7 minutes. But we need to balance parallelism benefits against server load and fairness.
Adaptive parallel download:
Client starts with 4 parallel workers (default).
Adaptive algorithm:
every 5 seconds:
measure aggregate_throughput across all workers
measure per_worker_throughput
if aggregate_throughput increased with last worker added AND workers < 8:
add_worker() // bandwidth not saturated, more parallelism helps
if per_worker_throughput < 1 Mbps AND workers > 1:
remove_worker() // connection is slow, reduce load
if any worker consistently slower than others:
reassign its chunks to faster workers (different CDN edge)
Typical behavior:
Home broadband (100 Mbps): 4 workers optimal, each ~25 Mbps
Data center (1 Gbps): ramps to 8 workers, each ~125 Mbps
Mobile (10 Mbps): drops to 2 workers, each ~5 Mbps
Bandwidth throttling (server-side):
Per-user throttle enforced at CDN edge:
Free tier: 10 Mbps per download
Pro tier: 100 Mbps per download
Enterprise: unlimited (fair-share)
Implementation: Token bucket in CDN edge config
Refill rate: tier_bandwidth_bps
Bucket size: 2 * chunk_size_bytes (allows burst for chunk start)
CDN reads throttle_tier from JWT claim, applies limit
No origin involvement — pure edge enforcement
Fair-share scheduling:
When CDN edge is congested (serving 10,000 concurrent downloads):
Total edge bandwidth: 100 Gbps
Per-user fair share: 10 Mbps (if all active)
But many users have lower-tier limits (10 Mbps)
Spare bandwidth redistributed to higher-tier users
Algorithm: Weighted fair queuing (WFQ)
weight(user) = min(user_tier_bandwidth, fair_share * 2)
Higher-tier users get proportionally more bandwidth during congestion
No user gets zero bandwidth (minimum guarantee: 1 Mbps)
Peer-assisted delivery (for extremely popular files):
When a file has > 10,000 concurrent downloaders:
→ Enable WebRTC-based peer exchange
→ Users who have downloaded chunks serve them to nearby peers
→ Similar to BitTorrent but coordinated by our tracker
Peer selection: same ISP, same city (minimize cross-network traffic)
Fallback: if peer is slow or unavailable, seamlessly fall back to CDN
Saves: 30-50% CDN egress for viral files (game launches, OS updates)
Privacy: peers only exchange encrypted chunks, no user identification
7. Extensions (2 min)
- Delta downloads: When a file is updated (e.g., v2.1 → v2.2), compute binary diff (bsdiff/xdelta). Client downloads only the delta (often 5-10% of full file size). Requires storing diff manifests alongside full files.
- Deduplication: Content-addressable storage — chunks with the same SHA256 stored once, regardless of how many files reference them. Saves 20-40% storage for datasets with overlapping content.
- Download scheduling: Allow users to schedule downloads during off-peak hours (2 AM - 6 AM) for lower priority / cost. Background downloads on mobile with WiFi-only constraint.
- Geographic access control: Restrict file downloads by country/region (licensing compliance). CDN edge checks user’s geographic location against file’s allowed regions. Block or redirect to appropriate licensed version.
- End-to-end encryption: Files encrypted at rest in S3 with per-file keys. Decryption key delivered via manifest (encrypted with user’s public key). CDN serves encrypted bytes — even CDN edge cannot read file contents.