1. Requirements & Scope (5 min)
Functional Requirements
- Users can upload photos (up to 50MB each) with the system generating multiple thumbnail sizes and extracting EXIF metadata automatically
- Organize photos into albums, add tags, and manage sharing permissions (private, shared with specific users, public)
- Search photos by metadata (date, location, camera), user-applied tags, and ML-generated labels (objects, faces, scenes)
- Browse photo feeds (own photos, shared albums, explore/discover) with infinite scroll and fast thumbnail loading
- Share photos and albums via links with configurable permissions (view-only, download allowed, password-protected)
Non-Functional Requirements
- Availability: 99.99% for viewing photos (reads). 99.9% for uploads (briefly queuing uploads is acceptable).
- Latency: Thumbnail loading < 100ms. Full-resolution photo < 500ms. Upload acknowledgment < 2 seconds (processing happens async).
- Consistency: Photo metadata must be strongly consistent (if you upload and refresh, your photo must appear). Search index can lag by 30 seconds.
- Scale: 100M users, 5M daily active, 50M photo uploads/day, 500M photos viewed/day, 10PB total stored photos.
- Durability: 99.999999999% (11 nines) for original photos. Losing a user’s photo is unacceptable.
2. Estimation (3 min)
Traffic
- Uploads: 50M photos/day = ~580 uploads/sec (peak 2x = 1,160/sec)
- Photo views (thumbnails): 500M/day = ~5,800/sec (peak 3x = 17,400/sec)
- Full-resolution views: 50M/day = ~580/sec
- Search queries: 5M DAU × 2 searches/day = 10M/day = ~115 QPS
- Upload bandwidth: 580/sec × 5MB avg = 2.9 GB/sec inbound
Storage
- Photos (originals): 50M/day × 5MB avg × 365 days = 91PB/year (after dedup and growth: ~50PB/year)
- Thumbnails (4 sizes per photo): 50M/day × 4 × 50KB avg = 10TB/day = 3.6PB/year
- Metadata: 50M/day × 2KB = 100GB/day = 36TB/year
- Total stored: ~10PB currently (growing 50PB/year)
Cost Insight
- Storage dominates cost: 10PB in S3 = ~$230K/month
- CDN egress for thumbnails: 500M views × 50KB = 25TB/day = ~$37K/month
- ML processing (image labeling): 50M photos × $0.001/photo = $50K/month
Key Insight
This is a write-heavy, storage-intensive system. The upload pipeline (ingest, process, store, index) is the most complex component. The read path is CDN-dominated (thumbnails served from edge). The most interesting engineering challenge is the image processing pipeline and ML-powered search.
3. API Design (3 min)
Photo Upload
// Step 1: Request upload URL (presigned)
POST /v1/photos/upload-url
Body: { "filename": "sunset.jpg", "content_type": "image/jpeg", "size_bytes": 4500000 }
Response 200: {
"upload_id": "upl_abc",
"presigned_url": "https://upload.s3.amazonaws.com/...",
"expires_at": "2024-02-22T19:00:00Z"
}
// Step 2: Client uploads directly to S3 via presigned URL
PUT {presigned_url}
Headers: Content-Type: image/jpeg
Body: <binary image data>
// Step 3: Confirm upload and set metadata
POST /v1/photos
Body: {
"upload_id": "upl_abc",
"title": "Beach Sunset",
"description": "Golden hour at Venice Beach",
"album_id": "album_123", // optional
"tags": ["sunset", "beach"],
"visibility": "private" // private, shared, public
}
Response 201: {
"photo_id": "photo_xyz",
"status": "processing", // thumbnails being generated
"urls": {
"original": null, // available after processing
"large": null,
"medium": null,
"thumb": null
}
}
Photo Browsing
GET /v1/photos?album_id=album_123&page=1&per_page=50
Response: {
"photos": [
{ "photo_id": "photo_xyz", "title": "Beach Sunset",
"urls": { "thumb": "https://cdn.example.com/t/photo_xyz_200.jpg",
"medium": "https://cdn.example.com/t/photo_xyz_800.jpg" },
"width": 4000, "height": 3000, "taken_at": "2024-02-20T18:30:00Z",
"tags": ["sunset", "beach"], "ml_labels": ["beach", "sunset", "ocean", "sky"] }
],
"total": 342, "next_page": 2
}
GET /v1/photos/{photo_id}
Response: { full metadata + all size URLs + EXIF data + sharing info }
Search
GET /v1/search/photos?q=sunset+beach&date_from=2024-01-01&location=california&page=1
Response: { "results": [...], "total": 89, "facets": { "years": [...], "locations": [...] } }
Sharing
POST /v1/albums/{album_id}/share
Body: {
"type": "link", // link, users, public
"permissions": "view", // view, download
"password": null, // optional
"expires_at": "2024-03-22T00:00:00Z"
}
Response 201: { "share_url": "https://photos.example.com/s/abc123", "share_id": "share_abc" }
Key Decisions
- Presigned URLs for upload (bypass API servers for large binary uploads, directly to S3)
- Processing is async (upload returns immediately with “processing” status, WebSocket/polling for completion)
- Thumbnail URLs use CDN domain (served from edge, not origin)
4. Data Model (3 min)
Photos (PostgreSQL — relational metadata)
Table: photos
photo_id (PK) | uuid
user_id (FK) | uuid
title | varchar(200)
description | text
visibility | enum('private', 'shared', 'public')
s3_key_original | varchar(500)
s3_key_large | varchar(500) -- 2048px longest edge
s3_key_medium | varchar(500) -- 800px
s3_key_thumb | varchar(500) -- 200px
width | int
height | int
file_size_bytes | bigint
content_type | varchar(50)
taken_at | timestamp -- from EXIF
uploaded_at | timestamp
processing_status | enum('pending', 'processing', 'ready', 'failed')
Table: photo_exif
photo_id (FK) | uuid (PK)
camera_make | varchar(100)
camera_model | varchar(100)
lens | varchar(100)
focal_length_mm | decimal(6,1)
aperture | decimal(4,1)
shutter_speed | varchar(20)
iso | int
gps_lat | decimal(9,6)
gps_lng | decimal(9,6)
orientation | int
Table: photo_tags
photo_id (FK) | uuid
tag | varchar(50)
source | enum('user', 'ml') -- user-applied vs ML-generated
confidence | decimal(3,2) -- ML confidence score (NULL for user tags)
(composite PK: photo_id + tag + source)
Albums (PostgreSQL)
Table: albums
album_id (PK) | uuid
user_id (FK) | uuid
title | varchar(200)
description | text
cover_photo_id (FK) | uuid
visibility | enum('private', 'shared', 'public')
created_at | timestamp
Table: album_photos
album_id (FK) | uuid
photo_id (FK) | uuid
position | int
added_at | timestamp
(composite PK: album_id + photo_id)
Search Index (Elasticsearch)
Document per photo:
{
"photo_id": "photo_xyz",
"user_id": "user_123",
"title": "Beach Sunset",
"description": "Golden hour at Venice Beach",
"tags": ["sunset", "beach"],
"ml_labels": [
{ "label": "beach", "confidence": 0.97 },
{ "label": "sunset", "confidence": 0.95 },
{ "label": "ocean", "confidence": 0.91 }
],
"location": { "lat": 33.985, "lon": -118.473 },
"taken_at": "2024-02-20T18:30:00Z",
"camera": "Canon EOS R5",
"visibility": "public"
}
Why PostgreSQL + S3 + Elasticsearch?
- PostgreSQL: relational integrity for photos, albums, sharing permissions (ACID for ownership/access control)
- S3: object storage for image binaries (11 nines durability, unlimited scale)
- Elasticsearch: full-text search on titles/descriptions + structured search on tags/EXIF + geo-search on location
5. High-Level Design (12 min)
Architecture
Client (Web/Mobile App)
│
├→ Upload Flow:
│ POST /v1/photos/upload-url → API Server → generate presigned S3 URL
│ PUT presigned_url → S3 (direct upload, bypasses API servers)
│ POST /v1/photos → API Server → PostgreSQL (metadata)
│ → SQS/Kafka (photo-uploaded event)
│ → Image Processing Pipeline (async):
│ 1. Download original from S3
│ 2. Extract EXIF metadata → write to PostgreSQL
│ 3. Generate thumbnails (200px, 800px, 2048px) → upload to S3
│ 4. Run ML labeling (object detection, scene classification) → write tags
│ 5. Index in Elasticsearch
│ 6. Update processing_status = 'ready'
│ 7. Notify client via WebSocket / push
│
├→ Browse Flow:
│ GET /v1/photos → API Server → PostgreSQL (metadata)
│ Thumbnail images → CDN → S3 (origin)
│ Full-res images → CDN → S3 (origin)
│
└→ Search Flow:
GET /v1/search → API Server → Elasticsearch
→ Return photo_ids → fetch metadata from PostgreSQL (batch)
→ Return thumbnail URLs (CDN)
CDN Configuration:
Origin: S3 bucket (photos-{region})
Cache: thumbnails TTL 30 days, originals TTL 7 days
Edge: 200+ global POPs
Transform: on-the-fly resize at edge (for uncommon sizes)
Image Processing Pipeline (Detail)
S3 Event (ObjectCreated) → SQS Queue → Processing Workers (auto-scaling)
Worker steps (per photo):
1. Download original from S3 (stream, don't load fully into memory for large files)
2. EXIF extraction (libexif):
→ Camera make/model, lens, settings (aperture, ISO, shutter speed)
→ GPS coordinates → reverse geocode to "Venice Beach, CA"
→ Date/time taken
→ Orientation (rotation needed?)
→ Write to photo_exif table
3. Thumbnail generation (libvips — fast, low-memory):
→ Apply EXIF orientation (auto-rotate)
→ Generate 4 sizes:
thumb: 200px longest edge, JPEG quality 80, ~15KB
medium: 800px longest edge, JPEG quality 85, ~80KB
large: 2048px longest edge, JPEG quality 90, ~300KB
(original preserved as-is)
→ Strip EXIF from thumbnails (privacy: no GPS in shared thumbnails)
→ Upload to S3 with content-type and cache-control headers
→ Invalidate CDN cache for this photo (if re-processing)
4. ML labeling (async, separate queue for GPU workers):
→ Object detection: "beach", "person", "dog", "car", ...
→ Scene classification: "sunset", "outdoor", "landscape"
→ Face detection: detect faces, cluster by identity (for "People" album)
→ OCR: detect text in photos (screenshots, documents)
→ Write labels with confidence scores to photo_tags table
5. Search indexing:
→ Build Elasticsearch document from metadata + tags + ML labels
→ Upsert into search index
6. Update status:
→ UPDATE photos SET processing_status = 'ready' WHERE photo_id = ...
→ Push WebSocket notification to client: "Your photo is ready"
Processing time: 2-5 seconds for thumbnails, 5-15 seconds for ML labeling
Auto-scaling: 50M photos/day ÷ 86,400 sec = 580/sec → need ~2,900 workers (at 5 sec each)
Peak: 1,160/sec → 5,800 workers (auto-scale with SQS queue depth)
Components
- API Server: Handles authentication, metadata CRUD, presigned URL generation. Stateless, horizontally scaled.
- S3 (Object Storage): Stores original photos and thumbnails. Cross-region replication for durability. Lifecycle policies for cost optimization.
- Image Processing Workers: Stateless workers (ECS/Lambda) that generate thumbnails, extract EXIF, and trigger ML labeling. Auto-scale based on queue depth.
- ML Labeling Service: GPU-powered workers running image classification and object detection models. Can be a managed service (AWS Rekognition) or self-hosted (PyTorch models on GPU instances).
- CDN: CloudFront or similar. Serves all image requests. Thumbnails cached aggressively (30-day TTL). Image transformations at edge for uncommon sizes.
- Elasticsearch: Full-text + structured search index. Updated async after upload processing.
- PostgreSQL: Source of truth for photo metadata, albums, users, permissions.
- Notification Service: WebSocket + push notifications for upload completion, share invitations.
6. Deep Dives (15 min)
Deep Dive 1: Image Upload Pipeline and Processing at Scale
The problem: 50M photos uploaded per day. Each needs thumbnail generation (CPU-bound), EXIF extraction, and ML labeling (GPU-bound). The pipeline must be fast (< 15 seconds end-to-end), reliable (no photo lost), and cost-efficient.
Presigned URL upload pattern:
Why not upload through our API servers?
- 5MB average × 580/sec = 2.9 GB/sec of upload bandwidth
- API servers would need massive network bandwidth just for pass-through
- Presigned URLs: client uploads directly to S3, API servers handle only metadata (KB-sized requests)
- S3 handles the heavy lifting: multi-part upload, checksums, retries
Flow:
Client → API Server: "I want to upload sunset.jpg (5MB)"
API Server → S3: generate presigned PUT URL (valid 1 hour)
API Server → Client: presigned URL
Client → S3: PUT binary data directly
S3 → SQS: ObjectCreated event (trigger processing)
Client → API Server: POST /v1/photos with metadata
Processing pipeline reliability:
Problem: What if a processing worker crashes mid-thumbnail-generation?
Solution: SQS with visibility timeout and dead-letter queue (DLQ)
1. Photo uploaded → SQS message created
2. Worker picks up message (visibility timeout: 5 minutes)
3. Worker processes photo → deletes message on success
4. If worker crashes → message becomes visible again after 5 minutes → another worker picks it up
5. If processing fails 3 times → message moves to DLQ → human investigation
Idempotency: processing is idempotent (re-processing the same photo is safe — overwrites thumbnails in S3)
Cost optimization for ML labeling:
ML labeling is expensive (GPU instances or API calls)
- Option A: AWS Rekognition — $1/1000 images = $50K/month for 50M photos
- Option B: Self-hosted (PyTorch on GPU instances):
- g5.xlarge: $1.006/hour, processes ~200 images/sec
- 50M/day ÷ 200/sec = 250K seconds = ~70 GPU-hours/day = $70/day = $2,100/month
- 24x cheaper but requires ML engineering to maintain models
Recommendation: self-hosted for core labels (10 most common categories),
Rekognition for advanced features (face clustering, OCR, explicit content detection)
Deep Dive 2: Search by ML-Generated Labels and Visual Similarity
The problem: Users want to search “beach sunset” and find all their photos matching that description, even if they never tagged them. Requires ML-powered semantic understanding of photo content.
ML labeling pipeline:
Model stack (per photo):
1. Object Detection (YOLO v8 or similar):
→ Detects: "person", "dog", "car", "tree", "building", ...
→ Output: labels with bounding boxes and confidence scores
→ Keep labels with confidence > 0.7
2. Scene Classification (ResNet-50 or EfficientNet):
→ Classifies: "beach", "mountain", "restaurant", "office", ...
→ Output: top-5 scenes with confidence
→ Keep labels with confidence > 0.6
3. Face Detection + Clustering:
→ Detect faces → compute face embeddings (128-dim vector)
→ Cluster faces across user's library (same person = same cluster)
→ User names clusters → "Photos of Sarah"
4. Image Embedding (CLIP model):
→ Generate 512-dim embedding vector for each photo
→ Enables: "find photos similar to this one"
→ Enables: text-to-image search ("sunset over water" → find matching photos)
Search implementation:
Text query: "beach sunset California"
1. Keyword search (Elasticsearch):
→ Match against: title, description, user tags, ML labels
→ BM25 scoring on text fields
→ Geo-filter if location is specified (reverse geocoded from EXIF GPS)
2. Semantic search (CLIP embeddings):
→ Encode query text → CLIP text embedding (512-dim)
→ Query vector DB (Milvus/Pinecone): find photos with most similar image embeddings
→ Return top-100 by cosine similarity
3. Hybrid ranking:
→ Merge results from keyword and semantic search
→ Re-rank by: text_score * 0.4 + semantic_score * 0.4 + recency * 0.1 + engagement * 0.1
→ Return top-20
Example results for "beach sunset California":
1. Photo with title "Venice Beach Sunset" (keyword match) + ML label "beach, sunset" (semantic match)
2. Photo with no title but ML labels "coastline, golden_hour" and GPS in Malibu (semantic match + geo)
3. Photo tagged "Santa Monica pier" with sunset in the image (partial keyword + semantic)
Face search (“Photos of Sarah”):
User identifies Sarah in one photo → face embedding stored
Search: find all photos where Sarah's face appears
→ Query face embedding DB (approximate nearest neighbor search)
→ Return photos where face embedding similarity > 0.85
Performance: 100M users × avg 5 named faces × 128-dim embedding = 64GB
→ Fits in Milvus/Pinecone, sub-100ms query time
→ Per-user isolation: each user's faces are a separate namespace
Deep Dive 3: CDN Delivery and Image Optimization
The problem: 500M photo views/day, mostly thumbnails. Must load fast on any device (desktop, mobile, slow 3G). CDN must be efficient (high cache hit rate) while supporting multiple image sizes and formats.
Multi-format delivery:
Modern browsers support WebP (30% smaller than JPEG) and AVIF (50% smaller).
But older browsers only support JPEG.
Strategy: Content negotiation at CDN edge
Client sends: Accept: image/avif, image/webp, image/jpeg
CDN edge:
1. Check if AVIF variant exists in cache → serve
2. Check if WebP variant exists in cache → serve
3. Serve JPEG (always available)
Pre-generation: generate WebP and AVIF during upload processing
Original JPEG: 80KB (thumbnail)
WebP: 56KB (-30%)
AVIF: 40KB (-50%)
Annual savings: 500M views/day × 40KB savings × 365 = 7.3PB less CDN egress/year
At $0.05/GB: $365K/year savings
Responsive image sizing:
Pre-generated sizes: 200px, 800px, 2048px + original
But what if a device needs 400px? (e.g., 2x retina mobile display showing 200px container)
Two approaches:
A) Pre-generate more sizes: 200, 400, 600, 800, 1200, 2048
Pro: simple, cacheable
Con: 6 sizes × 3 formats × 50M photos/day = 900M thumbnails/day (storage intensive)
B) On-the-fly resize at CDN edge (Cloudflare Image Resizing, CloudFront Lambda@Edge):
URL: cdn.example.com/photos/photo_xyz?w=400&format=auto
Edge worker: fetch nearest pre-generated size (800px), resize to 400px, cache result
Pro: infinite sizes, no pre-generation
Con: first request is slower (resize adds 50-100ms), edge compute cost
Recommendation: Pre-generate 4 sizes (200, 800, 2048, original) + on-the-fly for edge cases
→ Covers 90% of requests with pre-generated sizes
→ On-the-fly handles the long tail of unusual sizes
Lazy loading and progressive rendering:
Photo grid (infinite scroll):
1. Load low-quality placeholder first:
→ 20px wide blurred thumbnail (< 1KB), inline as base64 in HTML
→ Renders instantly (no network request)
2. Intersection Observer: when thumbnail enters viewport:
→ Load full thumbnail (200px, ~15KB)
→ Fade in over 200ms (smooth transition from blur to sharp)
3. User clicks photo → load 800px medium image
4. User zooms → load 2048px or original
Result: initial page load shows 50 blurred placeholders instantly,
then sharpens as thumbnails load (perceptually fast)
Total data for initial grid of 50 photos:
Inline placeholders: 50 × 1KB = 50KB (in HTML)
Visible thumbnails (20 on screen): 20 × 15KB = 300KB
Total: 350KB for initial paint — loads in < 1 second on 3G
7. Extensions (2 min)
- Collaborative albums: Multiple users contribute to a shared album. Requires: real-time updates (new photo appears for all viewers), conflict resolution (two users reorder simultaneously), and granular permissions (some can add/remove, others view-only).
- Photo editing: Basic in-browser editing (crop, rotate, filters, adjust brightness/contrast). Non-destructive editing: store edits as a list of operations applied to the original. Revert to original anytime.
- Memories / “On This Day”: Surface photos from the same date in previous years. Requires: date indexing, ML-based “interesting photo” scoring (skip duplicates and low-quality shots). Push notification: “You have memories from 3 years ago.”
- Storage tiering and cost optimization: Hot storage (S3 Standard) for photos accessed in last 90 days. Warm (S3 IA) for 90 days - 1 year. Cold (Glacier) for older photos. Auto-tier based on access frequency. Transparent to users — accessing a cold photo takes 5 seconds extra (show loading indicator).
- Print and merchandise: Let users order physical prints, photo books, calendars, or canvas prints. Integration with print fulfillment partners via API. Requires: color-accurate rendering previews and high-resolution image export with proper DPI.