1. Requirements & Scope (5 min)

Functional Requirements

  1. Users can upload photos (up to 50MB each) with the system generating multiple thumbnail sizes and extracting EXIF metadata automatically
  2. Organize photos into albums, add tags, and manage sharing permissions (private, shared with specific users, public)
  3. Search photos by metadata (date, location, camera), user-applied tags, and ML-generated labels (objects, faces, scenes)
  4. Browse photo feeds (own photos, shared albums, explore/discover) with infinite scroll and fast thumbnail loading
  5. Share photos and albums via links with configurable permissions (view-only, download allowed, password-protected)

Non-Functional Requirements

  • Availability: 99.99% for viewing photos (reads). 99.9% for uploads (briefly queuing uploads is acceptable).
  • Latency: Thumbnail loading < 100ms. Full-resolution photo < 500ms. Upload acknowledgment < 2 seconds (processing happens async).
  • Consistency: Photo metadata must be strongly consistent (if you upload and refresh, your photo must appear). Search index can lag by 30 seconds.
  • Scale: 100M users, 5M daily active, 50M photo uploads/day, 500M photos viewed/day, 10PB total stored photos.
  • Durability: 99.999999999% (11 nines) for original photos. Losing a user’s photo is unacceptable.

2. Estimation (3 min)

Traffic

  • Uploads: 50M photos/day = ~580 uploads/sec (peak 2x = 1,160/sec)
  • Photo views (thumbnails): 500M/day = ~5,800/sec (peak 3x = 17,400/sec)
  • Full-resolution views: 50M/day = ~580/sec
  • Search queries: 5M DAU × 2 searches/day = 10M/day = ~115 QPS
  • Upload bandwidth: 580/sec × 5MB avg = 2.9 GB/sec inbound

Storage

  • Photos (originals): 50M/day × 5MB avg × 365 days = 91PB/year (after dedup and growth: ~50PB/year)
  • Thumbnails (4 sizes per photo): 50M/day × 4 × 50KB avg = 10TB/day = 3.6PB/year
  • Metadata: 50M/day × 2KB = 100GB/day = 36TB/year
  • Total stored: ~10PB currently (growing 50PB/year)

Cost Insight

  • Storage dominates cost: 10PB in S3 = ~$230K/month
  • CDN egress for thumbnails: 500M views × 50KB = 25TB/day = ~$37K/month
  • ML processing (image labeling): 50M photos × $0.001/photo = $50K/month

Key Insight

This is a write-heavy, storage-intensive system. The upload pipeline (ingest, process, store, index) is the most complex component. The read path is CDN-dominated (thumbnails served from edge). The most interesting engineering challenge is the image processing pipeline and ML-powered search.


3. API Design (3 min)

Photo Upload

// Step 1: Request upload URL (presigned)
POST /v1/photos/upload-url
  Body: { "filename": "sunset.jpg", "content_type": "image/jpeg", "size_bytes": 4500000 }
  Response 200: {
    "upload_id": "upl_abc",
    "presigned_url": "https://upload.s3.amazonaws.com/...",
    "expires_at": "2024-02-22T19:00:00Z"
  }

// Step 2: Client uploads directly to S3 via presigned URL
PUT {presigned_url}
  Headers: Content-Type: image/jpeg
  Body: <binary image data>

// Step 3: Confirm upload and set metadata
POST /v1/photos
  Body: {
    "upload_id": "upl_abc",
    "title": "Beach Sunset",
    "description": "Golden hour at Venice Beach",
    "album_id": "album_123",           // optional
    "tags": ["sunset", "beach"],
    "visibility": "private"            // private, shared, public
  }
  Response 201: {
    "photo_id": "photo_xyz",
    "status": "processing",            // thumbnails being generated
    "urls": {
      "original": null,                 // available after processing
      "large": null,
      "medium": null,
      "thumb": null
    }
  }

Photo Browsing

GET /v1/photos?album_id=album_123&page=1&per_page=50
  Response: {
    "photos": [
      { "photo_id": "photo_xyz", "title": "Beach Sunset",
        "urls": { "thumb": "https://cdn.example.com/t/photo_xyz_200.jpg",
                  "medium": "https://cdn.example.com/t/photo_xyz_800.jpg" },
        "width": 4000, "height": 3000, "taken_at": "2024-02-20T18:30:00Z",
        "tags": ["sunset", "beach"], "ml_labels": ["beach", "sunset", "ocean", "sky"] }
    ],
    "total": 342, "next_page": 2
  }

GET /v1/photos/{photo_id}
  Response: { full metadata + all size URLs + EXIF data + sharing info }
GET /v1/search/photos?q=sunset+beach&date_from=2024-01-01&location=california&page=1
  Response: { "results": [...], "total": 89, "facets": { "years": [...], "locations": [...] } }

Sharing

POST /v1/albums/{album_id}/share
  Body: {
    "type": "link",                     // link, users, public
    "permissions": "view",              // view, download
    "password": null,                   // optional
    "expires_at": "2024-03-22T00:00:00Z"
  }
  Response 201: { "share_url": "https://photos.example.com/s/abc123", "share_id": "share_abc" }

Key Decisions

  • Presigned URLs for upload (bypass API servers for large binary uploads, directly to S3)
  • Processing is async (upload returns immediately with “processing” status, WebSocket/polling for completion)
  • Thumbnail URLs use CDN domain (served from edge, not origin)

4. Data Model (3 min)

Photos (PostgreSQL — relational metadata)

Table: photos
  photo_id         (PK) | uuid
  user_id          (FK) | uuid
  title                 | varchar(200)
  description           | text
  visibility            | enum('private', 'shared', 'public')
  s3_key_original       | varchar(500)
  s3_key_large          | varchar(500)   -- 2048px longest edge
  s3_key_medium         | varchar(500)   -- 800px
  s3_key_thumb          | varchar(500)   -- 200px
  width                 | int
  height                | int
  file_size_bytes       | bigint
  content_type          | varchar(50)
  taken_at              | timestamp       -- from EXIF
  uploaded_at           | timestamp
  processing_status     | enum('pending', 'processing', 'ready', 'failed')

Table: photo_exif
  photo_id         (FK) | uuid (PK)
  camera_make           | varchar(100)
  camera_model          | varchar(100)
  lens                  | varchar(100)
  focal_length_mm       | decimal(6,1)
  aperture              | decimal(4,1)
  shutter_speed         | varchar(20)
  iso                   | int
  gps_lat               | decimal(9,6)
  gps_lng               | decimal(9,6)
  orientation           | int

Table: photo_tags
  photo_id         (FK) | uuid
  tag                   | varchar(50)
  source                | enum('user', 'ml')       -- user-applied vs ML-generated
  confidence            | decimal(3,2)              -- ML confidence score (NULL for user tags)
  (composite PK: photo_id + tag + source)

Albums (PostgreSQL)

Table: albums
  album_id         (PK) | uuid
  user_id          (FK) | uuid
  title                 | varchar(200)
  description           | text
  cover_photo_id   (FK) | uuid
  visibility            | enum('private', 'shared', 'public')
  created_at            | timestamp

Table: album_photos
  album_id         (FK) | uuid
  photo_id         (FK) | uuid
  position              | int
  added_at              | timestamp
  (composite PK: album_id + photo_id)

Search Index (Elasticsearch)

Document per photo:
{
  "photo_id": "photo_xyz",
  "user_id": "user_123",
  "title": "Beach Sunset",
  "description": "Golden hour at Venice Beach",
  "tags": ["sunset", "beach"],
  "ml_labels": [
    { "label": "beach", "confidence": 0.97 },
    { "label": "sunset", "confidence": 0.95 },
    { "label": "ocean", "confidence": 0.91 }
  ],
  "location": { "lat": 33.985, "lon": -118.473 },
  "taken_at": "2024-02-20T18:30:00Z",
  "camera": "Canon EOS R5",
  "visibility": "public"
}

Why PostgreSQL + S3 + Elasticsearch?

  • PostgreSQL: relational integrity for photos, albums, sharing permissions (ACID for ownership/access control)
  • S3: object storage for image binaries (11 nines durability, unlimited scale)
  • Elasticsearch: full-text search on titles/descriptions + structured search on tags/EXIF + geo-search on location

5. High-Level Design (12 min)

Architecture

Client (Web/Mobile App)
  │
  ├→ Upload Flow:
  │    POST /v1/photos/upload-url → API Server → generate presigned S3 URL
  │    PUT presigned_url → S3 (direct upload, bypasses API servers)
  │    POST /v1/photos → API Server → PostgreSQL (metadata)
  │      → SQS/Kafka (photo-uploaded event)
  │        → Image Processing Pipeline (async):
  │          1. Download original from S3
  │          2. Extract EXIF metadata → write to PostgreSQL
  │          3. Generate thumbnails (200px, 800px, 2048px) → upload to S3
  │          4. Run ML labeling (object detection, scene classification) → write tags
  │          5. Index in Elasticsearch
  │          6. Update processing_status = 'ready'
  │          7. Notify client via WebSocket / push
  │
  ├→ Browse Flow:
  │    GET /v1/photos → API Server → PostgreSQL (metadata)
  │    Thumbnail images → CDN → S3 (origin)
  │    Full-res images → CDN → S3 (origin)
  │
  └→ Search Flow:
       GET /v1/search → API Server → Elasticsearch
         → Return photo_ids → fetch metadata from PostgreSQL (batch)
         → Return thumbnail URLs (CDN)

CDN Configuration:
  Origin: S3 bucket (photos-{region})
  Cache: thumbnails TTL 30 days, originals TTL 7 days
  Edge: 200+ global POPs
  Transform: on-the-fly resize at edge (for uncommon sizes)

Image Processing Pipeline (Detail)

S3 Event (ObjectCreated) → SQS Queue → Processing Workers (auto-scaling)

Worker steps (per photo):
  1. Download original from S3 (stream, don't load fully into memory for large files)

  2. EXIF extraction (libexif):
     → Camera make/model, lens, settings (aperture, ISO, shutter speed)
     → GPS coordinates → reverse geocode to "Venice Beach, CA"
     → Date/time taken
     → Orientation (rotation needed?)
     → Write to photo_exif table

  3. Thumbnail generation (libvips — fast, low-memory):
     → Apply EXIF orientation (auto-rotate)
     → Generate 4 sizes:
        thumb:  200px longest edge, JPEG quality 80, ~15KB
        medium: 800px longest edge, JPEG quality 85, ~80KB
        large:  2048px longest edge, JPEG quality 90, ~300KB
        (original preserved as-is)
     → Strip EXIF from thumbnails (privacy: no GPS in shared thumbnails)
     → Upload to S3 with content-type and cache-control headers
     → Invalidate CDN cache for this photo (if re-processing)

  4. ML labeling (async, separate queue for GPU workers):
     → Object detection: "beach", "person", "dog", "car", ...
     → Scene classification: "sunset", "outdoor", "landscape"
     → Face detection: detect faces, cluster by identity (for "People" album)
     → OCR: detect text in photos (screenshots, documents)
     → Write labels with confidence scores to photo_tags table

  5. Search indexing:
     → Build Elasticsearch document from metadata + tags + ML labels
     → Upsert into search index

  6. Update status:
     → UPDATE photos SET processing_status = 'ready' WHERE photo_id = ...
     → Push WebSocket notification to client: "Your photo is ready"

Processing time: 2-5 seconds for thumbnails, 5-15 seconds for ML labeling
Auto-scaling: 50M photos/day ÷ 86,400 sec = 580/sec → need ~2,900 workers (at 5 sec each)
  Peak: 1,160/sec → 5,800 workers (auto-scale with SQS queue depth)

Components

  1. API Server: Handles authentication, metadata CRUD, presigned URL generation. Stateless, horizontally scaled.
  2. S3 (Object Storage): Stores original photos and thumbnails. Cross-region replication for durability. Lifecycle policies for cost optimization.
  3. Image Processing Workers: Stateless workers (ECS/Lambda) that generate thumbnails, extract EXIF, and trigger ML labeling. Auto-scale based on queue depth.
  4. ML Labeling Service: GPU-powered workers running image classification and object detection models. Can be a managed service (AWS Rekognition) or self-hosted (PyTorch models on GPU instances).
  5. CDN: CloudFront or similar. Serves all image requests. Thumbnails cached aggressively (30-day TTL). Image transformations at edge for uncommon sizes.
  6. Elasticsearch: Full-text + structured search index. Updated async after upload processing.
  7. PostgreSQL: Source of truth for photo metadata, albums, users, permissions.
  8. Notification Service: WebSocket + push notifications for upload completion, share invitations.

6. Deep Dives (15 min)

Deep Dive 1: Image Upload Pipeline and Processing at Scale

The problem: 50M photos uploaded per day. Each needs thumbnail generation (CPU-bound), EXIF extraction, and ML labeling (GPU-bound). The pipeline must be fast (< 15 seconds end-to-end), reliable (no photo lost), and cost-efficient.

Presigned URL upload pattern:

Why not upload through our API servers?
  - 5MB average × 580/sec = 2.9 GB/sec of upload bandwidth
  - API servers would need massive network bandwidth just for pass-through
  - Presigned URLs: client uploads directly to S3, API servers handle only metadata (KB-sized requests)
  - S3 handles the heavy lifting: multi-part upload, checksums, retries

Flow:
  Client → API Server: "I want to upload sunset.jpg (5MB)"
  API Server → S3: generate presigned PUT URL (valid 1 hour)
  API Server → Client: presigned URL
  Client → S3: PUT binary data directly
  S3 → SQS: ObjectCreated event (trigger processing)
  Client → API Server: POST /v1/photos with metadata

Processing pipeline reliability:

Problem: What if a processing worker crashes mid-thumbnail-generation?
Solution: SQS with visibility timeout and dead-letter queue (DLQ)

  1. Photo uploaded → SQS message created
  2. Worker picks up message (visibility timeout: 5 minutes)
  3. Worker processes photo → deletes message on success
  4. If worker crashes → message becomes visible again after 5 minutes → another worker picks it up
  5. If processing fails 3 times → message moves to DLQ → human investigation

Idempotency: processing is idempotent (re-processing the same photo is safe — overwrites thumbnails in S3)

Cost optimization for ML labeling:

ML labeling is expensive (GPU instances or API calls)
  - Option A: AWS Rekognition — $1/1000 images = $50K/month for 50M photos
  - Option B: Self-hosted (PyTorch on GPU instances):
    - g5.xlarge: $1.006/hour, processes ~200 images/sec
    - 50M/day ÷ 200/sec = 250K seconds = ~70 GPU-hours/day = $70/day = $2,100/month
    - 24x cheaper but requires ML engineering to maintain models

  Recommendation: self-hosted for core labels (10 most common categories),
  Rekognition for advanced features (face clustering, OCR, explicit content detection)

Deep Dive 2: Search by ML-Generated Labels and Visual Similarity

The problem: Users want to search “beach sunset” and find all their photos matching that description, even if they never tagged them. Requires ML-powered semantic understanding of photo content.

ML labeling pipeline:

Model stack (per photo):
  1. Object Detection (YOLO v8 or similar):
     → Detects: "person", "dog", "car", "tree", "building", ...
     → Output: labels with bounding boxes and confidence scores
     → Keep labels with confidence > 0.7

  2. Scene Classification (ResNet-50 or EfficientNet):
     → Classifies: "beach", "mountain", "restaurant", "office", ...
     → Output: top-5 scenes with confidence
     → Keep labels with confidence > 0.6

  3. Face Detection + Clustering:
     → Detect faces → compute face embeddings (128-dim vector)
     → Cluster faces across user's library (same person = same cluster)
     → User names clusters → "Photos of Sarah"

  4. Image Embedding (CLIP model):
     → Generate 512-dim embedding vector for each photo
     → Enables: "find photos similar to this one"
     → Enables: text-to-image search ("sunset over water" → find matching photos)

Search implementation:

Text query: "beach sunset California"

1. Keyword search (Elasticsearch):
   → Match against: title, description, user tags, ML labels
   → BM25 scoring on text fields
   → Geo-filter if location is specified (reverse geocoded from EXIF GPS)

2. Semantic search (CLIP embeddings):
   → Encode query text → CLIP text embedding (512-dim)
   → Query vector DB (Milvus/Pinecone): find photos with most similar image embeddings
   → Return top-100 by cosine similarity

3. Hybrid ranking:
   → Merge results from keyword and semantic search
   → Re-rank by: text_score * 0.4 + semantic_score * 0.4 + recency * 0.1 + engagement * 0.1
   → Return top-20

Example results for "beach sunset California":
  1. Photo with title "Venice Beach Sunset" (keyword match) + ML label "beach, sunset" (semantic match)
  2. Photo with no title but ML labels "coastline, golden_hour" and GPS in Malibu (semantic match + geo)
  3. Photo tagged "Santa Monica pier" with sunset in the image (partial keyword + semantic)

Face search (“Photos of Sarah”):

User identifies Sarah in one photo → face embedding stored
Search: find all photos where Sarah's face appears
  → Query face embedding DB (approximate nearest neighbor search)
  → Return photos where face embedding similarity > 0.85

Performance: 100M users × avg 5 named faces × 128-dim embedding = 64GB
  → Fits in Milvus/Pinecone, sub-100ms query time
  → Per-user isolation: each user's faces are a separate namespace

Deep Dive 3: CDN Delivery and Image Optimization

The problem: 500M photo views/day, mostly thumbnails. Must load fast on any device (desktop, mobile, slow 3G). CDN must be efficient (high cache hit rate) while supporting multiple image sizes and formats.

Multi-format delivery:

Modern browsers support WebP (30% smaller than JPEG) and AVIF (50% smaller).
But older browsers only support JPEG.

Strategy: Content negotiation at CDN edge
  Client sends: Accept: image/avif, image/webp, image/jpeg
  CDN edge:
    1. Check if AVIF variant exists in cache → serve
    2. Check if WebP variant exists in cache → serve
    3. Serve JPEG (always available)

Pre-generation: generate WebP and AVIF during upload processing
  Original JPEG: 80KB (thumbnail)
  WebP: 56KB (-30%)
  AVIF: 40KB (-50%)

Annual savings: 500M views/day × 40KB savings × 365 = 7.3PB less CDN egress/year
  At $0.05/GB: $365K/year savings

Responsive image sizing:

Pre-generated sizes: 200px, 800px, 2048px + original
But what if a device needs 400px? (e.g., 2x retina mobile display showing 200px container)

Two approaches:
  A) Pre-generate more sizes: 200, 400, 600, 800, 1200, 2048
     Pro: simple, cacheable
     Con: 6 sizes × 3 formats × 50M photos/day = 900M thumbnails/day (storage intensive)

  B) On-the-fly resize at CDN edge (Cloudflare Image Resizing, CloudFront Lambda@Edge):
     URL: cdn.example.com/photos/photo_xyz?w=400&format=auto
     Edge worker: fetch nearest pre-generated size (800px), resize to 400px, cache result
     Pro: infinite sizes, no pre-generation
     Con: first request is slower (resize adds 50-100ms), edge compute cost

  Recommendation: Pre-generate 4 sizes (200, 800, 2048, original) + on-the-fly for edge cases
    → Covers 90% of requests with pre-generated sizes
    → On-the-fly handles the long tail of unusual sizes

Lazy loading and progressive rendering:

Photo grid (infinite scroll):
  1. Load low-quality placeholder first:
     → 20px wide blurred thumbnail (< 1KB), inline as base64 in HTML
     → Renders instantly (no network request)
  2. Intersection Observer: when thumbnail enters viewport:
     → Load full thumbnail (200px, ~15KB)
     → Fade in over 200ms (smooth transition from blur to sharp)
  3. User clicks photo → load 800px medium image
  4. User zooms → load 2048px or original

  Result: initial page load shows 50 blurred placeholders instantly,
  then sharpens as thumbnails load (perceptually fast)

  Total data for initial grid of 50 photos:
    Inline placeholders: 50 × 1KB = 50KB (in HTML)
    Visible thumbnails (20 on screen): 20 × 15KB = 300KB
    Total: 350KB for initial paint — loads in < 1 second on 3G

7. Extensions (2 min)

  • Collaborative albums: Multiple users contribute to a shared album. Requires: real-time updates (new photo appears for all viewers), conflict resolution (two users reorder simultaneously), and granular permissions (some can add/remove, others view-only).
  • Photo editing: Basic in-browser editing (crop, rotate, filters, adjust brightness/contrast). Non-destructive editing: store edits as a list of operations applied to the original. Revert to original anytime.
  • Memories / “On This Day”: Surface photos from the same date in previous years. Requires: date indexing, ML-based “interesting photo” scoring (skip duplicates and low-quality shots). Push notification: “You have memories from 3 years ago.”
  • Storage tiering and cost optimization: Hot storage (S3 Standard) for photos accessed in last 90 days. Warm (S3 IA) for 90 days - 1 year. Cold (Glacier) for older photos. Auto-tier based on access frequency. Transparent to users — accessing a cold photo takes 5 seconds extra (show loading indicator).
  • Print and merchandise: Let users order physical prints, photo books, calendars, or canvas prints. Integration with print fulfillment partners via API. Requires: color-accurate rendering previews and high-resolution image export with proper DPI.