1. Requirements & Scope (5 min)

Functional Requirements

  1. Users can create a paste (text content, up to 10MB)
  2. Each paste gets a unique, shareable URL
  3. Pastes can be public or private (unlisted — only accessible via URL)
  4. Pastes can have an optional expiration (10 min, 1 hour, 1 day, 1 week, never)
  5. Syntax highlighting for code pastes (client-side, not a backend concern)

Non-Functional Requirements

  • Availability: 99.9% — reads must be highly available; writes can tolerate brief degradation
  • Latency: Paste retrieval < 200ms at p99
  • Consistency: Strong consistency for writes (create → immediately readable). Eventual consistency acceptable for metadata like view counts.
  • Scale: 5M new pastes/day, 50M reads/day
  • Storage: Most pastes are small (< 50KB), but we support up to 10MB

2. Estimation (3 min)

Traffic

  • Writes: 5M/day ÷ 100K = 50 writes/sec, peak 250/sec
  • Reads: 50M/day ÷ 100K = 500 reads/sec, peak 2,500/sec

Storage

  • Average paste: 10KB (most are small code snippets)
  • 5M/day × 10KB = 50GB/day
  • Per year: ~18TB of paste content
  • Over 5 years: ~90TB

Bandwidth

  • Read: 500 reads/sec × 10KB = 5MB/sec average
  • Peak: 25MB/sec — manageable

3. API Design (3 min)

POST /api/v1/pastes
  Body: {
    "content": "def hello():\n    print('world')",
    "title": "My Snippet",          // optional
    "language": "python",            // optional, for syntax highlighting
    "expiration": "1d",              // optional: 10m, 1h, 1d, 1w, never
    "visibility": "unlisted"         // public or unlisted
  }
  Response 201: {
    "id": "aB3kX9p",
    "url": "https://paste.example.com/aB3kX9p",
    "raw_url": "https://paste.example.com/raw/aB3kX9p",
    "expires_at": "2026-02-23T12:00:00Z"
  }

GET /api/v1/pastes/{id}
  Response 200: {
    "id": "aB3kX9p",
    "title": "My Snippet",
    "content": "def hello():\n    print('world')",
    "language": "python",
    "created_at": "2026-02-22T12:00:00Z",
    "expires_at": "2026-02-23T12:00:00Z",
    "view_count": 42
  }

GET /raw/{id}
  Response 200: (plain text content, no JSON wrapper)
  Content-Type: text/plain

GET /api/v1/pastes/recent?limit=20&cursor={cursor}
  Response 200: { "pastes": [...], "next_cursor": "..." }
  // Only returns public pastes

4. Data Model (3 min)

Metadata Store: SQL (PostgreSQL)

Table: pastes
  id           (PK)   | char(7), Base62 encoded
  title                | varchar(200), nullable
  language             | varchar(50), nullable
  visibility           | enum('public', 'unlisted')
  content_key          | varchar(100)  -- S3 object key
  content_size         | int
  created_at           | timestamptz
  expires_at           | timestamptz, nullable
  view_count           | bigint, default 0

Content Store: Object Storage (S3)

  • Key: pastes/{id} → raw text content
  • Content stored separately because: metadata queries (list recent, check expiry) shouldn’t load paste content; S3 scales storage cheaply to petabytes; enables CDN integration for reads

Why SQL for metadata?

  • Need to query WHERE visibility = 'public' ORDER BY created_at DESC for the recent pastes feed
  • Need to query WHERE expires_at < NOW() for cleanup
  • 5M inserts/day (~50/sec) is well within PostgreSQL’s capacity with proper indexing

5. High-Level Design (12 min)

Write Path

Client → Load Balancer → Paste Service
  → Generate unique ID (Snowflake → Base62)
  → Upload content to S3 (key: pastes/{id})
  → Write metadata to PostgreSQL
  → Return paste URL

Read Path

Client → CDN (CloudFront/Cloudflare)
  Cache hit → Return content
  Cache miss → Load Balancer → Paste Service
    → Read metadata from PostgreSQL (or Redis cache)
    → Check expiry → if expired, return 404
    → Fetch content from S3 (or Redis if cached)
    → Increment view count async (Kafka → consumer)
    → Return response + cache at CDN

Components

  1. Paste Service: Stateless application servers handling create/read
  2. PostgreSQL (primary + replica): Metadata store
  3. S3: Content store — cheap, durable, scales to petabytes
  4. Redis: Cache hot pastes (metadata + content for small pastes < 100KB)
  5. CDN: Cache popular pastes at edge; raw endpoint is especially CDN-friendly
  6. Kafka: Async view count updates + expiry event stream
  7. Cleanup Worker: Periodic job to delete expired pastes from S3 and PostgreSQL

6. Deep Dives (15 min)

Deep Dive 1: Storage Architecture (S3 vs Database)

Why not store content in PostgreSQL?

  • 10KB avg × 5M/day = 50GB/day in the database. Over a year, that’s 18TB in PostgreSQL — painful to backup, replicate, and query.
  • Large TEXT columns hurt query performance for metadata operations
  • S3 gives us effectively unlimited storage at $0.023/GB/month
  • S3 has built-in 11 nines (99.999999999%) durability

Optimization for small pastes: For pastes < 1KB (very common — short code snippets), we could inline the content in the PostgreSQL row to avoid the S3 roundtrip. This is a classic optimization:

content_inline  TEXT  -- populated for pastes < 1KB
content_key     TEXT  -- populated for pastes >= 1KB (S3 key)

Read path: if content_inline is not null, return it directly. Otherwise, fetch from S3.

Content compression: gzip text content before storing in S3. Code text compresses at ~5:1 ratio, reducing storage from 18TB/year to ~3.6TB/year and bandwidth from 5MB/sec to ~1MB/sec.

Deep Dive 2: Handling Expiry at Scale

With millions of pastes expiring at different times, we need an efficient cleanup strategy.

Approach: Multi-layered expiry

  1. Read-time check (immediate): Every read checks expires_at. If expired, return 404 — even if the data hasn’t been physically deleted yet. This is the critical path and must be fast.

  2. Background cleanup (batched): A cron-based worker runs every hour:

    SELECT id, content_key FROM pastes
    WHERE expires_at < NOW()
    ORDER BY expires_at ASC
    LIMIT 10000;
    

    For each batch: delete from S3, delete from PostgreSQL, invalidate CDN/Redis cache.

  3. Database partitioning: Partition the pastes table by created_at month. When all pastes in a partition have expired (e.g., all “never” expiry pastes excluded), drop the entire partition — orders of magnitude faster than row-by-row deletion.

S3 lifecycle rules: As a safety net, set S3 lifecycle policy to delete objects older than the maximum possible TTL. This catches any objects the cleanup worker missed.

Deep Dive 3: Rate Limiting and Abuse Prevention

Pastebin is a magnet for abuse: malware distribution, data dumps, spam.

Rate limiting layers:

  1. IP-based: 30 pastes/hour per IP (anonymous), 100/hour for authenticated users
  2. Content size: Hard limit 10MB. Pastes > 1MB require authentication.
  3. Content scanning: Async pipeline — after paste creation, push to a content moderation queue:
    • Check against known malware signatures (ClamAV)
    • Check for sensitive data patterns (credit cards, SSNs)
    • Flag for manual review if suspicious
  4. CAPTCHA: Trigger on repeated creations from same IP

Storage abuse: Without limits, someone could use Pastebin as free unlimited storage. Mitigations:

  • Per-user storage quota (500MB free, paid tiers for more)
  • Anonymous pastes auto-expire after 30 days
  • Monitor creation patterns: if an IP creates thousands of pastes with random content, block it

7. Extensions (2 min)

  • Paste forking: Allow users to create a copy of any public paste and modify it (like GitHub gists)
  • Diff view: Compare two paste versions side by side
  • Collaboration: Real-time collaborative editing using CRDT/OT (like Google Docs) — significant complexity increase
  • Burn after reading: One-time view pastes that self-destruct after first read
  • API access: Full REST API with API keys for programmatic paste creation (CI/CD log sharing, etc.)
  • Search: Full-text search over public pastes using Elasticsearch