Design Pastebin

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Users can create a paste (text content, up to 10MB)
Each paste gets a unique, shareable URL
Pastes can be public or private (unlisted — only accessible via URL)
Pastes can have an optional expiration (10 min, 1 hour, 1 day, 1 week, never)
Syntax highlighting for code pastes (client-side, not a backend concern)

Non-Functional Requirements

Availability: 99.9% — reads must be highly available; writes can tolerate brief degradation
Latency: Paste retrieval < 200ms at p99
Consistency: Strong consistency for writes (create → immediately readable). Eventual consistency acceptable for metadata like view counts.
Scale: 5M new pastes/day, 50M reads/day
Storage: Most pastes are small (< 50KB), but we support up to 10MB

2. Estimation (3 min)

Traffic

Writes: 5M/day ÷ 100K = 50 writes/sec, peak 250/sec
Reads: 50M/day ÷ 100K = 500 reads/sec, peak 2,500/sec

Storage

Average paste: 10KB (most are small code snippets)
5M/day × 10KB = 50GB/day
Per year: ~18TB of paste content
Over 5 years: ~90TB

Bandwidth

Read: 500 reads/sec × 10KB = 5MB/sec average
Peak: 25MB/sec — manageable

3. API Design (3 min)

POST /api/v1/pastes
  Body: {
    "content": "def hello():\n    print('world')",
    "title": "My Snippet",          // optional
    "language": "python",            // optional, for syntax highlighting
    "expiration": "1d",              // optional: 10m, 1h, 1d, 1w, never
    "visibility": "unlisted"         // public or unlisted
  }
  Response 201: {
    "id": "aB3kX9p",
    "url": "https://paste.example.com/aB3kX9p",
    "raw_url": "https://paste.example.com/raw/aB3kX9p",
    "expires_at": "2026-02-23T12:00:00Z"
  }

GET /api/v1/pastes/{id}
  Response 200: {
    "id": "aB3kX9p",
    "title": "My Snippet",
    "content": "def hello():\n    print('world')",
    "language": "python",
    "created_at": "2026-02-22T12:00:00Z",
    "expires_at": "2026-02-23T12:00:00Z",
    "view_count": 42
  }

GET /raw/{id}
  Response 200: (plain text content, no JSON wrapper)
  Content-Type: text/plain

GET /api/v1/pastes/recent?limit=20&cursor={cursor}
  Response 200: { "pastes": [...], "next_cursor": "..." }
  // Only returns public pastes

4. Data Model (3 min)

Metadata Store: SQL (PostgreSQL)

Table: pastes
  id           (PK)   | char(7), Base62 encoded
  title                | varchar(200), nullable
  language             | varchar(50), nullable
  visibility           | enum('public', 'unlisted')
  content_key          | varchar(100)  -- S3 object key
  content_size         | int
  created_at           | timestamptz
  expires_at           | timestamptz, nullable
  view_count           | bigint, default 0

Content Store: Object Storage (S3)

Key: pastes/{id} → raw text content
Content stored separately because: metadata queries (list recent, check expiry) shouldn’t load paste content; S3 scales storage cheaply to petabytes; enables CDN integration for reads

Why SQL for metadata?

Need to query WHERE visibility = 'public' ORDER BY created_at DESC for the recent pastes feed
Need to query WHERE expires_at < NOW() for cleanup
5M inserts/day (~50/sec) is well within PostgreSQL’s capacity with proper indexing

5. High-Level Design (12 min)

Write Path

Client → Load Balancer → Paste Service
  → Generate unique ID (Snowflake → Base62)
  → Upload content to S3 (key: pastes/{id})
  → Write metadata to PostgreSQL
  → Return paste URL

Read Path

Client → CDN (CloudFront/Cloudflare)
  Cache hit → Return content
  Cache miss → Load Balancer → Paste Service
    → Read metadata from PostgreSQL (or Redis cache)
    → Check expiry → if expired, return 404
    → Fetch content from S3 (or Redis if cached)
    → Increment view count async (Kafka → consumer)
    → Return response + cache at CDN

Components

Paste Service: Stateless application servers handling create/read
PostgreSQL (primary + replica): Metadata store
S3: Content store — cheap, durable, scales to petabytes
Redis: Cache hot pastes (metadata + content for small pastes < 100KB)
CDN: Cache popular pastes at edge; raw endpoint is especially CDN-friendly
Kafka: Async view count updates + expiry event stream
Cleanup Worker: Periodic job to delete expired pastes from S3 and PostgreSQL

6. Deep Dives (15 min)

Deep Dive 1: Storage Architecture (S3 vs Database)

Why not store content in PostgreSQL?

10KB avg × 5M/day = 50GB/day in the database. Over a year, that’s 18TB in PostgreSQL — painful to backup, replicate, and query.
Large TEXT columns hurt query performance for metadata operations
S3 gives us effectively unlimited storage at $0.023/GB/month
S3 has built-in 11 nines (99.999999999%) durability

Optimization for small pastes: For pastes < 1KB (very common — short code snippets), we could inline the content in the PostgreSQL row to avoid the S3 roundtrip. This is a classic optimization:

content_inline  TEXT  -- populated for pastes < 1KB
content_key     TEXT  -- populated for pastes >= 1KB (S3 key)

Read path: if content_inline is not null, return it directly. Otherwise, fetch from S3.

Content compression: gzip text content before storing in S3. Code text compresses at ~5:1 ratio, reducing storage from 18TB/year to ~3.6TB/year and bandwidth from 5MB/sec to ~1MB/sec.

Deep Dive 2: Handling Expiry at Scale

With millions of pastes expiring at different times, we need an efficient cleanup strategy.

Approach: Multi-layered expiry

Read-time check (immediate): Every read checks expires_at. If expired, return 404 — even if the data hasn’t been physically deleted yet. This is the critical path and must be fast.
Background cleanup (batched): A cron-based worker runs every hour:
```
SELECT id, content_key FROM pastes
WHERE expires_at < NOW()
ORDER BY expires_at ASC
LIMIT 10000;
```
For each batch: delete from S3, delete from PostgreSQL, invalidate CDN/Redis cache.
Database partitioning: Partition the pastes table by created_at month. When all pastes in a partition have expired (e.g., all “never” expiry pastes excluded), drop the entire partition — orders of magnitude faster than row-by-row deletion.

S3 lifecycle rules: As a safety net, set S3 lifecycle policy to delete objects older than the maximum possible TTL. This catches any objects the cleanup worker missed.

Deep Dive 3: Rate Limiting and Abuse Prevention

Pastebin is a magnet for abuse: malware distribution, data dumps, spam.

Rate limiting layers:

IP-based: 30 pastes/hour per IP (anonymous), 100/hour for authenticated users
Content size: Hard limit 10MB. Pastes > 1MB require authentication.
Content scanning: Async pipeline — after paste creation, push to a content moderation queue:
- Check against known malware signatures (ClamAV)
- Check for sensitive data patterns (credit cards, SSNs)
- Flag for manual review if suspicious
CAPTCHA: Trigger on repeated creations from same IP

Storage abuse: Without limits, someone could use Pastebin as free unlimited storage. Mitigations:

Per-user storage quota (500MB free, paid tiers for more)
Anonymous pastes auto-expire after 30 days
Monitor creation patterns: if an IP creates thousands of pastes with random content, block it

7. Extensions (2 min)

Paste forking: Allow users to create a copy of any public paste and modify it (like GitHub gists)
Diff view: Compare two paste versions side by side
Collaboration: Real-time collaborative editing using CRDT/OT (like Google Docs) — significant complexity increase
Burn after reading: One-time view pastes that self-destruct after first read
API access: Full REST API with API keys for programmatic paste creation (CI/CD log sharing, etc.)
Search: Full-text search over public pastes using Elasticsearch

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Bandwidth#

3. API Design (3 min)#

4. Data Model (3 min)#

Metadata Store: SQL (PostgreSQL)#

Content Store: Object Storage (S3)#

Why SQL for metadata?#

5. High-Level Design (12 min)#

Write Path#

Read Path#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Storage Architecture (S3 vs Database)#

Deep Dive 2: Handling Expiry at Scale#

Deep Dive 3: Rate Limiting and Abuse Prevention#

7. Extensions (2 min)#