1. Requirements & Scope (5 min)
Functional Requirements
- Users can create a paste (text content, up to 10MB)
- Each paste gets a unique, shareable URL
- Pastes can be public or private (unlisted — only accessible via URL)
- Pastes can have an optional expiration (10 min, 1 hour, 1 day, 1 week, never)
- Syntax highlighting for code pastes (client-side, not a backend concern)
Non-Functional Requirements
- Availability: 99.9% — reads must be highly available; writes can tolerate brief degradation
- Latency: Paste retrieval < 200ms at p99
- Consistency: Strong consistency for writes (create → immediately readable). Eventual consistency acceptable for metadata like view counts.
- Scale: 5M new pastes/day, 50M reads/day
- Storage: Most pastes are small (< 50KB), but we support up to 10MB
2. Estimation (3 min)
Traffic
- Writes: 5M/day ÷ 100K = 50 writes/sec, peak 250/sec
- Reads: 50M/day ÷ 100K = 500 reads/sec, peak 2,500/sec
Storage
- Average paste: 10KB (most are small code snippets)
- 5M/day × 10KB = 50GB/day
- Per year: ~18TB of paste content
- Over 5 years: ~90TB
Bandwidth
- Read: 500 reads/sec × 10KB = 5MB/sec average
- Peak: 25MB/sec — manageable
3. API Design (3 min)
POST /api/v1/pastes
Body: {
"content": "def hello():\n print('world')",
"title": "My Snippet", // optional
"language": "python", // optional, for syntax highlighting
"expiration": "1d", // optional: 10m, 1h, 1d, 1w, never
"visibility": "unlisted" // public or unlisted
}
Response 201: {
"id": "aB3kX9p",
"url": "https://paste.example.com/aB3kX9p",
"raw_url": "https://paste.example.com/raw/aB3kX9p",
"expires_at": "2026-02-23T12:00:00Z"
}
GET /api/v1/pastes/{id}
Response 200: {
"id": "aB3kX9p",
"title": "My Snippet",
"content": "def hello():\n print('world')",
"language": "python",
"created_at": "2026-02-22T12:00:00Z",
"expires_at": "2026-02-23T12:00:00Z",
"view_count": 42
}
GET /raw/{id}
Response 200: (plain text content, no JSON wrapper)
Content-Type: text/plain
GET /api/v1/pastes/recent?limit=20&cursor={cursor}
Response 200: { "pastes": [...], "next_cursor": "..." }
// Only returns public pastes
4. Data Model (3 min)
Metadata Store: SQL (PostgreSQL)
Table: pastes
id (PK) | char(7), Base62 encoded
title | varchar(200), nullable
language | varchar(50), nullable
visibility | enum('public', 'unlisted')
content_key | varchar(100) -- S3 object key
content_size | int
created_at | timestamptz
expires_at | timestamptz, nullable
view_count | bigint, default 0
Content Store: Object Storage (S3)
- Key:
pastes/{id}→ raw text content - Content stored separately because: metadata queries (list recent, check expiry) shouldn’t load paste content; S3 scales storage cheaply to petabytes; enables CDN integration for reads
Why SQL for metadata?
- Need to query
WHERE visibility = 'public' ORDER BY created_at DESCfor the recent pastes feed - Need to query
WHERE expires_at < NOW()for cleanup - 5M inserts/day (~50/sec) is well within PostgreSQL’s capacity with proper indexing
5. High-Level Design (12 min)
Write Path
Client → Load Balancer → Paste Service
→ Generate unique ID (Snowflake → Base62)
→ Upload content to S3 (key: pastes/{id})
→ Write metadata to PostgreSQL
→ Return paste URL
Read Path
Client → CDN (CloudFront/Cloudflare)
Cache hit → Return content
Cache miss → Load Balancer → Paste Service
→ Read metadata from PostgreSQL (or Redis cache)
→ Check expiry → if expired, return 404
→ Fetch content from S3 (or Redis if cached)
→ Increment view count async (Kafka → consumer)
→ Return response + cache at CDN
Components
- Paste Service: Stateless application servers handling create/read
- PostgreSQL (primary + replica): Metadata store
- S3: Content store — cheap, durable, scales to petabytes
- Redis: Cache hot pastes (metadata + content for small pastes < 100KB)
- CDN: Cache popular pastes at edge; raw endpoint is especially CDN-friendly
- Kafka: Async view count updates + expiry event stream
- Cleanup Worker: Periodic job to delete expired pastes from S3 and PostgreSQL
6. Deep Dives (15 min)
Deep Dive 1: Storage Architecture (S3 vs Database)
Why not store content in PostgreSQL?
- 10KB avg × 5M/day = 50GB/day in the database. Over a year, that’s 18TB in PostgreSQL — painful to backup, replicate, and query.
- Large TEXT columns hurt query performance for metadata operations
- S3 gives us effectively unlimited storage at $0.023/GB/month
- S3 has built-in 11 nines (99.999999999%) durability
Optimization for small pastes: For pastes < 1KB (very common — short code snippets), we could inline the content in the PostgreSQL row to avoid the S3 roundtrip. This is a classic optimization:
content_inline TEXT -- populated for pastes < 1KB
content_key TEXT -- populated for pastes >= 1KB (S3 key)
Read path: if content_inline is not null, return it directly. Otherwise, fetch from S3.
Content compression: gzip text content before storing in S3. Code text compresses at ~5:1 ratio, reducing storage from 18TB/year to ~3.6TB/year and bandwidth from 5MB/sec to ~1MB/sec.
Deep Dive 2: Handling Expiry at Scale
With millions of pastes expiring at different times, we need an efficient cleanup strategy.
Approach: Multi-layered expiry
-
Read-time check (immediate): Every read checks
expires_at. If expired, return 404 — even if the data hasn’t been physically deleted yet. This is the critical path and must be fast. -
Background cleanup (batched): A cron-based worker runs every hour:
SELECT id, content_key FROM pastes WHERE expires_at < NOW() ORDER BY expires_at ASC LIMIT 10000;For each batch: delete from S3, delete from PostgreSQL, invalidate CDN/Redis cache.
-
Database partitioning: Partition the
pastestable bycreated_atmonth. When all pastes in a partition have expired (e.g., all “never” expiry pastes excluded), drop the entire partition — orders of magnitude faster than row-by-row deletion.
S3 lifecycle rules: As a safety net, set S3 lifecycle policy to delete objects older than the maximum possible TTL. This catches any objects the cleanup worker missed.
Deep Dive 3: Rate Limiting and Abuse Prevention
Pastebin is a magnet for abuse: malware distribution, data dumps, spam.
Rate limiting layers:
- IP-based: 30 pastes/hour per IP (anonymous), 100/hour for authenticated users
- Content size: Hard limit 10MB. Pastes > 1MB require authentication.
- Content scanning: Async pipeline — after paste creation, push to a content moderation queue:
- Check against known malware signatures (ClamAV)
- Check for sensitive data patterns (credit cards, SSNs)
- Flag for manual review if suspicious
- CAPTCHA: Trigger on repeated creations from same IP
Storage abuse: Without limits, someone could use Pastebin as free unlimited storage. Mitigations:
- Per-user storage quota (500MB free, paid tiers for more)
- Anonymous pastes auto-expire after 30 days
- Monitor creation patterns: if an IP creates thousands of pastes with random content, block it
7. Extensions (2 min)
- Paste forking: Allow users to create a copy of any public paste and modify it (like GitHub gists)
- Diff view: Compare two paste versions side by side
- Collaboration: Real-time collaborative editing using CRDT/OT (like Google Docs) — significant complexity increase
- Burn after reading: One-time view pastes that self-destruct after first read
- API access: Full REST API with API keys for programmatic paste creation (CI/CD log sharing, etc.)
- Search: Full-text search over public pastes using Elasticsearch