1. Requirements & Scope (5 min)
Functional Requirements
- Crawl the web starting from a seed set of URLs, discovering and following links to new pages
- Download and store the HTML content of each page for downstream indexing/processing
- Respect robots.txt directives and crawl-delay policies for every domain
- Detect and avoid duplicate URLs (normalization) and duplicate content (near-dedup)
- Support prioritized crawling — important/fresh pages crawled more frequently than stale/low-quality pages
Non-Functional Requirements
- Availability: 99.9% — the crawler should run continuously. Brief outages are acceptable (we just resume from where we left off).
- Throughput: Crawl 1 billion pages/day (~12,000 pages/sec sustained). Scale to 5 billion pages total in the index.
- Latency: Not a real-time system. End-to-end latency from URL discovery to content storage can be minutes to hours.
- Politeness: Never overload a single web server. Maximum 1 request/second per domain by default, respect Crawl-Delay.
- Robustness: Handle spider traps, malformed HTML, infinite URL spaces (calendars, session IDs), and adversarial pages gracefully.
2. Estimation (3 min)
Throughput
- Target: 1 billion pages/day
- 1B / 86,400 = ~12,000 pages/sec
- Average page size: 100KB (HTML + headers)
- Download bandwidth: 12,000 x 100KB = 1.2 GB/sec = ~10 Gbps
Storage
- 5 billion pages in the index
- Average compressed page: 30KB (HTML compresses ~3:1)
- Total content storage: 5B x 30KB = 150 TB
- URL frontier (URLs to crawl): 10 billion URLs x 200 bytes = 2 TB
- URL seen set (deduplication): 50 billion URLs x 8 bytes (fingerprint) = 400 GB — fits in distributed memory
DNS
- Unique domains: ~200 million
- DNS resolution: must cache aggressively. At 12,000 pages/sec, we cannot do a fresh DNS lookup for each.
- DNS cache: 200M domains x 100 bytes = 20 GB — fits in memory
Machines
- Each crawler worker: ~200 concurrent connections, 500 pages/sec per worker
- 12,000 / 500 = ~24 crawler workers (plus headroom → 40 workers)
- Each worker: 32 cores, 64GB RAM, 10 Gbps NIC
3. API Design (3 min)
The web crawler is not a user-facing API service — it is an internal batch processing system. However, it has internal control and data interfaces.
Control API
// Add seed URLs to the frontier
POST /crawler/seeds
Body: { "urls": ["https://example.com", "https://news.site.com"] }
Response 200: { "added": 2 }
// Configure crawl policy for a domain
PUT /crawler/policies/{domain}
Body: {
"max_qps": 1.0, // requests per second
"crawl_depth": 10, // max link hops from seed
"priority": "high", // high, medium, low
"recrawl_interval_hours": 24, // how often to re-crawl
"user_agent": "MyBot/1.0"
}
// Get crawl status
GET /crawler/status
Response 200: {
"pages_crawled_today": 892000000,
"frontier_size": 3400000000,
"active_workers": 38,
"avg_pages_per_sec": 11500,
"errors_today": 42000
}
// Pause/resume crawling for a domain
POST /crawler/domains/{domain}/pause
POST /crawler/domains/{domain}/resume
Data Output
// Crawled pages written to distributed storage
Path: s3://crawl-data/{date}/{domain_hash}/{url_hash}.warc.gz
WARC record format:
WARC-Type: response
WARC-Target-URI: https://example.com/page
WARC-Date: 2026-02-22T10:30:00Z
Content-Type: application/http; msgtype=response
[HTTP response headers]
[HTML body]
Key Decisions
- WARC format for storage — industry standard (used by Internet Archive, Common Crawl)
- Domain-level politeness is enforced in the frontier, not at the worker level
- Crawl data flows to downstream pipeline (indexer, content processor) via message queue
4. Data Model (3 min)
URL Frontier (Redis + disk-backed priority queue)
Per-domain queue:
Key: frontier:{domain}
Type: Priority Queue (sorted by priority score)
Entry: {
url | string
priority | float -- higher = crawl sooner
discovered_at | int64
depth | int -- hops from seed
last_crawled | int64 -- 0 if never crawled
retries | int
}
Domain metadata:
Key: domain:{domain}
Fields: {
robots_txt | string -- cached robots.txt content
robots_fetched | int64 -- when we last fetched robots.txt
crawl_delay | float -- from robots.txt or policy
last_request | int64 -- timestamp of last request (politeness)
dns_ip | string -- cached DNS resolution
dns_ttl | int64 -- DNS cache expiry
}
URL Seen Set (Bloom filter + backing store)
In-memory: Bloom filter
- 50 billion URL fingerprints
- 10 bits per element, 7 hash functions → ~1% false positive rate
- Size: 50B x 10 bits = 62.5 GB → distributed across workers
Backing store (RocksDB / LevelDB per worker):
Key: MD5(normalized_url) -- 16 bytes
Value: {last_crawled, content_hash} -- 16 bytes
Crawled Content (S3 / HDFS)
WARC files:
- One WARC file per domain-hour batch
- Compressed (gzip): ~30KB per page average
- Stored in S3 with lifecycle policy (keep 3 most recent crawls per page)
Why These Choices
- Redis for frontier hot state — fast priority queue operations, domain-level rate limiting
- Bloom filter for seen set — O(1) membership test, acceptable false positive rate (re-crawling a page is wasteful but not harmful)
- RocksDB for persistent dedup — handles billions of keys efficiently on local SSDs
- S3/WARC for content — cheap, durable, industry standard, decoupled from crawl process
5. High-Level Design (12 min)
Architecture
Seed URLs
│
▼
┌────────────────────────────────────────────────────────┐
│ URL Frontier Service │
│ │
│ ┌──────────────────────┐ ┌────────────────────────┐ │
│ │ Priority Scheduler │ │ Politeness Enforcer │ │
│ │ (global priority │ │ (per-domain rate limit,│ │
│ │ across all domains) │ │ robots.txt cache) │ │
│ └──────────┬───────────┘ └────────────┬───────────┘ │
│ │ │ │
│ └──────────┬─────────────────┘ │
│ │ │
│ Emit next URL │
│ (highest priority │
│ + politeness OK) │
└────────────────────────┬─────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Crawler Workers (40 machines) │
│ │
│ Per worker: │
│ 1. Receive URL from frontier │
│ 2. DNS resolution (local cache → DNS resolver) │
│ 3. HTTP fetch (async, 200 concurrent connections) │
│ 4. Parse robots.txt (if not cached) │
│ 5. Download page (follow redirects, handle timeouts) │
│ 6. Extract links from HTML │
│ 7. Normalize discovered URLs │
│ 8. Content fingerprint (SimHash for near-dedup) │
│ └─→ Send results to output pipeline │
└────────────────────────┬─────────────────────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Content │ │ URL │ │ Link │
│ Store │ │ Dedup │ │ Discovery│
│ (S3/WARC) │ │ Service │ │ → back to│
│ │ │ (bloom + │ │ frontier │
│ │ │ RocksDB)│ │ │
└──────────┘ └──────────┘ └──────────┘
URL Frontier Detail (the brain of the crawler)
Two-level queue architecture:
Level 1: Priority queues (front queues)
┌──────┐ ┌──────┐ ┌──────┐
│ High │ │ Med │ │ Low │
│prio │ │prio │ │prio │
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
└────────┼────────┘
│
Weighted selection (80% high, 15% med, 5% low)
│
▼
Level 2: Per-domain queues (back queues)
┌────────────┐ ┌────────────┐ ┌────────────┐
│example.com │ │news.com │ │blog.org │
│ [url, url] │ │ [url, url] │ │ [url, url] │
│ next: 1.2s │ │ next: 0.5s │ │ next: 3.0s │
└────────────┘ └────────────┘ └────────────┘
Each back queue has a "next allowed fetch time"
Scheduler picks the queue with earliest "next" time
After fetch: next_time += 1/crawl_rate (e.g., +1 second)
This ensures:
- Important pages are crawled first (front queue priority)
- No domain is overwhelmed (back queue rate limiting)
- Fair distribution across domains
Components
- URL Frontier Service: Central scheduling brain. Manages priority + politeness. Distributed across multiple nodes with domain-level sharding.
- Crawler Workers (40): Stateless HTTP fetchers. Pull URLs from frontier, download pages, extract links. Async I/O (epoll) for high concurrency.
- DNS Resolver Cache: Local DNS cache on each worker + shared Redis DNS cache. TTL-aware. Falls back to upstream DNS only on cache miss.
- URL Dedup Service: Bloom filter (in-memory) + RocksDB (persistent). Checks if URL was already crawled or queued. Normalized URL fingerprint as key.
- Content Store (S3): WARC files organized by domain and date. Lifecycle management keeps recent crawls, archives older versions.
- robots.txt Cache: Redis-backed, refreshed every 24 hours per domain. Parsed into allow/disallow rules for fast path matching.
- Monitoring: Pages crawled/sec, error rates by type (timeout, 4xx, 5xx), frontier size, domain coverage, politeness violations.
6. Deep Dives (15 min)
Deep Dive 1: URL Normalization and Deduplication
The problem: The same page can be reached via many different URLs. Without normalization, we waste bandwidth re-crawling identical content.
URL normalization rules:
1. Lowercase scheme and host:
HTTP://Example.COM/Page → http://example.com/Page
2. Remove default ports:
http://example.com:80/page → http://example.com/page
https://example.com:443/page → https://example.com/page
3. Remove fragment:
http://example.com/page#section → http://example.com/page
4. Sort query parameters:
http://example.com/page?b=2&a=1 → http://example.com/page?a=1&b=2
5. Remove tracking parameters:
http://example.com/page?utm_source=twitter&id=5 → http://example.com/page?id=5
(Maintain a list: utm_source, utm_medium, utm_campaign, fbclid, gclid, etc.)
6. Remove trailing slash:
http://example.com/page/ → http://example.com/page
7. Decode unnecessary percent-encoding:
http://example.com/%7Euser → http://example.com/~user
8. Canonicalize path:
http://example.com/a/../b/./c → http://example.com/b/c
URL fingerprinting:
After normalization:
fingerprint = MD5(normalized_url) -- 128-bit, 16 bytes
(MD5 is fine for dedup — we're not using it for security)
Bloom filter check:
if bloom_filter.might_contain(fingerprint):
check RocksDB for definitive answer (handle false positives)
else:
definitely new URL → add to frontier and bloom filter
Content-level deduplication (near-dedup):
Problem: Same content at different URLs (mirrors, syndication, URL parameters that don't change content)
SimHash (locality-sensitive hashing):
1. Extract text from HTML (strip tags)
2. Tokenize into shingles (3-word sliding window)
3. Hash each shingle → 64-bit hash
4. For each bit position: if hash bit = 1, add +1; if 0, add -1
5. Final SimHash: for each position, if sum > 0 → 1, else → 0
Two documents with SimHash Hamming distance ≤ 3 (out of 64 bits) → near-duplicate
→ Skip crawling or mark as duplicate in index
Storage: 8 bytes per page. For 5B pages: 40 GB. Searchable via bit-partitioned index.
Deep Dive 2: Handling Spider Traps and Adversarial Pages
Spider traps — infinite URL spaces:
1. Calendar trap:
/calendar/2026/02/23 → links to /calendar/2026/02/24 → ... infinitely
Detection: URL path depth limit (e.g., max 15 segments)
2. Session ID trap:
/page?session=abc → contains link to /page?session=def → ... infinite unique URLs
Detection: URL normalization strips session-like parameters
Heuristic: if a parameter value changes but content hash is identical → parameter is cosmetic
3. Soft 404s:
/nonexistent-page returns 200 OK with "Page not found" content
Detection: content similarity to known 404 pages of the domain
4. Adversarial content:
Page generates random links to keep crawler busy
Detection: anomaly detection per domain — if one domain generates 100x more unique URLs than
similar-sized domains, flag and throttle
5. Redirect loops:
A → B → C → A
Detection: max redirect depth (e.g., 10 hops)
Defenses:
Per-domain limits:
- Max URLs per domain in frontier: 100,000
- Max crawl depth from seed: 15 hops
- Max URL length: 2048 characters
- Max page size: 10MB (truncate after)
Per-page limits:
- Max links extracted per page: 500
- Max redirect hops: 10
- Fetch timeout: 30 seconds
- Connection timeout: 10 seconds
Adaptive throttling:
- If a domain's error rate > 50% over last 100 requests → pause for 1 hour
- If a domain generates > 10,000 URLs/hour from a single page → flag as trap, stop crawling that subtree
- Manual blocklist for known adversarial domains
Deep Dive 3: Crawl Prioritization and Freshness
The problem: 50 billion discoverable URLs, but we can only crawl 1 billion/day. How do we decide what to crawl first, and how often to re-crawl?
Priority scoring model:
priority(url) = w1 * page_importance + w2 * freshness_need + w3 * discovery_signal
1. Page Importance (static, updated periodically):
- PageRank or simplified version (domain authority)
- URL depth (shallower = more important: /about > /blog/2020/old-post)
- Domain priority tier (news sites > personal blogs)
2. Freshness Need (dynamic, per-page):
- Estimated change frequency based on past crawls:
If page changed in 3 of last 5 crawls → high change rate → re-crawl sooner
- Content type: news homepage → re-crawl every 15 minutes; static about page → re-crawl weekly
- Adaptive re-crawl interval:
new_interval = old_interval * 2 (if unchanged)
new_interval = old_interval / 2 (if changed)
Clamped to [15 minutes, 30 days]
3. Discovery Signal:
- New URL never crawled → boost priority
- URL discovered from multiple sources → likely important
- URL from sitemap.xml → explicitly requested by site owner
Implementation:
Priority score computed when URL enters frontier:
- New URLs: high base priority (discovery bonus)
- Re-crawl URLs: priority based on estimated change rate * importance
Frontier maintains separate queues for:
1. Never-crawled URLs (exploration)
2. Re-crawl URLs (freshness maintenance)
Budget allocation: 70% re-crawl (keep index fresh), 30% exploration (discover new content)
Starvation prevention:
- Every domain gets at least 1 crawl per day regardless of priority
- Low-priority URLs age-out: priority increases linearly with time since last crawl
- After 30 days with no crawl, any URL reaches maximum priority
7. Extensions (2 min)
- Incremental/delta crawling: Instead of re-downloading the entire page, use HTTP conditional requests (If-Modified-Since, ETag). If the server supports it, we get a 304 Not Modified → skip download, saving ~90% of re-crawl bandwidth.
- JavaScript rendering: Many modern pages require JS execution to render content. Add a headless browser farm (Puppeteer/Playwright) as a second-pass crawler for JS-heavy domains. First pass: raw HTML. Second pass (selective): rendered DOM. Trade-off: 10x slower, 50x more CPU.
- Distributed crawler coordination: For multi-datacenter deployment, assign domain ranges to specific datacenters to avoid duplicate crawling. Use consistent hashing on domain to assign “ownership.” Crawler in US handles .com/.org, crawler in EU handles .de/.fr, etc.
- Link graph extraction: In addition to storing page content, extract and store the link graph (source URL → destination URL). This feeds PageRank computation, spam detection, and web structure analysis. Store as an edge list in a graph database or adjacency list in HDFS.
- Compliance and legal: Respect noindex/nofollow meta tags. Honor opt-out requests (GDPR right to be forgotten). Maintain a domain blocklist for DMCA/legal takedowns. Log all crawl activity for audit trails.