Design a Web Crawler

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Throughput
- Storage
- DNS
- Machines
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Crawl the web starting from a seed set of URLs, discovering and following links to new pages
Download and store the HTML content of each page for downstream indexing/processing
Respect robots.txt directives and crawl-delay policies for every domain
Detect and avoid duplicate URLs (normalization) and duplicate content (near-dedup)
Support prioritized crawling — important/fresh pages crawled more frequently than stale/low-quality pages

Non-Functional Requirements

Availability: 99.9% — the crawler should run continuously. Brief outages are acceptable (we just resume from where we left off).
Throughput: Crawl 1 billion pages/day (~12,000 pages/sec sustained). Scale to 5 billion pages total in the index.
Latency: Not a real-time system. End-to-end latency from URL discovery to content storage can be minutes to hours.
Politeness: Never overload a single web server. Maximum 1 request/second per domain by default, respect Crawl-Delay.
Robustness: Handle spider traps, malformed HTML, infinite URL spaces (calendars, session IDs), and adversarial pages gracefully.

2. Estimation (3 min)

Throughput

Target: 1 billion pages/day
1B / 86,400 = ~12,000 pages/sec
Average page size: 100KB (HTML + headers)
Download bandwidth: 12,000 x 100KB = 1.2 GB/sec = ~10 Gbps

Storage

5 billion pages in the index
Average compressed page: 30KB (HTML compresses ~3:1)
Total content storage: 5B x 30KB = 150 TB
URL frontier (URLs to crawl): 10 billion URLs x 200 bytes = 2 TB
URL seen set (deduplication): 50 billion URLs x 8 bytes (fingerprint) = 400 GB — fits in distributed memory

DNS

Unique domains: ~200 million
DNS resolution: must cache aggressively. At 12,000 pages/sec, we cannot do a fresh DNS lookup for each.
DNS cache: 200M domains x 100 bytes = 20 GB — fits in memory

Machines

Each crawler worker: ~200 concurrent connections, 500 pages/sec per worker
12,000 / 500 = ~24 crawler workers (plus headroom → 40 workers)
Each worker: 32 cores, 64GB RAM, 10 Gbps NIC

3. API Design (3 min)

The web crawler is not a user-facing API service — it is an internal batch processing system. However, it has internal control and data interfaces.

Control API

// Add seed URLs to the frontier
POST /crawler/seeds
  Body: { "urls": ["https://example.com", "https://news.site.com"] }
  Response 200: { "added": 2 }

// Configure crawl policy for a domain
PUT /crawler/policies/{domain}
  Body: {
    "max_qps": 1.0,                    // requests per second
    "crawl_depth": 10,                  // max link hops from seed
    "priority": "high",                 // high, medium, low
    "recrawl_interval_hours": 24,       // how often to re-crawl
    "user_agent": "MyBot/1.0"
  }

// Get crawl status
GET /crawler/status
  Response 200: {
    "pages_crawled_today": 892000000,
    "frontier_size": 3400000000,
    "active_workers": 38,
    "avg_pages_per_sec": 11500,
    "errors_today": 42000
  }

// Pause/resume crawling for a domain
POST /crawler/domains/{domain}/pause
POST /crawler/domains/{domain}/resume

Data Output

// Crawled pages written to distributed storage
Path: s3://crawl-data/{date}/{domain_hash}/{url_hash}.warc.gz

WARC record format:
  WARC-Type: response
  WARC-Target-URI: https://example.com/page
  WARC-Date: 2026-02-22T10:30:00Z
  Content-Type: application/http; msgtype=response
  [HTTP response headers]
  [HTML body]

Key Decisions

WARC format for storage — industry standard (used by Internet Archive, Common Crawl)
Domain-level politeness is enforced in the frontier, not at the worker level
Crawl data flows to downstream pipeline (indexer, content processor) via message queue

4. Data Model (3 min)

URL Frontier (Redis + disk-backed priority queue)

Per-domain queue:
  Key: frontier:{domain}
  Type: Priority Queue (sorted by priority score)
  Entry: {
    url             | string
    priority        | float       -- higher = crawl sooner
    discovered_at   | int64
    depth           | int         -- hops from seed
    last_crawled    | int64       -- 0 if never crawled
    retries         | int
  }

Domain metadata:
  Key: domain:{domain}
  Fields: {
    robots_txt       | string     -- cached robots.txt content
    robots_fetched   | int64      -- when we last fetched robots.txt
    crawl_delay      | float      -- from robots.txt or policy
    last_request     | int64      -- timestamp of last request (politeness)
    dns_ip           | string     -- cached DNS resolution
    dns_ttl          | int64      -- DNS cache expiry
  }

URL Seen Set (Bloom filter + backing store)

In-memory: Bloom filter
  - 50 billion URL fingerprints
  - 10 bits per element, 7 hash functions → ~1% false positive rate
  - Size: 50B x 10 bits = 62.5 GB → distributed across workers

Backing store (RocksDB / LevelDB per worker):
  Key: MD5(normalized_url)    -- 16 bytes
  Value: {last_crawled, content_hash}  -- 16 bytes

Crawled Content (S3 / HDFS)

WARC files:
  - One WARC file per domain-hour batch
  - Compressed (gzip): ~30KB per page average
  - Stored in S3 with lifecycle policy (keep 3 most recent crawls per page)

Why These Choices

Redis for frontier hot state — fast priority queue operations, domain-level rate limiting
Bloom filter for seen set — O(1) membership test, acceptable false positive rate (re-crawling a page is wasteful but not harmful)
RocksDB for persistent dedup — handles billions of keys efficiently on local SSDs
S3/WARC for content — cheap, durable, industry standard, decoupled from crawl process

5. High-Level Design (12 min)

Architecture

Seed URLs
  │
  ▼
┌────────────────────────────────────────────────────────┐
│                    URL Frontier Service                  │
│                                                          │
│  ┌──────────────────────┐  ┌────────────────────────┐  │
│  │ Priority Scheduler    │  │ Politeness Enforcer    │  │
│  │ (global priority      │  │ (per-domain rate limit,│  │
│  │  across all domains)  │  │  robots.txt cache)     │  │
│  └──────────┬───────────┘  └────────────┬───────────┘  │
│             │                            │               │
│             └──────────┬─────────────────┘               │
│                        │                                  │
│                  Emit next URL                            │
│                  (highest priority                        │
│                   + politeness OK)                        │
└────────────────────────┬─────────────────────────────────┘
                         │
                         ▼
┌────────────────────────────────────────────────────────┐
│              Crawler Workers (40 machines)               │
│                                                          │
│  Per worker:                                             │
│  1. Receive URL from frontier                            │
│  2. DNS resolution (local cache → DNS resolver)          │
│  3. HTTP fetch (async, 200 concurrent connections)       │
│  4. Parse robots.txt (if not cached)                     │
│  5. Download page (follow redirects, handle timeouts)    │
│  6. Extract links from HTML                              │
│  7. Normalize discovered URLs                            │
│  8. Content fingerprint (SimHash for near-dedup)         │
│  └─→ Send results to output pipeline                    │
└────────────────────────┬─────────────────────────────────┘
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
     ┌──────────┐ ┌──────────┐ ┌──────────┐
     │ Content   │ │ URL      │ │ Link     │
     │ Store     │ │ Dedup    │ │ Discovery│
     │ (S3/WARC) │ │ Service  │ │ → back to│
     │           │ │ (bloom + │ │  frontier │
     │           │ │  RocksDB)│ │           │
     └──────────┘ └──────────┘ └──────────┘

URL Frontier Detail (the brain of the crawler)

Two-level queue architecture:

Level 1: Priority queues (front queues)
  ┌──────┐ ┌──────┐ ┌──────┐
  │ High │ │ Med  │ │ Low  │
  │prio  │ │prio  │ │prio  │
  └──┬───┘ └──┬───┘ └──┬───┘
     │        │        │
     └────────┼────────┘
              │
     Weighted selection (80% high, 15% med, 5% low)
              │
              ▼
Level 2: Per-domain queues (back queues)
  ┌────────────┐ ┌────────────┐ ┌────────────┐
  │example.com │ │news.com    │ │blog.org    │
  │ [url, url] │ │ [url, url] │ │ [url, url] │
  │ next: 1.2s │ │ next: 0.5s │ │ next: 3.0s │
  └────────────┘ └────────────┘ └────────────┘

  Each back queue has a "next allowed fetch time"
  Scheduler picks the queue with earliest "next" time
  After fetch: next_time += 1/crawl_rate (e.g., +1 second)

This ensures:
  - Important pages are crawled first (front queue priority)
  - No domain is overwhelmed (back queue rate limiting)
  - Fair distribution across domains

Components

URL Frontier Service: Central scheduling brain. Manages priority + politeness. Distributed across multiple nodes with domain-level sharding.
Crawler Workers (40): Stateless HTTP fetchers. Pull URLs from frontier, download pages, extract links. Async I/O (epoll) for high concurrency.
DNS Resolver Cache: Local DNS cache on each worker + shared Redis DNS cache. TTL-aware. Falls back to upstream DNS only on cache miss.
URL Dedup Service: Bloom filter (in-memory) + RocksDB (persistent). Checks if URL was already crawled or queued. Normalized URL fingerprint as key.
Content Store (S3): WARC files organized by domain and date. Lifecycle management keeps recent crawls, archives older versions.
robots.txt Cache: Redis-backed, refreshed every 24 hours per domain. Parsed into allow/disallow rules for fast path matching.
Monitoring: Pages crawled/sec, error rates by type (timeout, 4xx, 5xx), frontier size, domain coverage, politeness violations.

6. Deep Dives (15 min)

Deep Dive 1: URL Normalization and Deduplication

The problem: The same page can be reached via many different URLs. Without normalization, we waste bandwidth re-crawling identical content.

URL normalization rules:

1. Lowercase scheme and host:
   HTTP://Example.COM/Page → http://example.com/Page

2. Remove default ports:
   http://example.com:80/page → http://example.com/page
   https://example.com:443/page → https://example.com/page

3. Remove fragment:
   http://example.com/page#section → http://example.com/page

4. Sort query parameters:
   http://example.com/page?b=2&a=1 → http://example.com/page?a=1&b=2

5. Remove tracking parameters:
   http://example.com/page?utm_source=twitter&id=5 → http://example.com/page?id=5
   (Maintain a list: utm_source, utm_medium, utm_campaign, fbclid, gclid, etc.)

6. Remove trailing slash:
   http://example.com/page/ → http://example.com/page

7. Decode unnecessary percent-encoding:
   http://example.com/%7Euser → http://example.com/~user

8. Canonicalize path:
   http://example.com/a/../b/./c → http://example.com/b/c

URL fingerprinting:

After normalization:
  fingerprint = MD5(normalized_url)  -- 128-bit, 16 bytes
  (MD5 is fine for dedup — we're not using it for security)

Bloom filter check:
  if bloom_filter.might_contain(fingerprint):
    check RocksDB for definitive answer (handle false positives)
  else:
    definitely new URL → add to frontier and bloom filter

Content-level deduplication (near-dedup):

Problem: Same content at different URLs (mirrors, syndication, URL parameters that don't change content)

SimHash (locality-sensitive hashing):
  1. Extract text from HTML (strip tags)
  2. Tokenize into shingles (3-word sliding window)
  3. Hash each shingle → 64-bit hash
  4. For each bit position: if hash bit = 1, add +1; if 0, add -1
  5. Final SimHash: for each position, if sum > 0 → 1, else → 0

  Two documents with SimHash Hamming distance ≤ 3 (out of 64 bits) → near-duplicate
  → Skip crawling or mark as duplicate in index

Storage: 8 bytes per page. For 5B pages: 40 GB. Searchable via bit-partitioned index.

Deep Dive 2: Handling Spider Traps and Adversarial Pages

Spider traps — infinite URL spaces:

1. Calendar trap:
   /calendar/2026/02/23 → links to /calendar/2026/02/24 → ... infinitely
   Detection: URL path depth limit (e.g., max 15 segments)

2. Session ID trap:
   /page?session=abc → contains link to /page?session=def → ... infinite unique URLs
   Detection: URL normalization strips session-like parameters
   Heuristic: if a parameter value changes but content hash is identical → parameter is cosmetic

3. Soft 404s:
   /nonexistent-page returns 200 OK with "Page not found" content
   Detection: content similarity to known 404 pages of the domain

4. Adversarial content:
   Page generates random links to keep crawler busy
   Detection: anomaly detection per domain — if one domain generates 100x more unique URLs than
   similar-sized domains, flag and throttle

5. Redirect loops:
   A → B → C → A
   Detection: max redirect depth (e.g., 10 hops)

Defenses:

Per-domain limits:
  - Max URLs per domain in frontier: 100,000
  - Max crawl depth from seed: 15 hops
  - Max URL length: 2048 characters
  - Max page size: 10MB (truncate after)

Per-page limits:
  - Max links extracted per page: 500
  - Max redirect hops: 10
  - Fetch timeout: 30 seconds
  - Connection timeout: 10 seconds

Adaptive throttling:
  - If a domain's error rate > 50% over last 100 requests → pause for 1 hour
  - If a domain generates > 10,000 URLs/hour from a single page → flag as trap, stop crawling that subtree
  - Manual blocklist for known adversarial domains

Deep Dive 3: Crawl Prioritization and Freshness

The problem: 50 billion discoverable URLs, but we can only crawl 1 billion/day. How do we decide what to crawl first, and how often to re-crawl?

Priority scoring model:

priority(url) = w1 * page_importance + w2 * freshness_need + w3 * discovery_signal

1. Page Importance (static, updated periodically):
   - PageRank or simplified version (domain authority)
   - URL depth (shallower = more important: /about > /blog/2020/old-post)
   - Domain priority tier (news sites > personal blogs)

2. Freshness Need (dynamic, per-page):
   - Estimated change frequency based on past crawls:
     If page changed in 3 of last 5 crawls → high change rate → re-crawl sooner
   - Content type: news homepage → re-crawl every 15 minutes; static about page → re-crawl weekly
   - Adaptive re-crawl interval:
       new_interval = old_interval * 2 (if unchanged)
       new_interval = old_interval / 2 (if changed)
       Clamped to [15 minutes, 30 days]

3. Discovery Signal:
   - New URL never crawled → boost priority
   - URL discovered from multiple sources → likely important
   - URL from sitemap.xml → explicitly requested by site owner

Implementation:

Priority score computed when URL enters frontier:
  - New URLs: high base priority (discovery bonus)
  - Re-crawl URLs: priority based on estimated change rate * importance

Frontier maintains separate queues for:
  1. Never-crawled URLs (exploration)
  2. Re-crawl URLs (freshness maintenance)

Budget allocation: 70% re-crawl (keep index fresh), 30% exploration (discover new content)

Starvation prevention:
  - Every domain gets at least 1 crawl per day regardless of priority
  - Low-priority URLs age-out: priority increases linearly with time since last crawl
  - After 30 days with no crawl, any URL reaches maximum priority

7. Extensions (2 min)

Incremental/delta crawling: Instead of re-downloading the entire page, use HTTP conditional requests (If-Modified-Since, ETag). If the server supports it, we get a 304 Not Modified → skip download, saving ~90% of re-crawl bandwidth.
JavaScript rendering: Many modern pages require JS execution to render content. Add a headless browser farm (Puppeteer/Playwright) as a second-pass crawler for JS-heavy domains. First pass: raw HTML. Second pass (selective): rendered DOM. Trade-off: 10x slower, 50x more CPU.
Distributed crawler coordination: For multi-datacenter deployment, assign domain ranges to specific datacenters to avoid duplicate crawling. Use consistent hashing on domain to assign “ownership.” Crawler in US handles .com/.org, crawler in EU handles .de/.fr, etc.
Link graph extraction: In addition to storing page content, extract and store the link graph (source URL → destination URL). This feeds PageRank computation, spam detection, and web structure analysis. Store as an edge list in a graph database or adjacency list in HDFS.
Compliance and legal: Respect noindex/nofollow meta tags. Honor opt-out requests (GDPR right to be forgotten). Maintain a domain blocklist for DMCA/legal takedowns. Log all crawl activity for audit trails.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Throughput#

Storage#

DNS#

Machines#

3. API Design (3 min)#

Control API#

Data Output#

Key Decisions#

4. Data Model (3 min)#

URL Frontier (Redis + disk-backed priority queue)#

URL Seen Set (Bloom filter + backing store)#

Crawled Content (S3 / HDFS)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

URL Frontier Detail (the brain of the crawler)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: URL Normalization and Deduplication#

Deep Dive 2: Handling Spider Traps and Adversarial Pages#

Deep Dive 3: Crawl Prioritization and Freshness#

7. Extensions (2 min)#