1. Requirements & Scope (5 min)

Functional Requirements

  1. Crawl the web starting from a seed set of URLs, discovering and following links to new pages
  2. Download and store the HTML content of each page for downstream indexing/processing
  3. Respect robots.txt directives and crawl-delay policies for every domain
  4. Detect and avoid duplicate URLs (normalization) and duplicate content (near-dedup)
  5. Support prioritized crawling — important/fresh pages crawled more frequently than stale/low-quality pages

Non-Functional Requirements

  • Availability: 99.9% — the crawler should run continuously. Brief outages are acceptable (we just resume from where we left off).
  • Throughput: Crawl 1 billion pages/day (~12,000 pages/sec sustained). Scale to 5 billion pages total in the index.
  • Latency: Not a real-time system. End-to-end latency from URL discovery to content storage can be minutes to hours.
  • Politeness: Never overload a single web server. Maximum 1 request/second per domain by default, respect Crawl-Delay.
  • Robustness: Handle spider traps, malformed HTML, infinite URL spaces (calendars, session IDs), and adversarial pages gracefully.

2. Estimation (3 min)

Throughput

  • Target: 1 billion pages/day
  • 1B / 86,400 = ~12,000 pages/sec
  • Average page size: 100KB (HTML + headers)
  • Download bandwidth: 12,000 x 100KB = 1.2 GB/sec = ~10 Gbps

Storage

  • 5 billion pages in the index
  • Average compressed page: 30KB (HTML compresses ~3:1)
  • Total content storage: 5B x 30KB = 150 TB
  • URL frontier (URLs to crawl): 10 billion URLs x 200 bytes = 2 TB
  • URL seen set (deduplication): 50 billion URLs x 8 bytes (fingerprint) = 400 GB — fits in distributed memory

DNS

  • Unique domains: ~200 million
  • DNS resolution: must cache aggressively. At 12,000 pages/sec, we cannot do a fresh DNS lookup for each.
  • DNS cache: 200M domains x 100 bytes = 20 GB — fits in memory

Machines

  • Each crawler worker: ~200 concurrent connections, 500 pages/sec per worker
  • 12,000 / 500 = ~24 crawler workers (plus headroom → 40 workers)
  • Each worker: 32 cores, 64GB RAM, 10 Gbps NIC

3. API Design (3 min)

The web crawler is not a user-facing API service — it is an internal batch processing system. However, it has internal control and data interfaces.

Control API

// Add seed URLs to the frontier
POST /crawler/seeds
  Body: { "urls": ["https://example.com", "https://news.site.com"] }
  Response 200: { "added": 2 }

// Configure crawl policy for a domain
PUT /crawler/policies/{domain}
  Body: {
    "max_qps": 1.0,                    // requests per second
    "crawl_depth": 10,                  // max link hops from seed
    "priority": "high",                 // high, medium, low
    "recrawl_interval_hours": 24,       // how often to re-crawl
    "user_agent": "MyBot/1.0"
  }

// Get crawl status
GET /crawler/status
  Response 200: {
    "pages_crawled_today": 892000000,
    "frontier_size": 3400000000,
    "active_workers": 38,
    "avg_pages_per_sec": 11500,
    "errors_today": 42000
  }

// Pause/resume crawling for a domain
POST /crawler/domains/{domain}/pause
POST /crawler/domains/{domain}/resume

Data Output

// Crawled pages written to distributed storage
Path: s3://crawl-data/{date}/{domain_hash}/{url_hash}.warc.gz

WARC record format:
  WARC-Type: response
  WARC-Target-URI: https://example.com/page
  WARC-Date: 2026-02-22T10:30:00Z
  Content-Type: application/http; msgtype=response
  [HTTP response headers]
  [HTML body]

Key Decisions

  • WARC format for storage — industry standard (used by Internet Archive, Common Crawl)
  • Domain-level politeness is enforced in the frontier, not at the worker level
  • Crawl data flows to downstream pipeline (indexer, content processor) via message queue

4. Data Model (3 min)

URL Frontier (Redis + disk-backed priority queue)

Per-domain queue:
  Key: frontier:{domain}
  Type: Priority Queue (sorted by priority score)
  Entry: {
    url             | string
    priority        | float       -- higher = crawl sooner
    discovered_at   | int64
    depth           | int         -- hops from seed
    last_crawled    | int64       -- 0 if never crawled
    retries         | int
  }

Domain metadata:
  Key: domain:{domain}
  Fields: {
    robots_txt       | string     -- cached robots.txt content
    robots_fetched   | int64      -- when we last fetched robots.txt
    crawl_delay      | float      -- from robots.txt or policy
    last_request     | int64      -- timestamp of last request (politeness)
    dns_ip           | string     -- cached DNS resolution
    dns_ttl          | int64      -- DNS cache expiry
  }

URL Seen Set (Bloom filter + backing store)

In-memory: Bloom filter
  - 50 billion URL fingerprints
  - 10 bits per element, 7 hash functions → ~1% false positive rate
  - Size: 50B x 10 bits = 62.5 GB → distributed across workers

Backing store (RocksDB / LevelDB per worker):
  Key: MD5(normalized_url)    -- 16 bytes
  Value: {last_crawled, content_hash}  -- 16 bytes

Crawled Content (S3 / HDFS)

WARC files:
  - One WARC file per domain-hour batch
  - Compressed (gzip): ~30KB per page average
  - Stored in S3 with lifecycle policy (keep 3 most recent crawls per page)

Why These Choices

  • Redis for frontier hot state — fast priority queue operations, domain-level rate limiting
  • Bloom filter for seen set — O(1) membership test, acceptable false positive rate (re-crawling a page is wasteful but not harmful)
  • RocksDB for persistent dedup — handles billions of keys efficiently on local SSDs
  • S3/WARC for content — cheap, durable, industry standard, decoupled from crawl process

5. High-Level Design (12 min)

Architecture

Seed URLs
  │
  ▼
┌────────────────────────────────────────────────────────┐
│                    URL Frontier Service                  │
│                                                          │
│  ┌──────────────────────┐  ┌────────────────────────┐  │
│  │ Priority Scheduler    │  │ Politeness Enforcer    │  │
│  │ (global priority      │  │ (per-domain rate limit,│  │
│  │  across all domains)  │  │  robots.txt cache)     │  │
│  └──────────┬───────────┘  └────────────┬───────────┘  │
│             │                            │               │
│             └──────────┬─────────────────┘               │
│                        │                                  │
│                  Emit next URL                            │
│                  (highest priority                        │
│                   + politeness OK)                        │
└────────────────────────┬─────────────────────────────────┘
                         │
                         ▼
┌────────────────────────────────────────────────────────┐
│              Crawler Workers (40 machines)               │
│                                                          │
│  Per worker:                                             │
│  1. Receive URL from frontier                            │
│  2. DNS resolution (local cache → DNS resolver)          │
│  3. HTTP fetch (async, 200 concurrent connections)       │
│  4. Parse robots.txt (if not cached)                     │
│  5. Download page (follow redirects, handle timeouts)    │
│  6. Extract links from HTML                              │
│  7. Normalize discovered URLs                            │
│  8. Content fingerprint (SimHash for near-dedup)         │
│  └─→ Send results to output pipeline                    │
└────────────────────────┬─────────────────────────────────┘
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
     ┌──────────┐ ┌──────────┐ ┌──────────┐
     │ Content   │ │ URL      │ │ Link     │
     │ Store     │ │ Dedup    │ │ Discovery│
     │ (S3/WARC) │ │ Service  │ │ → back to│
     │           │ │ (bloom + │ │  frontier │
     │           │ │  RocksDB)│ │           │
     └──────────┘ └──────────┘ └──────────┘

URL Frontier Detail (the brain of the crawler)

Two-level queue architecture:

Level 1: Priority queues (front queues)
  ┌──────┐ ┌──────┐ ┌──────┐
  │ High │ │ Med  │ │ Low  │
  │prio  │ │prio  │ │prio  │
  └──┬───┘ └──┬───┘ └──┬───┘
     │        │        │
     └────────┼────────┘
              │
     Weighted selection (80% high, 15% med, 5% low)
              │
              ▼
Level 2: Per-domain queues (back queues)
  ┌────────────┐ ┌────────────┐ ┌────────────┐
  │example.com │ │news.com    │ │blog.org    │
  │ [url, url] │ │ [url, url] │ │ [url, url] │
  │ next: 1.2s │ │ next: 0.5s │ │ next: 3.0s │
  └────────────┘ └────────────┘ └────────────┘

  Each back queue has a "next allowed fetch time"
  Scheduler picks the queue with earliest "next" time
  After fetch: next_time += 1/crawl_rate (e.g., +1 second)

This ensures:
  - Important pages are crawled first (front queue priority)
  - No domain is overwhelmed (back queue rate limiting)
  - Fair distribution across domains

Components

  1. URL Frontier Service: Central scheduling brain. Manages priority + politeness. Distributed across multiple nodes with domain-level sharding.
  2. Crawler Workers (40): Stateless HTTP fetchers. Pull URLs from frontier, download pages, extract links. Async I/O (epoll) for high concurrency.
  3. DNS Resolver Cache: Local DNS cache on each worker + shared Redis DNS cache. TTL-aware. Falls back to upstream DNS only on cache miss.
  4. URL Dedup Service: Bloom filter (in-memory) + RocksDB (persistent). Checks if URL was already crawled or queued. Normalized URL fingerprint as key.
  5. Content Store (S3): WARC files organized by domain and date. Lifecycle management keeps recent crawls, archives older versions.
  6. robots.txt Cache: Redis-backed, refreshed every 24 hours per domain. Parsed into allow/disallow rules for fast path matching.
  7. Monitoring: Pages crawled/sec, error rates by type (timeout, 4xx, 5xx), frontier size, domain coverage, politeness violations.

6. Deep Dives (15 min)

Deep Dive 1: URL Normalization and Deduplication

The problem: The same page can be reached via many different URLs. Without normalization, we waste bandwidth re-crawling identical content.

URL normalization rules:

1. Lowercase scheme and host:
   HTTP://Example.COM/Page → http://example.com/Page

2. Remove default ports:
   http://example.com:80/page → http://example.com/page
   https://example.com:443/page → https://example.com/page

3. Remove fragment:
   http://example.com/page#section → http://example.com/page

4. Sort query parameters:
   http://example.com/page?b=2&a=1 → http://example.com/page?a=1&b=2

5. Remove tracking parameters:
   http://example.com/page?utm_source=twitter&id=5 → http://example.com/page?id=5
   (Maintain a list: utm_source, utm_medium, utm_campaign, fbclid, gclid, etc.)

6. Remove trailing slash:
   http://example.com/page/ → http://example.com/page

7. Decode unnecessary percent-encoding:
   http://example.com/%7Euser → http://example.com/~user

8. Canonicalize path:
   http://example.com/a/../b/./c → http://example.com/b/c

URL fingerprinting:

After normalization:
  fingerprint = MD5(normalized_url)  -- 128-bit, 16 bytes
  (MD5 is fine for dedup — we're not using it for security)

Bloom filter check:
  if bloom_filter.might_contain(fingerprint):
    check RocksDB for definitive answer (handle false positives)
  else:
    definitely new URL → add to frontier and bloom filter

Content-level deduplication (near-dedup):

Problem: Same content at different URLs (mirrors, syndication, URL parameters that don't change content)

SimHash (locality-sensitive hashing):
  1. Extract text from HTML (strip tags)
  2. Tokenize into shingles (3-word sliding window)
  3. Hash each shingle → 64-bit hash
  4. For each bit position: if hash bit = 1, add +1; if 0, add -1
  5. Final SimHash: for each position, if sum > 0 → 1, else → 0

  Two documents with SimHash Hamming distance ≤ 3 (out of 64 bits) → near-duplicate
  → Skip crawling or mark as duplicate in index

Storage: 8 bytes per page. For 5B pages: 40 GB. Searchable via bit-partitioned index.

Deep Dive 2: Handling Spider Traps and Adversarial Pages

Spider traps — infinite URL spaces:

1. Calendar trap:
   /calendar/2026/02/23 → links to /calendar/2026/02/24 → ... infinitely
   Detection: URL path depth limit (e.g., max 15 segments)

2. Session ID trap:
   /page?session=abc → contains link to /page?session=def → ... infinite unique URLs
   Detection: URL normalization strips session-like parameters
   Heuristic: if a parameter value changes but content hash is identical → parameter is cosmetic

3. Soft 404s:
   /nonexistent-page returns 200 OK with "Page not found" content
   Detection: content similarity to known 404 pages of the domain

4. Adversarial content:
   Page generates random links to keep crawler busy
   Detection: anomaly detection per domain — if one domain generates 100x more unique URLs than
   similar-sized domains, flag and throttle

5. Redirect loops:
   A → B → C → A
   Detection: max redirect depth (e.g., 10 hops)

Defenses:

Per-domain limits:
  - Max URLs per domain in frontier: 100,000
  - Max crawl depth from seed: 15 hops
  - Max URL length: 2048 characters
  - Max page size: 10MB (truncate after)

Per-page limits:
  - Max links extracted per page: 500
  - Max redirect hops: 10
  - Fetch timeout: 30 seconds
  - Connection timeout: 10 seconds

Adaptive throttling:
  - If a domain's error rate > 50% over last 100 requests → pause for 1 hour
  - If a domain generates > 10,000 URLs/hour from a single page → flag as trap, stop crawling that subtree
  - Manual blocklist for known adversarial domains

Deep Dive 3: Crawl Prioritization and Freshness

The problem: 50 billion discoverable URLs, but we can only crawl 1 billion/day. How do we decide what to crawl first, and how often to re-crawl?

Priority scoring model:

priority(url) = w1 * page_importance + w2 * freshness_need + w3 * discovery_signal

1. Page Importance (static, updated periodically):
   - PageRank or simplified version (domain authority)
   - URL depth (shallower = more important: /about > /blog/2020/old-post)
   - Domain priority tier (news sites > personal blogs)

2. Freshness Need (dynamic, per-page):
   - Estimated change frequency based on past crawls:
     If page changed in 3 of last 5 crawls → high change rate → re-crawl sooner
   - Content type: news homepage → re-crawl every 15 minutes; static about page → re-crawl weekly
   - Adaptive re-crawl interval:
       new_interval = old_interval * 2 (if unchanged)
       new_interval = old_interval / 2 (if changed)
       Clamped to [15 minutes, 30 days]

3. Discovery Signal:
   - New URL never crawled → boost priority
   - URL discovered from multiple sources → likely important
   - URL from sitemap.xml → explicitly requested by site owner

Implementation:

Priority score computed when URL enters frontier:
  - New URLs: high base priority (discovery bonus)
  - Re-crawl URLs: priority based on estimated change rate * importance

Frontier maintains separate queues for:
  1. Never-crawled URLs (exploration)
  2. Re-crawl URLs (freshness maintenance)

Budget allocation: 70% re-crawl (keep index fresh), 30% exploration (discover new content)

Starvation prevention:
  - Every domain gets at least 1 crawl per day regardless of priority
  - Low-priority URLs age-out: priority increases linearly with time since last crawl
  - After 30 days with no crawl, any URL reaches maximum priority

7. Extensions (2 min)

  • Incremental/delta crawling: Instead of re-downloading the entire page, use HTTP conditional requests (If-Modified-Since, ETag). If the server supports it, we get a 304 Not Modified → skip download, saving ~90% of re-crawl bandwidth.
  • JavaScript rendering: Many modern pages require JS execution to render content. Add a headless browser farm (Puppeteer/Playwright) as a second-pass crawler for JS-heavy domains. First pass: raw HTML. Second pass (selective): rendered DOM. Trade-off: 10x slower, 50x more CPU.
  • Distributed crawler coordination: For multi-datacenter deployment, assign domain ranges to specific datacenters to avoid duplicate crawling. Use consistent hashing on domain to assign “ownership.” Crawler in US handles .com/.org, crawler in EU handles .de/.fr, etc.
  • Link graph extraction: In addition to storing page content, extract and store the link graph (source URL → destination URL). This feeds PageRank computation, spam detection, and web structure analysis. Store as an edge list in a graph database or adjacency list in HDFS.
  • Compliance and legal: Respect noindex/nofollow meta tags. Honor opt-out requests (GDPR right to be forgotten). Maintain a domain blocklist for DMCA/legal takedowns. Log all crawl activity for audit trails.