Design a Botnet Detection System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Detect and classify botnet traffic in real-time — distinguish between legitimate users, known bots (scrapers, crawlers), and malicious botnets (DDoS, credential stuffing, spam)
Maintain an IP reputation database that scores IPs from 0 (definitively malicious) to 100 (definitively legitimate), updated continuously based on observed behavior
Support multiple detection methods: rate-based, behavioral fingerprinting, ML anomaly detection, honeypot-based, and collaborative intelligence (threat feeds)
Automatically block or challenge (CAPTCHA) detected bots with configurable actions per threat level
Provide a real-time dashboard showing attack patterns, blocked traffic, top threat sources, and detection accuracy metrics

Non-Functional Requirements

Availability: 99.99% — the detection system sits in the critical path. If it goes down, either all traffic is blocked (fail-closed) or all bots get through (fail-open). Neither is acceptable.
Latency: < 5ms decision time per request. Detection is in the hot path of every incoming request. Cannot add perceptible latency.
Consistency: Eventual consistency for IP reputation scores (a few seconds of propagation delay is fine). Real-time blocking decisions must use locally cached scores.
Scale: 1M requests/sec. 10M unique IPs/day. 500K concurrent connections. Detection must scale linearly.
False Positive Rate: < 0.1% — blocking legitimate users is worse than letting some bots through. False negatives (missed bots) are bad but recoverable. False positives lose customers.

2. Estimation (3 min)

Traffic

1M requests/sec → each request needs a bot/not-bot decision
1M feature extraction + scoring operations per second
Assuming 5% of traffic is botnet → 50K malicious requests/sec to detect and block

IP Reputation Database

10M unique IPs observed per day
Each IP record: ~500 bytes (IP, score, features, history, last seen, classification)
Active IPs (last 24h): 10M × 500 bytes = 5 GB — fits in memory (Redis)
Historical IPs (90-day window): 100M × 500 bytes = 50 GB — PostgreSQL with hot/cold tiering

Feature Extraction

Per-request features: ~20 signals extracted in < 1ms
- IP, User-Agent, headers, URL pattern, request timing, TLS fingerprint, geo
Per-IP aggregated features (sliding window): ~50 signals updated in < 2ms
- Request rate, endpoint diversity, timing regularity, error rate, session behavior
Feature vector size: ~200 bytes per request

ML Model

Model: gradient-boosted decision tree (XGBoost/LightGBM)
Inference time: < 1ms per request (100 trees, max depth 8)
Model size: ~10 MB (loaded in memory on every detection node)
Retraining: daily, on previous day’s labeled data

3. API Design (3 min)

Request Evaluation API (internal, called by API Gateway/WAF)

// Evaluate a request for bot probability
POST /v1/evaluate
  Body: {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ...",
    "url": "/api/v1/login",
    "method": "POST",
    "headers": { ... },
    "tls_fingerprint": "ja3_hash_abc",
    "geo": { "country": "RU", "asn": 12345 },
    "session_id": "sess_xyz"
  }
  Response 200: {
    "decision": "block",               // allow | challenge | throttle | block
    "confidence": 0.94,
    "threat_type": "credential_stuffing",
    "ip_reputation": 12,
    "signals": ["high_request_rate", "ja3_known_bot", "geo_mismatch"],
    "latency_ms": 2.3
  }

IP Reputation API

// Query IP reputation
GET /v1/reputation/{ip}
  Response 200: {
    "ip": "203.0.113.42",
    "score": 12,                        // 0-100, lower = more suspicious
    "classification": "likely_bot",
    "first_seen": "2026-02-01T00:00:00Z",
    "last_seen": "2026-02-22T10:00:00Z",
    "total_requests_24h": 45000,
    "blocked_requests_24h": 42000,
    "threat_types": ["credential_stuffing", "scraping"],
    "associated_asn": { "number": 12345, "name": "Shady Hosting Inc." }
  }

// Manually override IP reputation (analyst action)
PUT /v1/reputation/{ip}
  Body: { "score": 100, "reason": "False positive - legitimate partner API", "ttl": 86400 }

// Report malicious IP (collaborative intelligence)
POST /v1/reputation/report
  Body: { "ip": "203.0.113.42", "threat_type": "ddos", "evidence": "..." }

Configuration API

// Set detection policy
PUT /v1/policies/{policy_id}
  Body: {
    "name": "login_endpoint_protection",
    "endpoint_pattern": "/api/*/login",
    "rules": [
      { "type": "rate_limit", "threshold": 10, "window": "1m", "action": "challenge" },
      { "type": "ml_score", "threshold": 0.8, "action": "block" },
      { "type": "geo_block", "countries": ["XX"], "action": "block" }
    ]
  }

4. Data Model (3 min)

IP Reputation (Redis — hot data + PostgreSQL — full history)

// Redis (real-time lookups)
Key: ip:{ip_address}
Type: Hash
Fields:
  score              | int (0-100)
  classification     | string
  request_count_1h   | int
  request_count_24h  | int
  block_count_24h    | int
  last_seen          | timestamp
  features           | bytes (serialized feature vector)
  threat_types       | string (comma-separated)
TTL: 24 hours (re-populate from PostgreSQL on cache miss)

// PostgreSQL (historical)
Table: ip_reputation
  ip_address       (PK) | inet
  score                  | smallint
  classification         | varchar(50)
  first_seen             | timestamp
  last_seen              | timestamp
  total_requests         | bigint
  total_blocks           | bigint
  threat_types           | text[]
  asn_number             | int
  country_code           | char(2)
  updated_at             | timestamp

Request Log (Kafka + ClickHouse for analytics)

// Kafka topic: request_events (retained 7 days)
{
  "timestamp": 1708632000,
  "ip": "203.0.113.42",
  "url": "/api/v1/login",
  "method": "POST",
  "user_agent": "...",
  "tls_fingerprint": "ja3_abc",
  "decision": "block",
  "ml_score": 0.94,
  "signals": ["high_rate", "ja3_bot"],
  "latency_ms": 2.3
}

// ClickHouse (aggregated analytics, 90-day retention)
Table: request_events
  timestamp         | DateTime
  ip                | IPv4
  url               | String
  decision          | Enum('allow','challenge','throttle','block')
  ml_score          | Float32
  threat_type       | String
  -- columnar storage, efficient for time-series aggregations

Threat Intelligence (PostgreSQL)

Table: threat_feeds
  source            | varchar(100)  -- e.g., "spamhaus", "abuseipdb"
  ip_address        | inet
  threat_type       | varchar(50)
  confidence        | float
  reported_at       | timestamp
  expires_at        | timestamp

Table: ja3_fingerprints
  ja3_hash         (PK) | varchar(64)
  classification         | varchar(50)  -- "chrome_browser", "python_requests", "known_bot"
  confidence             | float
  source                 | varchar(100)

5. High-Level Design (12 min)

Request Evaluation Flow

Incoming Request
  → Edge / API Gateway
    → Bot Detection Middleware (< 5ms total):
      → 1. Extract features (< 0.5ms):
           IP, User-Agent, URL, method, headers, TLS fingerprint (JA3)
      → 2. IP reputation lookup (< 0.5ms):
           Redis: HGET ip:{ip} score
           If score < 20 → fast-path BLOCK (skip ML)
           If score > 90 → fast-path ALLOW (skip ML)
      → 3. Rate check (< 0.5ms):
           Redis: INCR ip:{ip}:rate:{minute_bucket}
           If > threshold → flag as rate-exceeded
      → 4. ML scoring (< 1ms):
           Feed feature vector to in-process model
           Score = probability of being a bot (0.0 - 1.0)
      → 5. Decision engine (< 0.5ms):
           Combine: IP reputation + rate check + ML score + policy rules
           Output: allow / challenge / throttle / block
      → 6. Async: log request event to Kafka
    → If allowed: forward to backend
    → If challenged: return CAPTCHA page
    → If blocked: return 403

Background Processing

Feature Aggregation Pipeline:
  Kafka (request_events) → Stream Processor (Flink/Kafka Streams):
    Per IP, compute sliding window aggregates:
      - Request rate (1min, 5min, 1hr windows)
      - Unique endpoints accessed
      - Error rate (4xx, 5xx responses)
      - Request timing entropy (regular intervals → bot)
      - Geographic consistency
    Write aggregated features to Redis: HSET ip:{ip} features {vector}
    Update IP reputation score based on new features

ML Model Training Pipeline (daily):
  ClickHouse (historical data) → Feature extraction → Label assignment:
    Labels from: manual reviews, honeypot catches, known-good traffic
  → Train XGBoost model → Validate on holdout set
  → If accuracy > threshold: deploy to all detection nodes
  → Canary: deploy to 5% of nodes first, monitor false positive rate

Components

Bot Detection Middleware: Runs in-process on every API Gateway node. Performs feature extraction, reputation lookup, rate check, ML inference, and decision. Must be < 5ms total.
Redis Cluster: IP reputation cache, rate counters, feature vectors. Co-located with API Gateway for < 1ms latency.
Feature Aggregation Service (Flink): Consumes request events from Kafka. Computes per-IP behavioral features in sliding windows. Updates Redis with fresh feature vectors.
Reputation Scoring Service: Periodically recalculates IP reputation scores based on aggregated features, threat feed data, and historical behavior. Writes to Redis + PostgreSQL.
ML Training Pipeline: Daily batch training on labeled data. Model validation and canary deployment. Model versioning and A/B testing.
Threat Intelligence Ingester: Pulls from external feeds (Spamhaus, AbuseIPDB, etc.) hourly. Updates threat_feeds table and pre-populates IP reputation for known-bad IPs.
Honeypot Service: Invisible endpoints (hidden form fields, fake API paths) that no legitimate user would access. Any traffic to honeypots is definitively bot traffic → auto-label and block.
Analyst Dashboard: Real-time view of attack patterns, top blocked IPs, false positive reviews, detection accuracy metrics.

Architecture

Internet
  → CDN / Edge (DDoS mitigation at L3/L4 — Cloudflare/AWS Shield)
    → API Gateway with Bot Detection Middleware
      ├── Redis Cluster (reputation, rates, features)
      ├── In-process ML model (XGBoost)
      └── Decision Engine (policy rules)
    → Backend Services (if request allowed)

Async Pipeline:
  Kafka ← request events
    → Flink (feature aggregation) → Redis
    → ClickHouse (analytics storage)

Batch Pipeline:
  ClickHouse → ML Training → Model Store → Deploy to Gateway nodes
  Threat Feeds → Ingester → PostgreSQL + Redis

6. Deep Dives (15 min)

Deep Dive 1: Behavioral Fingerprinting and TLS Analysis

IP addresses alone are insufficient for bot detection. Sophisticated botnets use residential IP proxies, rotating through millions of IPs. Each IP has low request volume (below rate limits), making IP-based detection blind.

TLS Fingerprinting (JA3/JA3S):

How it works:
  During TLS handshake, the client sends a ClientHello with:
    - TLS version
    - Cipher suites (ordered list)
    - Extensions
    - Elliptic curves
    - Point formats

  JA3 hashes these into a fingerprint: MD5(TLSVersion,Ciphers,Extensions,Curves,PointFormats)

Why it's powerful:
  - Chrome, Firefox, Safari each have a unique JA3 fingerprint
  - Python requests, Go net/http, curl each have distinct fingerprints
  - A bot claiming to be "Chrome" via User-Agent but with a Python JA3 → caught

  Detection:
    Known browser JA3 hashes → allow (if other signals are clean)
    Known bot library JA3 hashes → increase suspicion score
    Unknown JA3 → neutral (could be a new browser version)

Evasion and counter-evasion:
  - Sophisticated bots use headless Chrome (real JA3 fingerprint)
  - Counter: combine JA3 with other behavioral signals (see below)

HTTP/2 Fingerprinting:

HTTP/2 settings frame reveals:
  - HEADER_TABLE_SIZE
  - ENABLE_PUSH
  - MAX_CONCURRENT_STREAMS
  - INITIAL_WINDOW_SIZE
  - MAX_FRAME_SIZE
  - MAX_HEADER_LIST_SIZE
  - Priority frames and dependency tree

Each browser has a characteristic HTTP/2 fingerprint.
Headless Chrome has a slightly different fingerprint than real Chrome
(e.g., different priority tree structure for subresources).

Behavioral signals (per-session analysis):

Signal: Mouse/touch event patterns (for web traffic with JS challenge)
  - Bots: no mouse movement, instant form fills, zero scroll events
  - Humans: gradual mouse movement, variable typing speed, scrolling

Signal: Request timing entropy
  - Bots: regular intervals (exactly 100ms between requests)
  - Humans: irregular intervals (Poisson-distributed)
  - Metric: Shannon entropy of inter-request timing
    H = -Σ p(x) log2(p(x)) over binned timing intervals
    Low entropy → regular → likely bot
    High entropy → random → likely human

Signal: Navigation pattern
  - Bots: go directly to target URLs (no homepage, no CSS/JS loading)
  - Humans: load homepage → navigate → load subresources
  - Metric: ratio of API calls to page loads, subresource loading pattern

Signal: Session behavior
  - Bots: no cookies, no referrer, high bounce rate
  - Humans: cookies present, referrer chain, multi-page sessions

Deep Dive 2: ML-Based Anomaly Detection

Feature engineering (per request + per IP aggregate):

Per-request features (20 features):
  - ip_reputation_score (0-100)
  - is_known_datacenter_ip (0/1) — AWS, GCP, Azure, etc.
  - ja3_is_known_browser (0/1)
  - ja3_matches_user_agent (0/1)
  - user_agent_is_common (0/1)
  - url_depth (int)
  - has_referrer (0/1)
  - has_cookies (0/1)
  - http_version (float: 1.0, 1.1, 2.0)
  - content_type_appropriate (0/1)
  - ... (10 more header-based features)

Per-IP aggregate features (30 features, from sliding windows):
  - requests_per_minute_1m, _5m, _1h
  - unique_endpoints_1h
  - unique_user_agents_24h (rotation → bot)
  - error_rate_1h (4xx + 5xx / total)
  - timing_entropy_5m
  - avg_response_size_1h
  - login_attempt_rate_1h
  - geographic_distance_from_previous (km, if IP changed)
  - time_since_first_seen (seconds)
  - session_count_24h
  - ... (20 more behavioral features)

Model choice — XGBoost:

Why XGBoost over deep learning:
  - Inference time: < 1ms (critical for hot-path detection)
  - Interpretable: SHAP values explain which features drove the decision
  - Handles mixed features (categorical + numerical) natively
  - Small model size (~10 MB) — loaded in every gateway node's memory
  - Robust with missing features (graceful degradation)

Training pipeline:
  1. Labels:
     - Definitive bot: caught by honeypot (100% bot)
     - Definitive human: passed CAPTCHA + had normal session (100% human)
     - Weak bot signal: from threat feeds, rate limiting
     - Weak human signal: long session, organic navigation

  2. Train daily on last 7 days of labeled data
     - ~10M labeled samples/day (mostly from automated labeling)
     - 80/10/10 train/validation/test split

  3. Evaluation metrics:
     - FPR (false positive rate) must be < 0.1%
     - At FPR = 0.1%, target TPR (true positive rate) > 95%
     - AUC-ROC > 0.99

  4. Deployment:
     - New model compared against current model on holdout set
     - If metrics improve → canary deploy to 5% of nodes
     - Monitor FPR for 1 hour
     - If FPR stable → roll out to 100%
     - If FPR spikes → auto-rollback to previous model

Handling adversarial evasion:

Sophisticated bots adapt to avoid detection. Counter-measures:

1. Feature rotation: periodically change which features have the highest weight.
   Retrain weekly with randomized feature subsets.

2. Ensemble models: run 3 models trained on different feature subsets.
   If any model flags the request → challenge. Harder to evade all 3.

3. Honeypot-based ground truth: honeypots provide fresh, unevadable labels.
   Any request to a honeypot is definitively a bot, regardless of how well
   it mimics human behavior on other endpoints.

4. Behavioral challenges: inject invisible challenges (JS puzzles, canvas fingerprinting)
   that bots cannot solve without running a full browser engine.
   Cost: 50ms latency for first request only (then cached via cookie).

Deep Dive 3: Real-Time Blocking Architecture

Challenge: Making blocking decisions in < 5ms at 1M req/sec

Architecture: Three-tier decision system

Tier 1: Blocklist (< 0.1ms)
  - Pre-computed blocklist stored in memory (hash set)
  - Contains IPs with reputation score < 10
  - Updated every 60 seconds from Redis
  - Size: ~100K IPs × 4 bytes = 400 KB (trivially fits in L2 cache)
  - O(1) lookup. If IP is in blocklist → BLOCK immediately.

Tier 2: Rate + Reputation (< 1ms)
  - Redis lookup for IP reputation score and rate counter
  - If rate > threshold → THROTTLE or BLOCK
  - If reputation < 20 → CHALLENGE
  - If reputation > 90 → ALLOW (skip ML)
  - Most requests are resolved here (80% of traffic)

Tier 3: ML Scoring (< 2ms)
  - Only for ambiguous requests (reputation 20-90, rate below threshold)
  - ~20% of traffic reaches this tier
  - In-process XGBoost inference (no network call)
  - At 1M req/sec, 20% = 200K inferences/sec
  - Each inference: ~0.5ms on modern CPU → 100K inferences/sec per core
  - Need 2 cores dedicated to ML on each gateway node

Total budget breakdown:
  Feature extraction: 0.5ms
  Tier 1 (blocklist): 0.1ms
  Tier 2 (Redis): 0.5ms
  Tier 3 (ML): 1.0ms (when needed)
  Decision + logging: 0.3ms
  Total: 2.4ms typical, < 5ms worst case ✓

Challenge response flow (CAPTCHA):

When decision = "challenge":
  1. Return 403 with challenge page (CAPTCHA or JS puzzle)
  2. Client solves challenge → POST /v1/challenge/verify
  3. If solved correctly:
     → Set signed cookie (HMAC + timestamp) valid for 1 hour
     → Subsequent requests include cookie → bypass challenge
     → Boost IP reputation by +10 points
  4. If solved incorrectly or not attempted:
     → IP reputation decreases by -5 points
     → If multiple failures → block IP for 1 hour

7. Extensions (2 min)

Collaborative intelligence network: Share anonymized threat data across customers. When one customer detects a new botnet IP, propagate the reputation decrease to all customers within seconds. Similar to Cloudflare’s network effect — more customers = better detection.
Device fingerprinting: Beyond IP and TLS, use browser-level fingerprints (canvas, WebGL, font enumeration, screen resolution). Creates a device_id that persists across IP changes. Detect when a single device rotates through thousands of IPs (proxy/VPN abuse).
Credential stuffing protection: Specialized detection for login endpoints. Track unique username/password combinations per IP. If an IP tries 1000 different username/password pairs in an hour, it is definitively credential stuffing, regardless of rate. Integrate with Have I Been Pwned to check if attempted credentials are from known breaches.
DDoS mitigation at application layer (L7): Detect HTTP flood attacks targeting specific endpoints. Auto-scale challenge rate when traffic to a specific URL exceeds 10x normal. Progressive defense: rate limit → challenge → block → null-route at network edge.
Bot management (not just detection): Some bots are desirable (Googlebot, partner APIs). Maintain an allow-list with verified bot identities (reverse DNS verification for Googlebot, API key for partners). Provide different rate limits and access policies for known-good bots vs. unknown bots.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

IP Reputation Database#

Feature Extraction#

ML Model#

3. API Design (3 min)#

Request Evaluation API (internal, called by API Gateway/WAF)#

IP Reputation API#

Configuration API#

4. Data Model (3 min)#

IP Reputation (Redis — hot data + PostgreSQL — full history)#

Request Log (Kafka + ClickHouse for analytics)#

Threat Intelligence (PostgreSQL)#

5. High-Level Design (12 min)#

Request Evaluation Flow#

Background Processing#

Components#

Architecture#

6. Deep Dives (15 min)#

Deep Dive 1: Behavioral Fingerprinting and TLS Analysis#

Deep Dive 2: ML-Based Anomaly Detection#

Deep Dive 3: Real-Time Blocking Architecture#

7. Extensions (2 min)#