1. Requirements & Scope (5 min)
Functional Requirements
- Detect and classify botnet traffic in real-time — distinguish between legitimate users, known bots (scrapers, crawlers), and malicious botnets (DDoS, credential stuffing, spam)
- Maintain an IP reputation database that scores IPs from 0 (definitively malicious) to 100 (definitively legitimate), updated continuously based on observed behavior
- Support multiple detection methods: rate-based, behavioral fingerprinting, ML anomaly detection, honeypot-based, and collaborative intelligence (threat feeds)
- Automatically block or challenge (CAPTCHA) detected bots with configurable actions per threat level
- Provide a real-time dashboard showing attack patterns, blocked traffic, top threat sources, and detection accuracy metrics
Non-Functional Requirements
- Availability: 99.99% — the detection system sits in the critical path. If it goes down, either all traffic is blocked (fail-closed) or all bots get through (fail-open). Neither is acceptable.
- Latency: < 5ms decision time per request. Detection is in the hot path of every incoming request. Cannot add perceptible latency.
- Consistency: Eventual consistency for IP reputation scores (a few seconds of propagation delay is fine). Real-time blocking decisions must use locally cached scores.
- Scale: 1M requests/sec. 10M unique IPs/day. 500K concurrent connections. Detection must scale linearly.
- False Positive Rate: < 0.1% — blocking legitimate users is worse than letting some bots through. False negatives (missed bots) are bad but recoverable. False positives lose customers.
2. Estimation (3 min)
Traffic
- 1M requests/sec → each request needs a bot/not-bot decision
- 1M feature extraction + scoring operations per second
- Assuming 5% of traffic is botnet → 50K malicious requests/sec to detect and block
IP Reputation Database
- 10M unique IPs observed per day
- Each IP record: ~500 bytes (IP, score, features, history, last seen, classification)
- Active IPs (last 24h): 10M × 500 bytes = 5 GB — fits in memory (Redis)
- Historical IPs (90-day window): 100M × 500 bytes = 50 GB — PostgreSQL with hot/cold tiering
Feature Extraction
- Per-request features: ~20 signals extracted in < 1ms
- IP, User-Agent, headers, URL pattern, request timing, TLS fingerprint, geo
- Per-IP aggregated features (sliding window): ~50 signals updated in < 2ms
- Request rate, endpoint diversity, timing regularity, error rate, session behavior
- Feature vector size: ~200 bytes per request
ML Model
- Model: gradient-boosted decision tree (XGBoost/LightGBM)
- Inference time: < 1ms per request (100 trees, max depth 8)
- Model size: ~10 MB (loaded in memory on every detection node)
- Retraining: daily, on previous day’s labeled data
3. API Design (3 min)
Request Evaluation API (internal, called by API Gateway/WAF)
// Evaluate a request for bot probability
POST /v1/evaluate
Body: {
"ip": "203.0.113.42",
"user_agent": "Mozilla/5.0 ...",
"url": "/api/v1/login",
"method": "POST",
"headers": { ... },
"tls_fingerprint": "ja3_hash_abc",
"geo": { "country": "RU", "asn": 12345 },
"session_id": "sess_xyz"
}
Response 200: {
"decision": "block", // allow | challenge | throttle | block
"confidence": 0.94,
"threat_type": "credential_stuffing",
"ip_reputation": 12,
"signals": ["high_request_rate", "ja3_known_bot", "geo_mismatch"],
"latency_ms": 2.3
}
IP Reputation API
// Query IP reputation
GET /v1/reputation/{ip}
Response 200: {
"ip": "203.0.113.42",
"score": 12, // 0-100, lower = more suspicious
"classification": "likely_bot",
"first_seen": "2026-02-01T00:00:00Z",
"last_seen": "2026-02-22T10:00:00Z",
"total_requests_24h": 45000,
"blocked_requests_24h": 42000,
"threat_types": ["credential_stuffing", "scraping"],
"associated_asn": { "number": 12345, "name": "Shady Hosting Inc." }
}
// Manually override IP reputation (analyst action)
PUT /v1/reputation/{ip}
Body: { "score": 100, "reason": "False positive - legitimate partner API", "ttl": 86400 }
// Report malicious IP (collaborative intelligence)
POST /v1/reputation/report
Body: { "ip": "203.0.113.42", "threat_type": "ddos", "evidence": "..." }
Configuration API
// Set detection policy
PUT /v1/policies/{policy_id}
Body: {
"name": "login_endpoint_protection",
"endpoint_pattern": "/api/*/login",
"rules": [
{ "type": "rate_limit", "threshold": 10, "window": "1m", "action": "challenge" },
{ "type": "ml_score", "threshold": 0.8, "action": "block" },
{ "type": "geo_block", "countries": ["XX"], "action": "block" }
]
}
4. Data Model (3 min)
IP Reputation (Redis — hot data + PostgreSQL — full history)
// Redis (real-time lookups)
Key: ip:{ip_address}
Type: Hash
Fields:
score | int (0-100)
classification | string
request_count_1h | int
request_count_24h | int
block_count_24h | int
last_seen | timestamp
features | bytes (serialized feature vector)
threat_types | string (comma-separated)
TTL: 24 hours (re-populate from PostgreSQL on cache miss)
// PostgreSQL (historical)
Table: ip_reputation
ip_address (PK) | inet
score | smallint
classification | varchar(50)
first_seen | timestamp
last_seen | timestamp
total_requests | bigint
total_blocks | bigint
threat_types | text[]
asn_number | int
country_code | char(2)
updated_at | timestamp
Request Log (Kafka + ClickHouse for analytics)
// Kafka topic: request_events (retained 7 days)
{
"timestamp": 1708632000,
"ip": "203.0.113.42",
"url": "/api/v1/login",
"method": "POST",
"user_agent": "...",
"tls_fingerprint": "ja3_abc",
"decision": "block",
"ml_score": 0.94,
"signals": ["high_rate", "ja3_bot"],
"latency_ms": 2.3
}
// ClickHouse (aggregated analytics, 90-day retention)
Table: request_events
timestamp | DateTime
ip | IPv4
url | String
decision | Enum('allow','challenge','throttle','block')
ml_score | Float32
threat_type | String
-- columnar storage, efficient for time-series aggregations
Threat Intelligence (PostgreSQL)
Table: threat_feeds
source | varchar(100) -- e.g., "spamhaus", "abuseipdb"
ip_address | inet
threat_type | varchar(50)
confidence | float
reported_at | timestamp
expires_at | timestamp
Table: ja3_fingerprints
ja3_hash (PK) | varchar(64)
classification | varchar(50) -- "chrome_browser", "python_requests", "known_bot"
confidence | float
source | varchar(100)
5. High-Level Design (12 min)
Request Evaluation Flow
Incoming Request
→ Edge / API Gateway
→ Bot Detection Middleware (< 5ms total):
→ 1. Extract features (< 0.5ms):
IP, User-Agent, URL, method, headers, TLS fingerprint (JA3)
→ 2. IP reputation lookup (< 0.5ms):
Redis: HGET ip:{ip} score
If score < 20 → fast-path BLOCK (skip ML)
If score > 90 → fast-path ALLOW (skip ML)
→ 3. Rate check (< 0.5ms):
Redis: INCR ip:{ip}:rate:{minute_bucket}
If > threshold → flag as rate-exceeded
→ 4. ML scoring (< 1ms):
Feed feature vector to in-process model
Score = probability of being a bot (0.0 - 1.0)
→ 5. Decision engine (< 0.5ms):
Combine: IP reputation + rate check + ML score + policy rules
Output: allow / challenge / throttle / block
→ 6. Async: log request event to Kafka
→ If allowed: forward to backend
→ If challenged: return CAPTCHA page
→ If blocked: return 403
Background Processing
Feature Aggregation Pipeline:
Kafka (request_events) → Stream Processor (Flink/Kafka Streams):
Per IP, compute sliding window aggregates:
- Request rate (1min, 5min, 1hr windows)
- Unique endpoints accessed
- Error rate (4xx, 5xx responses)
- Request timing entropy (regular intervals → bot)
- Geographic consistency
Write aggregated features to Redis: HSET ip:{ip} features {vector}
Update IP reputation score based on new features
ML Model Training Pipeline (daily):
ClickHouse (historical data) → Feature extraction → Label assignment:
Labels from: manual reviews, honeypot catches, known-good traffic
→ Train XGBoost model → Validate on holdout set
→ If accuracy > threshold: deploy to all detection nodes
→ Canary: deploy to 5% of nodes first, monitor false positive rate
Components
- Bot Detection Middleware: Runs in-process on every API Gateway node. Performs feature extraction, reputation lookup, rate check, ML inference, and decision. Must be < 5ms total.
- Redis Cluster: IP reputation cache, rate counters, feature vectors. Co-located with API Gateway for < 1ms latency.
- Feature Aggregation Service (Flink): Consumes request events from Kafka. Computes per-IP behavioral features in sliding windows. Updates Redis with fresh feature vectors.
- Reputation Scoring Service: Periodically recalculates IP reputation scores based on aggregated features, threat feed data, and historical behavior. Writes to Redis + PostgreSQL.
- ML Training Pipeline: Daily batch training on labeled data. Model validation and canary deployment. Model versioning and A/B testing.
- Threat Intelligence Ingester: Pulls from external feeds (Spamhaus, AbuseIPDB, etc.) hourly. Updates threat_feeds table and pre-populates IP reputation for known-bad IPs.
- Honeypot Service: Invisible endpoints (hidden form fields, fake API paths) that no legitimate user would access. Any traffic to honeypots is definitively bot traffic → auto-label and block.
- Analyst Dashboard: Real-time view of attack patterns, top blocked IPs, false positive reviews, detection accuracy metrics.
Architecture
Internet
→ CDN / Edge (DDoS mitigation at L3/L4 — Cloudflare/AWS Shield)
→ API Gateway with Bot Detection Middleware
├── Redis Cluster (reputation, rates, features)
├── In-process ML model (XGBoost)
└── Decision Engine (policy rules)
→ Backend Services (if request allowed)
Async Pipeline:
Kafka ← request events
→ Flink (feature aggregation) → Redis
→ ClickHouse (analytics storage)
Batch Pipeline:
ClickHouse → ML Training → Model Store → Deploy to Gateway nodes
Threat Feeds → Ingester → PostgreSQL + Redis
6. Deep Dives (15 min)
Deep Dive 1: Behavioral Fingerprinting and TLS Analysis
IP addresses alone are insufficient for bot detection. Sophisticated botnets use residential IP proxies, rotating through millions of IPs. Each IP has low request volume (below rate limits), making IP-based detection blind.
TLS Fingerprinting (JA3/JA3S):
How it works:
During TLS handshake, the client sends a ClientHello with:
- TLS version
- Cipher suites (ordered list)
- Extensions
- Elliptic curves
- Point formats
JA3 hashes these into a fingerprint: MD5(TLSVersion,Ciphers,Extensions,Curves,PointFormats)
Why it's powerful:
- Chrome, Firefox, Safari each have a unique JA3 fingerprint
- Python requests, Go net/http, curl each have distinct fingerprints
- A bot claiming to be "Chrome" via User-Agent but with a Python JA3 → caught
Detection:
Known browser JA3 hashes → allow (if other signals are clean)
Known bot library JA3 hashes → increase suspicion score
Unknown JA3 → neutral (could be a new browser version)
Evasion and counter-evasion:
- Sophisticated bots use headless Chrome (real JA3 fingerprint)
- Counter: combine JA3 with other behavioral signals (see below)
HTTP/2 Fingerprinting:
HTTP/2 settings frame reveals:
- HEADER_TABLE_SIZE
- ENABLE_PUSH
- MAX_CONCURRENT_STREAMS
- INITIAL_WINDOW_SIZE
- MAX_FRAME_SIZE
- MAX_HEADER_LIST_SIZE
- Priority frames and dependency tree
Each browser has a characteristic HTTP/2 fingerprint.
Headless Chrome has a slightly different fingerprint than real Chrome
(e.g., different priority tree structure for subresources).
Behavioral signals (per-session analysis):
Signal: Mouse/touch event patterns (for web traffic with JS challenge)
- Bots: no mouse movement, instant form fills, zero scroll events
- Humans: gradual mouse movement, variable typing speed, scrolling
Signal: Request timing entropy
- Bots: regular intervals (exactly 100ms between requests)
- Humans: irregular intervals (Poisson-distributed)
- Metric: Shannon entropy of inter-request timing
H = -Σ p(x) log2(p(x)) over binned timing intervals
Low entropy → regular → likely bot
High entropy → random → likely human
Signal: Navigation pattern
- Bots: go directly to target URLs (no homepage, no CSS/JS loading)
- Humans: load homepage → navigate → load subresources
- Metric: ratio of API calls to page loads, subresource loading pattern
Signal: Session behavior
- Bots: no cookies, no referrer, high bounce rate
- Humans: cookies present, referrer chain, multi-page sessions
Deep Dive 2: ML-Based Anomaly Detection
Feature engineering (per request + per IP aggregate):
Per-request features (20 features):
- ip_reputation_score (0-100)
- is_known_datacenter_ip (0/1) — AWS, GCP, Azure, etc.
- ja3_is_known_browser (0/1)
- ja3_matches_user_agent (0/1)
- user_agent_is_common (0/1)
- url_depth (int)
- has_referrer (0/1)
- has_cookies (0/1)
- http_version (float: 1.0, 1.1, 2.0)
- content_type_appropriate (0/1)
- ... (10 more header-based features)
Per-IP aggregate features (30 features, from sliding windows):
- requests_per_minute_1m, _5m, _1h
- unique_endpoints_1h
- unique_user_agents_24h (rotation → bot)
- error_rate_1h (4xx + 5xx / total)
- timing_entropy_5m
- avg_response_size_1h
- login_attempt_rate_1h
- geographic_distance_from_previous (km, if IP changed)
- time_since_first_seen (seconds)
- session_count_24h
- ... (20 more behavioral features)
Model choice — XGBoost:
Why XGBoost over deep learning:
- Inference time: < 1ms (critical for hot-path detection)
- Interpretable: SHAP values explain which features drove the decision
- Handles mixed features (categorical + numerical) natively
- Small model size (~10 MB) — loaded in every gateway node's memory
- Robust with missing features (graceful degradation)
Training pipeline:
1. Labels:
- Definitive bot: caught by honeypot (100% bot)
- Definitive human: passed CAPTCHA + had normal session (100% human)
- Weak bot signal: from threat feeds, rate limiting
- Weak human signal: long session, organic navigation
2. Train daily on last 7 days of labeled data
- ~10M labeled samples/day (mostly from automated labeling)
- 80/10/10 train/validation/test split
3. Evaluation metrics:
- FPR (false positive rate) must be < 0.1%
- At FPR = 0.1%, target TPR (true positive rate) > 95%
- AUC-ROC > 0.99
4. Deployment:
- New model compared against current model on holdout set
- If metrics improve → canary deploy to 5% of nodes
- Monitor FPR for 1 hour
- If FPR stable → roll out to 100%
- If FPR spikes → auto-rollback to previous model
Handling adversarial evasion:
Sophisticated bots adapt to avoid detection. Counter-measures:
1. Feature rotation: periodically change which features have the highest weight.
Retrain weekly with randomized feature subsets.
2. Ensemble models: run 3 models trained on different feature subsets.
If any model flags the request → challenge. Harder to evade all 3.
3. Honeypot-based ground truth: honeypots provide fresh, unevadable labels.
Any request to a honeypot is definitively a bot, regardless of how well
it mimics human behavior on other endpoints.
4. Behavioral challenges: inject invisible challenges (JS puzzles, canvas fingerprinting)
that bots cannot solve without running a full browser engine.
Cost: 50ms latency for first request only (then cached via cookie).
Deep Dive 3: Real-Time Blocking Architecture
Challenge: Making blocking decisions in < 5ms at 1M req/sec
Architecture: Three-tier decision system
Tier 1: Blocklist (< 0.1ms)
- Pre-computed blocklist stored in memory (hash set)
- Contains IPs with reputation score < 10
- Updated every 60 seconds from Redis
- Size: ~100K IPs × 4 bytes = 400 KB (trivially fits in L2 cache)
- O(1) lookup. If IP is in blocklist → BLOCK immediately.
Tier 2: Rate + Reputation (< 1ms)
- Redis lookup for IP reputation score and rate counter
- If rate > threshold → THROTTLE or BLOCK
- If reputation < 20 → CHALLENGE
- If reputation > 90 → ALLOW (skip ML)
- Most requests are resolved here (80% of traffic)
Tier 3: ML Scoring (< 2ms)
- Only for ambiguous requests (reputation 20-90, rate below threshold)
- ~20% of traffic reaches this tier
- In-process XGBoost inference (no network call)
- At 1M req/sec, 20% = 200K inferences/sec
- Each inference: ~0.5ms on modern CPU → 100K inferences/sec per core
- Need 2 cores dedicated to ML on each gateway node
Total budget breakdown:
Feature extraction: 0.5ms
Tier 1 (blocklist): 0.1ms
Tier 2 (Redis): 0.5ms
Tier 3 (ML): 1.0ms (when needed)
Decision + logging: 0.3ms
Total: 2.4ms typical, < 5ms worst case ✓
Challenge response flow (CAPTCHA):
When decision = "challenge":
1. Return 403 with challenge page (CAPTCHA or JS puzzle)
2. Client solves challenge → POST /v1/challenge/verify
3. If solved correctly:
→ Set signed cookie (HMAC + timestamp) valid for 1 hour
→ Subsequent requests include cookie → bypass challenge
→ Boost IP reputation by +10 points
4. If solved incorrectly or not attempted:
→ IP reputation decreases by -5 points
→ If multiple failures → block IP for 1 hour
7. Extensions (2 min)
- Collaborative intelligence network: Share anonymized threat data across customers. When one customer detects a new botnet IP, propagate the reputation decrease to all customers within seconds. Similar to Cloudflare’s network effect — more customers = better detection.
- Device fingerprinting: Beyond IP and TLS, use browser-level fingerprints (canvas, WebGL, font enumeration, screen resolution). Creates a device_id that persists across IP changes. Detect when a single device rotates through thousands of IPs (proxy/VPN abuse).
- Credential stuffing protection: Specialized detection for login endpoints. Track unique username/password combinations per IP. If an IP tries 1000 different username/password pairs in an hour, it is definitively credential stuffing, regardless of rate. Integrate with Have I Been Pwned to check if attempted credentials are from known breaches.
- DDoS mitigation at application layer (L7): Detect HTTP flood attacks targeting specific endpoints. Auto-scale challenge rate when traffic to a specific URL exceeds 10x normal. Progressive defense: rate limit → challenge → block → null-route at network edge.
- Bot management (not just detection): Some bots are desirable (Googlebot, partner APIs). Maintain an allow-list with verified bot identities (reverse DNS verification for Googlebot, API key for partners). Provide different rate limits and access policies for known-good bots vs. unknown bots.