1. Requirements & Scope (5 min)
Functional Requirements
- Limit the number of API requests a client can make within a time window
- Support multiple rate limiting rules (per user, per IP, per API endpoint, per service)
- Return appropriate response when rate limit is exceeded (429 Too Many Requests)
- Include rate limit headers in every response (remaining quota, reset time)
- Support different limiting algorithms (configurable per rule)
Non-Functional Requirements
- Availability: 99.999% — if the rate limiter goes down, we either block all traffic (fail-closed) or allow all traffic (fail-open). Both are bad. It must be more available than the services it protects.
- Latency: < 1ms overhead per request — the rate limiter is in the hot path of every API call
- Consistency: Approximate accuracy is acceptable. If the limit is 100 req/min, allowing 105 is fine. Allowing 1000 is not.
- Scale: Must handle 10M+ requests/sec across all services
- Distributed: Must work correctly across multiple API gateway instances (not just per-node limiting)
2. Estimation (3 min)
Traffic
- 10M requests/sec across all API endpoints
- Each request → 1 rate limit check (read + conditional write to counter)
- 10M rate limit operations/sec
Storage
- Active rate limit keys: assume 50M unique clients (user_id or IP)
- Each key: ~100 bytes (key + counter + window metadata)
- Total: 50M × 100 bytes = 5GB — fits entirely in memory (Redis)
Key Insight
This is a latency-critical, memory-bound system. All state lives in Redis. The challenge is distributed correctness (accurate counting across multiple gateway instances) at extreme throughput with < 1ms latency.
3. API Design (3 min)
The rate limiter itself isn’t a user-facing API — it’s middleware. But it has an internal API and affects response headers.
Internal API (called by API Gateway)
// Check if request should be allowed
POST /ratelimit/check
Body: {
"client_id": "user_123", // or IP address
"endpoint": "/api/v1/tweets",
"method": "POST"
}
Response 200: {
"allowed": true,
"remaining": 47,
"limit": 100,
"reset_at": 1708632060 // Unix timestamp
}
// or
Response 200: {
"allowed": false,
"remaining": 0,
"limit": 100,
"reset_at": 1708632060,
"retry_after": 23 // seconds
}
Response Headers (added to every API response)
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1708632060
// When rate limited:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708632060
Retry-After: 23
Rules Configuration API
POST /ratelimit/rules
Body: {
"name": "tweets_post_limit",
"endpoint": "/api/v1/tweets",
"method": "POST",
"key": "user_id", // rate limit per user
"limit": 100,
"window": "1h",
"algorithm": "sliding_window"
}
4. Data Model (3 min)
Rate Limit Rules (PostgreSQL or config file)
Table: rate_limit_rules
rule_id (PK) | int
name | varchar(100)
endpoint_pattern | varchar(200) -- regex or exact match
method | varchar(10)
key_type | enum('user_id', 'ip', 'api_key', 'service')
limit | int
window_seconds | int
algorithm | enum('fixed_window', 'sliding_window', 'token_bucket', 'leaky_bucket')
enabled | boolean
Rate Limit Counters (Redis)
// Fixed Window
Key: rl:{rule_id}:{client_id}:{window_start}
Value: int (request count)
TTL: window_seconds
// Sliding Window Log
Key: rl:{rule_id}:{client_id}
Type: Sorted Set
Members: request timestamps
Score: timestamp
// Token Bucket
Key: rl:{rule_id}:{client_id}
Type: Hash
Fields: tokens (float), last_refill (timestamp)
5. High-Level Design (12 min)
Architecture
Client → Load Balancer → API Gateway
→ Rate Limiter Middleware:
1. Extract client identifier (user_id, IP, API key)
2. Look up applicable rules (cached locally, refreshed from Rules Service)
3. For each applicable rule:
→ Check counter in Redis
→ If allowed: increment counter, pass through
→ If exceeded: return 429
4. Add rate limit headers to response
→ Backend Service (if allowed)
Components
- API Gateway (with Rate Limiter middleware): Nginx/Envoy/custom gateway. Rate limiting runs as middleware before routing to backend.
- Redis Cluster: Stores all rate limit counters. Must be co-located with API gateways for < 1ms latency.
- Rules Service: Manages rate limit rules. Rules cached locally at each gateway (refresh every 30 seconds).
- Rules DB (PostgreSQL): Persistent storage for rate limit configuration.
- Monitoring: Real-time dashboards showing rate limit hits, top-limited clients, rule effectiveness.
Deployment
Region A:
API Gateway instances (10+) → Redis Cluster (same AZ)
Region B:
API Gateway instances (10+) → Redis Cluster (same AZ)
Critical: Redis must be in the same availability zone as the API Gateway. Cross-AZ Redis calls add 1-2ms latency, which doubles the overhead budget.
6. Deep Dives (15 min)
Deep Dive 1: Rate Limiting Algorithms
1. Fixed Window Counter
Window: every minute, starting at :00
Key: rl:rule1:user123:202602221800
Operation: INCR key; if count > limit → reject
TTL: 60 seconds
- Pros: Simple, O(1) per request, minimal storage
- Cons: Boundary burst problem — user can send 100 requests at 18:00:59 and 100 more at 18:01:00 → 200 requests in 2 seconds while “respecting” the 100/min limit
2. Sliding Window Log
Key: rl:rule1:user123 (Sorted Set)
On request:
1. ZREMRANGEBYSCORE key 0 (now - window) // remove old entries
2. ZCARD key // count entries in window
3. If count < limit: ZADD key now now // add new entry
- Pros: Perfectly accurate, no boundary issues
- Cons: O(N) storage where N = max requests in window. For 100 req/min limit → 100 entries per key. For 10M requests/sec across 50M keys → expensive.
3. Sliding Window Counter (best trade-off)
Combines fixed window + weighted overlap:
current_window_count + (previous_window_count × overlap_percentage)
Example (100 req/min limit, current time = 18:01:15):
Previous window (18:00 - 18:01): 84 requests
Current window (18:01 - 18:02): 36 requests
Overlap: (60 - 15) / 60 = 75% of previous window
Estimated count: 36 + (84 × 0.75) = 99 → allowed (under 100)
- Pros: O(1) per request, very accurate (error < 0.003% per CloudFlare’s analysis), minimal storage (2 counters per key)
- Cons: Approximate (but close enough)
4. Token Bucket
Bucket has max N tokens. Refills at rate R tokens/sec.
Each request consumes 1 token.
If bucket is empty → reject.
Redis implementation (Lua script for atomicity):
local tokens = redis.call('HGET', key, 'tokens')
local last_refill = redis.call('HGET', key, 'last_refill')
local elapsed = now - last_refill
tokens = min(max_tokens, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
-- allow
else
-- reject
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
- Pros: Allows controlled bursts (up to bucket size), smooth rate limiting
- Cons: Slightly more complex, 2 values per key instead of 1
Recommendation: Sliding Window Counter for most use cases (simple + accurate). Token Bucket when burst handling matters (API rate limits for external developers).
Deep Dive 2: Distributed Rate Limiting
The problem: 10 API Gateway instances, each checking Redis independently. Race condition:
- Gateway A: reads count = 99, allows request, increments to 100
- Gateway B: reads count = 99 (before A’s write propagates), allows request, increments to 100
- Result: 101 requests allowed when limit is 100
Solution 1: Lua Scripts (atomic operations in Redis)
-- Atomic check-and-increment
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1]) -- set TTL on first request
end
if current > tonumber(ARGV[2]) then
return 0 -- rejected
end
return 1 -- allowed
This runs atomically in Redis — no race condition. INCR is O(1) and the Lua script executes as a single Redis command.
Solution 2: Local rate limiting with sync (for extreme throughput)
- Each gateway maintains a local counter (in-process)
- Local limit = global_limit / num_instances (e.g., 100/10 = 10 per instance)
- Periodically sync with Redis (every 1-5 seconds)
- Pros: Zero network calls for most requests (sub-microsecond)
- Cons: Less accurate (bursty traffic may over/under-limit by 10-20%)
Recommendation: Lua scripts for most cases. Local counters only if Redis latency becomes a bottleneck (unlikely at < 1ms).
Deep Dive 3: Failure Modes and Resilience
What happens when Redis goes down?
Two philosophies:
- Fail-open: If Redis is unreachable, allow all requests. Risk: no rate limiting → potential DDoS of backend services.
- Fail-closed: If Redis is unreachable, reject all requests. Risk: complete service outage.
Best approach: Graceful degradation
- Redis Cluster with replicas — automatic failover in < 5 seconds
- If Redis is completely down:
- Fall back to local in-memory rate limiting per gateway instance
- Each instance enforces local_limit = global_limit / expected_instances
- Less accurate but prevents both zero-limiting and total outage
- Circuit breaker on Redis calls — after 3 consecutive failures, switch to local mode for 30 seconds before retrying Redis
Monitoring and alerting:
- Dashboard: requests allowed/rejected per rule, per endpoint, per client tier
- Alert on: Redis latency > 5ms, rate limit hit rate > 50% for any rule, Redis failover events
- Per-client rate limit status endpoint: let developers check their own quota usage
7. Extensions (2 min)
- Multi-level limiting: Global limit (service-wide) + per-user limit + per-endpoint limit. Check all levels; reject if any is exceeded.
- Dynamic rate limits: Adjust limits based on system load. If backend service is at 90% CPU, automatically tighten rate limits by 50%.
- Rate limit tiers: Free users: 100 req/min. Pro users: 1000 req/min. Enterprise: 10,000 req/min. Key type = API key → look up tier.
- Cost-based rate limiting: Instead of counting requests, count “cost units.” A GET = 1 unit, a POST = 5 units, a search = 10 units.
- Distributed rate limiting across regions: Global rate limit = sum of regional limits. Requires cross-region Redis sync or a centralized coordinator.
- IP reputation: Integrate with threat intelligence. Known bad IPs get stricter limits automatically.