Design an API Rate Limiter

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
- Rate Limit Rules (PostgreSQL or config file)
- Rate Limit Counters (Redis)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Limit the number of API requests a client can make within a time window
Support multiple rate limiting rules (per user, per IP, per API endpoint, per service)
Return appropriate response when rate limit is exceeded (429 Too Many Requests)
Include rate limit headers in every response (remaining quota, reset time)
Support different limiting algorithms (configurable per rule)

Non-Functional Requirements

Availability: 99.999% — if the rate limiter goes down, we either block all traffic (fail-closed) or allow all traffic (fail-open). Both are bad. It must be more available than the services it protects.
Latency: < 1ms overhead per request — the rate limiter is in the hot path of every API call
Consistency: Approximate accuracy is acceptable. If the limit is 100 req/min, allowing 105 is fine. Allowing 1000 is not.
Scale: Must handle 10M+ requests/sec across all services
Distributed: Must work correctly across multiple API gateway instances (not just per-node limiting)

2. Estimation (3 min)

Traffic

10M requests/sec across all API endpoints
Each request → 1 rate limit check (read + conditional write to counter)
10M rate limit operations/sec

Storage

Active rate limit keys: assume 50M unique clients (user_id or IP)
Each key: ~100 bytes (key + counter + window metadata)
Total: 50M × 100 bytes = 5GB — fits entirely in memory (Redis)

Key Insight

This is a latency-critical, memory-bound system. All state lives in Redis. The challenge is distributed correctness (accurate counting across multiple gateway instances) at extreme throughput with < 1ms latency.

3. API Design (3 min)

The rate limiter itself isn’t a user-facing API — it’s middleware. But it has an internal API and affects response headers.

Internal API (called by API Gateway)

// Check if request should be allowed
POST /ratelimit/check
  Body: {
    "client_id": "user_123",          // or IP address
    "endpoint": "/api/v1/tweets",
    "method": "POST"
  }
  Response 200: {
    "allowed": true,
    "remaining": 47,
    "limit": 100,
    "reset_at": 1708632060            // Unix timestamp
  }
  // or
  Response 200: {
    "allowed": false,
    "remaining": 0,
    "limit": 100,
    "reset_at": 1708632060,
    "retry_after": 23                 // seconds
  }

Response Headers (added to every API response)

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1708632060

// When rate limited:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708632060
Retry-After: 23

Rules Configuration API

POST /ratelimit/rules
  Body: {
    "name": "tweets_post_limit",
    "endpoint": "/api/v1/tweets",
    "method": "POST",
    "key": "user_id",                 // rate limit per user
    "limit": 100,
    "window": "1h",
    "algorithm": "sliding_window"
  }

4. Data Model (3 min)

Rate Limit Rules (PostgreSQL or config file)

Table: rate_limit_rules
  rule_id        (PK) | int
  name                 | varchar(100)
  endpoint_pattern     | varchar(200)  -- regex or exact match
  method               | varchar(10)
  key_type             | enum('user_id', 'ip', 'api_key', 'service')
  limit                | int
  window_seconds       | int
  algorithm            | enum('fixed_window', 'sliding_window', 'token_bucket', 'leaky_bucket')
  enabled              | boolean

Rate Limit Counters (Redis)

// Fixed Window
Key: rl:{rule_id}:{client_id}:{window_start}
Value: int (request count)
TTL: window_seconds

// Sliding Window Log
Key: rl:{rule_id}:{client_id}
Type: Sorted Set
Members: request timestamps
Score: timestamp

// Token Bucket
Key: rl:{rule_id}:{client_id}
Type: Hash
Fields: tokens (float), last_refill (timestamp)

5. High-Level Design (12 min)

Architecture

Client → Load Balancer → API Gateway
  → Rate Limiter Middleware:
    1. Extract client identifier (user_id, IP, API key)
    2. Look up applicable rules (cached locally, refreshed from Rules Service)
    3. For each applicable rule:
       → Check counter in Redis
       → If allowed: increment counter, pass through
       → If exceeded: return 429
    4. Add rate limit headers to response
  → Backend Service (if allowed)

Components

API Gateway (with Rate Limiter middleware): Nginx/Envoy/custom gateway. Rate limiting runs as middleware before routing to backend.
Redis Cluster: Stores all rate limit counters. Must be co-located with API gateways for < 1ms latency.
Rules Service: Manages rate limit rules. Rules cached locally at each gateway (refresh every 30 seconds).
Rules DB (PostgreSQL): Persistent storage for rate limit configuration.
Monitoring: Real-time dashboards showing rate limit hits, top-limited clients, rule effectiveness.

Deployment

Region A:
  API Gateway instances (10+) → Redis Cluster (same AZ)
Region B:
  API Gateway instances (10+) → Redis Cluster (same AZ)

Critical: Redis must be in the same availability zone as the API Gateway. Cross-AZ Redis calls add 1-2ms latency, which doubles the overhead budget.

6. Deep Dives (15 min)

Deep Dive 1: Rate Limiting Algorithms

1. Fixed Window Counter

Window: every minute, starting at :00
Key: rl:rule1:user123:202602221800
Operation: INCR key; if count > limit → reject
TTL: 60 seconds

Pros: Simple, O(1) per request, minimal storage
Cons: Boundary burst problem — user can send 100 requests at 18:00:59 and 100 more at 18:01:00 → 200 requests in 2 seconds while “respecting” the 100/min limit

2. Sliding Window Log

Key: rl:rule1:user123 (Sorted Set)
On request:
  1. ZREMRANGEBYSCORE key 0 (now - window)   // remove old entries
  2. ZCARD key                                // count entries in window
  3. If count < limit: ZADD key now now       // add new entry

Pros: Perfectly accurate, no boundary issues
Cons: O(N) storage where N = max requests in window. For 100 req/min limit → 100 entries per key. For 10M requests/sec across 50M keys → expensive.

3. Sliding Window Counter (best trade-off)

Combines fixed window + weighted overlap:
  current_window_count + (previous_window_count × overlap_percentage)

Example (100 req/min limit, current time = 18:01:15):
  Previous window (18:00 - 18:01): 84 requests
  Current window (18:01 - 18:02): 36 requests
  Overlap: (60 - 15) / 60 = 75% of previous window
  Estimated count: 36 + (84 × 0.75) = 99 → allowed (under 100)

Pros: O(1) per request, very accurate (error < 0.003% per CloudFlare’s analysis), minimal storage (2 counters per key)
Cons: Approximate (but close enough)

4. Token Bucket

Bucket has max N tokens. Refills at rate R tokens/sec.
Each request consumes 1 token.
If bucket is empty → reject.

Redis implementation (Lua script for atomicity):
  local tokens = redis.call('HGET', key, 'tokens')
  local last_refill = redis.call('HGET', key, 'last_refill')
  local elapsed = now - last_refill
  tokens = min(max_tokens, tokens + elapsed * refill_rate)
  if tokens >= 1 then
    tokens = tokens - 1
    -- allow
  else
    -- reject
  end
  redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)

Pros: Allows controlled bursts (up to bucket size), smooth rate limiting
Cons: Slightly more complex, 2 values per key instead of 1

Recommendation: Sliding Window Counter for most use cases (simple + accurate). Token Bucket when burst handling matters (API rate limits for external developers).

Deep Dive 2: Distributed Rate Limiting

The problem: 10 API Gateway instances, each checking Redis independently. Race condition:

Gateway A: reads count = 99, allows request, increments to 100
Gateway B: reads count = 99 (before A’s write propagates), allows request, increments to 100
Result: 101 requests allowed when limit is 100

Solution 1: Lua Scripts (atomic operations in Redis)

-- Atomic check-and-increment
local current = redis.call('INCR', KEYS[1])
if current == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])  -- set TTL on first request
end
if current > tonumber(ARGV[2]) then
  return 0  -- rejected
end
return 1  -- allowed

This runs atomically in Redis — no race condition. INCR is O(1) and the Lua script executes as a single Redis command.

Solution 2: Local rate limiting with sync (for extreme throughput)

Each gateway maintains a local counter (in-process)
Local limit = global_limit / num_instances (e.g., 100/10 = 10 per instance)
Periodically sync with Redis (every 1-5 seconds)
Pros: Zero network calls for most requests (sub-microsecond)
Cons: Less accurate (bursty traffic may over/under-limit by 10-20%)

Recommendation: Lua scripts for most cases. Local counters only if Redis latency becomes a bottleneck (unlikely at < 1ms).

Deep Dive 3: Failure Modes and Resilience

What happens when Redis goes down?

Two philosophies:

Fail-open: If Redis is unreachable, allow all requests. Risk: no rate limiting → potential DDoS of backend services.
Fail-closed: If Redis is unreachable, reject all requests. Risk: complete service outage.

Best approach: Graceful degradation

Redis Cluster with replicas — automatic failover in < 5 seconds
If Redis is completely down:
- Fall back to local in-memory rate limiting per gateway instance
- Each instance enforces local_limit = global_limit / expected_instances
- Less accurate but prevents both zero-limiting and total outage
Circuit breaker on Redis calls — after 3 consecutive failures, switch to local mode for 30 seconds before retrying Redis

Monitoring and alerting:

Dashboard: requests allowed/rejected per rule, per endpoint, per client tier
Alert on: Redis latency > 5ms, rate limit hit rate > 50% for any rule, Redis failover events
Per-client rate limit status endpoint: let developers check their own quota usage

7. Extensions (2 min)

Multi-level limiting: Global limit (service-wide) + per-user limit + per-endpoint limit. Check all levels; reject if any is exceeded.
Dynamic rate limits: Adjust limits based on system load. If backend service is at 90% CPU, automatically tighten rate limits by 50%.
Rate limit tiers: Free users: 100 req/min. Pro users: 1000 req/min. Enterprise: 10,000 req/min. Key type = API key → look up tier.
Cost-based rate limiting: Instead of counting requests, count “cost units.” A GET = 1 unit, a POST = 5 units, a search = 10 units.
Distributed rate limiting across regions: Global rate limit = sum of regional limits. Requires cross-region Redis sync or a centralized coordinator.
IP reputation: Integrate with threat intelligence. Known bad IPs get stricter limits automatically.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Key Insight#

3. API Design (3 min)#

Internal API (called by API Gateway)#

Response Headers (added to every API response)#

Rules Configuration API#

4. Data Model (3 min)#

Rate Limit Rules (PostgreSQL or config file)#

Rate Limit Counters (Redis)#

5. High-Level Design (12 min)#

Architecture#

Components#

Deployment#

6. Deep Dives (15 min)#

Deep Dive 1: Rate Limiting Algorithms#

Deep Dive 2: Distributed Rate Limiting#

Deep Dive 3: Failure Modes and Resilience#

7. Extensions (2 min)#