Everyone thinks the rate limiter is a for-loop with a counter. “Count requests per user per minute, reject when it goes over 100, done.” Then the interviewer asks the questions that actually matter: which algorithm, where does the counter live when you have 40 API servers behind a load balancer, how do you stay accurate when two servers increment the same user’s counter at the same millisecond, and how much latency does this add to every single request in the system. Now it is a real problem.

The rate limiter is deceptive because it sits in front of everything. It runs on the hot path of every request, so it must be fast, and it must be correct, and those two goals fight each other the moment you go distributed. The whole design comes down to three decisions: the algorithm, where the limiter physically runs, and how you keep the count consistent across a cluster without paying a round trip to a shared store on every request.

Let me do it properly.

Functional Requirements (FR)

In scope:

  • Limit requests per client per window. Given a client identity (API key, user ID, or IP), allow up to N requests per time window and reject the rest.
  • Multiple rules / tiers. Different limits for different endpoints and different customer tiers (free user 100/min, paid user 10,000/min, the /login endpoint 5/min regardless of tier).
  • Reject cleanly. Over-limit requests get HTTP 429 Too Many Requests with a Retry-After header and rate-limit headers so clients can back off.
  • Configurable rules at runtime. Operators can change limits without redeploying the fleet.
  • Distributed correctness. The limit is global across the whole cluster, not per-server. A user limited to 100/min must not get 100 per server times 40 servers.

Explicitly out of scope (say this out loud so you control the scope):

  • Authentication and identity resolution. We assume the request already carries a resolved client identity; how you authenticate is a separate system.
  • Billing, quota accounting, and usage metering (a rate limiter is short-window traffic shaping, not a monthly-quota ledger).
  • DDoS / L3-L4 volumetric attack mitigation. That belongs at the network edge (scrubbing, SYN cookies, anycast). We handle application-layer request rate, not packet floods.
  • Full WAF rule engines. We do rate rules, not signature-based blocking.

The one decision that drives everything: the rate limiter runs on the critical path of every request. Every millisecond and every failure mode it introduces is multiplied by total system QPS. So the design is dominated by “how do I make the allow/deny decision fast and available, while keeping the count accurate across many nodes.”

Non-Functional Requirements (NFR)

  • Scale: the fleet fronts ~1M requests/sec at peak. The limiter itself therefore does ~1M limit-checks/sec. Tens of millions of distinct clients (keys/users/IPs).
  • Latency: the limiter must add under ~1ms at p99 to a request. Anything more and you have made every API call slower to protect against a minority of abusers. Ideally the common “allowed” case is sub-millisecond and never blocks on a slow store.
  • Availability: 99.99%+. But there is a subtle policy question: if the limiter’s backing store is down, do you fail open (allow the request) or fail closed (reject)? For most public APIs you fail open - a rate limiter outage should not take down the whole API. For a login endpoint protecting against brute force, you may fail closed. This is a per-rule choice.
  • Consistency: eventual, bounded. Perfect global accuracy on every request would require a synchronous quorum write per request, which kills latency. We accept a small, bounded over-count (a few percent overshoot at window boundaries) in exchange for speed. The interview answer is “as accurate as we can get without a synchronous round trip per request.”
  • Durability: near zero. Counters are ephemeral, sub-minute state. If we lose the counters on a crash, the worst case is a brief window where a few clients get extra requests. We never persist rate-limit counters to disk; they live in memory (Redis).

Back-of-the-Envelope Estimation (BoE)

Let me ground the design in numbers.

Request rate (the limiter’s own load):

Peak system traffic: 1,000,000 requests/sec
The limiter checks every request:
  = 1M limit-checks/sec

Each check is: read counter -> compare -> maybe increment.

If every check is a network round trip to a central store:

1M checks/sec against Redis.
A single Redis node does ~100K-200K simple ops/sec comfortably.
1M / 150K ≈ 7+ Redis nodes just to hold the op rate,
and every request pays one RTT (~0.3-0.5ms in-datacenter).

That RTT-per-request number is the whole ballgame. It tells us two things: (1) we must shard Redis, and (2) we want to avoid a round trip on the common case if we can, which is what pushes us toward local token buckets with async sync.

Memory for counters:

Distinct active clients in a window: ~10M
Per client, per rule, we store a small structure:

  Token bucket: (tokens_float, last_refill_ts)  ≈ 16 bytes value
  + key overhead (client_id + rule_id string) ≈ 40-60 bytes
  ≈ ~80 bytes/client/rule with Redis overhead

10M clients * ~80 bytes ≈ 800 MB
With a few rules per client, call it 2-4 GB.

Tiny. The entire rate-limit state fits in RAM on a small Redis cluster many times over. This is a latency and coordination problem, not a storage problem.

Sliding window log cost (why we reject it at scale):

Sliding window LOG stores a timestamp per request per client.
A client doing 10,000 req/min = 10,000 timestamps held for 60s.
10,000 * 8 bytes = 80 KB for ONE heavy client.
100K heavy clients = 8 GB just for one minute of one rule's logs,
plus the CPU to trim the sorted set on every request.

That memory-and-CPU blowup is exactly why the sliding window log, while the most accurate algorithm, does not survive contact with scale. We will use it as the correctness baseline and then pick something cheaper.

Bandwidth: each limit-check request/response is tiny (a key and a small integer). Even at 1M checks/sec against Redis, that is a few hundred MB/sec of internal traffic spread across the shard cluster. Negligible compared to the actual application payloads. This is a request-rate and latency problem, not bandwidth.

High-Level Design (HLD)

The rate limiter is not a standalone service you route to; it is a decision made inline before a request reaches business logic. The question is where that decision physically runs and what state backs it. Here is the end-to-end picture with the limiter living at the API gateway, backed by a sharded Redis cluster, with a local in-process cache in front of Redis to kill the common-case round trip.

                        ┌─────────────────────────┐
                        │        Clients            │
                        │ (apps, servers, browsers) │
                        └────────────┬──────────────┘
                                     │
                        ┌────────────▼──────────────┐
                        │      Load Balancer          │
                        └────────────┬──────────────┘
                                     │
                 ┌───────────────────┼───────────────────┐
                 │                   │                   │
        ┌────────▼───────┐  ┌────────▼───────┐  ┌────────▼───────┐
        │  API Gateway 1  │  │  API Gateway 2 │  │  API Gateway N │
        │ ┌────────────┐ │  │ ┌────────────┐ │  │ ┌────────────┐ │
        │ │Rate Limiter│ │  │ │Rate Limiter│ │  │ │Rate Limiter│ │
        │ │ middleware │ │  │ │ middleware │ │  │ │ middleware │ │
        │ │  + local   │ │  │ │  + local   │ │  │ │  + local   │ │
        │ │  cache     │ │  │ │  cache     │ │  │ │  cache     │ │
        │ └─────┬──────┘ │  │ └─────┬──────┘ │  │ └─────┬──────┘ │
        └───────┼────────┘  └───────┼────────┘  └───────┼────────┘
                │  allow            │                   │
                │  -> upstream      │ (429 if denied)   │
                ▼                   ▼                   ▼
        ┌───────────────────────────────────────────────────┐
        │        Upstream / Business Services                 │
        └───────────────────────────────────────────────────┘

        Shared counter state (consulted by all gateways):
        ┌───────────────────────────────────────────────────┐
        │      Redis Cluster (sharded by client_id)           │
        │   shard = hash(client_id) % num_shards              │
        │   each shard: primary + replica, Lua-atomic ops     │
        └───────────────────────────────────────────────────┘

        Rules & config:
        ┌───────────────────────────────────────────────────┐
        │  Config Service (rules) -> pushed to gateways       │
        │  (etcd / control plane), hot-reloaded, no redeploy  │
        └───────────────────────────────────────────────────┘

Request flow (the hot path):

  1. A request lands on a gateway instance. The rate-limiter middleware resolves the client identity (API key/user/IP) and the applicable rule(s) for the endpoint.
  2. The middleware builds a key like rl:{rule_id}:{client_id} and asks: “is this client under its limit right now?”
  3. It first consults the local in-process state for that key (a local token bucket or a short-lived cached decision). If the local view can decide confidently (clearly under limit, or a local reservation is available), it decides immediately with zero network calls.
  4. On the boundary cases, or on a fixed cadence, it consults the shared Redis shard for that client via an atomic operation (a Lua script) that reads-modifies-writes the counter in one round trip.
  5. If allowed, the request proceeds to the upstream service and the limiter returns rate-limit headers (X-RateLimit-Remaining, etc.).
  6. If denied, the middleware short-circuits with 429 Too Many Requests and a Retry-After header. The upstream is never touched.

The key architectural insight is step 3: you do not want a Redis round trip on every one of a million requests per second. The design tension the whole rest of the post resolves is “keep it accurate globally (needs shared state) while keeping it fast (avoid shared state on the hot path).”

Component Deep Dive

Three hard parts: (1) which algorithm, (2) where the limiter lives, and (3) making it accurate in a distributed cluster with Redis. I will walk each from naive to scalable.

1. The Algorithm: Fixed Window -> Sliding Window -> Token Bucket

Approach A: Fixed window counter (the naive instinct)

Keep one integer per client per calendar window. Reset it when the clock ticks over to the next window.

# key: "rl:{client}:{minute_bucket}"
def allow(client, limit):
    bucket = current_minute()                 # e.g. 2026-07-05T10:31
    key = f"rl:{client}:{bucket}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 60)
    return count <= limit

Dead simple, one counter, O(1) memory per client. Where it breaks: the boundary burst. The window resets on a hard clock edge, so a client can send limit requests in the last second of one window and limit more in the first second of the next window - 2 * limit requests in a ~2-second span while technically never violating either window. For a limit of 100/min, an attacker gets 200 requests in 2 seconds. That defeats the purpose of the limit, which was to smooth traffic, not just to cap a calendar minute.

Approach B: Sliding window log (the accurate but expensive fix)

Store the exact timestamp of every request in a sorted set. To decide, drop all timestamps older than now - window, then count what remains.

def allow(client, limit, window=60):
    key = f"rl:{client}"
    now = time.time()
    redis.zremrangebyscore(key, 0, now - window)  # trim old
    count = redis.zcard(key)
    if count < limit:
        redis.zadd(key, {str(now): now})
        redis.expire(key, window)
        return True
    return False

This is perfectly accurate - it counts exactly the requests in the trailing window seconds, no boundary burst. Where it breaks: memory and CPU. As the BoE showed, you store one entry per request. A client doing 10,000 req/min holds 10,000 timestamps, and every single request pays a ZREMRANGEBYSCORE + ZCARD + ZADD on a growing sorted set. At a million requests per second this is both a memory blowup (gigabytes of timestamps) and a CPU blowup (sorted-set trimming on the hot path). Accurate, but it does not survive scale. Keep it as the mental gold standard.

Approach C: Sliding window counter (the practical middle)

Approximate the sliding window using two fixed-window counters and a weighted average. You keep the current window’s count and the previous window’s count, and estimate the rolling count by weighting the previous window by how far into the current window you are.

estimated = current_count
          + previous_count * (1 - elapsed_fraction_of_current_window)

# Example: limit 100/min, 25% into the current minute.
# previous minute had 80, current minute has 30 so far.
# estimated = 30 + 80 * (1 - 0.25) = 30 + 60 = 90  -> allow (90 < 100)

O(1) memory (two counters), no per-request timestamp storage, and it smooths the boundary burst to within a small approximation error (it assumes the previous window’s requests were uniformly distributed, which is not exactly true, but the error is a few percent). This is what a lot of production systems actually run because it is cheap and good enough.

Approach D: Token bucket (the answer for most APIs)

A bucket holds up to B tokens and refills at R tokens/sec. Each request takes one token; if the bucket is empty, reject. You store just two numbers per client: the current token count and the last refill timestamp. On each request you lazily refill based on elapsed time, then try to take a token.

def allow(client, capacity, refill_rate):
    key = f"rl:{client}"
    now = time.time()
    tokens, last = redis.hmget(key, "tokens", "ts")  # read state
    tokens = float(tokens) if tokens else capacity
    last = float(last) if last else now

    # lazy refill: add tokens for elapsed time, cap at capacity
    tokens = min(capacity, tokens + (now - last) * refill_rate)

    if tokens >= 1:
        tokens -= 1
        allowed = True
    else:
        allowed = False

    redis.hset(key, mapping={"tokens": tokens, "ts": now})
    return allowed

Why this wins for general API rate limiting:

  • O(1) memory - two numbers per client, like the sliding counter, but it also models bursts naturally.
  • Allows controlled bursts. A client that has been quiet accumulates tokens up to capacity and can burst that many at once, then is throttled to the steady refill_rate. This matches how real clients behave (idle, then a burst) and is usually what you want - you cap the sustained rate while tolerating short spikes.
  • Lazy refill means no background timer. You compute refill on read from elapsed time. No cron, no per-bucket ticking thread.

Its cousin, the leaky bucket, models the opposite: a fixed-rate outflow queue that smooths bursts into a constant drip. Use leaky bucket when the downstream needs a strictly steady rate (e.g. protecting a fragile legacy backend); use token bucket when you want to allow bursts up to a cap (most public APIs).

Decision summary:

Algorithm Memory/client Boundary burst Allows bursts Cost Verdict
Fixed window O(1) Yes (2x) No Cheapest Too inaccurate
Sliding log O(requests) None No Expensive Gold standard, does not scale
Sliding counter O(1) Approx none No Cheap Good default
Token bucket O(1) None Yes (to cap) Cheap The answer for APIs
Leaky bucket O(1) None No (steady out) Cheap When downstream needs steady rate

The answer: token bucket for general API rate limiting (bursts are a feature), with sliding window counter as the alternative when you specifically want to forbid bursts.

2. Where the Limiter Lives: Gateway vs Sidecar vs Middleware

The algorithm is portable; where you run it changes the whole operational story. Three placements.

Approach A: A middleware library in each service (the naive start)

Drop a rate-limiter library into every service. Where it breaks: every team must integrate and upgrade it, you get N language implementations (Go service, Java service, Python service each need their own), and rules are scattered. It works for one service but does not give you one place to enforce and change policy across the platform. Fine for a monolith, wrong for a fleet.

Approach B: At the API gateway (the standard answer)

Put the rate limiter in the API gateway (the layer every external request already passes through: Kong, Envoy, NGINX, or your own gateway). This is the natural home because:

  • One choke point. Every request already traverses the gateway, so you get one place to enforce, one place to configure, one place to observe. Rules are centralized.
  • Reject early. An over-limit request dies at the edge before consuming any upstream capacity - it never opens a connection to a business service. That is the whole point: protect the expensive services by shedding load cheaply and early.
  • Language-agnostic. The limiter is implemented once at the gateway, not re-implemented per service language.

The trade-off: the gateway now holds state (or talks to a state store), and it is on the critical path, so its latency budget is tight. That is what the local-cache + Redis design below solves.

Approach C: Sidecar (service mesh) for internal / east-west traffic

For internal service-to-service traffic (service A calling service B inside your mesh), a central gateway is not on that path - internal calls do not egress through the edge gateway. Here you run the limiter as a sidecar proxy (Envoy next to each pod in a service mesh like Istio). Each pod’s sidecar enforces limits on traffic to/from that pod, talking to the same shared counter store.

North-south (external):  Client -> Edge Gateway [rate limit] -> Services
East-west  (internal):   Service A -> [sidecar rate limit] -> Service B

The rule of thumb I would state in the interview: gateway for north-south (external ingress), sidecar for east-west (internal mesh) traffic. Both share the same algorithm and the same backing Redis so limits are consistent regardless of path. The sidecar trades a little more resource overhead (a proxy per pod) for enforcement on paths the edge gateway never sees.

3. Distributed Accuracy with Redis (the core hard part)

A single gateway with a local counter is trivially accurate. The problem: you have 40 gateway instances behind a load balancer, and a client’s requests are sprayed across all of them. Each instance sees only a fraction of the client’s traffic, so no single instance can enforce a global limit alone.

Naive: local counters only

Each gateway keeps its own in-memory counter. Where it breaks: with 40 gateways and a limit of 100/min, if you set each local counter to 100 the client gets 4,000/min globally. If you divide (100/40 ≈ 2.5 each), a client whose traffic happens to hash to one gateway gets throttled at 2-3 while the global limit is 100. Local-only is either far too loose or too tight. You need shared state.

Better: shared Redis counter with atomic operations

All gateways read and write the same counter in a central Redis, sharded by client. The critical requirement is atomicity: the read-refill-decrement of a token bucket must be one indivisible operation, or two gateways racing on the same client will both read “1 token left,” both decrement, and both allow - a lost-update race that lets the client exceed the limit.

The wrong way is a naive GET then SET from the application: that is a read-modify-write with a gap in the middle where another gateway interleaves. The right way is to make Redis do the whole thing atomically. Redis is single-threaded per shard, so a Lua script executes start-to-finish with no interleaving:

-- token_bucket.lua  KEYS[1]=bucket key
-- ARGV: capacity, refill_rate, now, requested (usually 1)
local capacity    = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now         = tonumber(ARGV[3])
local requested   = tonumber(ARGV[4])

local state = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(state[1])
local last   = tonumber(state[2])
if tokens == nil then tokens = capacity; last = now end

-- lazy refill based on elapsed time
tokens = math.min(capacity, tokens + (now - last) * refill_rate)

local allowed = 0
if tokens >= requested then
  tokens = tokens - requested
  allowed = 1
end

redis.call("HSET", KEYS[1], "tokens", tokens, "ts", now)
-- expire idle buckets so memory self-cleans
redis.call("PEXPIRE", KEYS[1], math.ceil(capacity / refill_rate * 1000) + 1000)

return { allowed, tokens }

This gives us exact global accuracy - one atomic operation per request, no race - at the cost of one Redis round trip per request. That is the accuracy-vs-latency trade in its purest form. At 1M req/sec this is the design that is correct but pays a full RTT every time and needs a sharded Redis to hold the op rate.

Best for scale: local token bucket + periodic Redis sync (the hybrid)

To get accuracy without a round trip on every request, each gateway keeps a local token bucket per hot client and periodically reconciles with Redis. The common case (clearly under limit) is decided locally in nanoseconds; Redis is consulted only at the boundary or on a fixed cadence.

Two proven patterns:

  • Batched reservations (the practical one). A gateway does not ask Redis for one token at a time; it atomically claims a small batch (say 10 tokens) from the global bucket in one round trip, then serves those 10 requests locally with zero further Redis calls. When its local reservation runs low, it claims another batch. This cuts Redis traffic by the batch factor (10x fewer round trips) at the cost of a bounded overshoot: a gateway holding an unused reservation when the window shifts can let the global count run slightly over. You tune batch size to trade accuracy against Redis load.

  • Sync-and-correct. Each gateway serves from its local bucket and every T milliseconds pushes its local consumption to Redis and pulls back the authoritative global remaining, then corrects its local view. Between syncs it can drift by at most (num_gateways * per-gateway consumption in T). You bound the error by choosing T.

Both accept a small, bounded overshoot in exchange for killing the per-request round trip. This is the honest interview answer: perfect global accuracy needs a synchronous atomic op per request; at a million requests per second you trade a few percent of overshoot for an order-of-magnitude drop in coordination cost. State the knob (batch size / sync interval) and that it directly tunes accuracy vs latency.

Redis failure and the fail-open/fail-closed decision

If the Redis shard for a client is unreachable, the gateway cannot consult global state. Policy:

  • Fail open (allow) for most public API traffic - a rate-limiter outage must not become an API outage. The local bucket still provides a coarse ceiling, so it is not wide open.
  • Fail closed (reject) for security-critical rules like /login brute-force protection, where letting requests through unlimited is worse than a brief outage.

Make this a per-rule flag. Also run Redis as primary + replica per shard with fast failover (Redis Sentinel or Cluster) so a single node loss does not trigger fail-open at all.

API Design & Data Schema

Client-facing behavior (headers, not a REST API)

The rate limiter is inline middleware, so its “API” is the set of headers and status codes it attaches to normal traffic:

Any request -> gateway:

Allowed (request proceeds upstream), response carries:
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 42
  X-RateLimit-Reset: 1751702400        # epoch when tokens refill

Denied:
  HTTP/1.1 429 Too Many Requests
  Retry-After: 12                       # seconds until a token is available
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 0
  X-RateLimit-Reset: 1751702400
  Body: { "error": "rate_limit_exceeded", "retry_after": 12 }

Admin / control-plane API (managing rules)

PUT /admin/v1/rules/{rule_id}
Body:
{
  "match": { "endpoint": "/api/v1/*", "tier": "free" },
  "algorithm": "token_bucket",
  "capacity": 100,
  "refill_rate_per_sec": 1.67,          # ~100/min sustained
  "key_by": "api_key",                  # api_key | user_id | ip
  "on_store_failure": "fail_open"       # fail_open | fail_closed
}
Response 200: { "rule_id": "free-tier-default", "version": 7 }

GET /admin/v1/rules            -> list all active rules
DELETE /admin/v1/rules/{id}    -> remove a rule

Rules live in a config service (etcd or similar) and are pushed to every gateway, which hot-reloads them into memory. No redeploy to change a limit. Gateways read rules from local memory on the hot path (never a network call to fetch a rule per request).

Data store choice: in-memory KV (Redis), not SQL

The state is tiny, ephemeral, and accessed by exact key at extreme rate: client_id -> {tokens, last_ts}. There are no joins, no range scans, no durability requirement, and we need atomic read-modify-write at a million ops/sec. That is the exact profile for an in-memory key-value store (Redis), sharded. SQL is the wrong tool: durable disk-backed transactions add latency and durability we do not need, and a relational engine cannot cheaply serve atomic per-key increments at this rate. We deliberately want the data to be non-durable - counters are throwaway sub-minute state.

Counter state (Redis, per client per rule)

Key:   rl:{rule_id}:{client_id}          -- e.g. "rl:free-default:apikey_abc123"
Type:  Hash
Fields:
  tokens   FLOAT      -- current tokens remaining (fractional; lazy refill)
  ts       FLOAT      -- last refill timestamp (epoch seconds)
TTL:   set to ~ (capacity / refill_rate) + slack, so idle buckets self-evict
Access: single atomic Lua script (read + refill + decrement + expire)
Shard:  hash(client_id) % num_shards

For the sliding-window-counter alternative, the value is instead two integers (current_window_count, previous_window_count) keyed with the window bucket, updated with an atomic INCR + weighted read.

Rules (config service, e.g. etcd)

rule_id            STRING PK
match              JSON        -- endpoint pattern, tier, method
algorithm          ENUM        -- token_bucket | sliding_counter | leaky_bucket
capacity           INT
refill_rate        FLOAT
key_by             ENUM        -- api_key | user_id | ip
on_store_failure   ENUM        -- fail_open | fail_closed
version            INT

Rules are read-heavy, tiny, and change rarely - perfect for a strongly consistent config store that pushes updates to gateways. This is the one place we want strong consistency (a stale rule that allows 10x traffic is bad), and the volume is trivial so consistency is cheap here.

Bottlenecks & Scaling

Where it breaks first, in order, and the fix for each:

1. Redis round-trip on every request (breaks first). At 1M req/sec, one RTT per request both saturates Redis op capacity and adds latency to every call. Fix: the hybrid local-bucket design - batched reservations or periodic sync so the common case is decided in-process and Redis is consulted per batch, not per request. This is the single most important scaling move and it trades a bounded overshoot for an order-of-magnitude drop in Redis load.

2. Single Redis node as bottleneck / SPOF. One node cannot hold 1M ops/sec and its loss would blind the whole fleet. Fix: shard Redis by client_id (hash(client_id) % num_shards). Each client’s state lives on exactly one shard, so an atomic Lua op touches one shard and load spreads evenly. Client IDs are high-cardinality and uniformly hashed, so no natural hot shard. Run each shard as primary + replica with Sentinel/Cluster failover for availability.

Shard key choice: client_id. Every limit-check is keyed by client, so the client is the natural partition key and each check is a single-shard, single-key atomic op. Sharding by rule_id would be wrong - a few popular rules would create hot shards and every client on that rule would pile onto one node.

3. Hot key (one client hammering, or a shared IP behind NAT). A single super-heavy client concentrates all its traffic on one Redis key/shard - the viral or abusive client scenario. Fix: the local token bucket already absorbs most of a hot client’s traffic in-process (that client’s requests are decided locally between syncs, not sent to Redis each time). For a genuinely enormous single key you can also shard one logical bucket into K sub-buckets (client_id:0..K-1, each holding capacity/K) spread across shards, picking a sub-bucket per request; you trade a little accuracy for spreading a single hot key’s load. Note the shared-NAT-IP case: if you key by IP, a whole office behind one NAT shares a limit - prefer keying by API key/user where possible.

4. Latency added to the critical path. The limiter sits in front of everything, so its latency multiplies by total QPS. Fix: decide the common “allowed” case in-process (no network), keep the Lua op at O(1), colocate Redis in the same datacenter/AZ as the gateways to keep the occasional RTT sub-millisecond, and never fetch rules over the network on the hot path (rules are pushed and cached in gateway memory).

5. Boundary-burst inaccuracy. Fixed windows let clients get 2x at the edges. Fix: token bucket (or sliding window counter) instead of fixed window - no hard reset edge, so no boundary doubling. Covered in the algorithm deep dive.

6. Store outage taking down the API. If the counter store is unreachable and you fail closed, a Redis blip becomes a full API outage. Fix: per-rule fail-open/fail-closed policy - fail open for general traffic (the local bucket still caps abuse coarsely), fail closed only for security-critical endpoints. Combined with replicated shards and fast failover so the outage window is tiny.

7. Clock skew across gateways. Token-bucket lazy refill uses timestamps; if gateway clocks disagree, refill math drifts. Fix: compute elapsed time using Redis server time (via the Lua TIME command) as the single clock source for the atomic path, rather than each gateway’s local clock, so all refills reference one monotonic authority. Keep gateways NTP-synced for the local-bucket path and keep sync intervals short so drift stays bounded.

Wrap-Up

The trade-offs that define this design:

  • Token bucket over sliding-window-log. We give up the log’s perfect accuracy and take O(1) memory plus natural burst handling. The log is the correctness gold standard but its per-request memory and CPU do not survive a million requests per second. Token bucket is the answer for general APIs; sliding window counter is the pick when bursts must be forbidden.
  • Gateway for north-south, sidecar for east-west. One choke point at the edge to reject over-limit traffic before it costs anything upstream, and sidecars for internal mesh traffic the edge never sees. Same algorithm, same Redis, consistent limits on every path.
  • Local bucket + periodic Redis sync over a round trip per request. This is the central accuracy-vs-latency call. Perfect global accuracy needs a synchronous atomic op per request; we accept a small bounded overshoot (tuned by batch size / sync interval) to decide the common case in-process and cut Redis load by an order of magnitude.
  • Atomic Lua on a sharded, non-durable Redis. Single-threaded Redis makes the read-refill-decrement indivisible, killing the lost-update race; sharding by client_id spreads the load with no hot shard; we deliberately keep the state in memory and throwaway.
  • Fail open by default, fail closed where it matters. A rate-limiter outage must not become an API outage - except on endpoints where letting traffic through is worse than a brief block.

One-line summary: a token-bucket rate limiter enforced inline at the API gateway (and mesh sidecars), where each node decides the common case from a local bucket and reconciles with an atomic, Lua-scripted, client-sharded Redis on a fixed cadence, trading a small bounded overshoot for sub-millisecond decisions at a million requests per second.