Design a Distributed Cache (Memcached/Redis)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
- Core Operations
- Key Design Decisions
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Get/Set/Delete — clients can store, retrieve, and remove key-value pairs with sub-millisecond latency
TTL support — every key can have an optional time-to-live; expired keys are automatically purged
Eviction policies — when memory is full, evict keys according to a configurable policy (LRU, LFU, random)
Atomic operations — support CAS (compare-and-swap), increment/decrement, and simple Lua-script-style transactions
Cluster management — automatically shard data across nodes, detect failures, and rebalance with minimal data movement

Non-Functional Requirements

Availability: 99.99% — the cache is in the read hot-path; downtime causes a stampede on the backing store
Latency: p50 < 0.5ms, p99 < 2ms for single-key operations from within the same datacenter
Consistency: Eventual consistency across replicas is acceptable. For a single shard, reads-after-writes must be consistent (read-your-writes from the leader)
Scale: 100M+ ops/sec cluster-wide, 500TB+ aggregate memory across thousands of nodes
Durability: Best-effort. Cache is ephemeral by design, but optional AOF/snapshotting for warm restarts

2. Estimation (3 min)

Traffic

100M cache operations/sec (70% reads, 30% writes)
Read QPS: 70M/s, Write QPS: 30M/s
Average key size: 64 bytes, average value size: 1 KB
Average payload per op: ~1 KB → bandwidth: 100M × 1 KB = 100 GB/s cluster-wide

Storage

500B unique keys in steady state (Zipf distribution — 20% of keys serve 80% of traffic)
Average entry: 64 B (key) + 1 KB (value) + 48 B (metadata: TTL, flags, pointers) ≈ 1.1 KB
Total memory: 500B × 1.1 KB = 550 TB
With replication factor 2 (one replica per shard): 1.1 PB raw memory

Node Count

If each node has 128 GB usable memory → 550 TB / 128 GB ≈ 4,300 primary shards + 4,300 replicas ≈ 8,600 nodes
Each node handles: 100M / 4,300 ≈ 23K ops/sec — very comfortable for a single Redis-like process

Key Insight

This is a memory-bound, latency-critical system. The hard problems are: (1) distributing keys evenly across thousands of shards, (2) handling hot keys that violate even distribution, and (3) preventing thundering herds when popular keys expire.

3. API Design (3 min)

Core Operations

GET /cache/{key}
  → 200 { "value": "...", "ttl_remaining": 3600, "cas_token": "abc123" }
  → 404 (cache miss)

PUT /cache/{key}
  Body: { "value": "...", "ttl": 3600, "cas_token": "abc123" }
  → 201 (created) | 200 (updated) | 409 (CAS conflict)

DELETE /cache/{key}
  → 204 (deleted) | 404 (not found)

POST /cache/_batch
  Body: { "gets": ["k1","k2","k3"], "sets": [{"key":"k4","value":"v4","ttl":60}] }
  → 200 { "results": { "k1": "v1", "k2": null, "k3": "v3" }, "set_acks": ["k4"] }

Key Design Decisions

Binary protocol in production (not REST). REST shown for clarity. Real systems use a compact binary protocol (like Memcached binary protocol or Redis RESP3) to minimize serialization overhead.
Batch/multi-get is critical. Clients often need 10-50 keys per request. Batching amortizes round-trip latency.
CAS tokens enable optimistic concurrency without distributed locks. The client reads a CAS token with GET, then includes it in the SET. If another writer changed the value, the CAS token won’t match → 409 Conflict.
Client-side routing: The client library knows the shard map and sends requests directly to the correct shard — no proxy hop.

4. Data Model (3 min)

In-Memory Data Structure (per shard)

Field	Type	Notes
key	byte[]	Raw bytes, up to 250B (Memcached default)
value	byte[]	Up to 1MB per value
ttl_expires_at	int64	Unix timestamp in ms; 0 = no expiry
cas_version	uint64	Monotonically increasing per key
flags	uint32	Client-defined metadata (e.g., serialization format)
last_access_time	int64	For LRU eviction
access_frequency	uint16	For LFU eviction (logarithmic counter, like Redis)

Hash Table

Primary index: open-addressing hash table with incremental resizing (like Redis’s dict)
Two hash tables during resize: reads check both; writes go to new table; background migration moves buckets
Load factor trigger: resize when load > 1.0 (or < 0.1 for shrinking)

Memory Allocator: Slab Allocator (Memcached approach)

Pre-allocate slab classes: 64B, 128B, 256B, 512B, 1KB, 2KB, … 1MB
Each class has a pool of fixed-size chunks. A key-value pair is stored in the smallest class that fits.
Why? Avoids external fragmentation. malloc/free on millions of small objects leads to fragmentation; slab allocation keeps memory usage predictable.
Trade-off: Internal fragmentation — a 65B value wastes 63B in a 128B slab. Tunable with custom slab class sizes.

Why Not Disk?

This is a pure in-memory store. Disk adds latency (even NVMe is 10-100x slower than DRAM). Optional persistence (AOF/RDB snapshots) is only for warm restart, not for serving reads.

5. High-Level Design (12 min)

Architecture Overview

                          ┌─────────────────────────────────────┐
                          │         Configuration Service        │
                          │  (shard map, ring topology, health)  │
                          └──────────┬──────────────────────────┘
                                     │ watch shard map
                    ┌────────────────┼────────────────────┐
                    ▼                ▼                     ▼
             ┌────────────┐   ┌────────────┐       ┌────────────┐
             │  Client Lib │   │  Client Lib │       │  Client Lib │
             │  (App Svc)  │   │  (App Svc)  │       │  (App Svc)  │
             └──────┬─────┘   └──────┬─────┘       └──────┬─────┘
                    │                │                     │
        consistent hash              │          consistent hash
        route to shard               │          route to shard
                    │                │                     │
          ┌─────────┼────────────────┼─────────────────────┤
          ▼         ▼                ▼                     ▼
   ┌─────────┐ ┌─────────┐   ┌─────────┐          ┌─────────┐
   │ Shard 0  │ │ Shard 1  │   │ Shard 2  │   ...   │ Shard N  │
   │ (Leader) │ │ (Leader) │   │ (Leader) │          │ (Leader) │
   └────┬────┘ └────┬────┘   └────┬────┘          └────┬────┘
        │           │              │                    │
        ▼           ▼              ▼                    ▼
   ┌─────────┐ ┌─────────┐   ┌─────────┐          ┌─────────┐
   │ Replica  │ │ Replica  │   │ Replica  │          │ Replica  │
   │ (Follow) │ │ (Follow) │   │ (Follow) │          │ (Follow) │
   └─────────┘ └─────────┘   └─────────┘          └─────────┘

Component Breakdown

1. Client Library (Smart Client)

Maintains a local copy of the shard map (hash ring)
Routes requests directly to the correct shard leader — no intermediate proxy
Handles connection pooling, pipelining, retries, and circuit breaking
On cache miss, can optionally call a “loader function” to fetch from the backing store and populate the cache

2. Cache Shards (Leader)

Single-threaded event loop (like Redis) — no locks, no contention
Handles GET/SET/DEL/CAS/batch operations
Runs eviction when memory pressure exceeds configured limit
Replicates writes to followers asynchronously

3. Cache Replicas (Follower)

Receive replication stream from leader
Serve read traffic if configured for read replicas (reduces leader load)
Promote to leader on leader failure (automatic failover)

4. Configuration Service (ZooKeeper / etcd)

Stores the hash ring configuration: which virtual nodes map to which physical shards
Stores health status of each shard
Clients watch for changes → update local shard map within seconds

5. Cluster Manager (Control Plane)

Monitors shard health via heartbeats
Triggers failover: promotes replica to leader, updates shard map in config service
Handles scaling: adds/removes shards, triggers data migration
Rebalances hot shards by splitting them

Request Flow (Cache Hit)

App Server → Client Lib: GET("user:123")
Client Lib → hash("user:123") → virtual node 47 → Shard 3 (leader)
Client Lib → Shard 3: GET user:123
Shard 3 → hash table lookup → found, TTL not expired
Shard 3 → Client Lib: {value, cas_token, ttl}
Client Lib → App Server: cache hit, return value

Request Flow (Cache Miss with Read-Through)

App Server → Client Lib: GET("user:123")
Client Lib → Shard 3: GET user:123
Shard 3 → hash table lookup → not found
Shard 3 → Client Lib: MISS
Client Lib → App Server: cache miss
App Server → Database: SELECT * FROM users WHERE id = 123
App Server → Client Lib: SET("user:123", userData, ttl=3600)
Client Lib → Shard 3: SET user:123 ...

6. Deep Dives (15 min)

Deep Dive 1: Consistent Hashing & Key Distribution

The Problem: With 4,300+ shards, adding or removing a node should not cause a massive rehash (which would invalidate most of the cache).

Solution: Consistent Hashing with Virtual Nodes

Map the keyspace to a ring (0 to 2^32 - 1)
Each physical shard owns multiple virtual nodes (e.g., 150-200 vnodes per physical node) placed at random positions on the ring
A key maps to the first vnode clockwise from its hash position
When a shard is added, it takes ownership of ~1/N of the ring — only keys that fall in those segments are migrated
When a shard is removed, its vnodes are distributed to the remaining shards

Why virtual nodes?

With only physical nodes, the ring is unbalanced (nodes own unequal ranges)
Virtual nodes smooth out the distribution. With 150 vnodes per node, standard deviation of load is ~5%
Also enables heterogeneous hardware: a beefy node gets more vnodes than a small one

Client-Side vs Server-Side Sharding:

	Client-Side (Memcached model)	Server-Side (Redis Cluster model)
Routing	Client computes hash, sends to correct shard	Client sends to any node, node redirects (MOVED/ASK)
Latency	Lower (no redirect hop)	Higher on first request (redirect), then client caches slots
Client complexity	High (must know ring topology)	Low (dumb client works, smart client is better)
Operational	Harder to change ring (all clients must update)	Cluster manages ring internally

Our choice: Client-side sharding with config-service-driven updates. Clients subscribe to shard map changes and update atomically. This gives us the lowest latency while keeping clients updated.

Handling Rebalancing:

When a new shard joins, the cluster manager assigns it vnodes and begins background migration
During migration, the old shard keeps serving reads for migrating keys
Writes for migrating keys go to the new shard (write-to-new, read-from-both)
Once migration completes, the shard map updates atomically

Deep Dive 2: Hot Key Handling & Thundering Herd Prevention

Hot Key Problem: A single key (e.g., a viral tweet, a breaking news article) can receive 1M+ reads/sec. A single shard (single-threaded) maxes out at ~200K ops/sec. This key alone would overwhelm the shard.

Solution 1: Client-Side Local Cache (L1 Cache)

The client library maintains a small local in-process cache (e.g., 100MB LRU)
Hot keys are automatically cached locally with a short TTL (1-5 seconds)
Eliminates network round-trips entirely for the hottest keys
Trade-off: stale data for up to TTL seconds. Acceptable for read-heavy, eventually-consistent use cases.

Solution 2: Server-Side Key Replication

Detect hot keys (shard tracks per-key access counts with a probabilistic counter)
Replicate hot keys to K additional shards (e.g., 5 replicas)
Client routes hot key reads to hash(key + random(0..K)) → spreads load across K+1 shards
Trade-off: writes to hot keys must update all replicas (fan-out write cost)

Solution 3: Read Replicas for Hot Shards

Add more read replicas to the shard that owns the hot key
Client reads from a random replica
Simple but coarse-grained — helps if the entire shard is hot, not just one key

Thundering Herd / Cache Stampede Prevention:

Scenario: A popular key expires. 10,000 concurrent requests see a cache miss and all hit the database simultaneously.

Solution A: Probabilistic Early Expiration (PER)

Before TTL expires, each request has an increasing probability of “voluntarily” recomputing the value
Formula: should_recompute = (current_time - fetch_time) > ttl * (1 - beta * ln(random()))
The beta parameter controls how early recomputation starts (typically 1.0)
Result: one or two requests recompute before expiry, the rest still see the cached value

Solution B: Locking / Coalescing

On cache miss, the first request acquires a distributed lock (shard-local semaphore) for that key
Subsequent requests for the same key either wait (blocking) or get a stale value (serve-stale)
The lock holder fetches from DB, populates cache, releases lock
Other waiters then read from cache

Solution C: Background Refresh

For critical keys, a background job refreshes them before TTL expiry
Application never sees a miss for these keys
Works for a known set of important keys, not for arbitrary keys

Our approach: Layer all three. PER for general keys, locking for keys with expensive recomputation, background refresh for the top-1000 known-hot keys.

Deep Dive 3: Write Policies & Cache Coherence

Write-Through:

App → Cache.SET(key, value) → Cache writes to DB synchronously → ACK to app

Pro: Cache and DB are always consistent
Con: Every write has DB latency in the critical path. Write throughput limited by DB.
Use when: Strong consistency is required (e.g., user account data)

Write-Behind (Write-Back):

App → Cache.SET(key, value) → ACK immediately → Cache async-flushes to DB

Pro: Write latency = cache latency only. Can batch/coalesce writes to DB.
Con: Data loss risk if cache node dies before flush. DB is temporarily stale.
Use when: High write throughput matters more than durability (e.g., view counters, session data)

Write-Around:

App → DB.WRITE(key, value) → invalidate Cache.DEL(key)
Next read → cache miss → read from DB → populate cache

Pro: Cache only holds data that’s actually read (no wasted memory on write-only keys)
Con: First read after write is slow (cache miss + DB read)
Use when: Writes rarely re-read immediately (e.g., log entries, analytics events)

Cache Invalidation Strategies:

TTL-based: Simple, eventual consistency. Good default.
Event-driven invalidation: DB publishes change events (CDC / binlog) → invalidation service deletes stale cache keys. Tighter consistency.
Lease-based: On cache miss, the cache gives the client a lease (token). Client fetches from DB, then sets with the lease. If the key was updated by another writer during the fetch, the lease is invalid → prevents stale set.

Cache Coherence Across Datacenters:

Each datacenter has its own cache cluster
Writes in DC-A invalidate cache in DC-B via an async cross-DC invalidation bus
Uses “delete-on-write” semantics: when a key is written in any DC, all other DCs delete that key from their local cache
This avoids conflicting values across DCs (at the cost of a cache miss in remote DCs after a write)

7. Extensions (2 min)

Multi-tenancy & isolation: Namespace keys per tenant, enforce per-tenant memory quotas, and implement fair-share scheduling so one tenant’s hot keys don’t starve others.
Tiered caching (DRAM + NVMe): Store hot data in DRAM (sub-ms latency) and warm data on NVMe SSDs (< 5ms latency). This 10x’s effective capacity per node at slightly higher tail latency. Facebook’s CacheLib does this.
Compression: For values > 1KB, compress with LZ4 (fast) or ZSTD (better ratio). Reduces memory usage by 2-4x for text-heavy workloads. Trade-off: CPU overhead per request.
Observability: Expose per-shard metrics (hit rate, eviction rate, memory usage, p99 latency), per-key access tracking for hot key detection, and slow-log for operations exceeding latency thresholds.
Graceful degradation: When a shard is overloaded, shed load by returning cache misses for low-priority keys (based on key prefix / namespace priority). Better to miss the cache than to crash the shard.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Node Count#

Key Insight#

3. API Design (3 min)#

Core Operations#

Key Design Decisions#

4. Data Model (3 min)#

In-Memory Data Structure (per shard)#

Hash Table#

Memory Allocator: Slab Allocator (Memcached approach)#

Why Not Disk?#

5. High-Level Design (12 min)#

Architecture Overview#

Component Breakdown#

Request Flow (Cache Hit)#

Request Flow (Cache Miss with Read-Through)#

6. Deep Dives (15 min)#

Deep Dive 1: Consistent Hashing & Key Distribution#

Deep Dive 2: Hot Key Handling & Thundering Herd Prevention#

Deep Dive 3: Write Policies & Cache Coherence#

7. Extensions (2 min)#