1. Requirements & Scope (5 min)

Functional Requirements

  1. Get/Set/Delete — clients can store, retrieve, and remove key-value pairs with sub-millisecond latency
  2. TTL support — every key can have an optional time-to-live; expired keys are automatically purged
  3. Eviction policies — when memory is full, evict keys according to a configurable policy (LRU, LFU, random)
  4. Atomic operations — support CAS (compare-and-swap), increment/decrement, and simple Lua-script-style transactions
  5. Cluster management — automatically shard data across nodes, detect failures, and rebalance with minimal data movement

Non-Functional Requirements

  • Availability: 99.99% — the cache is in the read hot-path; downtime causes a stampede on the backing store
  • Latency: p50 < 0.5ms, p99 < 2ms for single-key operations from within the same datacenter
  • Consistency: Eventual consistency across replicas is acceptable. For a single shard, reads-after-writes must be consistent (read-your-writes from the leader)
  • Scale: 100M+ ops/sec cluster-wide, 500TB+ aggregate memory across thousands of nodes
  • Durability: Best-effort. Cache is ephemeral by design, but optional AOF/snapshotting for warm restarts

2. Estimation (3 min)

Traffic

  • 100M cache operations/sec (70% reads, 30% writes)
  • Read QPS: 70M/s, Write QPS: 30M/s
  • Average key size: 64 bytes, average value size: 1 KB
  • Average payload per op: ~1 KB → bandwidth: 100M × 1 KB = 100 GB/s cluster-wide

Storage

  • 500B unique keys in steady state (Zipf distribution — 20% of keys serve 80% of traffic)
  • Average entry: 64 B (key) + 1 KB (value) + 48 B (metadata: TTL, flags, pointers) ≈ 1.1 KB
  • Total memory: 500B × 1.1 KB = 550 TB
  • With replication factor 2 (one replica per shard): 1.1 PB raw memory

Node Count

  • If each node has 128 GB usable memory → 550 TB / 128 GB ≈ 4,300 primary shards + 4,300 replicas ≈ 8,600 nodes
  • Each node handles: 100M / 4,300 ≈ 23K ops/sec — very comfortable for a single Redis-like process

Key Insight

This is a memory-bound, latency-critical system. The hard problems are: (1) distributing keys evenly across thousands of shards, (2) handling hot keys that violate even distribution, and (3) preventing thundering herds when popular keys expire.


3. API Design (3 min)

Core Operations

GET /cache/{key}
  → 200 { "value": "...", "ttl_remaining": 3600, "cas_token": "abc123" }
  → 404 (cache miss)

PUT /cache/{key}
  Body: { "value": "...", "ttl": 3600, "cas_token": "abc123" }
  → 201 (created) | 200 (updated) | 409 (CAS conflict)

DELETE /cache/{key}
  → 204 (deleted) | 404 (not found)

POST /cache/_batch
  Body: { "gets": ["k1","k2","k3"], "sets": [{"key":"k4","value":"v4","ttl":60}] }
  → 200 { "results": { "k1": "v1", "k2": null, "k3": "v3" }, "set_acks": ["k4"] }

Key Design Decisions

  • Binary protocol in production (not REST). REST shown for clarity. Real systems use a compact binary protocol (like Memcached binary protocol or Redis RESP3) to minimize serialization overhead.
  • Batch/multi-get is critical. Clients often need 10-50 keys per request. Batching amortizes round-trip latency.
  • CAS tokens enable optimistic concurrency without distributed locks. The client reads a CAS token with GET, then includes it in the SET. If another writer changed the value, the CAS token won’t match → 409 Conflict.
  • Client-side routing: The client library knows the shard map and sends requests directly to the correct shard — no proxy hop.

4. Data Model (3 min)

In-Memory Data Structure (per shard)

Field Type Notes
key byte[] Raw bytes, up to 250B (Memcached default)
value byte[] Up to 1MB per value
ttl_expires_at int64 Unix timestamp in ms; 0 = no expiry
cas_version uint64 Monotonically increasing per key
flags uint32 Client-defined metadata (e.g., serialization format)
last_access_time int64 For LRU eviction
access_frequency uint16 For LFU eviction (logarithmic counter, like Redis)

Hash Table

  • Primary index: open-addressing hash table with incremental resizing (like Redis’s dict)
  • Two hash tables during resize: reads check both; writes go to new table; background migration moves buckets
  • Load factor trigger: resize when load > 1.0 (or < 0.1 for shrinking)

Memory Allocator: Slab Allocator (Memcached approach)

  • Pre-allocate slab classes: 64B, 128B, 256B, 512B, 1KB, 2KB, … 1MB
  • Each class has a pool of fixed-size chunks. A key-value pair is stored in the smallest class that fits.
  • Why? Avoids external fragmentation. malloc/free on millions of small objects leads to fragmentation; slab allocation keeps memory usage predictable.
  • Trade-off: Internal fragmentation — a 65B value wastes 63B in a 128B slab. Tunable with custom slab class sizes.

Why Not Disk?

This is a pure in-memory store. Disk adds latency (even NVMe is 10-100x slower than DRAM). Optional persistence (AOF/RDB snapshots) is only for warm restart, not for serving reads.


5. High-Level Design (12 min)

Architecture Overview

                          ┌─────────────────────────────────────┐
                          │         Configuration Service        │
                          │  (shard map, ring topology, health)  │
                          └──────────┬──────────────────────────┘
                                     │ watch shard map
                    ┌────────────────┼────────────────────┐
                    ▼                ▼                     ▼
             ┌────────────┐   ┌────────────┐       ┌────────────┐
             │  Client Lib │   │  Client Lib │       │  Client Lib │
             │  (App Svc)  │   │  (App Svc)  │       │  (App Svc)  │
             └──────┬─────┘   └──────┬─────┘       └──────┬─────┘
                    │                │                     │
        consistent hash              │          consistent hash
        route to shard               │          route to shard
                    │                │                     │
          ┌─────────┼────────────────┼─────────────────────┤
          ▼         ▼                ▼                     ▼
   ┌─────────┐ ┌─────────┐   ┌─────────┐          ┌─────────┐
   │ Shard 0  │ │ Shard 1  │   │ Shard 2  │   ...   │ Shard N  │
   │ (Leader) │ │ (Leader) │   │ (Leader) │          │ (Leader) │
   └────┬────┘ └────┬────┘   └────┬────┘          └────┬────┘
        │           │              │                    │
        ▼           ▼              ▼                    ▼
   ┌─────────┐ ┌─────────┐   ┌─────────┐          ┌─────────┐
   │ Replica  │ │ Replica  │   │ Replica  │          │ Replica  │
   │ (Follow) │ │ (Follow) │   │ (Follow) │          │ (Follow) │
   └─────────┘ └─────────┘   └─────────┘          └─────────┘

Component Breakdown

1. Client Library (Smart Client)

  • Maintains a local copy of the shard map (hash ring)
  • Routes requests directly to the correct shard leader — no intermediate proxy
  • Handles connection pooling, pipelining, retries, and circuit breaking
  • On cache miss, can optionally call a “loader function” to fetch from the backing store and populate the cache

2. Cache Shards (Leader)

  • Single-threaded event loop (like Redis) — no locks, no contention
  • Handles GET/SET/DEL/CAS/batch operations
  • Runs eviction when memory pressure exceeds configured limit
  • Replicates writes to followers asynchronously

3. Cache Replicas (Follower)

  • Receive replication stream from leader
  • Serve read traffic if configured for read replicas (reduces leader load)
  • Promote to leader on leader failure (automatic failover)

4. Configuration Service (ZooKeeper / etcd)

  • Stores the hash ring configuration: which virtual nodes map to which physical shards
  • Stores health status of each shard
  • Clients watch for changes → update local shard map within seconds

5. Cluster Manager (Control Plane)

  • Monitors shard health via heartbeats
  • Triggers failover: promotes replica to leader, updates shard map in config service
  • Handles scaling: adds/removes shards, triggers data migration
  • Rebalances hot shards by splitting them

Request Flow (Cache Hit)

App Server → Client Lib: GET("user:123")
Client Lib → hash("user:123") → virtual node 47 → Shard 3 (leader)
Client Lib → Shard 3: GET user:123
Shard 3 → hash table lookup → found, TTL not expired
Shard 3 → Client Lib: {value, cas_token, ttl}
Client Lib → App Server: cache hit, return value

Request Flow (Cache Miss with Read-Through)

App Server → Client Lib: GET("user:123")
Client Lib → Shard 3: GET user:123
Shard 3 → hash table lookup → not found
Shard 3 → Client Lib: MISS
Client Lib → App Server: cache miss
App Server → Database: SELECT * FROM users WHERE id = 123
App Server → Client Lib: SET("user:123", userData, ttl=3600)
Client Lib → Shard 3: SET user:123 ...

6. Deep Dives (15 min)

Deep Dive 1: Consistent Hashing & Key Distribution

The Problem: With 4,300+ shards, adding or removing a node should not cause a massive rehash (which would invalidate most of the cache).

Solution: Consistent Hashing with Virtual Nodes

  1. Map the keyspace to a ring (0 to 2^32 - 1)
  2. Each physical shard owns multiple virtual nodes (e.g., 150-200 vnodes per physical node) placed at random positions on the ring
  3. A key maps to the first vnode clockwise from its hash position
  4. When a shard is added, it takes ownership of ~1/N of the ring — only keys that fall in those segments are migrated
  5. When a shard is removed, its vnodes are distributed to the remaining shards

Why virtual nodes?

  • With only physical nodes, the ring is unbalanced (nodes own unequal ranges)
  • Virtual nodes smooth out the distribution. With 150 vnodes per node, standard deviation of load is ~5%
  • Also enables heterogeneous hardware: a beefy node gets more vnodes than a small one

Client-Side vs Server-Side Sharding:

Client-Side (Memcached model) Server-Side (Redis Cluster model)
Routing Client computes hash, sends to correct shard Client sends to any node, node redirects (MOVED/ASK)
Latency Lower (no redirect hop) Higher on first request (redirect), then client caches slots
Client complexity High (must know ring topology) Low (dumb client works, smart client is better)
Operational Harder to change ring (all clients must update) Cluster manages ring internally

Our choice: Client-side sharding with config-service-driven updates. Clients subscribe to shard map changes and update atomically. This gives us the lowest latency while keeping clients updated.

Handling Rebalancing:

  • When a new shard joins, the cluster manager assigns it vnodes and begins background migration
  • During migration, the old shard keeps serving reads for migrating keys
  • Writes for migrating keys go to the new shard (write-to-new, read-from-both)
  • Once migration completes, the shard map updates atomically

Deep Dive 2: Hot Key Handling & Thundering Herd Prevention

Hot Key Problem: A single key (e.g., a viral tweet, a breaking news article) can receive 1M+ reads/sec. A single shard (single-threaded) maxes out at ~200K ops/sec. This key alone would overwhelm the shard.

Solution 1: Client-Side Local Cache (L1 Cache)

  • The client library maintains a small local in-process cache (e.g., 100MB LRU)
  • Hot keys are automatically cached locally with a short TTL (1-5 seconds)
  • Eliminates network round-trips entirely for the hottest keys
  • Trade-off: stale data for up to TTL seconds. Acceptable for read-heavy, eventually-consistent use cases.

Solution 2: Server-Side Key Replication

  • Detect hot keys (shard tracks per-key access counts with a probabilistic counter)
  • Replicate hot keys to K additional shards (e.g., 5 replicas)
  • Client routes hot key reads to hash(key + random(0..K)) → spreads load across K+1 shards
  • Trade-off: writes to hot keys must update all replicas (fan-out write cost)

Solution 3: Read Replicas for Hot Shards

  • Add more read replicas to the shard that owns the hot key
  • Client reads from a random replica
  • Simple but coarse-grained — helps if the entire shard is hot, not just one key

Thundering Herd / Cache Stampede Prevention:

Scenario: A popular key expires. 10,000 concurrent requests see a cache miss and all hit the database simultaneously.

Solution A: Probabilistic Early Expiration (PER)

  • Before TTL expires, each request has an increasing probability of “voluntarily” recomputing the value
  • Formula: should_recompute = (current_time - fetch_time) > ttl * (1 - beta * ln(random()))
  • The beta parameter controls how early recomputation starts (typically 1.0)
  • Result: one or two requests recompute before expiry, the rest still see the cached value

Solution B: Locking / Coalescing

  • On cache miss, the first request acquires a distributed lock (shard-local semaphore) for that key
  • Subsequent requests for the same key either wait (blocking) or get a stale value (serve-stale)
  • The lock holder fetches from DB, populates cache, releases lock
  • Other waiters then read from cache

Solution C: Background Refresh

  • For critical keys, a background job refreshes them before TTL expiry
  • Application never sees a miss for these keys
  • Works for a known set of important keys, not for arbitrary keys

Our approach: Layer all three. PER for general keys, locking for keys with expensive recomputation, background refresh for the top-1000 known-hot keys.

Deep Dive 3: Write Policies & Cache Coherence

Write-Through:

App → Cache.SET(key, value) → Cache writes to DB synchronously → ACK to app
  • Pro: Cache and DB are always consistent
  • Con: Every write has DB latency in the critical path. Write throughput limited by DB.
  • Use when: Strong consistency is required (e.g., user account data)

Write-Behind (Write-Back):

App → Cache.SET(key, value) → ACK immediately → Cache async-flushes to DB
  • Pro: Write latency = cache latency only. Can batch/coalesce writes to DB.
  • Con: Data loss risk if cache node dies before flush. DB is temporarily stale.
  • Use when: High write throughput matters more than durability (e.g., view counters, session data)

Write-Around:

App → DB.WRITE(key, value) → invalidate Cache.DEL(key)
Next read → cache miss → read from DB → populate cache
  • Pro: Cache only holds data that’s actually read (no wasted memory on write-only keys)
  • Con: First read after write is slow (cache miss + DB read)
  • Use when: Writes rarely re-read immediately (e.g., log entries, analytics events)

Cache Invalidation Strategies:

  1. TTL-based: Simple, eventual consistency. Good default.
  2. Event-driven invalidation: DB publishes change events (CDC / binlog) → invalidation service deletes stale cache keys. Tighter consistency.
  3. Lease-based: On cache miss, the cache gives the client a lease (token). Client fetches from DB, then sets with the lease. If the key was updated by another writer during the fetch, the lease is invalid → prevents stale set.

Cache Coherence Across Datacenters:

  • Each datacenter has its own cache cluster
  • Writes in DC-A invalidate cache in DC-B via an async cross-DC invalidation bus
  • Uses “delete-on-write” semantics: when a key is written in any DC, all other DCs delete that key from their local cache
  • This avoids conflicting values across DCs (at the cost of a cache miss in remote DCs after a write)

7. Extensions (2 min)

  • Multi-tenancy & isolation: Namespace keys per tenant, enforce per-tenant memory quotas, and implement fair-share scheduling so one tenant’s hot keys don’t starve others.
  • Tiered caching (DRAM + NVMe): Store hot data in DRAM (sub-ms latency) and warm data on NVMe SSDs (< 5ms latency). This 10x’s effective capacity per node at slightly higher tail latency. Facebook’s CacheLib does this.
  • Compression: For values > 1KB, compress with LZ4 (fast) or ZSTD (better ratio). Reduces memory usage by 2-4x for text-heavy workloads. Trade-off: CPU overhead per request.
  • Observability: Expose per-shard metrics (hit rate, eviction rate, memory usage, p99 latency), per-key access tracking for hot key detection, and slow-log for operations exceeding latency thresholds.
  • Graceful degradation: When a shard is overloaded, shed load by returning cache misses for low-priority keys (based on key prefix / namespace priority). Better to miss the cache than to crash the shard.