Design Typeahead Suggestion / Autocomplete

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
- Trie (in-memory, per-server)
- Frequency Data (Kafka → Redis → Batch Rebuild)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

As the user types in a search box, show top 5-10 suggestions matching the prefix
Suggestions should be ranked by popularity (search frequency)
Suggestions update as new search trends emerge (near-real-time)
Support personalization (user’s own search history weighted higher)
Handle typos/fuzzy matching (optional stretch goal)

Non-Functional Requirements

Availability: 99.99% — broken autocomplete degrades the entire search experience
Latency: Suggestions must appear in < 50ms at p99 (users expect instant response as they type)
Consistency: Eventual consistency is fine — if a new trending search takes a few minutes to appear in suggestions, that’s acceptable
Scale: 10B search queries/day, suggestions requested on every keystroke → 50B+ suggestion requests/day (average query = 5 keystrokes)

2. Estimation (3 min)

Traffic

50B suggestion requests/day ÷ 100K = 500,000 requests/sec
Peak: 3x → 1.5M requests/sec
This is an extremely high-QPS system — latency and throughput are everything

Data Size

Vocabulary: ~5M unique search phrases (with frequency counts)
Each phrase: avg 30 chars + frequency count + metadata = ~100 bytes
Total vocabulary: 5M × 100 bytes = 500MB — fits entirely in memory

Key Insight

This is a memory-bound, latency-critical system. The entire dataset fits in RAM. The challenge is serving 1.5M QPS at < 50ms consistently.

3. API Design (3 min)

GET /api/v1/suggest?q={prefix}&limit=10&user_id={user_id}
  Response 200: {
    "suggestions": [
      { "text": "system design interview", "score": 9850 },
      { "text": "system design primer", "score": 7200 },
      { "text": "system design patterns", "score": 5100 },
      { "text": "system design youtube", "score": 4800 },
      { "text": "system design book", "score": 3200 }
    ]
  }

// Internal: update search frequency (called by Search Service)
POST /api/v1/suggest/record
  Body: { "query": "system design interview" }
  Response 204

Key decisions:

No authentication required — autocomplete is called on every keystroke, adding auth overhead would kill latency
Client-side debouncing — client should wait 100-200ms after last keystroke before requesting (reduces QPS by 3-5x)
Limit parameter — default 5-10 suggestions, never more than 20
User context optional — for personalization (boost user’s recent searches)

4. Data Model (3 min)

Trie (in-memory, per-server)

Trie Node:
  char           | single character
  children       | map<char, TrieNode>
  is_end         | boolean (marks end of a complete phrase)
  top_k          | list<(phrase, score)>  -- pre-computed top-K for this prefix

Pre-computed top-K at each node:

Each trie node stores the top 10 suggestions for its prefix
When user types “sys”, we look up the node at s→y→s and immediately return top_k
This makes lookup O(L) where L = prefix length — typically 3-5 characters → ~5 pointer traversals = sub-microsecond

Why not query-time sorting?

For prefix “s”, there might be 500K matching phrases
Sorting 500K items per request at 1.5M QPS is impossible
Pre-computing top-K at build time moves the cost offline

Frequency Data (Kafka → Redis → Batch Rebuild)

Redis (real-time frequency tracking):
  Key: freq:{normalized_query}
  Value: count (incremented on each search)

Aggregated data (periodic snapshot):
  query_text | frequency | updated_at

5. High-Level Design (12 min)

Suggestion Serving Path (hot path — must be fast)

Client types "sys"
  → Client debounces (100ms)
  → GET /suggest?q=sys
  → Load Balancer → Suggestion Server
    → Trie lookup: s → y → s → return top_k
    → (Optional) Merge with personalized results from user history
    → Return top 10 suggestions
  → Total time: < 5ms

Data Update Path (offline/near-real-time)

User performs a search → Search Service
  → Publish to Kafka: search_queries topic
  → Real-time aggregator (Flink/Spark Streaming):
    → Count query frequencies in sliding windows (1h, 24h)
    → Write updated frequencies to Redis
  → Periodic Trie Rebuilder (every 15-30 minutes):
    → Read all query frequencies from Redis
    → Build new Trie with pre-computed top-K at each node
    → Serialize Trie → Publish to Trie Distribution Service
    → Each Suggestion Server downloads the new Trie, swaps it in atomically

Components

Suggestion Servers (stateful): Each holds the complete Trie in memory. Horizontal scaling = more replicas.
Trie Builder (periodic batch job): Builds new Trie from frequency data. Runs every 15-30 minutes.
Frequency Aggregator (Flink/Spark Streaming): Processes search events from Kafka, maintains frequency counts in Redis.
Redis: Real-time frequency counts, user search history (for personalization)
Kafka: Search event stream
Trie Distribution Service (S3 or internal blob store): Stores serialized Trie snapshots. Suggestion Servers pull new versions.

6. Deep Dives (15 min)

Deep Dive 1: Trie Design and Optimization

Basic Trie → Compressed Trie (Patricia Trie):

Many paths in a Trie have no branching (e.g., “system” = s→y→s→t→e→m with no branches)
Compressed Trie merges single-child paths: “system” stored as one node
Memory reduction: 40-60% fewer nodes

Pre-computed top-K propagation:

Build bottom-up: leaf nodes have their own frequency
At each internal node, top-K is the union of its children’s top-K, re-sorted
Space: each node stores 10 entries × ~50 bytes = 500 bytes
For 5M unique queries with avg 30 chars → ~150M trie nodes → after compression ~50M nodes
50M × 500 bytes top-K = 25GB per server — fits in a mid-range server’s RAM

Atomic Trie swap:

Build new Trie in a background thread
When complete, atomically swap the pointer (read pointer → new Trie)
Old Trie is garbage collected after all in-flight requests complete
Zero-downtime updates

Sharding for larger vocabularies:

If vocabulary exceeds single-server memory, shard by first character (or first 2 chars)
26 shards for a-z, or 676 shards for aa-zz
Request routing: hash first char(s) of prefix → correct shard
But for most systems, 5M unique queries at 25GB is fine for a single server with replicas

The problem: A static Trie built from historical data won’t surface breaking news or trending searches. “Election results 2026” should appear within minutes of becoming a trending search, not after the next 30-minute Trie rebuild.

Solution: Two-layer architecture

Static Trie (rebuilt every 30 min): Contains all queries with stable frequency. This is the primary data source.
Trending cache (updated every 1-2 min): Kafka → Flink streaming aggregation with sliding 1-hour window. Detect queries with abnormal velocity (frequency / historical_avg > 5x). Store trending queries in a small Redis sorted set.
At query time: Merge results from static Trie + trending cache. Trending queries get a boost factor in scoring.

Decay function: Trending score = recent_count × decay_factor^(minutes_since_first_spike). This ensures trending queries rise fast and also fade naturally.

Content filtering: Trending queries pass through a filter before being shown:

Blocklist of offensive terms
Adult content filter
Spam detection (bot-driven query patterns)

Deep Dive 3: Personalization

User search history:

Store last 1000 searches per user in Redis: user_history:{user_id} (sorted set by timestamp)
On suggestion request with user_id:
1. Get top-K from Trie (global popularity)
2. Get matching queries from user’s history
3. Merge with weighted scoring: final_score = 0.7 × global_score + 0.3 × personal_score
4. Personal score = recency × frequency (how often and how recently the user searched this)

Privacy consideration:

User can clear search history → delete Redis key
Search history stored per-user, not shared
Personalization is optional (works without user_id)

Edge case — cold start:

New user with no history → pure global suggestions
As they search more, personalization kicks in gradually

7. Extensions (2 min)

Fuzzy matching: Handle typos using edit distance. Build a secondary index of common misspellings → correct query mappings. At query time, if exact prefix has few results, try fuzzy alternatives.
Multi-language: Separate Tries per language. Detect language from input characters or user locale.
Category-aware suggestions: “apple” → suggest “apple stock price” (finance), “apple macbook” (tech), “apple pie recipe” (food) with category labels.
Query completion vs entity suggestion: Alongside query completions, show entity cards (e.g., typing “barack” → show “Barack Obama” with a photo, Wikipedia snippet). Requires an entity knowledge graph.
Offline/client-side: For mobile apps, ship a compressed Trie of top 100K queries in the app binary. Serve suggestions locally when offline. Sync with server when online.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Data Size#

Key Insight#

3. API Design (3 min)#

4. Data Model (3 min)#

Trie (in-memory, per-server)#

Frequency Data (Kafka → Redis → Batch Rebuild)#

5. High-Level Design (12 min)#

Suggestion Serving Path (hot path — must be fast)#

Data Update Path (offline/near-real-time)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Trie Design and Optimization#

Deep Dive 2: Trending Queries and Freshness#

Deep Dive 3: Personalization#

7. Extensions (2 min)#