1. Requirements & Scope (5 min)

Functional Requirements

  1. As the user types in a search box, show top 5-10 suggestions matching the prefix
  2. Suggestions should be ranked by popularity (search frequency)
  3. Suggestions update as new search trends emerge (near-real-time)
  4. Support personalization (user’s own search history weighted higher)
  5. Handle typos/fuzzy matching (optional stretch goal)

Non-Functional Requirements

  • Availability: 99.99% — broken autocomplete degrades the entire search experience
  • Latency: Suggestions must appear in < 50ms at p99 (users expect instant response as they type)
  • Consistency: Eventual consistency is fine — if a new trending search takes a few minutes to appear in suggestions, that’s acceptable
  • Scale: 10B search queries/day, suggestions requested on every keystroke → 50B+ suggestion requests/day (average query = 5 keystrokes)

2. Estimation (3 min)

Traffic

  • 50B suggestion requests/day ÷ 100K = 500,000 requests/sec
  • Peak: 3x → 1.5M requests/sec
  • This is an extremely high-QPS system — latency and throughput are everything

Data Size

  • Vocabulary: ~5M unique search phrases (with frequency counts)
  • Each phrase: avg 30 chars + frequency count + metadata = ~100 bytes
  • Total vocabulary: 5M × 100 bytes = 500MB — fits entirely in memory

Key Insight

This is a memory-bound, latency-critical system. The entire dataset fits in RAM. The challenge is serving 1.5M QPS at < 50ms consistently.


3. API Design (3 min)

GET /api/v1/suggest?q={prefix}&limit=10&user_id={user_id}
  Response 200: {
    "suggestions": [
      { "text": "system design interview", "score": 9850 },
      { "text": "system design primer", "score": 7200 },
      { "text": "system design patterns", "score": 5100 },
      { "text": "system design youtube", "score": 4800 },
      { "text": "system design book", "score": 3200 }
    ]
  }

// Internal: update search frequency (called by Search Service)
POST /api/v1/suggest/record
  Body: { "query": "system design interview" }
  Response 204

Key decisions:

  • No authentication required — autocomplete is called on every keystroke, adding auth overhead would kill latency
  • Client-side debouncing — client should wait 100-200ms after last keystroke before requesting (reduces QPS by 3-5x)
  • Limit parameter — default 5-10 suggestions, never more than 20
  • User context optional — for personalization (boost user’s recent searches)

4. Data Model (3 min)

Trie (in-memory, per-server)

Trie Node:
  char           | single character
  children       | map<char, TrieNode>
  is_end         | boolean (marks end of a complete phrase)
  top_k          | list<(phrase, score)>  -- pre-computed top-K for this prefix

Pre-computed top-K at each node:

  • Each trie node stores the top 10 suggestions for its prefix
  • When user types “sys”, we look up the node at s→y→s and immediately return top_k
  • This makes lookup O(L) where L = prefix length — typically 3-5 characters → ~5 pointer traversals = sub-microsecond

Why not query-time sorting?

  • For prefix “s”, there might be 500K matching phrases
  • Sorting 500K items per request at 1.5M QPS is impossible
  • Pre-computing top-K at build time moves the cost offline

Frequency Data (Kafka → Redis → Batch Rebuild)

Redis (real-time frequency tracking):
  Key: freq:{normalized_query}
  Value: count (incremented on each search)

Aggregated data (periodic snapshot):
  query_text | frequency | updated_at

5. High-Level Design (12 min)

Suggestion Serving Path (hot path — must be fast)

Client types "sys"
  → Client debounces (100ms)
  → GET /suggest?q=sys
  → Load Balancer → Suggestion Server
    → Trie lookup: s → y → s → return top_k
    → (Optional) Merge with personalized results from user history
    → Return top 10 suggestions
  → Total time: < 5ms

Data Update Path (offline/near-real-time)

User performs a search → Search Service
  → Publish to Kafka: search_queries topic
  → Real-time aggregator (Flink/Spark Streaming):
    → Count query frequencies in sliding windows (1h, 24h)
    → Write updated frequencies to Redis
  → Periodic Trie Rebuilder (every 15-30 minutes):
    → Read all query frequencies from Redis
    → Build new Trie with pre-computed top-K at each node
    → Serialize Trie → Publish to Trie Distribution Service
    → Each Suggestion Server downloads the new Trie, swaps it in atomically

Components

  1. Suggestion Servers (stateful): Each holds the complete Trie in memory. Horizontal scaling = more replicas.
  2. Trie Builder (periodic batch job): Builds new Trie from frequency data. Runs every 15-30 minutes.
  3. Frequency Aggregator (Flink/Spark Streaming): Processes search events from Kafka, maintains frequency counts in Redis.
  4. Redis: Real-time frequency counts, user search history (for personalization)
  5. Kafka: Search event stream
  6. Trie Distribution Service (S3 or internal blob store): Stores serialized Trie snapshots. Suggestion Servers pull new versions.

6. Deep Dives (15 min)

Deep Dive 1: Trie Design and Optimization

Basic Trie → Compressed Trie (Patricia Trie):

  • Many paths in a Trie have no branching (e.g., “system” = s→y→s→t→e→m with no branches)
  • Compressed Trie merges single-child paths: “system” stored as one node
  • Memory reduction: 40-60% fewer nodes

Pre-computed top-K propagation:

  • Build bottom-up: leaf nodes have their own frequency
  • At each internal node, top-K is the union of its children’s top-K, re-sorted
  • Space: each node stores 10 entries × ~50 bytes = 500 bytes
  • For 5M unique queries with avg 30 chars → ~150M trie nodes → after compression ~50M nodes
  • 50M × 500 bytes top-K = 25GB per server — fits in a mid-range server’s RAM

Atomic Trie swap:

  • Build new Trie in a background thread
  • When complete, atomically swap the pointer (read pointer → new Trie)
  • Old Trie is garbage collected after all in-flight requests complete
  • Zero-downtime updates

Sharding for larger vocabularies:

  • If vocabulary exceeds single-server memory, shard by first character (or first 2 chars)
  • 26 shards for a-z, or 676 shards for aa-zz
  • Request routing: hash first char(s) of prefix → correct shard
  • But for most systems, 5M unique queries at 25GB is fine for a single server with replicas

The problem: A static Trie built from historical data won’t surface breaking news or trending searches. “Election results 2026” should appear within minutes of becoming a trending search, not after the next 30-minute Trie rebuild.

Solution: Two-layer architecture

  1. Static Trie (rebuilt every 30 min): Contains all queries with stable frequency. This is the primary data source.

  2. Trending cache (updated every 1-2 min): Kafka → Flink streaming aggregation with sliding 1-hour window. Detect queries with abnormal velocity (frequency / historical_avg > 5x). Store trending queries in a small Redis sorted set.

  3. At query time: Merge results from static Trie + trending cache. Trending queries get a boost factor in scoring.

Decay function: Trending score = recent_count × decay_factor^(minutes_since_first_spike). This ensures trending queries rise fast and also fade naturally.

Content filtering: Trending queries pass through a filter before being shown:

  • Blocklist of offensive terms
  • Adult content filter
  • Spam detection (bot-driven query patterns)

Deep Dive 3: Personalization

User search history:

  • Store last 1000 searches per user in Redis: user_history:{user_id} (sorted set by timestamp)
  • On suggestion request with user_id:
    1. Get top-K from Trie (global popularity)
    2. Get matching queries from user’s history
    3. Merge with weighted scoring: final_score = 0.7 × global_score + 0.3 × personal_score
    4. Personal score = recency × frequency (how often and how recently the user searched this)

Privacy consideration:

  • User can clear search history → delete Redis key
  • Search history stored per-user, not shared
  • Personalization is optional (works without user_id)

Edge case — cold start:

  • New user with no history → pure global suggestions
  • As they search more, personalization kicks in gradually

7. Extensions (2 min)

  • Fuzzy matching: Handle typos using edit distance. Build a secondary index of common misspellings → correct query mappings. At query time, if exact prefix has few results, try fuzzy alternatives.
  • Multi-language: Separate Tries per language. Detect language from input characters or user locale.
  • Category-aware suggestions: “apple” → suggest “apple stock price” (finance), “apple macbook” (tech), “apple pie recipe” (food) with category labels.
  • Query completion vs entity suggestion: Alongside query completions, show entity cards (e.g., typing “barack” → show “Barack Obama” with a photo, Wikipedia snippet). Requires an entity knowledge graph.
  • Offline/client-side: For mobile apps, ship a compressed Trie of top 100K queries in the app binary. Serve suggestions locally when offline. Sync with server when online.