Design Yelp (Proximity Service)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
- Key Decisions
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Users search for businesses near a given location (latitude, longitude) within a specified radius
Results are ranked by a combination of relevance, distance, and user ratings
Support filtering by business category (restaurants, hotels, gas stations), price range, open now, and rating threshold
Display business details: name, address, photos, hours, reviews, ratings
Business owners can add/update their business listing information

Non-Functional Requirements

Availability: 99.99% — search is the core product. Users expect instant results when they open the app.
Latency: < 100ms for proximity search queries (p99). Business detail pages < 200ms.
Consistency: Eventual consistency for business data updates. A new restaurant appearing in search within 1 hour is acceptable. But search results must be consistent within a single query (no partial results).
Scale: 200M businesses worldwide, 100K search QPS at peak, 500M DAU
Location accuracy: Results within a 500m radius should be nearly exhaustive (recall > 99%). Results at larger radii (5km+) can prioritize relevance over exhaustiveness.

2. Estimation (3 min)

Traffic

Search QPS: 100K peak, 40K average
Business detail views: 200K QPS (users click into results)
Business writes (new/updated listings): 1K/sec (low — read-heavy system)

Storage

200M businesses x 2KB per business (name, address, coordinates, category, hours, ratings) = 400 GB for core business data
Photos: 200M businesses x 5 photos avg x 200KB = 200 TB (CDN-served)
Reviews: 2 billion reviews x 500 bytes = 1 TB
Geospatial index: 200M businesses x 50 bytes (coordinates + geohash + pointers) = 10 GB — fits entirely in memory

Key Insight

The geospatial index (10GB) is small enough to fit in memory on a single machine, but we need to replicate it across multiple read replicas for 100K QPS. The real challenge is not storage — it is efficient spatial querying and ranking at low latency.

3. API Design (3 min)

// Search for nearby businesses
GET /search?lat=37.7749&lng=-122.4194&radius=5000&category=restaurant&sort=best_match&limit=20&cursor={cursor}
  Response 200: {
    "businesses": [
      {
        "business_id": "biz_abc123",
        "name": "Joe's Pizza",
        "address": "123 Main St, San Francisco, CA",
        "lat": 37.7752,
        "lng": -122.4183,
        "distance_m": 120,
        "category": "restaurant",
        "subcategory": "pizza",
        "rating": 4.5,
        "review_count": 342,
        "price_level": 2,              // 1-4 ($-$$$$)
        "is_open": true,
        "photo_url": "https://cdn.yelp.com/photos/abc.jpg",
        "relevance_score": 0.94
      },
      ...
    ],
    "total_results": 87,
    "next_cursor": "eyJ..."
  }

// Get business details
GET /businesses/{business_id}
  Response 200: {
    "business_id": "biz_abc123",
    "name": "Joe's Pizza",
    "address": "123 Main St, San Francisco, CA",
    "lat": 37.7752,
    "lng": -122.4183,
    "category": "restaurant",
    "hours": {"mon": "11:00-22:00", "tue": "11:00-22:00", ...},
    "rating": 4.5,
    "review_count": 342,
    "photos": ["url1", "url2", ...],
    "attributes": {"delivery": true, "outdoor_seating": true, ...}
  }

// Add/update business (business owner)
POST /businesses
PUT /businesses/{business_id}
  Body: { "name": "...", "address": "...", "lat": ..., "lng": ..., ... }

Key Decisions

Search radius in meters (not miles) — internal consistency, convert for display
sort=best_match is the default (not purely distance-based) — mirrors real user intent
Cursor-based pagination for stable results during scrolling
Business writes are separate from the search index — async update pipeline

4. Data Model (3 min)

Business Table (MySQL, sharded by business_id)

Table: businesses
  business_id     (PK) | bigint
  name                  | varchar(200)
  address               | varchar(500)
  latitude              | decimal(10,7)
  longitude             | decimal(10,7)
  geohash               | varchar(12)     -- precomputed for indexing
  category              | varchar(50)
  subcategory           | varchar(50)
  price_level           | tinyint(1-4)
  rating_avg            | decimal(2,1)
  review_count          | int
  is_active             | boolean
  hours_json            | json
  attributes_json       | json
  created_at            | timestamp
  updated_at            | timestamp

Geospatial Index (in-memory, replicated)

Option A: Geohash-based index
  HashMap<geohash_prefix, List<Business>>
  geohash precision 6 (~1.2km x 0.6km cells)

  Search: compute geohash of query point → fetch all businesses in that cell
          + fetch businesses in 8 neighboring cells
          → filter by exact distance, rank, return

Option B: Quadtree
  Root covers entire world → recursively split into 4 quadrants
  Split when a node has > 100 businesses
  Leaf nodes store business lists
  Search: find leaf containing query point → expand to neighbors until radius covered

Option C: S2 Geometry (Google's approach)
  Map Earth surface to faces of a cube → Hilbert curve → 64-bit cell ID
  Hierarchical cells at varying levels of precision
  Search: compute S2 cell covering for the search circle → fetch businesses in those cells

Reviews Table (MySQL, sharded by business_id)

Table: reviews
  review_id       (PK) | bigint
  business_id          | bigint (FK, shard key)
  user_id              | bigint
  rating               | tinyint(1-5)
  text                 | text
  photos               | json
  created_at           | timestamp
  helpful_count        | int

Why These Choices

MySQL for business data — structured, relational, well-suited for CRUD operations. Sharded by business_id for write distribution.
In-memory geospatial index — 10GB fits in RAM. Read replicas handle 100K QPS. Updated asynchronously from business DB writes.
Geohash over Quadtree for primary index — simpler, Redis-native (GEORADIUS), easy to shard. Quadtree as secondary for dynamic radius searches.

5. High-Level Design (12 min)

Architecture

Mobile/Web Client
  │
  ▼
┌───────────────┐
│  API Gateway   │  (rate limiting, auth, routing)
│  + Load Balancer│
└───────┬───────┘
        │
        ├────────────────────┬─────────────────────┐
        ▼                    ▼                     ▼
┌───────────────┐  ┌─────────────────┐   ┌───────────────┐
│ Search Service │  │ Business Service│   │ Review Service │
│ (proximity     │  │ (CRUD for       │   │ (CRUD for      │
│  queries)      │  │  listings)      │   │  reviews)      │
└───────┬───────┘  └────────┬────────┘   └───────────────┘
        │                   │
        ▼                   │
┌───────────────────┐       │
│ Geospatial Index   │       │
│ (in-memory,        │       │
│  read replicas)    │       │
│                    │       │
│ ┌────────────────┐ │       │
│ │ Geohash Index  │ │       │
│ │ HashMap<prefix,│ │       │
│ │  businesses>   │ │       │
│ └────────────────┘ │       │
│ ┌────────────────┐ │       │
│ │ Quadtree       │ │       │
│ │ (for dynamic   │ │       │
│ │  radius)       │ │       │
│ └────────────────┘ │       │
└───────────────────┘       │
        ▲                   │
        │                   ▼
        │          ┌─────────────────┐
        └──────────│ Index Builder    │
                   │ (async, reads    │
                   │  from Business   │
                   │  DB, rebuilds    │
                   │  geo index)      │
                   └─────────────────┘

Storage Layer:
  ┌──────────────┐  ┌──────────────┐  ┌──────────┐
  │ Business DB   │  │ Review DB    │  │ CDN       │
  │ (MySQL,       │  │ (MySQL,      │  │ (photos,  │
  │  sharded)     │  │  sharded)    │  │  static)  │
  └──────────────┘  └──────────────┘  └──────────┘

Search Query Flow

Client: "Find restaurants within 2km of (37.7749, -122.4194)"
  │
  ▼
Search Service:
  1. Compute geohash of query point: "9q8yyk" (precision 6)
  2. Fetch candidate businesses from geohash cells:
     - Primary cell: "9q8yyk"
     - 8 neighboring cells: "9q8yyj", "9q8yym", "9q8yys", ...
     → ~500 candidate businesses
  3. Filter:
     a. Exact distance: haversine(query_lat, query_lng, biz_lat, biz_lng) <= 2000m
     b. Category: category == "restaurant"
     c. Active: is_active == true
     → ~120 businesses pass filters
  4. Rank:
     score = 0.4 * relevance + 0.3 * distance_score + 0.2 * rating_score + 0.1 * popularity
     - distance_score = 1 - (distance / max_radius)
     - rating_score = (rating - 1) / 4
     - popularity = log(review_count) / log(max_review_count)
     - relevance = text_match_score (if search query includes keywords)
  5. Return top 20, encode cursor

Components

Search Service (stateless, 50+ instances): Receives proximity queries, queries geospatial index, filters, ranks, returns results. Each instance holds a read-only copy of the geo index in memory.
Geospatial Index (in-memory, 20 read replicas): Geohash-based primary index + Quadtree for dynamic radius. 10GB in memory. Rebuilt every 15 minutes from Business DB.
Business Service: CRUD operations for business listings. Writes to MySQL, triggers async index rebuild.
Review Service: Handles review CRUD, rating aggregation. Updates rating_avg and review_count on Business table asynchronously.
Index Builder: Periodic job that reads all active businesses from MySQL, builds geohash index and Quadtree, pushes to read replicas. Can also do incremental updates via CDC (Change Data Capture).
CDN: All photos served from CDN edge locations. Business detail pages reference CDN URLs.

6. Deep Dives (15 min)

Deep Dive 1: Geospatial Indexing — Geohash vs Quadtree vs S2

Geohash:

Encoding: Interleave bits of latitude and longitude → base32 string
  (37.7749, -122.4194) → "9q8yyk6j"

Precision levels:
  Precision 4: ~39km x 19.5km  (country-level)
  Precision 5: ~4.9km x 4.9km  (city district)
  Precision 6: ~1.2km x 0.6km  (neighborhood)    ← primary index level
  Precision 7: ~153m x 153m    (block level)
  Precision 8: ~38m x 19m      (building level)

Key property: nearby points share a common prefix (mostly)
  "9q8yyk" and "9q8yym" are adjacent cells

Search at precision 6:
  1. Compute geohash prefix for query point
  2. Fetch businesses in this cell + 8 neighbors = 9 cells
  3. 9 cells x ~1.2km x 0.6km ≈ covers ~3.6km x 1.8km area
  4. Filter by exact distance within this area

Pros:
  - Simple: just a string prefix lookup in a hash map
  - Redis-native: GEOSEARCH command uses geohash internally
  - Easy to distribute: shard by geohash prefix

Cons:
  - Edge problem: points near cell boundaries may be close geographically but have different prefixes
    Solution: always query neighboring cells (8-connectivity)
  - Fixed grid: cells don't adapt to business density
    Dense downtown: 500 businesses per cell → filter overhead
    Rural area: 0 businesses per cell → wasted queries

Quadtree:

Structure: Recursive subdivision of 2D space into 4 quadrants
  Root: entire world
  Split rule: when a node has > K businesses (K = 100), split into 4 children
  Leaf nodes: contain ≤ K businesses each

Dense area (Manhattan): tree depth 15+ → leaves cover ~50m x 50m
Sparse area (Wyoming): tree depth 5 → leaves cover ~100km x 100km

Search:
  1. Traverse tree to find leaf containing query point
  2. All businesses in this leaf are candidates
  3. If radius extends beyond leaf boundary:
     Traverse to neighboring leaves (walk up tree, then down siblings)
  4. Continue expanding until the search circle is fully covered

Pros:
  - Adaptive: dense areas get fine granularity, sparse areas stay coarse
  - Efficient for variable-radius queries
  - Natural for nearest-neighbor (expand outward until K results found)

Cons:
  - More complex to implement than geohash
  - In-memory tree structure: ~50 bytes per node, ~2M nodes for 200M businesses → 100MB
  - Harder to distribute (tree structure is inherently single-machine)
  - Updates require tree rebalancing

S2 Geometry (Google’s approach):

Mapping: Earth → cube (6 faces) → Hilbert curve → 64-bit cell ID
  Hierarchical: level 0 = face, level 30 = ~1cm²
  Cell ID encodes the path from root to leaf

Search:
  1. Compute S2 cell covering for the search circle (set of cells that cover the area)
  2. Cell covering adapts: tight covering for small areas, coarse for large
  3. Fetch businesses in all covering cells

Pros:
  - No edge effects (unlike geohash)
  - Hilbert curve preserves spatial locality better than geohash
  - Very efficient coverings: typically 4-8 cells to cover any circle
  - Production-proven at Google scale

Cons:
  - Complex implementation (use the S2 library)
  - Less intuitive than geohash for debugging

Recommendation: Geohash for the primary index (simplicity, Redis compatibility). Quadtree as a secondary structure for nearest-neighbor queries and dynamic radius expansion.

Deep Dive 2: Dynamic Radius Expansion and “No Results” Handling

The problem: User searches for “sushi restaurants within 1km” in a suburb. There are zero results. What do we do?

Strategy: Expand radius progressively:

1. Initial search: radius = 1km (user's request)
   → 0 results

2. Auto-expand: radius = 2km
   → 3 results
   → Return these with a note: "Showing results within 2km"

3. If still 0: radius = 5km, then 10km, then 25km
   → Cap at 50km (beyond this, results are irrelevant)

Implementation:
  With Quadtree:
    Start at leaf node → expand to parent → parent's siblings → grandparent
    Natural expanding search: O(log N) per expansion step

  With Geohash:
    Precision 7 (153m) → Precision 6 (1.2km) → Precision 5 (4.9km) → Precision 4 (39km)
    Each level covers ~8x more area
    Less granular expansion (jumpy)

  Hybrid (best):
    1. Geohash precision 6 for initial search (covers ~3.6km with neighbors)
    2. If insufficient results: drop to precision 5 (covers ~15km)
    3. If still insufficient: Quadtree nearest-neighbor search (expand outward incrementally)

Ranking adjustment with expanded radius:

When radius is auto-expanded, re-weight the ranking formula:
  - Increase distance weight: users who wanted 1km results strongly prefer closer options
  - Apply distance penalty: score *= exp(-distance / original_radius)
  - This ensures that a 4.5-star restaurant at 1.5km ranks above a 4.8-star restaurant at 10km

Presentation:
  "No sushi restaurants within 1km. Showing results within 5km."
  Results grouped: "Within 2km" | "Within 5km"

Deep Dive 3: Caching Strategy for Location Queries

The problem: Location queries seem uncacheable — every user has a different (lat, lng). But there are patterns we can exploit.

Level 1: Geohash-based query caching

Observation: Two users 200m apart in the same geohash cell, searching for "pizza",
  will get nearly identical results.

Cache key: geohash_prefix:category:sort:filters
  Example: "9q8yyk:restaurant:best_match:price_2" → cached result set

Cache hit rate: ~60-70% (users cluster in dense areas, search for common categories)
TTL: 5 minutes (business data doesn't change frequently)

Implementation:
  1. Compute geohash of query point (precision 6)
  2. Construct cache key from geohash + filters
  3. If cache hit: filter cached results by exact distance from user's actual location, re-rank
  4. If cache miss: execute full search, populate cache

The cached result set is larger than what any single user needs (all businesses in the geohash area).
Per-user distance filtering and re-ranking is done on the cached superset.

Level 2: Popular location pre-computation

Pre-compute search results for popular (geohash, category) pairs:
  - Top 1000 geohash cells by query volume (city centers, airports, tourist areas)
  - Top 20 categories (restaurant, coffee, gas station, hotel, ...)
  - 1000 x 20 = 20,000 pre-computed result sets
  - Updated every 5 minutes

These serve as a "warm cache" — 40% of all queries hit these pre-computed results.

Level 3: Client-side caching

When user loads search results:
  - Cache results on the client for 10 minutes
  - If user pans map slightly (< 500m), filter existing results client-side
  - Only fetch from server when user moves significantly or changes filters
  - Reduces server QPS by ~30%

7. Extensions (2 min)

Real-time business status: Integrate live data — current wait times, live occupancy, “unusually busy right now.” Use business check-ins, Google Maps popular times equivalent, or IoT sensors. Display as a badge on search results.
Recommendation engine: “Because you liked Joe’s Pizza, you might like…” Collaborative filtering on user review history + content-based filtering on business attributes. Surface in a “Recommended for You” section below search results.
Map-based search: When user pans/zooms the map, dynamically load businesses in the visible viewport. Use S2 cell covering of the viewport rectangle. Cluster markers at low zoom levels (100 restaurants in downtown → single cluster marker).
Review authenticity and spam detection: ML model to detect fake reviews (burst patterns, reviewer history, sentiment analysis, copy-paste detection). Flag suspicious reviews for manual moderation. Weight verified-purchase reviews higher in the rating calculation.
Offline-first mobile experience: Cache recently viewed areas and business details on the mobile device. Users can search and browse while offline (stale but functional). Sync when connectivity is restored.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Key Insight#

3. API Design (3 min)#

Key Decisions#

4. Data Model (3 min)#

Business Table (MySQL, sharded by business_id)#

Geospatial Index (in-memory, replicated)#

Reviews Table (MySQL, sharded by business_id)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

Search Query Flow#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Geospatial Indexing — Geohash vs Quadtree vs S2#

Deep Dive 2: Dynamic Radius Expansion and “No Results” Handling#

Deep Dive 3: Caching Strategy for Location Queries#

7. Extensions (2 min)#