Design a Surge Pricing System (Uber)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Traffic
- Storage
- Compute
- Key Insight
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Compute a dynamic pricing multiplier for each geographic zone based on real-time supply (available drivers) and demand (ride requests)
Divide the service area into geospatial zones and compute independent surge multipliers per zone
Display the current surge multiplier to riders before they confirm a ride, with a price estimate
Apply smoothing and dampening so surge prices don’t oscillate wildly (e.g., 1.0× → 3.5× → 1.2× within minutes)
Enforce price caps and fairness rules (regulatory limits, max multiplier during emergencies, consistent pricing within a zone)

Non-Functional Requirements

Availability: 99.99% — surge pricing is in the critical path of every ride request. If it’s down, rides can’t be priced.
Latency: Surge multiplier lookup < 20ms per ride request. Surge recomputation runs every 30-60 seconds.
Consistency: All ride requests within the same zone at the same time should see the same surge multiplier. Eventual consistency across data centers is acceptable (< 5 second lag).
Scale: 500 cities, 50K zones globally, 100K ride requests/sec at peak, 5M active drivers
Freshness: Surge must reflect conditions no older than 60 seconds. Stale surge = mispricing = lost revenue or angry riders.

2. Estimation (3 min)

Traffic

100K ride requests/sec at peak → each needs a surge lookup → 100K reads/sec
Driver location updates: 5M drivers × 1 update every 4 seconds = 1.25M location updates/sec
Surge recomputation: 50K zones × 1 recomputation/minute = ~833 zone recomputations/sec (lightweight)
Ride request events (for demand counting): 100K events/sec

Storage

Zone definitions: 50K zones × 1 KB (polygon or H3 index) = 50 MB (fits in memory)
Current surge state: 50K zones × 100 bytes (multiplier, supply, demand, timestamp) = 5 MB (fits in Redis)
Historical surge data (for analytics): 50K zones × 1 record/min × 60 min × 24 hr = 72M records/day × 200 bytes = ~14 GB/day

Compute

Per zone recomputation: count supply (drivers in zone), count demand (requests in zone in last 2 min), compute ratio, apply formula
Simple arithmetic — CPU is trivial
The bottleneck is ingesting and aggregating 1.25M location updates/sec to determine supply per zone in real-time

Key Insight

This is a real-time geospatial aggregation problem. The hard parts are: (1) efficiently mapping millions of driver locations to zones every few seconds, (2) computing stable surge multipliers that respond to demand without oscillating, and (3) ensuring riders and drivers see consistent prices during a ride.

3. API Design (3 min)

Surge Lookup (called on every ride request)

GET /surge?lat=37.7749&lng=-122.4194
Response 200: {
  "zone_id": "h3_872830828ffffff",
  "surge_multiplier": 1.8,
  "price_estimate": {
    "base_fare": 5.00,
    "per_mile": 1.50,
    "per_minute": 0.25,
    "surge_multiplied_estimate": 18.90,
    "currency": "USD"
  },
  "surge_expiry": 1708632090,          // valid for 90 seconds
  "surge_id": "surge_abc123"           // lock this price for the rider
}

Confirm Ride with Surge

POST /rides
Body: {
  "rider_id": "rider_456",
  "pickup": { "lat": 37.7749, "lng": -122.4194 },
  "dropoff": { "lat": 37.7849, "lng": -122.4094 },
  "surge_id": "surge_abc123",          // locks in the surge multiplier
  "payment_method_id": "pm_xyz"
}
Response 201: {
  "ride_id": "ride_789",
  "surge_multiplier": 1.8,             // locked at time of confirm
  "estimated_fare": 18.90
}

Driver Location Update (high-frequency)

POST /drivers/location
Body: {
  "driver_id": "driver_123",
  "lat": 37.7755,
  "lng": -122.4180,
  "status": "available",               // available, on_trip, offline
  "timestamp": 1708632000
}
Response 204 No Content

Admin / Analytics

GET /surge/zones?city=san_francisco
  → Returns all zones with current surge multiplier

GET /surge/history?zone_id=h3_872830828ffffff&start=...&end=...
  → Historical surge data for analytics

PUT /surge/config
Body: {
  "city": "san_francisco",
  "max_multiplier": 5.0,
  "min_multiplier": 1.0,
  "smoothing_factor": 0.3,
  "emergency_cap": 1.0                 // disable surge during declared emergencies
}

Key Decisions

Surge ID locking: When a rider sees surge 1.8×, that price is locked for 90 seconds. Even if surge changes to 2.5× before they confirm, they pay 1.8×. Prevents bait-and-switch frustration.
204 for location updates: No response body needed, minimizes latency on the highest-volume endpoint
H3 hexagonal zones (not arbitrary polygons) — uniform area, efficient spatial indexing

4. Data Model (3 min)

Zone Definitions (in-memory, loaded from config)

Structure: zone_config (loaded into each service instance)
  zone_id             | string          -- H3 index at resolution 7
  city                | string
  center_lat          | float
  center_lng          | float
  neighbors           | array<string>   -- adjacent H3 cells (for smoothing)

Real-Time Zone State (Redis)

Key: surge:{zone_id}
Value (Hash): {
  "multiplier": 1.8,
  "supply": 45,                        // available drivers in zone
  "demand": 82,                        // ride requests in last 2 min
  "supply_demand_ratio": 0.549,
  "updated_at": 1708632000,
  "version": 12345                     // monotonic version for consistency
}
TTL: 120 seconds (auto-expire if not refreshed — failsafe to 1.0× if pipeline dies)

Surge Lock (Redis, short TTL)

Key: surge_lock:{surge_id}
Value: {
  "zone_id": "h3_872830828ffffff",
  "multiplier": 1.8,
  "created_at": 1708632000
}
TTL: 90 seconds

Driver Locations (Redis Geo or custom)

Key: drivers:{city}
Type: Redis GeoSet
Members: driver_id with (lat, lng)

// Or: in-memory geospatial index in the Surge Computation Service
// Updated from driver location stream

Historical Surge (ClickHouse)

Table: surge_history
  zone_id             | string
  timestamp           | datetime
  multiplier          | float
  supply              | int
  demand              | int
  ratio               | float
  city                | string

-- Partitioned by date, ordered by (zone_id, timestamp)
-- Used for analytics, A/B testing analysis, and ML model training

Why These Choices

Redis for real-time state: sub-millisecond reads, 100K lookups/sec trivially. TTL-based expiry as a safety net.
H3 hexagons (Uber’s own library): uniform area (unlike lat/lng grids), hierarchical (can aggregate to coarser resolution), well-supported.
ClickHouse for historical: columnar storage, fast aggregation for surge analytics.

5. High-Level Design (12 min)

Architecture

Driver App → Location Service → Kafka (driver-locations topic)
                                         │
Rider App → Ride Service → Kafka (ride-requests topic)
                                         │
                              ┌──────────▼──────────────┐
                              │  Surge Computation       │
                              │  Service                 │
                              │  (runs every 30 seconds) │
                              │                          │
                              │  For each zone:          │
                              │    1. Count supply       │
                              │    2. Count demand       │
                              │    3. Compute ratio      │
                              │    4. Apply formula      │
                              │    5. Apply smoothing    │
                              │    6. Apply caps         │
                              │    7. Write to Redis     │
                              └──────────┬──────────────┘
                                         │
                                    ┌────▼────┐
                                    │  Redis  │
                                    │  (surge │
                                    │  state) │
                                    └────┬────┘
                                         │
Ride Request → Ride Service → GET surge from Redis → return to rider

Surge Computation Pipeline (detailed)

Every 30 seconds, for each zone:

Step 1: Count Supply
  Query driver location index: how many drivers with status='available'
  are within zone_id's H3 hexagon?

  Implementation: Flink job consuming driver-locations topic
    Maintains in-memory H3-indexed driver set
    On location update: re-index driver into its new H3 cell
    Supply query: count(drivers in cell where status = 'available')

Step 2: Count Demand
  Flink job consuming ride-requests topic
    Sliding window: count requests per zone in last 2 minutes
    (Weighted: recent requests weighted more than 2-min-old ones)

Step 3: Compute Supply-Demand Ratio
  ratio = supply / max(demand, 1)       // avoid division by zero

Step 4: Apply Surge Formula
  if ratio >= 1.0: multiplier = 1.0     // supply exceeds demand, no surge
  if ratio < 1.0:
    // Inverse relationship: less supply relative to demand = higher surge
    raw_multiplier = 1.0 / ratio
    // Example: supply=20, demand=60 → ratio=0.33 → raw_multiplier=3.0

Step 5: Apply Smoothing (prevent oscillation)
  new_multiplier = α × raw_multiplier + (1 - α) × previous_multiplier
  where α = 0.3 (smoothing factor)

  Example: previous=1.5, raw=3.0, α=0.3
    new = 0.3 × 3.0 + 0.7 × 1.5 = 0.9 + 1.05 = 1.95
  Surge rises gradually from 1.5 to 1.95, not jumping to 3.0

Step 6: Apply Caps and Rules
  multiplier = max(min_multiplier, min(multiplier, max_multiplier))
  // min = 1.0, max = 5.0 (configurable per city)

  // Emergency override
  if city.emergency_mode: multiplier = 1.0

  // Regulatory cap (e.g., some cities cap at 2.0)
  if city.regulatory_cap: multiplier = min(multiplier, city.regulatory_cap)

Step 7: Spatial Smoothing (optional)
  // Prevent sharp boundaries between adjacent zones
  smoothed = 0.6 × zone_multiplier + 0.4 × avg(neighbor_multipliers)
  // Rider at a zone boundary shouldn't see 1.0 vs 3.0 by moving 50 meters

Step 8: Write to Redis
  HSET surge:{zone_id} multiplier {value} supply {n} demand {n} updated_at {ts}

Components

Location Service: Receives driver GPS updates, validates, publishes to Kafka. High-throughput (1.25M/sec).
Ride Service: Handles ride requests. On request, looks up surge from Redis, returns price estimate. On confirm, locks surge_id.
Surge Computation Service (Flink): Stateful stream processor. Maintains real-time supply/demand counts per zone. Recomputes multipliers every 30 seconds. Writes to Redis.
Redis Cluster: Stores current surge state per zone. Sub-millisecond reads for 100K ride requests/sec.
Config Service: Manages per-city surge parameters (max multiplier, smoothing factor, emergency overrides). Changes propagated to computation service in real-time.
Analytics Pipeline: Writes surge history to ClickHouse for post-hoc analysis, A/B testing, and ML model training.

6. Deep Dives (15 min)

Deep Dive 1: H3 Hexagonal Grid — Why and How

Why not latitude/longitude grid squares?

Grid squares have non-uniform areas (a 0.01° cell at the equator is 1.1 km², at 60°N it’s 0.56 km²)
Square grids have two types of neighbors (edge-adjacent and corner-adjacent) — uneven neighbor relationships
Hexagons have uniform area and exactly 6 equidistant neighbors

H3 Indexing System (developed by Uber):

Resolutions:
  Level 7: ~5.16 km² per hexagon (good for city-level surge)
  Level 8: ~0.74 km² per hexagon (neighborhood-level)
  Level 9: ~0.11 km² per hexagon (block-level, too granular for surge)

Typical choice: Resolution 7 for surge pricing
  San Francisco: ~120 km² → ~23 hexagons
  New York City: ~783 km² → ~152 hexagons
  Global: 50K hexagons across all served cities

Converting lat/lng to H3:
  h3_index = h3.latlng_to_cell(lat, lng, resolution=7)
  // Returns: "872830828ffffff"
  // O(1) computation — just bit manipulation

Finding neighbors:
  neighbors = h3.grid_ring(h3_index, k=1)  // 6 neighbors at distance 1
  // Used for spatial smoothing of surge multipliers

Hierarchical aggregation:

If resolution 7 zones are too granular (low data density in suburb):
  → Aggregate to resolution 6 (7× larger): h3.cell_to_parent(cell, 6)
  → Compute surge at coarser granularity for sparse areas

Adaptive resolution:
  Urban core: resolution 8 (high driver/rider density → fine granularity)
  Suburbs: resolution 7
  Rural: resolution 6 or no surge at all

Deep Dive 2: Surge Multiplier Stability — Smoothing and Dampening

The oscillation problem:

t=0: supply=50, demand=100, ratio=0.5, multiplier=2.0
     → Riders see 2.0× surge, some cancel → demand drops
     → Drivers rush to surge zone → supply increases
t=30s: supply=90, demand=40, ratio=2.25, multiplier=1.0
     → Low surge attracts riders back, drivers leave
t=60s: supply=30, demand=120, ratio=0.25, multiplier=4.0
     → Wild oscillation: 2.0 → 1.0 → 4.0

This is a classic feedback loop / control theory problem.

Solution 1: Exponential Moving Average (EMA)

smoothed_multiplier(t) = α × raw(t) + (1 - α) × smoothed(t - 1)

α = 0.3 (slow response, stable):
  t=0:  raw=2.0, smoothed = 0.3(2.0) + 0.7(1.0) = 1.3
  t=30: raw=1.0, smoothed = 0.3(1.0) + 0.7(1.3) = 1.21
  t=60: raw=4.0, smoothed = 0.3(4.0) + 0.7(1.21) = 2.05

Result: instead of 2.0 → 1.0 → 4.0, we get 1.3 → 1.21 → 2.05
Much more stable, but responds slowly to genuine demand spikes.

Solution 2: Asymmetric smoothing

// Surge should rise faster than it falls
// Rising = responding to demand (important for driver incentive)
// Falling = returning to normal (should be gradual to prevent re-oscillation)

α_up = 0.5    (faster rise)
α_down = 0.15  (slower fall)

if raw(t) > smoothed(t-1):
  smoothed(t) = α_up × raw(t) + (1 - α_up) × smoothed(t-1)
else:
  smoothed(t) = α_down × raw(t) + (1 - α_down) × smoothed(t-1)

Solution 3: Minimum duration (hysteresis)

Once surge rises above a threshold, it stays there for at least N minutes.
Prevents fluttering between 1.0× and 1.5× every 30 seconds.

if multiplier(t) > 1.5 and multiplier(t-1) < 1.5:
  surge_start_time = t
  // Cannot drop below 1.5 until t + 5 minutes

Solution 4: Step-function discretization

Instead of continuous values (1.0, 1.1, 1.2, ...):
Allowed surge levels: [1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 4.0, 5.0]

Snap to nearest allowed level.
This prevents visually confusing changes like 1.7× → 1.8× → 1.7×
Users see fewer, more meaningful changes.

Deep Dive 3: Fairness, Gaming Prevention, and A/B Testing

Fairness concerns:

Problem 1: Low-income neighborhoods might see persistent surge due to fewer drivers.
  → Monitor surge frequency by zone. Flag zones with surge > 1.5× more than 40% of the time.
  → Possible mitigations: driver incentive bonuses for underserved zones, reduced max multiplier.

Problem 2: Surge during emergencies (natural disasters, mass shootings).
  → Emergency mode: cap multiplier at 1.0 for affected zones.
  → Triggered manually by ops or automatically via emergency API feeds.
  → Uber actually implemented this after backlash during Hurricane Sandy.

Problem 3: Riders game surge by moving their pin to an adjacent non-surge zone.
  → Use pickup location for billing, not pin location.
  → Or: compute surge at both pin and actual pickup, use the maximum.

A/B testing surge parameters:

Goal: Test whether α=0.3 or α=0.5 produces better revenue and rider satisfaction.

Setup:
  Treatment A: zones with even H3 indices use α=0.3
  Treatment B: zones with odd H3 indices use α=0.5
  (Ensures geographic interleaving — reduces confounders)

Metrics:
  - Revenue per ride (higher surge = more revenue but fewer rides)
  - Ride completion rate (too much surge = riders cancel)
  - Driver earnings (surge should attract drivers to high-demand areas)
  - ETA (surge should reduce wait time by attracting supply)
  - Rider satisfaction (post-ride survey, 1-5 stars)

Duration: 2-4 weeks per city (need enough data for statistical significance)

Guardrails:
  - If treatment group ride completions drop > 5%, auto-revert
  - Revenue difference > 10% → alert team for review

Price lock guarantee:

When rider sees surge 2.0× and taps "confirm":
  1. Server creates surge_lock in Redis: { surge_id, multiplier: 2.0, expires: +90s }
  2. Rider has 90 seconds to complete booking
  3. On ride creation: read surge_lock, apply multiplier 2.0 regardless of current surge
  4. After 90 seconds: lock expires, rider must re-check current surge

Why: Prevents scenario where surge jumps from 2.0 to 3.5 between
     the rider seeing the price and tapping confirm (3-5 seconds)
     → Would feel like bait-and-switch

7. Extensions (2 min)

Predictive surge: ML model trained on historical data predicts surge 15-30 minutes ahead. Inputs: time of day, day of week, weather, local events, real-time supply trends. Use prediction to pre-position drivers (suggest drivers move toward areas where surge is expected).
Surge-aware routing: When computing ride ETAs and routes, factor in surge in nearby zones. A driver completing a trip near a surge zone might be routed slightly toward it for the next pickup, increasing effective supply.
Multi-product surge: Different multipliers for UberX, UberXL, UberBlack. Luxury tiers may have different supply-demand dynamics. Compute per-(zone, product) multipliers.
Rider surge tolerance profiles: Some riders always accept surge; others cancel at 1.5×. Use historical acceptance data to personalize the surge UI (e.g., show “prices are higher than usual” vs exact multiplier). Note: personalized pricing is legally sensitive in many jurisdictions.
Driver incentive integration: Instead of only charging riders more, offer temporary driver bonuses in high-demand zones (“Earn 2× in downtown for the next 30 minutes”). This increases supply-side response without always raising rider prices.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Compute#

Key Insight#

3. API Design (3 min)#

Surge Lookup (called on every ride request)#

Confirm Ride with Surge#

Driver Location Update (high-frequency)#

Admin / Analytics#

Key Decisions#

4. Data Model (3 min)#

Zone Definitions (in-memory, loaded from config)#

Real-Time Zone State (Redis)#

Surge Lock (Redis, short TTL)#

Driver Locations (Redis Geo or custom)#

Historical Surge (ClickHouse)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

Surge Computation Pipeline (detailed)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: H3 Hexagonal Grid — Why and How#

Deep Dive 2: Surge Multiplier Stability — Smoothing and Dampening#

Deep Dive 3: Fairness, Gaming Prevention, and A/B Testing#

7. Extensions (2 min)#