1. Requirements & Scope (5 min)
Functional Requirements
- Compute a dynamic pricing multiplier for each geographic zone based on real-time supply (available drivers) and demand (ride requests)
- Divide the service area into geospatial zones and compute independent surge multipliers per zone
- Display the current surge multiplier to riders before they confirm a ride, with a price estimate
- Apply smoothing and dampening so surge prices don’t oscillate wildly (e.g., 1.0× → 3.5× → 1.2× within minutes)
- Enforce price caps and fairness rules (regulatory limits, max multiplier during emergencies, consistent pricing within a zone)
Non-Functional Requirements
- Availability: 99.99% — surge pricing is in the critical path of every ride request. If it’s down, rides can’t be priced.
- Latency: Surge multiplier lookup < 20ms per ride request. Surge recomputation runs every 30-60 seconds.
- Consistency: All ride requests within the same zone at the same time should see the same surge multiplier. Eventual consistency across data centers is acceptable (< 5 second lag).
- Scale: 500 cities, 50K zones globally, 100K ride requests/sec at peak, 5M active drivers
- Freshness: Surge must reflect conditions no older than 60 seconds. Stale surge = mispricing = lost revenue or angry riders.
2. Estimation (3 min)
Traffic
- 100K ride requests/sec at peak → each needs a surge lookup → 100K reads/sec
- Driver location updates: 5M drivers × 1 update every 4 seconds = 1.25M location updates/sec
- Surge recomputation: 50K zones × 1 recomputation/minute = ~833 zone recomputations/sec (lightweight)
- Ride request events (for demand counting): 100K events/sec
Storage
- Zone definitions: 50K zones × 1 KB (polygon or H3 index) = 50 MB (fits in memory)
- Current surge state: 50K zones × 100 bytes (multiplier, supply, demand, timestamp) = 5 MB (fits in Redis)
- Historical surge data (for analytics): 50K zones × 1 record/min × 60 min × 24 hr = 72M records/day × 200 bytes = ~14 GB/day
Compute
- Per zone recomputation: count supply (drivers in zone), count demand (requests in zone in last 2 min), compute ratio, apply formula
- Simple arithmetic — CPU is trivial
- The bottleneck is ingesting and aggregating 1.25M location updates/sec to determine supply per zone in real-time
Key Insight
This is a real-time geospatial aggregation problem. The hard parts are: (1) efficiently mapping millions of driver locations to zones every few seconds, (2) computing stable surge multipliers that respond to demand without oscillating, and (3) ensuring riders and drivers see consistent prices during a ride.
3. API Design (3 min)
Surge Lookup (called on every ride request)
GET /surge?lat=37.7749&lng=-122.4194
Response 200: {
"zone_id": "h3_872830828ffffff",
"surge_multiplier": 1.8,
"price_estimate": {
"base_fare": 5.00,
"per_mile": 1.50,
"per_minute": 0.25,
"surge_multiplied_estimate": 18.90,
"currency": "USD"
},
"surge_expiry": 1708632090, // valid for 90 seconds
"surge_id": "surge_abc123" // lock this price for the rider
}
Confirm Ride with Surge
POST /rides
Body: {
"rider_id": "rider_456",
"pickup": { "lat": 37.7749, "lng": -122.4194 },
"dropoff": { "lat": 37.7849, "lng": -122.4094 },
"surge_id": "surge_abc123", // locks in the surge multiplier
"payment_method_id": "pm_xyz"
}
Response 201: {
"ride_id": "ride_789",
"surge_multiplier": 1.8, // locked at time of confirm
"estimated_fare": 18.90
}
Driver Location Update (high-frequency)
POST /drivers/location
Body: {
"driver_id": "driver_123",
"lat": 37.7755,
"lng": -122.4180,
"status": "available", // available, on_trip, offline
"timestamp": 1708632000
}
Response 204 No Content
Admin / Analytics
GET /surge/zones?city=san_francisco
→ Returns all zones with current surge multiplier
GET /surge/history?zone_id=h3_872830828ffffff&start=...&end=...
→ Historical surge data for analytics
PUT /surge/config
Body: {
"city": "san_francisco",
"max_multiplier": 5.0,
"min_multiplier": 1.0,
"smoothing_factor": 0.3,
"emergency_cap": 1.0 // disable surge during declared emergencies
}
Key Decisions
- Surge ID locking: When a rider sees surge 1.8×, that price is locked for 90 seconds. Even if surge changes to 2.5× before they confirm, they pay 1.8×. Prevents bait-and-switch frustration.
- 204 for location updates: No response body needed, minimizes latency on the highest-volume endpoint
- H3 hexagonal zones (not arbitrary polygons) — uniform area, efficient spatial indexing
4. Data Model (3 min)
Zone Definitions (in-memory, loaded from config)
Structure: zone_config (loaded into each service instance)
zone_id | string -- H3 index at resolution 7
city | string
center_lat | float
center_lng | float
neighbors | array<string> -- adjacent H3 cells (for smoothing)
Real-Time Zone State (Redis)
Key: surge:{zone_id}
Value (Hash): {
"multiplier": 1.8,
"supply": 45, // available drivers in zone
"demand": 82, // ride requests in last 2 min
"supply_demand_ratio": 0.549,
"updated_at": 1708632000,
"version": 12345 // monotonic version for consistency
}
TTL: 120 seconds (auto-expire if not refreshed — failsafe to 1.0× if pipeline dies)
Surge Lock (Redis, short TTL)
Key: surge_lock:{surge_id}
Value: {
"zone_id": "h3_872830828ffffff",
"multiplier": 1.8,
"created_at": 1708632000
}
TTL: 90 seconds
Driver Locations (Redis Geo or custom)
Key: drivers:{city}
Type: Redis GeoSet
Members: driver_id with (lat, lng)
// Or: in-memory geospatial index in the Surge Computation Service
// Updated from driver location stream
Historical Surge (ClickHouse)
Table: surge_history
zone_id | string
timestamp | datetime
multiplier | float
supply | int
demand | int
ratio | float
city | string
-- Partitioned by date, ordered by (zone_id, timestamp)
-- Used for analytics, A/B testing analysis, and ML model training
Why These Choices
- Redis for real-time state: sub-millisecond reads, 100K lookups/sec trivially. TTL-based expiry as a safety net.
- H3 hexagons (Uber’s own library): uniform area (unlike lat/lng grids), hierarchical (can aggregate to coarser resolution), well-supported.
- ClickHouse for historical: columnar storage, fast aggregation for surge analytics.
5. High-Level Design (12 min)
Architecture
Driver App → Location Service → Kafka (driver-locations topic)
│
Rider App → Ride Service → Kafka (ride-requests topic)
│
┌──────────▼──────────────┐
│ Surge Computation │
│ Service │
│ (runs every 30 seconds) │
│ │
│ For each zone: │
│ 1. Count supply │
│ 2. Count demand │
│ 3. Compute ratio │
│ 4. Apply formula │
│ 5. Apply smoothing │
│ 6. Apply caps │
│ 7. Write to Redis │
└──────────┬──────────────┘
│
┌────▼────┐
│ Redis │
│ (surge │
│ state) │
└────┬────┘
│
Ride Request → Ride Service → GET surge from Redis → return to rider
Surge Computation Pipeline (detailed)
Every 30 seconds, for each zone:
Step 1: Count Supply
Query driver location index: how many drivers with status='available'
are within zone_id's H3 hexagon?
Implementation: Flink job consuming driver-locations topic
Maintains in-memory H3-indexed driver set
On location update: re-index driver into its new H3 cell
Supply query: count(drivers in cell where status = 'available')
Step 2: Count Demand
Flink job consuming ride-requests topic
Sliding window: count requests per zone in last 2 minutes
(Weighted: recent requests weighted more than 2-min-old ones)
Step 3: Compute Supply-Demand Ratio
ratio = supply / max(demand, 1) // avoid division by zero
Step 4: Apply Surge Formula
if ratio >= 1.0: multiplier = 1.0 // supply exceeds demand, no surge
if ratio < 1.0:
// Inverse relationship: less supply relative to demand = higher surge
raw_multiplier = 1.0 / ratio
// Example: supply=20, demand=60 → ratio=0.33 → raw_multiplier=3.0
Step 5: Apply Smoothing (prevent oscillation)
new_multiplier = α × raw_multiplier + (1 - α) × previous_multiplier
where α = 0.3 (smoothing factor)
Example: previous=1.5, raw=3.0, α=0.3
new = 0.3 × 3.0 + 0.7 × 1.5 = 0.9 + 1.05 = 1.95
Surge rises gradually from 1.5 to 1.95, not jumping to 3.0
Step 6: Apply Caps and Rules
multiplier = max(min_multiplier, min(multiplier, max_multiplier))
// min = 1.0, max = 5.0 (configurable per city)
// Emergency override
if city.emergency_mode: multiplier = 1.0
// Regulatory cap (e.g., some cities cap at 2.0)
if city.regulatory_cap: multiplier = min(multiplier, city.regulatory_cap)
Step 7: Spatial Smoothing (optional)
// Prevent sharp boundaries between adjacent zones
smoothed = 0.6 × zone_multiplier + 0.4 × avg(neighbor_multipliers)
// Rider at a zone boundary shouldn't see 1.0 vs 3.0 by moving 50 meters
Step 8: Write to Redis
HSET surge:{zone_id} multiplier {value} supply {n} demand {n} updated_at {ts}
Components
- Location Service: Receives driver GPS updates, validates, publishes to Kafka. High-throughput (1.25M/sec).
- Ride Service: Handles ride requests. On request, looks up surge from Redis, returns price estimate. On confirm, locks surge_id.
- Surge Computation Service (Flink): Stateful stream processor. Maintains real-time supply/demand counts per zone. Recomputes multipliers every 30 seconds. Writes to Redis.
- Redis Cluster: Stores current surge state per zone. Sub-millisecond reads for 100K ride requests/sec.
- Config Service: Manages per-city surge parameters (max multiplier, smoothing factor, emergency overrides). Changes propagated to computation service in real-time.
- Analytics Pipeline: Writes surge history to ClickHouse for post-hoc analysis, A/B testing, and ML model training.
6. Deep Dives (15 min)
Deep Dive 1: H3 Hexagonal Grid — Why and How
Why not latitude/longitude grid squares?
- Grid squares have non-uniform areas (a 0.01° cell at the equator is 1.1 km², at 60°N it’s 0.56 km²)
- Square grids have two types of neighbors (edge-adjacent and corner-adjacent) — uneven neighbor relationships
- Hexagons have uniform area and exactly 6 equidistant neighbors
H3 Indexing System (developed by Uber):
Resolutions:
Level 7: ~5.16 km² per hexagon (good for city-level surge)
Level 8: ~0.74 km² per hexagon (neighborhood-level)
Level 9: ~0.11 km² per hexagon (block-level, too granular for surge)
Typical choice: Resolution 7 for surge pricing
San Francisco: ~120 km² → ~23 hexagons
New York City: ~783 km² → ~152 hexagons
Global: 50K hexagons across all served cities
Converting lat/lng to H3:
h3_index = h3.latlng_to_cell(lat, lng, resolution=7)
// Returns: "872830828ffffff"
// O(1) computation — just bit manipulation
Finding neighbors:
neighbors = h3.grid_ring(h3_index, k=1) // 6 neighbors at distance 1
// Used for spatial smoothing of surge multipliers
Hierarchical aggregation:
If resolution 7 zones are too granular (low data density in suburb):
→ Aggregate to resolution 6 (7× larger): h3.cell_to_parent(cell, 6)
→ Compute surge at coarser granularity for sparse areas
Adaptive resolution:
Urban core: resolution 8 (high driver/rider density → fine granularity)
Suburbs: resolution 7
Rural: resolution 6 or no surge at all
Deep Dive 2: Surge Multiplier Stability — Smoothing and Dampening
The oscillation problem:
t=0: supply=50, demand=100, ratio=0.5, multiplier=2.0
→ Riders see 2.0× surge, some cancel → demand drops
→ Drivers rush to surge zone → supply increases
t=30s: supply=90, demand=40, ratio=2.25, multiplier=1.0
→ Low surge attracts riders back, drivers leave
t=60s: supply=30, demand=120, ratio=0.25, multiplier=4.0
→ Wild oscillation: 2.0 → 1.0 → 4.0
This is a classic feedback loop / control theory problem.
Solution 1: Exponential Moving Average (EMA)
smoothed_multiplier(t) = α × raw(t) + (1 - α) × smoothed(t - 1)
α = 0.3 (slow response, stable):
t=0: raw=2.0, smoothed = 0.3(2.0) + 0.7(1.0) = 1.3
t=30: raw=1.0, smoothed = 0.3(1.0) + 0.7(1.3) = 1.21
t=60: raw=4.0, smoothed = 0.3(4.0) + 0.7(1.21) = 2.05
Result: instead of 2.0 → 1.0 → 4.0, we get 1.3 → 1.21 → 2.05
Much more stable, but responds slowly to genuine demand spikes.
Solution 2: Asymmetric smoothing
// Surge should rise faster than it falls
// Rising = responding to demand (important for driver incentive)
// Falling = returning to normal (should be gradual to prevent re-oscillation)
α_up = 0.5 (faster rise)
α_down = 0.15 (slower fall)
if raw(t) > smoothed(t-1):
smoothed(t) = α_up × raw(t) + (1 - α_up) × smoothed(t-1)
else:
smoothed(t) = α_down × raw(t) + (1 - α_down) × smoothed(t-1)
Solution 3: Minimum duration (hysteresis)
Once surge rises above a threshold, it stays there for at least N minutes.
Prevents fluttering between 1.0× and 1.5× every 30 seconds.
if multiplier(t) > 1.5 and multiplier(t-1) < 1.5:
surge_start_time = t
// Cannot drop below 1.5 until t + 5 minutes
Solution 4: Step-function discretization
Instead of continuous values (1.0, 1.1, 1.2, ...):
Allowed surge levels: [1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 4.0, 5.0]
Snap to nearest allowed level.
This prevents visually confusing changes like 1.7× → 1.8× → 1.7×
Users see fewer, more meaningful changes.
Deep Dive 3: Fairness, Gaming Prevention, and A/B Testing
Fairness concerns:
Problem 1: Low-income neighborhoods might see persistent surge due to fewer drivers.
→ Monitor surge frequency by zone. Flag zones with surge > 1.5× more than 40% of the time.
→ Possible mitigations: driver incentive bonuses for underserved zones, reduced max multiplier.
Problem 2: Surge during emergencies (natural disasters, mass shootings).
→ Emergency mode: cap multiplier at 1.0 for affected zones.
→ Triggered manually by ops or automatically via emergency API feeds.
→ Uber actually implemented this after backlash during Hurricane Sandy.
Problem 3: Riders game surge by moving their pin to an adjacent non-surge zone.
→ Use pickup location for billing, not pin location.
→ Or: compute surge at both pin and actual pickup, use the maximum.
A/B testing surge parameters:
Goal: Test whether α=0.3 or α=0.5 produces better revenue and rider satisfaction.
Setup:
Treatment A: zones with even H3 indices use α=0.3
Treatment B: zones with odd H3 indices use α=0.5
(Ensures geographic interleaving — reduces confounders)
Metrics:
- Revenue per ride (higher surge = more revenue but fewer rides)
- Ride completion rate (too much surge = riders cancel)
- Driver earnings (surge should attract drivers to high-demand areas)
- ETA (surge should reduce wait time by attracting supply)
- Rider satisfaction (post-ride survey, 1-5 stars)
Duration: 2-4 weeks per city (need enough data for statistical significance)
Guardrails:
- If treatment group ride completions drop > 5%, auto-revert
- Revenue difference > 10% → alert team for review
Price lock guarantee:
When rider sees surge 2.0× and taps "confirm":
1. Server creates surge_lock in Redis: { surge_id, multiplier: 2.0, expires: +90s }
2. Rider has 90 seconds to complete booking
3. On ride creation: read surge_lock, apply multiplier 2.0 regardless of current surge
4. After 90 seconds: lock expires, rider must re-check current surge
Why: Prevents scenario where surge jumps from 2.0 to 3.5 between
the rider seeing the price and tapping confirm (3-5 seconds)
→ Would feel like bait-and-switch
7. Extensions (2 min)
- Predictive surge: ML model trained on historical data predicts surge 15-30 minutes ahead. Inputs: time of day, day of week, weather, local events, real-time supply trends. Use prediction to pre-position drivers (suggest drivers move toward areas where surge is expected).
- Surge-aware routing: When computing ride ETAs and routes, factor in surge in nearby zones. A driver completing a trip near a surge zone might be routed slightly toward it for the next pickup, increasing effective supply.
- Multi-product surge: Different multipliers for UberX, UberXL, UberBlack. Luxury tiers may have different supply-demand dynamics. Compute per-(zone, product) multipliers.
- Rider surge tolerance profiles: Some riders always accept surge; others cancel at 1.5×. Use historical acceptance data to personalize the surge UI (e.g., show “prices are higher than usual” vs exact multiplier). Note: personalized pricing is legally sensitive in many jurisdictions.
- Driver incentive integration: Instead of only charging riders more, offer temporary driver bonuses in high-demand zones (“Earn 2× in downtown for the next 30 minutes”). This increases supply-side response without always raising rider prices.