Design an ETA Estimation System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Given an origin and destination, compute the estimated time of arrival (ETA) accounting for current traffic conditions
Support multiple routing options (fastest, shortest, avoid tolls, avoid highways) and return ETAs for each
Incorporate real-time traffic data (GPS probe data from active drivers, incident reports, road closures) to adjust ETAs continuously
Provide ETA updates during an active trip (re-estimate as conditions change or the driver deviates from the route)
Return confidence intervals with ETAs (e.g., “18-24 minutes” not just “21 minutes”) to communicate uncertainty

Non-Functional Requirements

Availability: 99.99% — ETA is shown on every ride request, every search, every navigation step. It’s the most-queried service.
Latency: ETA computation < 200ms for a single origin-destination pair. Batch ETA (nearby drivers to rider) < 500ms for 50 candidates.
Accuracy: Mean Absolute Error (MAE) < 15% of actual trip time. P90 error < 25%. Accuracy matters more than precision.
Scale: 500K ETA requests/sec at peak (every map interaction, every ride match, every navigation step). Road network: 50M road segments globally.
Freshness: Traffic data reflected in ETAs within 2 minutes of observation. Stale traffic = wrong ETAs = bad routing.

2. Estimation (3 min)

Traffic

500K ETA requests/sec at peak
Each ETA request requires a graph search over a road network subgraph
Average route: ~200 road segments explored (varies by distance)
Total road segments processed: 500K × 200 = 100M segment lookups/sec

Road Network Size

Global road network: ~50M road segments, ~40M intersections
Graph representation: each segment = (from_node, to_node, distance, base_travel_time, current_speed)
Storage: 50M segments × 100 bytes = 5 GB — fits in memory on a single server
With precomputed contraction hierarchy: add ~10 GB for shortcut edges → 15 GB total

Traffic Data

GPS probes from active drivers: 5M drivers × 1 update/4 seconds = 1.25M probe points/sec
Each probe: (lat, lng, speed, heading, timestamp) = ~40 bytes
Probe ingestion bandwidth: 1.25M × 40B = 50 MB/sec
Map-matched probes per segment per minute: varies (highways: hundreds, residential: 0-5)

ML Model Inference

If using ML for ETA prediction (not just graph routing):
- Feature extraction: ~50 features per route (distance, segment speeds, time of day, weather, etc.)
- Model inference: ~5ms per prediction (on GPU) or ~20ms (on CPU)
- At 500K requests/sec × 5ms per inference: need 2500 GPU-seconds/sec → ~250 GPU instances
- Or: precompute and cache ETAs for common routes; ML inference only for cache misses

Key Insight

This is a graph search problem enhanced by real-time data and ML. Pure Dijkstra on 50M segments is too slow for 200ms SLA. We need hierarchical precomputation (Contraction Hierarchies or similar) to reduce search space by 100-1000×, combined with real-time traffic edge weights to keep ETAs accurate.

3. API Design (3 min)

Single Route ETA

GET /eta?origin=37.7749,-122.4194
  &destination=37.7849,-122.4094
  &departure_time=now
  &routing_preference=fastest
  &alternatives=true

Response 200: {
  "routes": [
    {
      "route_id": "route_abc",
      "eta_seconds": 1260,
      "eta_range": { "min": 1080, "max": 1500 },    // confidence interval
      "distance_meters": 8500,
      "departure_time": "2024-02-22T14:30:00Z",
      "arrival_time": "2024-02-22T14:51:00Z",
      "segments": [
        { "road": "Market St", "distance_m": 1200, "duration_s": 180, "speed_kmh": 24, "traffic": "moderate" },
        { "road": "US-101 N", "distance_m": 5000, "duration_s": 240, "speed_kmh": 75, "traffic": "light" },
        ...
      ],
      "traffic_summary": "Moderate traffic on Market St, light elsewhere"
    },
    { "route_id": "route_def", ... }   // alternative route
  ]
}

Batch ETA (for ride matching: many drivers to one rider)

POST /eta/batch
Body: {
  "destination": { "lat": 37.7849, "lng": -122.4094 },
  "origins": [
    { "id": "driver_1", "lat": 37.7749, "lng": -122.4194 },
    { "id": "driver_2", "lat": 37.7800, "lng": -122.4150 },
    ...                                  // up to 50 origins
  ],
  "departure_time": "now"
}
Response 200: {
  "etas": [
    { "id": "driver_1", "eta_seconds": 420, "distance_meters": 2100 },
    { "id": "driver_2", "eta_seconds": 180, "distance_meters": 900 },
    ...
  ]
}

Live Trip ETA Update

GET /eta/trip/{trip_id}
Response 200: {
  "trip_id": "trip_789",
  "current_position": { "lat": 37.7780, "lng": -122.4160 },
  "remaining_eta_seconds": 780,
  "remaining_distance_meters": 5200,
  "original_eta_seconds": 1260,
  "eta_change": -180,                   // 3 minutes faster than original
  "traffic_ahead": "clearing"
}

Key Decisions

Confidence intervals instead of point estimates — reduces user frustration when ETA is off by 3 minutes
Batch ETA is critical for ride matching — must compute 50 ETAs in < 500ms (10ms per ETA with parallel graph searches)
Alternatives returned by default — let users choose (fastest vs avoiding highways)
Traffic level per segment — helps users understand why ETA is high (“heavy traffic on I-280”)

4. Data Model (3 min)

Road Network Graph (in-memory)

Structure: adjacency list

Node (intersection):
  node_id             | int64
  lat                 | float
  lng                 | float
  level               | int            // contraction hierarchy level

Edge (road segment):
  edge_id             | int64
  from_node           | int64
  to_node             | int64
  distance_meters     | int
  base_travel_time_ms | int            // travel time at speed limit, no traffic
  road_class          | enum           // highway, arterial, residential, etc.
  speed_limit_kmh     | int
  is_toll             | boolean
  is_oneway           | boolean
  shortcut_edges      | array<int64>   // for Contraction Hierarchies

Real-Time Traffic Layer (Redis + in-memory)

Key: traffic:{edge_id}
Value: {
  "current_speed_kmh": 35,
  "free_flow_speed_kmh": 65,
  "congestion_ratio": 0.54,           // current/freeflow (< 1 means congestion)
  "updated_at": 1708632000,
  "probe_count": 12,                  // how many GPS probes contributed
  "confidence": 0.85                  // higher with more probes
}
TTL: 300 seconds (fall back to historical if no recent data)

Historical Traffic Patterns (precomputed, loaded on startup)

Table: historical_speeds (Parquet files, loaded into memory)
  edge_id             | int64
  day_of_week         | int            // 0=Monday, 6=Sunday
  time_bucket         | int            // 5-minute bucket (0-287)
  avg_speed_kmh       | float
  p25_speed_kmh       | float
  p75_speed_kmh       | float
  sample_count        | int

// 50M edges × 7 days × 288 buckets × 20 bytes = ~2TB
// Too large for memory → store only top 5M most-used edges (highways, arterials)
// 5M × 7 × 288 × 20B = ~200GB → partition across 20 servers (10GB each)

ML Feature Store (Redis)

Key: ml_features:{route_hash}
Value: {
  "distance": 8500,
  "num_segments": 42,
  "num_turns": 12,
  "pct_highway": 0.58,
  "time_of_day_bucket": 58,           // 5-minute bucket
  "day_of_week": 4,                   // Thursday
  "weather_code": 2,                  // light rain
  "avg_segment_congestion": 0.72,
  "historical_mean_eta": 1150,
  "historical_std_eta": 180
}
TTL: 60 seconds (recomputed frequently)

Why These Choices

In-memory graph: Road network must be in memory for sub-millisecond segment lookups during graph search. 15 GB is trivially cacheable.
Redis for traffic: Shared across all routing instances. Sub-millisecond per-segment speed lookup. TTL ensures stale data auto-expires.
Parquet for historical: Columnar format, efficient for loading specific edge × time bucket slices.

5. High-Level Design (12 min)

Architecture

GPS Probes (from driver apps)
  → Probe Ingestion Service
  → Kafka (raw-probes topic)
  → Map Matching Pipeline (Flink):
       Raw (lat,lng) → snapped to road segment
       → Compute speed per segment
       → Aggregate: per-edge speed (median of last 5 min of probes)
       → Write to Redis (traffic layer)

ETA Request Flow:
  Client → API Gateway → ETA Service:
    1. Snap origin/destination to nearest nodes (KD-tree lookup)
    2. Query traffic layer (Redis) for edge weights
    3. Run graph search (Contraction Hierarchies with traffic weights)
    4. Optional: ML model refinement of graph-based ETA
    5. Compute confidence interval
    6. Return ETA + route

ETA Computation Pipeline

Step 1: Geocode & Snap
  Origin (lat, lng) → nearest graph node (KD-tree, < 1ms)
  Destination (lat, lng) → nearest graph node

Step 2: Compute Edge Weights
  For each edge in the graph:
    if real-time traffic available (probe_count > 3 in last 5 min):
      weight = distance / current_speed
    elif historical data available:
      weight = distance / historical_speed[day_of_week][time_bucket]
    else:
      weight = distance / free_flow_speed × 1.2  (20% congestion buffer)

Step 3: Graph Search
  Use Contraction Hierarchies (CH) bidirectional Dijkstra:
    Forward search from origin (upward in hierarchy)
    Backward search from destination (upward in hierarchy)
    Meet in the middle at a high-level node
    Unpack shortcut edges to reconstruct full path

  CH reduces search space from ~1M nodes to ~1000 nodes
  Search time: < 5ms for city-scale routes

Step 4: ML Refinement (optional, for high-value routes)
  Features: graph-based ETA, route characteristics, time/weather
  Model: gradient-boosted tree (XGBoost) or neural network
  Output: adjusted ETA (corrects systematic biases in graph ETA)

  Graph ETA might say 21 min, but ML knows that this corridor
  at 5 PM on Fridays typically takes 10% longer → adjusted ETA = 23 min

Step 5: Confidence Interval
  Based on historical variance for this route type:
    std_dev = f(distance, road_class_mix, time_of_day)
    eta_range = [eta - 1.5 × std_dev, eta + 1.5 × std_dev]

Components

Probe Ingestion Service: Receives GPS updates from drivers. High-throughput (1.25M/sec). Publishes to Kafka.
Map Matching Pipeline (Flink): Matches raw GPS coordinates to road segments using Hidden Markov Model (HMM). Computes per-segment speeds. Aggregates over 5-minute windows.
Traffic Service (Redis): Stores current speed per edge. Read by all ETA Service instances.
ETA Service: Core computation. Loads road graph + CH index into memory. Queries Redis for traffic. Runs graph search. Optionally invokes ML model. Returns ETA.
ML Model Service: Serves ETA prediction model. Feature extraction from route characteristics + traffic. Deployed as a sidecar or separate gRPC service.
Historical Data Service: Serves precomputed historical speed profiles. Used as fallback when real-time probe data is sparse.
Monitoring: ETA accuracy tracking (predicted vs actual trip time), traffic data freshness, probe coverage heatmap.

Deployment

Per region:
  ETA Service: 100+ instances (each with full road graph in memory, ~15GB)
  Redis Cluster: 10 shards for traffic data
  Flink Cluster: map matching + traffic aggregation
  ML Model: 50 GPU instances (or CPU with model optimization)

6. Deep Dives (15 min)

Deep Dive 1: Contraction Hierarchies — Why Graph Search is Fast

Problem: Dijkstra on a 50M-edge graph takes ~500ms. We need < 5ms.

Contraction Hierarchies (CH):

Preprocessing (offline, takes hours):
  1. Assign importance to each node (heuristic: road class, edge count)
     Highways nodes = high importance, residential = low
  2. Iteratively "contract" the least important node:
     - Remove node v from graph
     - For each pair (u, w) of v's neighbors:
       if shortest path u→w goes through v:
         add shortcut edge u→w with weight = weight(u→v) + weight(v→w)
  3. Result: hierarchical graph where
     - Bottom: residential streets
     - Middle: arterial roads
     - Top: highways and major intersections

Query (online, < 5ms):
  Bidirectional search:
    Forward from origin: explore only edges going UP in hierarchy
    Backward from destination: explore only edges going UP in hierarchy
    Meet at the highest common node

  Why this works:
    Any shortest path goes "up" to highways, travels along, then goes "down"
    By only searching upward from both ends, we explore ~1000 nodes instead of ~1M

  Path unpacking:
    Shortcut edges are expanded recursively to reconstruct the actual road-by-road path

Performance comparison:

Algorithm	Nodes Explored	Query Time
Dijkstra	~1,000,000	~500ms
A*	~300,000	~150ms
CH	~1,000	~2-5ms
CH + traffic	~2,000	~5-10ms

CH with dynamic traffic weights (the hard part):

Standard CH assumes static edge weights (precomputed).
With traffic, weights change every few minutes.

Approach 1: Full CH recomputation every 5 minutes
  Too slow — CH preprocessing takes 30+ minutes for a city

Approach 2: Customizable CH (CCH)
  Preprocessing: compute CH structure (topology only) offline
  Runtime: update edge weights without changing structure
  Reweight every 1-2 minutes → takes ~2 seconds for a city
  Query time: ~10ms (slightly slower than static CH)

Approach 3: A* with landmarks (ALT) as fallback
  When traffic changes invalidate CH shortcuts, fall back to A*
  with landmark-based heuristic (15-30ms, still fast enough)

Recommendation: CCH (Customizable Contraction Hierarchies)
  - Topology precomputed once (offline, takes hours)
  - Metric (edge weights) updated every minute (takes seconds)
  - Query time < 10ms even with dynamic traffic

Deep Dive 2: Real-Time Traffic from GPS Probes

Pipeline: Raw GPS → Road Segment Speeds

Step 1: Map Matching (GPS → road segment)

Problem: GPS is noisy (±10m). A point at (37.775, -122.419) could be on
Market St, Mission St, or an alley between them.

Solution: Hidden Markov Model (HMM) map matching
  States: road segments within 50m of GPS point
  Emission probability: P(GPS point | on segment) ∝ Gaussian(distance)
  Transition probability: P(segment_j at t+1 | segment_i at t)
    ∝ 1 / |route_distance - GPS_distance|
    (prefer transitions where driving distance matches GPS displacement)

  Viterbi algorithm: find most likely sequence of segments

  Input:  [(37.775, -122.419, t=0), (37.776, -122.418, t=4s), ...]
  Output: [segment_1234 at t=0, segment_1234 at t=4s, segment_1235 at t=8s, ...]

Step 2: Speed computation per segment

From map-matched trajectory:
  Driver traversed segment 1235 (length 200m) in 8 seconds
  Speed = 200m / 8s = 25 m/s = 90 km/h

Aggregate over all drivers on segment 1235 in last 5 minutes:
  Probe speeds: [88, 90, 85, 92, 87, 91] km/h
  Median speed: 89 km/h (median is robust to outliers)
  Probe count: 6 (high confidence)

Step 3: Handling sparse data

Problem: Residential streets may have 0 probes in the last 5 minutes.

Fallback hierarchy:
  1. Real-time probes (last 5 min, > 3 probes): use median speed
  2. Recent probes (last 15 min, > 3 probes): use median speed × 0.95 (slight penalty for staleness)
  3. Historical speed profile (same day-of-week, same 15-min bucket): use historical median
  4. Free-flow speed × congestion factor (based on nearby segments): last resort

Nearby segment propagation:
  If segment S has no data but adjacent segments A and B show 60% of free-flow:
    Estimate S at 60% of free-flow (spatial smoothing)
    Weighted by road class similarity

Step 4: Incident detection

Sudden speed drop on a segment (from 60 km/h to 5 km/h) → possible incident
  1. Flag segment as "incident suspected"
  2. Propagate congestion to upstream segments (traffic backs up)
  3. Adjust ETAs for routes through this area
  4. If official incident report received → confirm and adjust TTL

Incidents cleared:
  When probe speeds recover to > 80% of free-flow → remove incident flag
  Gradual recovery: don't immediately jump to full speed (road takes time to clear)

Deep Dive 3: ML-Based ETA Prediction

Why graph-based ETA isn’t enough:

Graph search computes: sum(distance_i / speed_i) for all segments on route

This misses:
  - Turn delays (left turns at intersections take 30-60 seconds)
  - Traffic signal timing (red lights add 0-120 seconds per intersection)
  - Merge/exit delays on highways
  - Systematic biases (school zones at 3 PM, stadium events)
  - Weather impact (rain → 15% slower, snow → 30% slower)

Graph ETA error: typically 20-30% MAE
ML-corrected ETA error: typically 10-15% MAE (50% reduction)

ML Model Architecture:

Input features (per route):
  Route features:
    - Total distance (meters)
    - Number of segments
    - Number of turns (left, right, U-turn)
    - Road class distribution (% highway, % arterial, % residential)
    - Number of traffic signals (estimated from intersection data)

  Traffic features:
    - Graph-based ETA (from CH search)
    - Average congestion ratio along route
    - Max congestion ratio (bottleneck segment)
    - Number of congested segments (speed < 50% of free-flow)

  Temporal features:
    - Hour of day (one-hot or sinusoidal encoding)
    - Day of week
    - Is holiday
    - Minutes since sunrise/sunset

  Weather features:
    - Temperature, precipitation, visibility
    - Road condition (dry, wet, icy)

  Historical features:
    - Historical mean ETA for this O-D pair at this time
    - Historical variance

Model: Gradient-boosted trees (XGBoost / LightGBM)
  - 500 trees, max depth 8
  - Training data: millions of completed trips with actual duration
  - Loss function: Huber loss (robust to outliers)
  - Output: predicted ETA in seconds

  Inference time: ~2ms on CPU (small model), < 1ms on GPU

Alternative: Deep learning (Transformer on segment sequence)
  - Encode each segment as a vector (speed, distance, road class)
  - Self-attention over segment sequence
  - Better for long routes with complex interactions
  - Slower inference: ~10ms
  - Use for high-value routes (long-distance navigation)

Confidence intervals:

Train a quantile regression model alongside the point estimate:
  Model outputs: P10, P50 (median), P90 of ETA

  Example: P10 = 16 min, P50 = 21 min, P90 = 28 min
  Display: "21 min (16-28 min range)"

Alternatively: predict mean + variance
  ETA = 21 min, σ = 4 min
  95% CI: [21 - 2×4, 21 + 2×4] = [13, 29] min

Factors that increase uncertainty:
  - Longer routes (more segments = more variance)
  - Routes through congested areas (traffic is unpredictable)
  - Bad weather
  - Low probe coverage (less data = less confidence)

7. Extensions (2 min)

Multi-modal ETA: Combine driving, walking, and public transit. Compute ETAs for “drive to station → take train → walk to destination.” Requires transit schedule integration (GTFS feeds) alongside road network graph.
Predictive ETA (future departure): “If I leave at 5 PM, how long will it take?” Use historical traffic patterns for the departure time, not current traffic. Blend historical and real-time as departure time approaches.
ETA for delivery fleets: Optimize for multiple stops (Traveling Salesman variant). Compute arrival time at each stop. Factor in loading/unloading time. Update ETAs as driver completes each stop.
Offline-capable ETA: Cache road graph and recent traffic on-device for areas with poor connectivity. Compute ETAs locally with degraded accuracy. Sync when connectivity returns.
ETA accuracy feedback loop: Compare predicted ETAs with actual trip durations in real-time. If a route’s ETA error exceeds 20%, trigger re-calibration of traffic weights for that area. Auto-detect map data errors (road closures, new roads).

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Road Network Size#

Traffic Data#

ML Model Inference#

Key Insight#

3. API Design (3 min)#

Single Route ETA#

Batch ETA (for ride matching: many drivers to one rider)#

Live Trip ETA Update#

Key Decisions#

4. Data Model (3 min)#

Road Network Graph (in-memory)#

Real-Time Traffic Layer (Redis + in-memory)#

Historical Traffic Patterns (precomputed, loaded on startup)#

ML Feature Store (Redis)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

ETA Computation Pipeline#

Components#

Deployment#

6. Deep Dives (15 min)#

Deep Dive 1: Contraction Hierarchies — Why Graph Search is Fast#

Deep Dive 2: Real-Time Traffic from GPS Probes#

Deep Dive 3: ML-Based ETA Prediction#

7. Extensions (2 min)#