1. Requirements & Scope (5 min)
Functional Requirements
- Given an origin and destination, compute the estimated time of arrival (ETA) accounting for current traffic conditions
- Support multiple routing options (fastest, shortest, avoid tolls, avoid highways) and return ETAs for each
- Incorporate real-time traffic data (GPS probe data from active drivers, incident reports, road closures) to adjust ETAs continuously
- Provide ETA updates during an active trip (re-estimate as conditions change or the driver deviates from the route)
- Return confidence intervals with ETAs (e.g., “18-24 minutes” not just “21 minutes”) to communicate uncertainty
Non-Functional Requirements
- Availability: 99.99% — ETA is shown on every ride request, every search, every navigation step. It’s the most-queried service.
- Latency: ETA computation < 200ms for a single origin-destination pair. Batch ETA (nearby drivers to rider) < 500ms for 50 candidates.
- Accuracy: Mean Absolute Error (MAE) < 15% of actual trip time. P90 error < 25%. Accuracy matters more than precision.
- Scale: 500K ETA requests/sec at peak (every map interaction, every ride match, every navigation step). Road network: 50M road segments globally.
- Freshness: Traffic data reflected in ETAs within 2 minutes of observation. Stale traffic = wrong ETAs = bad routing.
2. Estimation (3 min)
Traffic
- 500K ETA requests/sec at peak
- Each ETA request requires a graph search over a road network subgraph
- Average route: ~200 road segments explored (varies by distance)
- Total road segments processed: 500K × 200 = 100M segment lookups/sec
Road Network Size
- Global road network: ~50M road segments, ~40M intersections
- Graph representation: each segment = (from_node, to_node, distance, base_travel_time, current_speed)
- Storage: 50M segments × 100 bytes = 5 GB — fits in memory on a single server
- With precomputed contraction hierarchy: add ~10 GB for shortcut edges → 15 GB total
Traffic Data
- GPS probes from active drivers: 5M drivers × 1 update/4 seconds = 1.25M probe points/sec
- Each probe: (lat, lng, speed, heading, timestamp) = ~40 bytes
- Probe ingestion bandwidth: 1.25M × 40B = 50 MB/sec
- Map-matched probes per segment per minute: varies (highways: hundreds, residential: 0-5)
ML Model Inference
- If using ML for ETA prediction (not just graph routing):
- Feature extraction: ~50 features per route (distance, segment speeds, time of day, weather, etc.)
- Model inference: ~5ms per prediction (on GPU) or ~20ms (on CPU)
- At 500K requests/sec × 5ms per inference: need 2500 GPU-seconds/sec → ~250 GPU instances
- Or: precompute and cache ETAs for common routes; ML inference only for cache misses
Key Insight
This is a graph search problem enhanced by real-time data and ML. Pure Dijkstra on 50M segments is too slow for 200ms SLA. We need hierarchical precomputation (Contraction Hierarchies or similar) to reduce search space by 100-1000×, combined with real-time traffic edge weights to keep ETAs accurate.
3. API Design (3 min)
Single Route ETA
GET /eta?origin=37.7749,-122.4194
&destination=37.7849,-122.4094
&departure_time=now
&routing_preference=fastest
&alternatives=true
Response 200: {
"routes": [
{
"route_id": "route_abc",
"eta_seconds": 1260,
"eta_range": { "min": 1080, "max": 1500 }, // confidence interval
"distance_meters": 8500,
"departure_time": "2024-02-22T14:30:00Z",
"arrival_time": "2024-02-22T14:51:00Z",
"segments": [
{ "road": "Market St", "distance_m": 1200, "duration_s": 180, "speed_kmh": 24, "traffic": "moderate" },
{ "road": "US-101 N", "distance_m": 5000, "duration_s": 240, "speed_kmh": 75, "traffic": "light" },
...
],
"traffic_summary": "Moderate traffic on Market St, light elsewhere"
},
{ "route_id": "route_def", ... } // alternative route
]
}
Batch ETA (for ride matching: many drivers to one rider)
POST /eta/batch
Body: {
"destination": { "lat": 37.7849, "lng": -122.4094 },
"origins": [
{ "id": "driver_1", "lat": 37.7749, "lng": -122.4194 },
{ "id": "driver_2", "lat": 37.7800, "lng": -122.4150 },
... // up to 50 origins
],
"departure_time": "now"
}
Response 200: {
"etas": [
{ "id": "driver_1", "eta_seconds": 420, "distance_meters": 2100 },
{ "id": "driver_2", "eta_seconds": 180, "distance_meters": 900 },
...
]
}
Live Trip ETA Update
GET /eta/trip/{trip_id}
Response 200: {
"trip_id": "trip_789",
"current_position": { "lat": 37.7780, "lng": -122.4160 },
"remaining_eta_seconds": 780,
"remaining_distance_meters": 5200,
"original_eta_seconds": 1260,
"eta_change": -180, // 3 minutes faster than original
"traffic_ahead": "clearing"
}
Key Decisions
- Confidence intervals instead of point estimates — reduces user frustration when ETA is off by 3 minutes
- Batch ETA is critical for ride matching — must compute 50 ETAs in < 500ms (10ms per ETA with parallel graph searches)
- Alternatives returned by default — let users choose (fastest vs avoiding highways)
- Traffic level per segment — helps users understand why ETA is high (“heavy traffic on I-280”)
4. Data Model (3 min)
Road Network Graph (in-memory)
Structure: adjacency list
Node (intersection):
node_id | int64
lat | float
lng | float
level | int // contraction hierarchy level
Edge (road segment):
edge_id | int64
from_node | int64
to_node | int64
distance_meters | int
base_travel_time_ms | int // travel time at speed limit, no traffic
road_class | enum // highway, arterial, residential, etc.
speed_limit_kmh | int
is_toll | boolean
is_oneway | boolean
shortcut_edges | array<int64> // for Contraction Hierarchies
Real-Time Traffic Layer (Redis + in-memory)
Key: traffic:{edge_id}
Value: {
"current_speed_kmh": 35,
"free_flow_speed_kmh": 65,
"congestion_ratio": 0.54, // current/freeflow (< 1 means congestion)
"updated_at": 1708632000,
"probe_count": 12, // how many GPS probes contributed
"confidence": 0.85 // higher with more probes
}
TTL: 300 seconds (fall back to historical if no recent data)
Historical Traffic Patterns (precomputed, loaded on startup)
Table: historical_speeds (Parquet files, loaded into memory)
edge_id | int64
day_of_week | int // 0=Monday, 6=Sunday
time_bucket | int // 5-minute bucket (0-287)
avg_speed_kmh | float
p25_speed_kmh | float
p75_speed_kmh | float
sample_count | int
// 50M edges × 7 days × 288 buckets × 20 bytes = ~2TB
// Too large for memory → store only top 5M most-used edges (highways, arterials)
// 5M × 7 × 288 × 20B = ~200GB → partition across 20 servers (10GB each)
ML Feature Store (Redis)
Key: ml_features:{route_hash}
Value: {
"distance": 8500,
"num_segments": 42,
"num_turns": 12,
"pct_highway": 0.58,
"time_of_day_bucket": 58, // 5-minute bucket
"day_of_week": 4, // Thursday
"weather_code": 2, // light rain
"avg_segment_congestion": 0.72,
"historical_mean_eta": 1150,
"historical_std_eta": 180
}
TTL: 60 seconds (recomputed frequently)
Why These Choices
- In-memory graph: Road network must be in memory for sub-millisecond segment lookups during graph search. 15 GB is trivially cacheable.
- Redis for traffic: Shared across all routing instances. Sub-millisecond per-segment speed lookup. TTL ensures stale data auto-expires.
- Parquet for historical: Columnar format, efficient for loading specific edge × time bucket slices.
5. High-Level Design (12 min)
Architecture
GPS Probes (from driver apps)
→ Probe Ingestion Service
→ Kafka (raw-probes topic)
→ Map Matching Pipeline (Flink):
Raw (lat,lng) → snapped to road segment
→ Compute speed per segment
→ Aggregate: per-edge speed (median of last 5 min of probes)
→ Write to Redis (traffic layer)
ETA Request Flow:
Client → API Gateway → ETA Service:
1. Snap origin/destination to nearest nodes (KD-tree lookup)
2. Query traffic layer (Redis) for edge weights
3. Run graph search (Contraction Hierarchies with traffic weights)
4. Optional: ML model refinement of graph-based ETA
5. Compute confidence interval
6. Return ETA + route
ETA Computation Pipeline
Step 1: Geocode & Snap
Origin (lat, lng) → nearest graph node (KD-tree, < 1ms)
Destination (lat, lng) → nearest graph node
Step 2: Compute Edge Weights
For each edge in the graph:
if real-time traffic available (probe_count > 3 in last 5 min):
weight = distance / current_speed
elif historical data available:
weight = distance / historical_speed[day_of_week][time_bucket]
else:
weight = distance / free_flow_speed × 1.2 (20% congestion buffer)
Step 3: Graph Search
Use Contraction Hierarchies (CH) bidirectional Dijkstra:
Forward search from origin (upward in hierarchy)
Backward search from destination (upward in hierarchy)
Meet in the middle at a high-level node
Unpack shortcut edges to reconstruct full path
CH reduces search space from ~1M nodes to ~1000 nodes
Search time: < 5ms for city-scale routes
Step 4: ML Refinement (optional, for high-value routes)
Features: graph-based ETA, route characteristics, time/weather
Model: gradient-boosted tree (XGBoost) or neural network
Output: adjusted ETA (corrects systematic biases in graph ETA)
Graph ETA might say 21 min, but ML knows that this corridor
at 5 PM on Fridays typically takes 10% longer → adjusted ETA = 23 min
Step 5: Confidence Interval
Based on historical variance for this route type:
std_dev = f(distance, road_class_mix, time_of_day)
eta_range = [eta - 1.5 × std_dev, eta + 1.5 × std_dev]
Components
- Probe Ingestion Service: Receives GPS updates from drivers. High-throughput (1.25M/sec). Publishes to Kafka.
- Map Matching Pipeline (Flink): Matches raw GPS coordinates to road segments using Hidden Markov Model (HMM). Computes per-segment speeds. Aggregates over 5-minute windows.
- Traffic Service (Redis): Stores current speed per edge. Read by all ETA Service instances.
- ETA Service: Core computation. Loads road graph + CH index into memory. Queries Redis for traffic. Runs graph search. Optionally invokes ML model. Returns ETA.
- ML Model Service: Serves ETA prediction model. Feature extraction from route characteristics + traffic. Deployed as a sidecar or separate gRPC service.
- Historical Data Service: Serves precomputed historical speed profiles. Used as fallback when real-time probe data is sparse.
- Monitoring: ETA accuracy tracking (predicted vs actual trip time), traffic data freshness, probe coverage heatmap.
Deployment
Per region:
ETA Service: 100+ instances (each with full road graph in memory, ~15GB)
Redis Cluster: 10 shards for traffic data
Flink Cluster: map matching + traffic aggregation
ML Model: 50 GPU instances (or CPU with model optimization)
6. Deep Dives (15 min)
Deep Dive 1: Contraction Hierarchies — Why Graph Search is Fast
Problem: Dijkstra on a 50M-edge graph takes ~500ms. We need < 5ms.
Contraction Hierarchies (CH):
Preprocessing (offline, takes hours):
1. Assign importance to each node (heuristic: road class, edge count)
Highways nodes = high importance, residential = low
2. Iteratively "contract" the least important node:
- Remove node v from graph
- For each pair (u, w) of v's neighbors:
if shortest path u→w goes through v:
add shortcut edge u→w with weight = weight(u→v) + weight(v→w)
3. Result: hierarchical graph where
- Bottom: residential streets
- Middle: arterial roads
- Top: highways and major intersections
Query (online, < 5ms):
Bidirectional search:
Forward from origin: explore only edges going UP in hierarchy
Backward from destination: explore only edges going UP in hierarchy
Meet at the highest common node
Why this works:
Any shortest path goes "up" to highways, travels along, then goes "down"
By only searching upward from both ends, we explore ~1000 nodes instead of ~1M
Path unpacking:
Shortcut edges are expanded recursively to reconstruct the actual road-by-road path
Performance comparison:
| Algorithm | Nodes Explored | Query Time |
|---|---|---|
| Dijkstra | ~1,000,000 | ~500ms |
| A* | ~300,000 | ~150ms |
| CH | ~1,000 | ~2-5ms |
| CH + traffic | ~2,000 | ~5-10ms |
CH with dynamic traffic weights (the hard part):
Standard CH assumes static edge weights (precomputed).
With traffic, weights change every few minutes.
Approach 1: Full CH recomputation every 5 minutes
Too slow — CH preprocessing takes 30+ minutes for a city
Approach 2: Customizable CH (CCH)
Preprocessing: compute CH structure (topology only) offline
Runtime: update edge weights without changing structure
Reweight every 1-2 minutes → takes ~2 seconds for a city
Query time: ~10ms (slightly slower than static CH)
Approach 3: A* with landmarks (ALT) as fallback
When traffic changes invalidate CH shortcuts, fall back to A*
with landmark-based heuristic (15-30ms, still fast enough)
Recommendation: CCH (Customizable Contraction Hierarchies)
- Topology precomputed once (offline, takes hours)
- Metric (edge weights) updated every minute (takes seconds)
- Query time < 10ms even with dynamic traffic
Deep Dive 2: Real-Time Traffic from GPS Probes
Pipeline: Raw GPS → Road Segment Speeds
Step 1: Map Matching (GPS → road segment)
Problem: GPS is noisy (±10m). A point at (37.775, -122.419) could be on
Market St, Mission St, or an alley between them.
Solution: Hidden Markov Model (HMM) map matching
States: road segments within 50m of GPS point
Emission probability: P(GPS point | on segment) ∝ Gaussian(distance)
Transition probability: P(segment_j at t+1 | segment_i at t)
∝ 1 / |route_distance - GPS_distance|
(prefer transitions where driving distance matches GPS displacement)
Viterbi algorithm: find most likely sequence of segments
Input: [(37.775, -122.419, t=0), (37.776, -122.418, t=4s), ...]
Output: [segment_1234 at t=0, segment_1234 at t=4s, segment_1235 at t=8s, ...]
Step 2: Speed computation per segment
From map-matched trajectory:
Driver traversed segment 1235 (length 200m) in 8 seconds
Speed = 200m / 8s = 25 m/s = 90 km/h
Aggregate over all drivers on segment 1235 in last 5 minutes:
Probe speeds: [88, 90, 85, 92, 87, 91] km/h
Median speed: 89 km/h (median is robust to outliers)
Probe count: 6 (high confidence)
Step 3: Handling sparse data
Problem: Residential streets may have 0 probes in the last 5 minutes.
Fallback hierarchy:
1. Real-time probes (last 5 min, > 3 probes): use median speed
2. Recent probes (last 15 min, > 3 probes): use median speed × 0.95 (slight penalty for staleness)
3. Historical speed profile (same day-of-week, same 15-min bucket): use historical median
4. Free-flow speed × congestion factor (based on nearby segments): last resort
Nearby segment propagation:
If segment S has no data but adjacent segments A and B show 60% of free-flow:
Estimate S at 60% of free-flow (spatial smoothing)
Weighted by road class similarity
Step 4: Incident detection
Sudden speed drop on a segment (from 60 km/h to 5 km/h) → possible incident
1. Flag segment as "incident suspected"
2. Propagate congestion to upstream segments (traffic backs up)
3. Adjust ETAs for routes through this area
4. If official incident report received → confirm and adjust TTL
Incidents cleared:
When probe speeds recover to > 80% of free-flow → remove incident flag
Gradual recovery: don't immediately jump to full speed (road takes time to clear)
Deep Dive 3: ML-Based ETA Prediction
Why graph-based ETA isn’t enough:
Graph search computes: sum(distance_i / speed_i) for all segments on route
This misses:
- Turn delays (left turns at intersections take 30-60 seconds)
- Traffic signal timing (red lights add 0-120 seconds per intersection)
- Merge/exit delays on highways
- Systematic biases (school zones at 3 PM, stadium events)
- Weather impact (rain → 15% slower, snow → 30% slower)
Graph ETA error: typically 20-30% MAE
ML-corrected ETA error: typically 10-15% MAE (50% reduction)
ML Model Architecture:
Input features (per route):
Route features:
- Total distance (meters)
- Number of segments
- Number of turns (left, right, U-turn)
- Road class distribution (% highway, % arterial, % residential)
- Number of traffic signals (estimated from intersection data)
Traffic features:
- Graph-based ETA (from CH search)
- Average congestion ratio along route
- Max congestion ratio (bottleneck segment)
- Number of congested segments (speed < 50% of free-flow)
Temporal features:
- Hour of day (one-hot or sinusoidal encoding)
- Day of week
- Is holiday
- Minutes since sunrise/sunset
Weather features:
- Temperature, precipitation, visibility
- Road condition (dry, wet, icy)
Historical features:
- Historical mean ETA for this O-D pair at this time
- Historical variance
Model: Gradient-boosted trees (XGBoost / LightGBM)
- 500 trees, max depth 8
- Training data: millions of completed trips with actual duration
- Loss function: Huber loss (robust to outliers)
- Output: predicted ETA in seconds
Inference time: ~2ms on CPU (small model), < 1ms on GPU
Alternative: Deep learning (Transformer on segment sequence)
- Encode each segment as a vector (speed, distance, road class)
- Self-attention over segment sequence
- Better for long routes with complex interactions
- Slower inference: ~10ms
- Use for high-value routes (long-distance navigation)
Confidence intervals:
Train a quantile regression model alongside the point estimate:
Model outputs: P10, P50 (median), P90 of ETA
Example: P10 = 16 min, P50 = 21 min, P90 = 28 min
Display: "21 min (16-28 min range)"
Alternatively: predict mean + variance
ETA = 21 min, σ = 4 min
95% CI: [21 - 2×4, 21 + 2×4] = [13, 29] min
Factors that increase uncertainty:
- Longer routes (more segments = more variance)
- Routes through congested areas (traffic is unpredictable)
- Bad weather
- Low probe coverage (less data = less confidence)
7. Extensions (2 min)
- Multi-modal ETA: Combine driving, walking, and public transit. Compute ETAs for “drive to station → take train → walk to destination.” Requires transit schedule integration (GTFS feeds) alongside road network graph.
- Predictive ETA (future departure): “If I leave at 5 PM, how long will it take?” Use historical traffic patterns for the departure time, not current traffic. Blend historical and real-time as departure time approaches.
- ETA for delivery fleets: Optimize for multiple stops (Traveling Salesman variant). Compute arrival time at each stop. Factor in loading/unloading time. Update ETAs as driver completes each stop.
- Offline-capable ETA: Cache road graph and recent traffic on-device for areas with poor connectivity. Compute ETAs locally with degraded accuracy. Sync when connectivity returns.
- ETA accuracy feedback loop: Compare predicted ETAs with actual trip durations in real-time. If a route’s ETA error exceeds 20%, trigger re-calibration of traffic weights for that area. Auto-detect map data errors (road closures, new roads).