“Design Uber” sounds like a CRUD app with a map on top. A rider taps a pin, a driver shows up, money moves. The interviewer lets you believe that for about thirty seconds, then asks the question that breaks the toy version: there are 5 million drivers on the road right now, each one broadcasting its GPS position every few seconds, and a rider standing on a corner wants the nearest available car in under a second. How do you find “the closest driver” out of millions of constantly-moving points, hundreds of thousands of times per second, without scanning the whole planet on every request?
That single question - nearest-neighbour search over millions of moving points, at high write and high read rates simultaneously - is the spine of the whole system. Everything hard here hangs off it: how you ingest a firehose of location pings without melting your database, how you index a 2D surface so “who is near me” is a small bounded lookup instead of a full scan, how you match a rider to a driver without two dispatchers handing the same car to two people, and how surge pricing reads that same live supply-and-demand picture to move the price. Let me build it properly.
Functional Requirements (FR)
In scope:
- Driver location updates. Every active driver’s app streams its GPS position (lat/lng) to the backend every ~4 seconds. The system tracks the live location of millions of drivers.
- Request a ride. A rider enters pickup and drop-off, sees an ETA and a fare estimate, and requests. The system finds nearby available drivers and dispatches the request.
- Driver-rider matching. Given a rider’s pickup location, find the best available driver (nearest, subject to rating/ETA/acceptance heuristics) and offer the trip. The driver accepts or declines; on decline or timeout, offer the next candidate.
- Live trip tracking. Once matched, the rider sees the driver’s car moving toward them on the map in real time, and both parties track the trip to completion.
- Ride lifecycle. A trip moves through states: requested -> matched -> driver en route -> arrived -> in progress -> completed (or cancelled), with fare computed at the end.
- Surge pricing. When demand in an area outstrips supply, apply a dynamic price multiplier to that area, shown to the rider before they confirm.
- ETA and fare estimation. Estimated time of arrival for pickup, and an up-front fare estimate based on distance, time, and surge.
Explicitly out of scope (say this out loud so you own the scope):
- Payments and fraud. Charging a card, driver payouts, and fraud detection are their own systems; the trip ends by handing a fare and rider/driver IDs to a billing service. I will not design the ledger.
- Maps, routing, and road-network ETA internals. Turn-by-turn navigation, the road graph, and the routing engine (Dijkstra/contraction hierarchies over a road graph) are a separate specialist system. I will treat “give me the road-distance ETA between A and B” as a service call.
- Ratings, support, promotions, Uber Eats. Each is its own system.
- Driver onboarding / KYC. Assume drivers exist and are verified.
The one decision that drives everything: this is a read-and-write-heavy geospatial system where the data is location and location is constantly changing. Unlike a feed (read-heavy) or chat (push-heavy but point-to-point), the defining constraint is spatial - “find me things near this point” over a dataset that is rewritten every few seconds. That is why the geospatial index, not the database, is the heart of the design.
Non-Functional Requirements (NFR)
- Scale: ~5M active drivers online at peak globally, ~100M monthly riders, ~20M trips/day. Location pings from every online driver every ~4 seconds.
- Latency: the “find nearby drivers” query must return in under ~200ms so a ride request feels instant. Matching (offer -> accept) should complete within a few seconds. Live location updates on the rider’s map should feel real time (p99 under ~1s end to end).
- Availability: 99.99%+ on the request-ride and matching path. A rider who cannot get a car churns immediately. Location ingestion can tolerate brief degradation (a dropped ping is recoverable on the next one) but matching cannot.
- Consistency: matching must be strongly consistent for the driver assignment - a driver can be offered to exactly one rider at a time and assigned to exactly one trip. Two riders must never both be told “your driver is arriving” for the same car. Location data, by contrast, is best-effort and eventually consistent - a location that is 4 seconds stale is fine.
- Durability: trip records (who, where, when, fare, state transitions) must be durable and never lost - they are financial and legal records. Raw location pings are ephemeral; only the trip’s route needs archiving.
- Geospatial freshness: a driver’s tracked position should be no more than one ping interval (~4s) stale for matching purposes.
Back-of-the-Envelope Estimation (BoE)
Real numbers with the arithmetic, because they dictate the architecture.
Location updates (the write firehose):
5M active drivers, each pinging every 4 seconds
= 5,000,000 / 4
= 1,250,000 location writes/sec average
Peak (2x average, rush hour) ≈ 2.5M location writes/sec
That 1.25M-2.5M writes/sec is the number that kills a naive design. It is a sustained firehose, not a spike, and every write is a small update to a hot dataset.
Ride requests (reads that trigger the geo-query):
20M trips/day
= 20,000,000 / 86,400
≈ 230 trips/sec average
Peak (5x, commute crunch) ≈ 1,200 trips/sec
Each ride request triggers one “find nearby drivers” query (which reads many driver positions) plus a matching flow. So reads-that-matter are ~1,200/sec, but each fans out to read dozens of candidate drivers, and the location writes dwarf everything. The asymmetry is inverted from a normal app: writes (locations) are ~1,000x the trip requests.
Storage:
Live location state is small and overwritten in place - we do not keep history of every ping for matching:
Per driver live record:
driver_id 8 bytes
lat, lng 16 bytes
updated_at 8 bytes
status 1 byte
heading/speed 8 bytes
≈ 40 bytes (round to ~100 with keys/overhead)
5M drivers * 100 bytes ≈ 500 MB of live location state.
500MB fits comfortably in memory. That is the key realization: the entire live location index is small enough to live in RAM. The hard part is not storing it, it is updating it 2.5M times/sec and querying it spatially.
Trip records, which we do keep durably:
Per trip ≈ 1 KB (ids, timestamps, pickup/drop coords, route polyline pointer, fare, state history)
20M trips/day * 1 KB = 20 GB/day ≈ 7.3 TB/year.
Modest and easily sharded. If we also archive raw location traces per trip for disputes (a polyline of a few KB), add a bounded multiple - still tens of TB/year, cheap object storage.
Bandwidth:
Ingress: 2.5M pings/sec * ~200 bytes (payload + framing) ≈ 500 MB/sec inbound.
Egress: live trip tracking pushes driver positions to riders in active trips.
~1M concurrent trips * 1 push / few seconds * ~100 bytes ≈ tens of MB/sec.
The takeaways: a 2.5M writes/sec location firehose that must not touch a relational DB, a live index small enough (~500MB) to hold in memory, spatial queries that must be bounded (never a full scan), and matching that must be strongly consistent even though everything around it is eventually consistent. The geospatial index and the ingestion path, not storage, are the design drivers.
High-Level Design (HLD)
The architecture separates the location plane (a high-write, in-memory geospatial index fed by a streaming ingestion pipeline) from the transactional plane (matching, trips, pricing) that queries it. A rider’s request reads the location plane to get candidates, then the matching service runs a strongly-consistent assignment against them.
┌──────────────┐ ┌──────────────┐
│ Driver App │ location ping / 4s │ Rider App │
│ (GPS stream) │ │ (request/track)│
└──────┬────────┘ └──────┬────────┘
│ │
┌──────▼───────────────┐ ┌──────────▼──────────┐
│ Location Gateway │ │ API Gateway │
│ (WebSocket/gRPC edge) │ │ (ride requests, │
└──────┬────────────────┘ │ trip tracking) │
│ pings └──────────┬──────────┘
┌──────▼───────────────┐ │
│ Ingestion Stream │ │
│ (Kafka: location │ │
│ topic, keyed by │ │
│ driver_id) │ │
└──────┬────────────────┘ │
│ consumed │
┌──────▼─────────────────────────────┐ │
│ Location Service │◀────────────┤ "drivers near (lat,lng)?"
│ ┌───────────────────────────────┐ │ │
│ │ In-memory Geospatial Index │ │ │
│ │ (geohash buckets / QuadTree, │ │ ┌──────▼──────────┐
│ │ sharded by region) │ │ │ Matching Service │
│ └───────────────────────────────┘ │ │ (dispatch, offer,│
└─────────────────────────────────────┘ │ strong assign) │
▲ └──────┬──────────┘
│ candidate drivers │
│ ┌────────▼─────────┐
┌──────┴────────┐ ┌────────────────┐ │ Trip Service │
│ Surge/Pricing │ │ ETA / Routing │ │ (state machine, │
│ Service │ │ Service │ │ durable trips) │
└───────────────┘ └────────────────┘ └────────┬─────────┘
▲ │
│ supply/demand per cell ┌───────────▼──────────┐
└───────────────────────────────────│ Trip Store (sharded │
│ SQL, by trip/city) │
└──────────────────────┘
Driver location update flow:
- The driver app holds a persistent connection (WebSocket or gRPC stream) to a Location Gateway at the edge and pushes a ping
{driver_id, lat, lng, heading, speed, ts}every ~4s. - The gateway drops the ping onto a Kafka topic keyed by
driver_id(keying guarantees per-driver ordering; a newer position never overtakes an older one). - The Location Service consumes the stream and updates its in-memory geospatial index: it computes the driver’s geohash/cell, moves the driver from its old cell to its new one if it crossed a boundary, and overwrites the driver’s live position. The old ping is discarded - we keep only “where is each driver now.”
- Nothing hits a durable database on the hot path. The location plane is memory plus a stream.
Ride request / matching flow:
- Rider taps request. API Gateway calls the Matching Service with pickup
(lat, lng). - Matching Service asks the Location Service: “give me available drivers within radius R of this point.” The geo-index returns candidates from the pickup cell plus its neighbours - a bounded lookup, not a scan.
- Matching Service asks the ETA Service for road-distance ETAs to rank the candidates, and the Surge Service for the current multiplier in that cell.
- Matching offers the trip to the best candidate. It atomically locks that driver (strongly consistent) so no other request can grab them. The driver accepts or declines/times out; on failure, unlock and offer the next candidate.
- On accept, the Trip Service creates a durable trip record (state =
matched), and both apps switch to live tracking mode. The driver’s status flips toon_tripso the location index stops offering them.
Live tracking flow: during the trip, the driver’s ongoing pings are additionally relayed to the matched rider (via the rider’s open connection) so the rider sees the car move. The trip advances through its state machine until completed, at which point the fare is finalized and handed to billing.
The key insight: the location plane is memory + stream and is eventually consistent; the matching plane is a small strongly-consistent transaction layered on top of it. You keep the firehose away from your transactional database entirely, and you only pay for strong consistency on the rare, tiny “assign this driver” operation.
Component Deep Dive
The hard parts, naive-first then evolved: (1) ingesting the location firehose, (2) the geospatial index for nearest-neighbour search, (3) driver-rider matching without double-assignment, (4) surge pricing.
1. Ingesting 2.5M location updates/sec
Naive approach: write each ping to a database row. The driver app does UPDATE drivers SET lat=?, lng=?, updated_at=? WHERE id=? every 4 seconds.
Where it breaks:
- Write throughput. 2.5M
UPDATEs/sec against any relational database is hopeless. Even a well-sharded Postgres fleet chokes; each update is a row write, index update, WAL append, and replication event. You would need thousands of shards purely to absorb writes that are immediately overwritten 4 seconds later. - Write amplification for nothing. You are durably persisting data with a useful lifetime of 4 seconds. Every fsync, every replica, every index rebuild is wasted effort on data you will throw away.
- The index churn. If you also index
(lat, lng)for spatial queries, every ping rewrites that index. A B-tree spatial index under 2.5M writes/sec is a lock-contention nightmare.
First evolution: batch and drop durability. Write to an in-memory store (Redis) instead of a relational DB, and have drivers batch pings. Better - Redis can take the writes - but a single Redis is still a hot spot at millions of writes/sec, and you have coupled ingestion directly to the store, so a slow consumer backs up onto the driver apps.
The answer: a streaming ingestion pipeline that decouples ingest from index update, with the live state in memory.
- Persistent connections at the edge. Drivers hold a WebSocket/gRPC stream to a Location Gateway rather than doing an HTTP round trip every 4s. This removes per-ping connection setup and auth cost (same reasoning as chat: request/response is the wrong shape for a constant stream).
- Kafka as a shock absorber, keyed by
driver_id. The gateway publishes pings to a partitioned Kafka topic. Keying bydriver_idguarantees all of one driver’s pings land on the same partition in order, so the index never applies a stale position over a fresh one. Kafka absorbs bursts; if the Location Service falls behind, pings queue in the log instead of overwhelming it or blocking drivers. - In-memory index, partitioned by geography. Location Service consumers apply pings to a sharded in-memory geospatial index. Because the whole live dataset is ~500MB, it fits in RAM across a handful of nodes. Updates are O(1) hash operations (compute cell, move driver between buckets), not B-tree writes.
Driver app --WSS--> Location Gateway --produce--> Kafka (topic: locations,
partition = hash(driver_id))
Kafka --consume--> Location Service shard --> in-memory index:
cell_new = geohash(lat, lng, precision)
if cell_new != driver.cell_old:
index[cell_old].remove(driver_id)
index[cell_new].add(driver_id)
driver.pos = (lat, lng); driver.updated_at = ts
We accept that this state is not durable - and that is correct, because a lost location is re-sent 4 seconds later. If a Location Service node crashes, a replacement rebuilds its slice of the index by replaying the last few seconds of the Kafka topic (retention only needs to be a minute or two for this purpose). We traded durability we did not need for the throughput we absolutely need.
2. The geospatial index (geohash / QuadTree)
This is the core of the whole problem: given a point, return the drivers near it, fast, over millions of moving points.
Naive approach: scan all drivers and compute distance. For a ride request at (lat, lng), loop over all 5M drivers, compute the haversine distance to each, keep those within radius R, sort, return the closest.
Where it breaks:
- O(N) per query. 5M distance calculations per ride request, ~1,200 requests/sec peak = 6 billion haversine computations/sec. Absurd.
- No pruning. You examine drivers in Tokyo to answer a request in London. The overwhelming majority of the dataset is irrelevant to any given query but you touch all of it anyway.
First evolution: a naive lat/lng grid with SQL. Index drivers by rounded (lat, lng) and query a bounding box: WHERE lat BETWEEN ? AND ? AND lng BETWEEN ? AND ?. This prunes to a box, which is a huge improvement, but: a database B-tree on two independent columns cannot efficiently answer a 2D range (it picks one column, scans, filters the other), the index churns under the write firehose, and a fixed-degree box is a poor fit for wildly varying density (a box that is right for rural Montana returns 10,000 drivers in Manhattan).
The answer: a spatial index that maps 2D space to buckets - geohash - with density handled by a QuadTree where needed. Two closely related tools; I will use both and say when.
Geohash. A geohash encodes a (lat, lng) into a short string by recursively bisecting the world into a grid and interleaving the bits of latitude and longitude. Each added character refines the cell:
precision 4 ≈ 39 km x 19 km cell
precision 5 ≈ 4.9 km x 4.9 km
precision 6 ≈ 1.2 km x 0.6 km
precision 7 ≈ 153 m x 153 m
The magic property: points that are close in space share a geohash prefix. So “drivers near me” becomes “drivers whose geohash starts with my prefix” - a hash/prefix lookup, not a distance scan. We bucket drivers by, say, a precision-6 geohash (~1km cells) in an in-memory map geohash_prefix -> set<driver_id>.
index: Map<geohash6, Set<driver_id>>
find_nearby(lat, lng, radius):
my_cell = geohash(lat, lng, 6)
cells = [my_cell] + neighbours(my_cell) # the 8 adjacent cells
candidates = union(index[c] for c in cells)
return [d for d in candidates if haversine(d.pos,(lat,lng)) <= radius]
Two subtleties, both must be handled:
- The edge problem. The nearest driver might be just across a cell boundary, in an adjacent geohash with a different prefix. Fix: always query the target cell plus its 8 neighbours. Geohash neighbour computation is a known bit-manipulation; you gather all 9 cells’ drivers as candidates, then do exact distance filtering on that small set.
- Density variation. A precision-6 cell in Manhattan holds thousands of drivers; the same precision in a suburb holds two. A fixed precision is a bad global choice. This is where the QuadTree earns its place.
QuadTree. A QuadTree is an adaptive spatial index: start with the whole map as one node; whenever a node holds more than a threshold (say 100) drivers, split it into four quadrants; recurse. Dense areas get deep, fine-grained subdivisions; sparse areas stay shallow and coarse.
world
/ | | \
NW NE SW SE
|
(dense -> split again)
/ | | \
... ... ... ... (each leaf ≤ ~100 drivers)
A nearest-neighbour query descends the tree to the rider’s leaf, then walks outward to sibling/neighbour leaves until it has enough candidates within the radius. Because each leaf is bounded (~100 drivers), the query touches a small, roughly constant number of drivers regardless of whether the rider is in Manhattan or Montana. That density-adaptiveness is exactly what a fixed geohash grid lacks.
How they combine in practice. Geohash gives you a dead-simple, shardable bucket key and prefix search that is trivial to distribute and to store in Redis-like structures. QuadTree gives you density-adaptive leaves. A common real design: geohash for coarse sharding and bucketing, with adaptive precision (use a longer prefix in dense cities, shorter in sparse regions) - effectively a QuadTree’s adaptiveness expressed through variable geohash precision. Redis even ships a geospatial type (GEOADD/GEOSEARCH) built on geohash-scored sorted sets, which is a ready-made version of this bucket index. The interview point is understanding why both exist: prefix-shareability for distribution, adaptive depth for density.
| Approach | Query cost | Handles density | Distributable | Write cost |
|---|---|---|---|---|
| Full scan | O(N) | n/a | no | trivial |
| Lat/lng SQL box | O(box) + DB churn | poorly (fixed box) | hard | high (index churn) |
| Geohash buckets | O(9 cells) | fixed precision | easily (by prefix) | O(1) in memory |
| QuadTree | ~O(log N) to leaf | adaptively | with care | O(1), occasional split |
The rebuild cadence. The index is continuously updated by the ingestion stream. A driver crossing a cell boundary moves buckets on its next ping. We never rebuild the whole thing - it mutates in place, one driver at a time, which is why keeping it in memory with O(1) bucket moves is essential.
3. Driver-rider matching without double-assignment
Naive approach: give the rider the nearest driver. Query the index, pick the closest available driver, mark them taken, done.
Where it breaks:
- Race condition / double booking. Two riders request within the same instant, the index returns the same nearest driver to both matching workers, both mark that driver taken, both tell their rider “your car is arriving.” One rider gets ghosted. This is the single most important correctness bug in the system - it directly violates the “a driver is assigned to exactly one trip” NFR.
- Nearest is not best. The geometrically closest driver may be across a river with no bridge (20 minutes by road though 200m as the crow flies), or facing the wrong way on a highway, or a serial decliner. Straight-line nearest ignores road distance and driver behaviour.
- Decline handling. Drivers decline or ignore offers. If you “assign” and walk away, a declined offer leaves the rider stranded.
The fix: a candidate-ranking dispatch loop with a strongly-consistent per-driver lock and sequential offers.
Strong consistency on the assignment, only there. The whole system is eventually consistent except this one operation. When Matching picks a candidate, it must atomically claim that driver so no concurrent request can claim the same one. Options, in order of preference:
- A distributed lock / conditional write keyed by
driver_id:SET lock:driver:{id} = trip_id NX EX 30(RedisNX= set only if absent). Only one matcher wins the key; the loser skips that driver and moves to the next candidate. The 30s TTL auto-releases if the offer flow crashes, so a driver is never locked forever. - Equivalently, a conditional update in a strongly-consistent store:
UPDATE drivers SET status='offered', trip=? WHERE id=? AND status='available'- theAND status='available'makes it a compare-and-swap; only one wins.
match(rider, pickup):
candidates = location.find_nearby(pickup, R) # eventually-consistent read
ranked = rank(candidates, by=[road_eta, rating, accept_rate]) # ETA service
for d in ranked:
if not acquire_lock(d.id, trip_id): # STRONG: CAS / SET NX
continue # someone else grabbed d
offer(d, trip_id) # push offer to driver app
resp = await_response(d, timeout=15s)
if resp == ACCEPT:
trip.assign(d); d.status = on_trip
return d
release_lock(d.id) # declined/timeout -> next
return NO_DRIVER
Ranking, not raw nearest. Candidates from the geo-index are ranked by road ETA (from the routing service, not straight-line distance), tie-broken by driver rating and historical acceptance rate. This is where “nearest by geohash” becomes “best by real-world pickup time.” The geo-index’s job is to cheaply produce a small candidate set; ranking refines it.
Sequential offers with timeout. The trip is offered to one driver at a time (or a small batch in some designs, first-to-accept wins). On decline or a ~15s timeout, release the lock and offer the next-ranked driver. This continues until someone accepts or the candidate list is exhausted (then widen the radius, or tell the rider none available).
Why not just optimistic and reconcile later? Because the failure mode - two riders both promised the same car - is user-visible and infuriating, and reconciling “actually your driver isn’t coming” after the fact is far worse than a 5ms lock. This is the one place a small dose of strong consistency is worth its cost, and because assignments are rare (~1,200/sec) compared to location writes (2.5M/sec), the lock traffic is negligible.
4. Surge pricing
Naive approach: a global static price, or a nightly-computed multiplier. Charge a fixed per-km rate, maybe bump it during “known” rush hours from a config table.
Where it breaks:
- It does not balance supply and demand. Surge exists to do economic work: when 500 people want cars in a district with 20 free drivers, raising the price both rations demand (some riders wait or walk) and pulls supply (nearby drivers move toward the money). A static price cannot respond to a concert letting out or a sudden downpour.
- Wrong granularity. One global multiplier is useless; surge is intensely local. The airport can be 3x while a district 2km away is 1x.
- Staleness. A multiplier computed nightly cannot react to conditions that change minute to minute.
The fix: compute a demand/supply ratio per geo-cell on a short rolling window, and map it to a multiplier.
Surge reuses the same geohash cells the location index already uses. For each cell, on a rolling window (say the last 1-2 minutes), track two counters:
- Supply: number of available drivers currently in the cell (the location index already knows this - it is
len(index[cell])filtered to available). - Demand: number of ride requests originating in the cell in the window (incremented by the Matching Service, or derived from a demand stream).
for each active cell every ~30s:
supply = available_drivers_in(cell)
demand = ride_requests_in(cell, last_2_min)
ratio = demand / max(supply, 1)
surge = clamp(surge_curve(ratio), 1.0, 5.0) # e.g. 1.0 below 1, rising past it
cache.set("surge:" + cell, surge, TTL=60s)
- Streaming aggregation. Demand and supply per cell are computed by a stream processor (Kafka Streams / Flink) consuming the ride-request stream and the location stream, windowed over ~1-2 minutes. Output is a small map
cell -> multiplier, written to a fast cache (Redis) with a short TTL. - Read path is trivial. When a rider requests, the Pricing Service does
GET surge:{cell}- an O(1) lookup - multiplies the base fare, and shows it before the rider confirms. The heavy lifting (aggregation) is asynchronous and off the request path. - Smoothing and caps. Raw ratio is jittery, so the surge curve smooths it (exponential moving average) and caps it (e.g. 5x) to avoid absurd prices and oscillation. A rider who sees a quote is honoured that quote for a short window even if surge moves, so the price does not change under them mid-confirmation.
- Hysteresis. Bumping and dropping surge on every tick makes prices flicker; a small hysteresis band (raise faster than you lower) keeps it stable.
The elegant part: surge is a read of the same supply-and-demand picture the matching system already maintains. Supply is literally the count in each location cell; demand is the request rate the matcher already sees. Surge pricing is not a bolt-on - it is a windowed aggregation over data the system already streams.
API Design & Data Schema
The location path is streamed frames; the request/trip path is REST plus a tracking stream.
Location streaming (driver -> backend)
// persistent gRPC / WebSocket stream, one frame every ~4s
{ "type": "LOCATION", "driver_id": "d_88", "lat": 19.0760, "lng": 72.8777,
"heading": 210, "speed_kph": 34, "ts": 1751... }
// server -> driver: dispatch offer (server-initiated)
{ "type": "OFFER", "trip_id": "t_51", "pickup": {"lat":..,"lng":..},
"rider_rating": 4.8, "est_fare": 240, "expires_in_s": 15 }
// driver -> server
{ "type": "OFFER_RESPONSE", "trip_id": "t_51", "decision": "ACCEPT" }
REST (rider-facing)
POST /api/v1/rides/estimate
body: { pickup:{lat,lng}, dropoff:{lat,lng} }
-> { eta_pickup_s: 240, distance_km: 8.2, base_fare: 190,
surge: 1.8, total_estimate: 342 }
POST /api/v1/rides/request
body: { pickup:{lat,lng}, dropoff:{lat,lng}, product: "go", quote_id: "q_9" }
-> 202 Accepted { trip_id: "t_51", status: "matching" }
GET /api/v1/rides/{trip_id}
-> { trip_id, status: "driver_en_route", driver:{id,name,car,rating},
driver_location:{lat,lng}, eta_s: 180 }
POST /api/v1/rides/{trip_id}/cancel -> { status: "cancelled", fee: 0 }
GET /api/v1/drivers/nearby?lat=..&lng=..&radius_m=3000 (internal, matching)
-> { drivers: [ {id, lat, lng, eta_s, rating}, ... ] }
Live trip tracking is pushed over the rider’s open connection (the matched driver’s pings relayed to the rider), not polled - polling every second for 1M active trips is wasteful.
Data stores - the right tool per plane
1. Live location index - in-memory geospatial (Redis GEO or a custom in-memory QuadTree), NOT a database. Sharded by geohash region. Rebuildable from the last ~1 minute of the Kafka location topic. This is the deliberate use of non-durable, in-memory storage because the data’s lifetime is 4 seconds.
Redis (geo) per region shard:
GEOADD drivers:region_5 <lng> <lat> <driver_id> # updates position
GEOSEARCH drivers:region_5 FROMLONLAT <lng> <lat>
BYRADIUS 3 km ASC COUNT 20 # nearby query
driver:{id}:status -> available | offered | on_trip (with TTL/heartbeat)
2. Trip Store - relational (sharded PostgreSQL / a NewSQL like CockroachDB). Trips are the one place we want SQL: a trip is a transactional entity with a strict state machine, foreign relationships (rider, driver, fare), and records that must be durable and queried by multiple keys. Volume is modest (~230 trips/sec), so a sharded relational fleet is perfect, and it gives us ACID transitions for the state machine.
Table: trips
trip_id BIGINT PRIMARY KEY -- Snowflake, time-sortable
rider_id BIGINT INDEX
driver_id BIGINT INDEX
status ENUM(requested, matched, en_route, arrived,
in_progress, completed, cancelled)
pickup_lat/lng, drop_lat/lng DOUBLE
requested_at, matched_at, completed_at TIMESTAMP
surge DECIMAL(3,2)
fare DECIMAL(10,2)
route_ref STRING -- pointer to route polyline in blob store
city_id INT -- SHARD KEY
Shard by: city_id -- trips are naturally partitioned by geography;
-- a city's trips live together, queried together
Indexes: (rider_id, requested_at), (driver_id, requested_at), status
3. Driver / rider profiles - relational, cached in Redis. Slow-changing reference data (name, car, rating, home city). Read-through cache in front.
4. Surge multipliers - Redis, short TTL. surge:{cell} -> multiplier, written by the streaming aggregator, read O(1) on the estimate/request path.
Why SQL for trips but in-memory/NoSQL for locations: the two planes have opposite needs. Locations are ephemeral, enormous in write rate, and need spatial queries - a durable relational store is exactly wrong (write amplification, index churn). Trips are financial records with a state machine and relationships - eventual consistency and a schemaless bucket store would be exactly wrong (you cannot lose a completed trip’s fare, and you want ACID transitions). Matching the store to the plane is the whole schema decision.
Bottlenecks & Scaling
Where it breaks first, in order, and the fix for each:
1. The location write firehose (breaks first). 2.5M writes/sec will destroy any durable database.
Fix: never write locations to a durable DB on the hot path. Stream through Kafka (keyed by driver_id for per-driver order) into an in-memory, region-sharded geo-index. Accept non-durability - a lost ping self-heals on the next one; a crashed index node rebuilds from a minute of Kafka retention.
2. Nearest-neighbour query cost. A full scan is O(N) over 5M points. Fix: the geohash / QuadTree index turns it into a bounded lookup of one cell plus its 8 neighbours (or a QuadTree leaf plus siblings) - a small, roughly constant candidate set regardless of city density. Exact distance filtering runs only on that small set.
3. Geographic sharding of the index. A single in-memory index node cannot hold or update the whole world at 2.5M writes/sec. Fix: shard the location index by geohash region (each node owns a set of top-level cells). A driver’s pings route to the node owning its region. This spreads both the write load and the query load geographically. Cross-region queries (a request near a shard boundary) fan out to the two adjacent region nodes - the same “query neighbours” logic, one level up.
4. Hot spots - dense cities and events. Manhattan at rush hour, or a stadium emptying, concentrates both drivers and requests onto one region shard and a few cells.
Fix: adaptive cell precision (finer geohash / deeper QuadTree in dense areas) so a hot cell does not return 10,000 candidates; sub-shard hot regions onto dedicated nodes; cap candidates returned (COUNT 20) since you only need the top few by ETA anyway. Because the index is in memory and cheap to move, hot regions can be split dynamically.
5. The double-assignment race (the correctness bottleneck). Concurrent matchers grabbing the same driver.
Fix: strongly-consistent per-driver lock (SET NX / CAS on status='available') so exactly one matcher claims a driver. Rare (~1,200/sec) relative to location writes, so the strong-consistency cost is negligible. TTL on the lock prevents a crashed offer from stranding a driver.
6. Matching latency during offer/decline loops. Sequentially offering to slow-responding drivers adds seconds. Fix: tight offer timeouts (~15s), rank by acceptance rate so likely-accepters are offered first, and optionally offer to a small batch in parallel with first-accept-wins (still guarded by the lock so only one is assigned). Widen the search radius progressively if the initial candidate set is exhausted.
7. Surge aggregation load. Computing supply/demand per cell continuously.
Fix: it is a windowed streaming aggregation (Flink/Kafka Streams), fully off the request path, writing a tiny cell -> multiplier map to Redis with a short TTL. The request path only does an O(1) GET. Smoothing and hysteresis keep it from oscillating.
8. Trip store scaling. Millions of durable trips.
Fix: shard the relational trip store by city_id - trips are naturally geo-partitioned, and per-city queries (analytics, driver history, support) stay within a shard. Read replicas for the tracking reads. Archive completed trips older than N months to cheaper storage; keep the hot set small.
9. Live tracking fan-out. ~1M concurrent trips each pushing the driver’s position to the rider. Fix: relay the driver’s existing ping stream to the matched rider over the rider’s open connection - no extra GPS source, no polling. This is point-to-point (one driver -> one rider per trip), so it scales with connection servers, not with a broadcast fan-out.
10. Single points of failure. Fix: Location Gateways and Location Service shards are stateless-relative-to-Kafka (rebuild from the log); Matching and Pricing are stateless behind the LB; Kafka, Redis, and the trip DB are all replicated. The one strongly-consistent component (the driver lock) runs on a replicated Redis/consensus store with failover. No single box whose loss stops rides - at worst a region’s index rebuilds from the stream while requests briefly retry.
11. Multi-region / global. Fix: deploy the whole stack per region and route by geography (a rider and the drivers around them are always in the same region), so the latency-critical location and matching loop never crosses a region boundary. Trips replicate asynchronously to a global store for analytics; per-conversation-style cross-region coordination is unnecessary because a trip lives entirely in one region.
Wrap-Up
The trade-offs that define this design:
- In-memory streaming location plane over a durable database. We deliberately refuse durability for location data because it lives 4 seconds - trading “we might lose a ping” (self-healing on the next one) for the only architecture that survives 2.5M writes/sec, and keeping the firehose entirely off any relational store.
- Geohash + QuadTree spatial index over full scan or SQL boxes. We map 2D space to prefix-shareable buckets (geohash) with density-adaptive depth (QuadTree) so nearest-neighbour is a bounded lookup of one cell plus neighbours, not an O(N) scan - the single technique that makes “find the nearest driver” cheap.
- Eventual consistency everywhere except the assignment. The location plane, presence, and surge are all eventually consistent; only the driver claim gets a strongly-consistent lock/CAS. We pay for strong consistency exactly once per match (rare) to make double-booking impossible, and nowhere else.
- Ranking by road ETA, not straight-line distance. The geo-index cheaply produces a candidate set; the routing service ranks it by real pickup time and driver behaviour - separating “who is nearby” (fast, spatial) from “who is best” (accurate, road-aware).
- Surge as a windowed read of the same supply/demand the system already streams. Supply is the count per location cell; demand is the request rate the matcher already sees. Surge is an asynchronous aggregation writing a tiny multiplier map, so the price path stays an O(1) cache read.
One-line summary: a geospatial ride-hailing system where millions of drivers stream GPS through Kafka into an in-memory, region-sharded geohash/QuadTree index that answers “nearest driver” as a bounded cell lookup, matching claims a driver with a single strongly-consistent lock to prevent double-booking, ranks candidates by real road ETA, computes surge as a short-window supply/demand aggregation over the same location cells, and persists only the transactional trips to a city-sharded relational store.
Comments