1. Requirements & Scope (5 min)

Functional Requirements

  1. Ingest real-time market data feeds from multiple exchanges (NYSE, NASDAQ, etc.) and normalize into a unified price stream
  2. Fan out live price updates to millions of subscribers via WebSocket with sub-second latency
  3. Maintain and serve order book data (top-of-book and depth) for each symbol
  4. Compute and serve the National Best Bid and Offer (NBBO) by aggregating prices across exchanges
  5. Store and serve historical price data (OHLCV — Open, High, Low, Close, Volume) at multiple time resolutions and support price-change alerting

Non-Functional Requirements

  • Availability: 99.99% during market hours (9:30 AM - 4:00 PM ET). Pre-market and after-hours: 99.9%.
  • Latency: Tick-to-display < 100ms for retail users. Tick-to-internal-processing < 10ms. (Not targeting HFT microsecond latency — this is for retail/fintech platforms.)
  • Consistency: Price data must be ordered correctly (no out-of-order ticks shown to users). NBBO must be accurate within 50ms.
  • Scale: 10,000 symbols, 1M price updates/sec during peak market hours, 5M concurrent WebSocket subscribers.
  • Durability: All tick data stored for regulatory compliance (7 years). Historical OHLCV data stored indefinitely.

2. Estimation (3 min)

Traffic

  • Inbound market data: 1M ticks/sec during peak (10,000 symbols × 100 updates/sec for active symbols)
  • WebSocket subscribers: 5M concurrent connections
  • Fan-out: Each tick for a popular symbol (AAPL, TSLA) may go to 500K subscribers → 500M messages/sec outbound at peak
  • Historical data queries: 10,000 QPS (chart data requests for different symbols/timeframes)

Storage

  • Tick data (raw): 1M ticks/sec × 64 bytes × 6.5 hours/day × 252 trading days = 37TB/year
  • OHLCV 1-min candles: 10,000 symbols × 390 candles/day × 252 days × 48 bytes = 47GB/year
  • OHLCV 1-day candles: 10,000 symbols × 252 days × 48 bytes = 121MB/year
  • Order book snapshots (top 10 levels): 10,000 symbols × 10 updates/sec × 200 bytes = 20MB/sec (stored in memory, snapshots persisted hourly)

Key Insight

This is a fan-out-heavy system. Ingesting 1M ticks/sec is manageable, but fanning out each tick to potentially 500K subscribers creates a 500:1 amplification. The WebSocket fan-out layer is the hardest scaling challenge. Historical data is a classic time-series storage problem.


3. API Design (3 min)

Real-Time Price Subscription (WebSocket)

// Client connects
WS wss://stream.example.com/v1/prices

// Client subscribes to symbols
→ { "action": "subscribe", "symbols": ["AAPL", "TSLA", "GOOGL"] }
← { "type": "subscribed", "symbols": ["AAPL", "TSLA", "GOOGL"] }

// Server pushes price updates
← { "type": "tick", "symbol": "AAPL", "price": 185.42, "bid": 185.41,
    "ask": 185.43, "volume": 1200, "exchange": "NASDAQ",
    "timestamp": 1708632060123 }

// Client unsubscribes
→ { "action": "unsubscribe", "symbols": ["TSLA"] }

// Server pushes NBBO updates
← { "type": "nbbo", "symbol": "AAPL", "best_bid": 185.41,
    "best_ask": 185.42, "bid_exchange": "NYSE", "ask_exchange": "NASDAQ",
    "timestamp": 1708632060125 }

Historical Data (REST)

GET /v1/prices/{symbol}/candles?interval=1m&from=1708600000&to=1708632000
  Response: {
    "symbol": "AAPL",
    "interval": "1m",
    "candles": [
      { "time": 1708600000, "open": 184.50, "high": 184.75,
        "low": 184.30, "close": 184.60, "volume": 52340 },
      ...
    ]
  }

GET /v1/prices/{symbol}/book?depth=10
  Response: {
    "symbol": "AAPL",
    "bids": [
      { "price": 185.41, "quantity": 500, "exchange": "NYSE" },
      { "price": 185.40, "quantity": 1200, "exchange": "NASDAQ" },
      ...
    ],
    "asks": [ ... ],
    "timestamp": 1708632060123
  }

Price Alerts

POST /v1/alerts
  Body: { "symbol": "AAPL", "condition": "price_above", "threshold": 190.00,
          "notify": ["push", "email"] }
  Response 201: { "alert_id": "alert_123", "status": "active" }

Key Decisions

  • WebSocket for real-time (not SSE) — bidirectional allows subscribe/unsubscribe without reconnection
  • Clients subscribe to specific symbols (not all market data) — reduces fan-out per connection
  • Historical API uses standard REST with caching headers (ETags, Cache-Control) — candle data for closed periods is immutable

4. Data Model (3 min)

Real-Time State (Redis / In-Memory)

// Last trade per symbol
HSET price:AAPL
  last_price: 185.42
  bid: 185.41
  ask: 185.43
  volume: 52340
  timestamp: 1708632060123
  change_pct: 0.45
  day_high: 186.10
  day_low: 183.90

// Order book (sorted sets per side)
ZADD book:AAPL:bids 185.41 "185.41:500:NYSE"
ZADD book:AAPL:bids 185.40 "185.40:1200:NASDAQ"
ZADD book:AAPL:asks 185.42 "185.42:800:NASDAQ"

Tick Data (ClickHouse — time-series optimized)

Table: ticks
  symbol           | LowCardinality(String)
  exchange         | LowCardinality(String)
  price            | Decimal64(4)
  bid              | Decimal64(4)
  ask              | Decimal64(4)
  volume           | UInt32
  timestamp        | DateTime64(3)           -- millisecond precision
  tick_type        | Enum8('trade', 'quote', 'nbbo')

  ENGINE = MergeTree()
  PARTITION BY (toDate(timestamp), symbol)
  ORDER BY (symbol, timestamp)
  TTL timestamp + INTERVAL 7 YEAR DELETE

OHLCV Candles (ClickHouse — materialized views)

Table: candles_1m (materialized view from ticks)
  symbol           | LowCardinality(String)
  time_bucket      | DateTime                 -- 1-minute bucket
  open             | Decimal64(4)
  high             | Decimal64(4)
  low              | Decimal64(4)
  close            | Decimal64(4)
  volume           | UInt64
  trade_count      | UInt32

Table: candles_1h, candles_1d (aggregated from candles_1m)

Price Alerts (PostgreSQL)

Table: price_alerts
  alert_id         (PK) | uuid
  user_id          (FK) | uuid
  symbol                | varchar(10)
  condition             | enum('price_above', 'price_below', 'change_pct_above', 'change_pct_below')
  threshold             | decimal(12,4)
  notify_channels       | varchar(20)[]
  status                | enum('active', 'triggered', 'expired')
  created_at            | timestamp
  triggered_at          | timestamp

Why ClickHouse?

  • Columnar storage with extreme compression for time-series data (10-15x compression on price data)
  • Handles 1M inserts/sec with ease (designed for high-throughput ingestion)
  • Blazing-fast aggregation queries for OHLCV computation
  • Partition by date + symbol for efficient time-range queries

5. High-Level Design (12 min)

Architecture

Market Data Feeds (NYSE, NASDAQ, BATS, etc.)
  → Feed Handlers (one per exchange, protocol-specific)
    → Normalize to unified tick format
    → Publish to Kafka (market-data topic, partitioned by symbol)

Kafka (market-data, 256 partitions by symbol hash):
  → Consumer 1: NBBO Engine
    → Computes best bid/offer across exchanges
    → Publishes NBBO updates to Kafka (nbbo topic)

  → Consumer 2: Candle Builder
    → Maintains in-memory OHLCV state per symbol per interval
    → Flushes completed candles to ClickHouse every minute

  → Consumer 3: Tick Writer
    → Batch inserts raw ticks to ClickHouse (10K rows/batch)

  → Consumer 4: Alert Evaluator
    → Checks each tick against active alerts for that symbol
    → Triggers notifications for matched alerts

  → Consumer 5: Fan-Out Service
    → Routes ticks to WebSocket servers based on subscriber interest

WebSocket Layer (fan-out):
  WebSocket Servers (500 instances, each handles 10K connections)
    → Subscriber Registry: maps symbol → set of connections
    → On tick for AAPL: lookup all connections subscribed to AAPL → push

  Client → Load Balancer (sticky by connection) → WebSocket Server
    → Subscribes to symbols → registered in local subscriber map
    → Also registered in global registry (Redis pub/sub or Kafka)

Historical Queries:
  Client → API Gateway → Query Service → ClickHouse (candles/ticks)
    → Response cached in Redis (1-min TTL for recent data, infinite for closed periods)

Fan-Out Architecture (Critical Path)

Tick for AAPL arrives (500K subscribers interested):
  1. Fan-Out Service receives tick from Kafka
  2. Lookup: which WebSocket servers have AAPL subscribers?
     → Redis SET "subscribers:AAPL" = {ws-server-1, ws-server-5, ws-server-42, ...}
  3. Publish to each relevant WebSocket server's channel:
     → Redis PUB "ws-server-1:ticks" → { symbol: AAPL, price: 185.42 }
     → Redis PUB "ws-server-5:ticks" → ...
  4. Each WebSocket server receives tick, fans out to local subscribers
     → ws-server-1 has 2,000 AAPL subscribers → sends 2,000 messages
     → ws-server-5 has 1,500 AAPL subscribers → sends 1,500 messages

Total fan-out: 1 tick → 500 WebSocket servers × 1,000 local sends = 500K messages
Latency budget: Kafka → Fan-Out → Redis PUB → WS Server → Client = < 50ms

Components

  1. Feed Handlers: Exchange-specific parsers (FIX protocol for NYSE, ITCH for NASDAQ). One handler per exchange, stateless, auto-reconnecting.
  2. NBBO Engine: Stateful processor maintaining best bid/ask per symbol across all exchanges. Outputs NBBO changes to downstream consumers.
  3. Candle Builder: Maintains in-memory OHLCV accumulators. First tick in window sets Open. Each tick updates High/Low/Close/Volume. Flushes on window close.
  4. Tick Writer: High-throughput batch writer to ClickHouse. Buffers ticks in memory, flushes every second or every 10K ticks.
  5. Fan-Out Service: Routes ticks to the correct WebSocket servers based on subscriber interest. The scaling bottleneck — must handle 1M ticks/sec input, 500M messages/sec output.
  6. WebSocket Servers: Maintain persistent connections with clients. Each server handles 10K connections. Stateless except for local subscriber list.
  7. Alert Evaluator: Checks incoming ticks against active alerts. Uses in-memory index (symbol → alerts sorted by threshold) for O(log N) evaluation.
  8. Query Service: Serves historical data requests. Caches immutable candle data aggressively.

6. Deep Dives (15 min)

Deep Dive 1: WebSocket Fan-Out at Scale (5M Concurrent Connections)

The problem: 5M concurrent WebSocket connections, each subscribed to 5-20 symbols. When AAPL ticks 100 times/sec and has 500K subscribers, we need to deliver 50M messages/sec just for one symbol. Total across all symbols: hundreds of millions of messages per second.

Two-tier fan-out architecture:

Tier 1: Kafka → Fan-Out Workers (50 instances)
  - Each worker handles a partition of symbols
  - Worker for AAPL: reads tick → looks up which WS server groups have AAPL subscribers
  - Publishes to per-server-group channels (not per-connection)
  - Fan-out: 1 tick → 50 server groups = 50 messages (manageable)

Tier 2: WS Server Group → Individual Connections
  - Each server group: 10 WS servers behind a load balancer
  - Each WS server: 10K connections
  - Server receives tick for AAPL → iterates local subscriber set → sends to each
  - Fan-out: 1 tick per server → 1,000 sends (10K connections, ~10% subscribed to AAPL)
  - Total: 50 groups × 10 servers × 1,000 sends = 500K deliveries

Optimizations:

  1. Message batching: Instead of sending each tick individually, buffer 100ms of ticks per symbol and send as a batch. Reduces per-connection message rate from 100/sec to 10/sec.
Instead of: 100 individual AAPL ticks/sec per connection
Send: 10 batches/sec, each with 10 ticks
Reduces: WebSocket frame overhead by 10x
  1. Subscription-level throttling: Not all users need 100 ticks/sec. Free users get 1 update/sec (last-value throttle). Pro users get full speed.
For free users: maintain last_sent timestamp per (connection, symbol)
  If now - last_sent < 1 second: skip (will get next tick)
  Reduces outbound messages by 99% for free users
  1. Binary encoding: Use Protocol Buffers or MessagePack instead of JSON for WebSocket messages. 3-5x smaller payload.
JSON: {"type":"tick","symbol":"AAPL","price":185.42,"bid":185.41,"ask":185.43,"volume":1200,"timestamp":1708632060123}
  = 120 bytes

Protobuf: [binary encoded same data]
  = 32 bytes
  1. Connection affinity: Users subscribing to similar symbols routed to the same WS server group. This means each server group handles fewer distinct symbols, reducing the fan-out surface.

Deep Dive 2: NBBO Computation and Order Book Management

The problem: Each exchange has its own order book for each symbol. The National Best Bid and Offer (NBBO) is the best bid and best ask across ALL exchanges. It changes every time any exchange’s top-of-book changes. Must be computed in < 10ms.

NBBO computation:

State per symbol:
  exchange_books = {
    "NYSE":    { best_bid: 185.40, bid_size: 500,  best_ask: 185.43, ask_size: 800 },
    "NASDAQ":  { best_bid: 185.41, bid_size: 1200, best_ask: 185.42, ask_size: 600 },
    "BATS":    { best_bid: 185.39, bid_size: 300,  best_ask: 185.44, ask_size: 400 },
    "IEX":     { best_bid: 185.40, bid_size: 200,  best_ask: 185.43, ask_size: 500 }
  }

NBBO = {
  best_bid: max(all exchange best_bids) = 185.41 (NASDAQ)
  best_ask: min(all exchange best_asks) = 185.42 (NASDAQ)
  spread: best_ask - best_bid = 0.01
}

On each exchange quote update:
  1. Update that exchange's entry in exchange_books
  2. Recompute NBBO (scan 4-15 exchanges, O(E) where E is small)
  3. If NBBO changed: publish NBBO update
  4. Detect locked market (bid = ask) or crossed market (bid > ask) → flag

Order book management (top N levels):

Per symbol, per exchange:
  Sorted map: price_level → (quantity, order_count)

  Bids: descending by price (highest bid first)
    185.41 → (1200 shares, 15 orders)
    185.40 → (500 shares, 8 orders)
    185.39 → (2000 shares, 22 orders)
    ...

  Asks: ascending by price (lowest ask first)
    185.42 → (600 shares, 10 orders)
    185.43 → (800 shares, 12 orders)
    185.44 → (400 shares, 5 orders)
    ...

Update types:
  ADD: new order at price level → increment quantity
  REMOVE: order cancelled → decrement quantity (remove level if 0)
  TRADE: execution → decrement quantity at best price level
  REPLACE: order modified → update quantity at price level

Consistency guarantee:

  • Exchange feeds are processed in strict sequence number order
  • If a gap in sequence is detected → request retransmission from exchange
  • During retransmission, NBBO for that symbol is marked as “stale” (not published to users)
  • Stale NBBO resolved within 1 second (exchange retransmission is fast)

Deep Dive 3: Historical Data Storage and OHLCV Candle Generation

The problem: Store 37TB/year of raw tick data, compute OHLCV candles at multiple resolutions (1m, 5m, 15m, 1h, 1d, 1w), and serve chart queries with < 200ms latency for any symbol and time range.

Real-time candle generation (streaming):

Candle Builder state (in-memory, per symbol, per interval):
  current_candle = {
    symbol: "AAPL",
    interval: "1m",
    window_start: 1708632060000,          // 18:01:00.000
    window_end:   1708632120000,          // 18:02:00.000
    open:  185.40,                         // first tick in window
    high:  185.55,                         // max price so far
    low:   185.35,                         // min price so far
    close: 185.42,                         // last tick received
    volume: 12340,                         // cumulative volume
    trade_count: 87                        // number of trades
  }

On each tick:
  1. If tick.timestamp >= window_end:
     → Flush current_candle to ClickHouse
     → Start new candle (open = tick.price)
  2. Else:
     → Update high = max(high, tick.price)
     → Update low = min(low, tick.price)
     → Update close = tick.price
     → Update volume += tick.volume
     → Update trade_count++

Higher-resolution candles built from lower:
  1m candles → aggregate → 5m candles
  5m candles → aggregate → 15m candles
  15m candles → aggregate → 1h candles
  1h candles → aggregate → 1d candles

Historical query optimization:

Query: "AAPL 1-minute candles for the last 30 days"
  = 10,000 symbols × 390 candles/day × 30 days = 117M rows total
  But for one symbol: 390 × 30 = 11,700 rows → scans in < 10ms on ClickHouse

Query: "AAPL 1-day candles for 5 years"
  = 252 × 5 = 1,260 rows → trivially fast

Query: "All symbols' latest 1-day candle" (screener/heatmap)
  = 10,000 rows → fast with materialized view

Caching strategy:
  - Completed candles (past minutes/hours/days): cache forever (immutable)
  - Current candle (in-progress minute): cache with 5-second TTL
  - Pre-compute popular views: top 100 symbols' candles pre-cached in Redis

Data lifecycle:

Raw ticks:
  0-7 days: hot storage (ClickHouse SSD)
  7-90 days: warm storage (ClickHouse HDD)
  90 days - 7 years: cold storage (S3 Parquet files, query via ClickHouse external tables)

OHLCV candles (1m):
  0-1 year: ClickHouse SSD
  1-7 years: ClickHouse HDD

OHLCV candles (1d):
  All time: ClickHouse SSD (tiny data, ~1MB/symbol for 20 years)

7. Extensions (2 min)

  • Technical indicators: Compute common indicators (SMA, EMA, RSI, MACD, Bollinger Bands) as streaming aggregations from candle data. Pre-compute for popular symbols, compute on-demand for others. Serve via API alongside candle data.
  • Market replay: Allow users to “replay” a trading day at 1x or faster speed. Stream historical ticks through the same WebSocket interface as live data. Useful for backtesting trading strategies.
  • Cross-asset correlation: Compute real-time correlation matrices between stocks, crypto, forex, and commodities. Detect unusual correlation breakdowns as early warning signals for market events.
  • Regulatory audit trail: Immutable log of all price data received, all NBBO changes, and all data served to clients. Queryable by regulators (SEC/FINRA). Stored in append-only, tamper-proof storage with cryptographic chaining.
  • Personalized watchlists with smart alerts: ML-based alerts beyond simple thresholds — “AAPL is showing unusual volume” or “TSLA broke through 200-day moving average.” Alert conditions expressed in a DSL that compiles to streaming queries.