1. Requirements & Scope (5 min)

Functional Requirements

  1. Order placement — users can place market, limit, and stop-loss orders for equities, F&O, and ETFs
  2. Real-time portfolio tracking — show holdings, positions, P&L (realized + unrealized), and margin utilization in real time
  3. Market data feed — stream live prices (LTP, bid/ask, depth) to clients via WebSocket
  4. Order book management — maintain per-user order book with status lifecycle (open → pending → executed → settled / rejected / cancelled)
  5. Margin & risk management — pre-trade risk checks: sufficient margin, position limits, circuit breaker enforcement, and exposure calculations

Non-Functional Requirements

  • Availability: 99.99% during market hours (9:15 AM – 3:30 PM IST). Every minute of downtime = real money lost for users.
  • Latency: < 10ms for order placement (broker-side), < 50ms for end-to-end order acknowledgement. Market data tick-to-display < 200ms.
  • Consistency: Orders MUST be strongly consistent — a placed order must never be lost. Portfolio and P&L can be eventually consistent (~1s lag acceptable).
  • Scale: 10M registered users, 1M concurrent during market hours, 50K orders/sec peak (market open spike at 9:15 AM), 500K+ price ticks/sec from exchange feed.
  • Durability: Zero tolerance for order loss. Every order must be audit-trailed with nanosecond-precision timestamps for regulatory compliance (SEBI/SEC).

2. Estimation (3 min)

Traffic

  • Orders: 50K orders/sec peak (market open), ~5K orders/sec steady state during market hours
  • Market data: 5,000 instruments × 100 ticks/sec = 500K ticks/sec ingested from exchange
  • WebSocket connections: 1M concurrent users, each subscribed to ~10 instruments on average
  • Fan-out: each tick fans out to subscribed users → peak 5M messages/sec outbound via WebSocket

Storage

  • Orders: 50M orders/day × 500 bytes = 25 GB/day → 6 TB/year (must retain 7+ years for regulatory compliance)
  • Trade history: ~20M trades/day × 300 bytes = 6 GB/day
  • Market data (OHLCV candles): 5,000 instruments × 375 min × 100 bytes = 188 MB/day for 1-min candles. Tick-level data: 500K ticks/sec × 6.25 hrs × 50 bytes = ~56 GB/day
  • User portfolios: 10M users × 1 KB average = 10 GB (fits in memory for hot data)

Key Insight

This is a latency-critical, correctness-critical system with a massive spike pattern. 9:15 AM (market open) sees 10–20× traffic surge within seconds. The order pipeline must be idempotent, durable (WAL + queue), and pre-warmed. Market data is a classic pub-sub fan-out problem — 500K ticks/sec → millions of WebSocket pushes.


3. API Design (3 min)

Order Placement

POST /api/v1/orders
  Headers: Authorization: Bearer <token>
  Body: {
    "instrument_id": "NSE:RELIANCE",
    "order_type": "LIMIT",            // MARKET | LIMIT | SL | SL-M
    "transaction_type": "BUY",        // BUY | SELL
    "quantity": 100,
    "price": 2450.50,                 // required for LIMIT/SL
    "trigger_price": 2445.00,         // required for SL/SL-M
    "product": "CNC",                 // CNC (delivery) | MIS (intraday) | NRML (F&O)
    "validity": "DAY",                // DAY | IOC | GTT
    "idempotency_key": "uuid-v4"      // client-generated, prevents duplicate orders
  }
  Response 202: {
    "order_id": "230215000012345",
    "status": "PENDING_VALIDATION",
    "timestamp": "2025-02-15T09:15:00.123456789Z"
  }

Order Management

PUT    /api/v1/orders/{order_id}        — modify open order (price/qty)
DELETE /api/v1/orders/{order_id}        — cancel open order
GET    /api/v1/orders                   — list today's orders (with filters)
GET    /api/v1/orders/{order_id}        — order detail + execution history

Portfolio & Positions

GET /api/v1/portfolio/holdings          — delivery holdings with avg cost, P&L
GET /api/v1/portfolio/positions         — intraday + F&O positions with live P&L
GET /api/v1/portfolio/margins           — available margin, used margin, exposure

Market Data (WebSocket)

WS /api/v1/stream
  → Subscribe:   {"action": "subscribe", "instruments": ["NSE:RELIANCE", "NSE:INFY"]}
  ← Tick:        {"instrument": "NSE:RELIANCE", "ltp": 2451.30, "bid": 2451.25,
                   "ask": 2451.35, "volume": 1234567, "timestamp": 1708012500123}
  ← Order Update: {"type": "order_update", "order_id": "...", "status": "EXECUTED", ...}

Key Decisions

  • 202 Accepted for order placement — order enters async pipeline. Client gets status via WebSocket push or polling.
  • Idempotency key is mandatory — network retries must not create duplicate orders.
  • Market data and order updates share the same WebSocket connection to minimize connection overhead.

4. Data Model (3 min)

Orders Table (PostgreSQL — strong consistency, ACID for financial data)

orders
├── order_id          BIGINT PRIMARY KEY (exchange-format sequential ID)
├── user_id           BIGINT NOT NULL (FK → users)
├── instrument_id     VARCHAR(30) NOT NULL          -- "NSE:RELIANCE"
├── order_type        ENUM('MARKET','LIMIT','SL','SL_M')
├── transaction_type  ENUM('BUY','SELL')
├── product           ENUM('CNC','MIS','NRML')
├── quantity          INT NOT NULL
├── price             DECIMAL(12,2)                  -- NULL for MARKET orders
├── trigger_price     DECIMAL(12,2)                  -- NULL if not SL
├── filled_quantity   INT DEFAULT 0
├── avg_fill_price    DECIMAL(12,2)
├── status            ENUM('PENDING','OPEN','PARTIALLY_FILLED','EXECUTED','CANCELLED','REJECTED')
├── idempotency_key   UUID UNIQUE
├── validity          ENUM('DAY','IOC','GTT')
├── exchange_order_id VARCHAR(30)                    -- ID returned by exchange
├── reject_reason     TEXT
├── created_at        TIMESTAMP(9) NOT NULL          -- nanosecond precision
├── updated_at        TIMESTAMP(9) NOT NULL
├── INDEX (user_id, created_at DESC)
├── INDEX (instrument_id, status)
└── INDEX (idempotency_key)

Holdings Table

holdings
├── user_id           BIGINT
├── instrument_id     VARCHAR(30)
├── quantity          INT
├── avg_cost          DECIMAL(12,2)
├── PRIMARY KEY (user_id, instrument_id)

Positions Table (intraday/F&O — recalculated each trading day)

positions
├── user_id           BIGINT
├── instrument_id     VARCHAR(30)
├── product           ENUM('MIS','NRML')
├── buy_quantity      INT
├── sell_quantity     INT
├── buy_avg           DECIMAL(12,2)
├── sell_avg          DECIMAL(12,2)
├── realized_pnl     DECIMAL(14,2)
├── PRIMARY KEY (user_id, instrument_id, product)

Margin Ledger

margin_ledger
├── user_id           BIGINT PRIMARY KEY
├── available_cash    DECIMAL(14,2)
├── used_margin       DECIMAL(14,2)
├── blocked_margin    DECIMAL(14,2)     -- blocked for pending orders
├── exposure_margin   DECIMAL(14,2)     -- for F&O positions
├── updated_at        TIMESTAMP

Why PostgreSQL?

  • Orders are financial transactions — ACID compliance is non-negotiable
  • Need complex queries for regulatory reporting, P&L computation, audit trails
  • Partitioning by date for orders table (easy archival of old data)
  • Read replicas for portfolio queries (eventual consistency acceptable for reads)
  • Redis for hot path: margin cache, position cache, session store, idempotency dedup

5. High-Level Design (12 min)

System Architecture

                         ┌──────────────┐
                         │   Exchange    │
                         │ (NSE / BSE)  │
                         └──────┬───────┘
                                │
                    ┌───────────┴───────────┐
                    │  Exchange Gateway      │  ← FIX protocol / binary protocol
                    │  (Order Router +       │  ← Sends orders, receives fills
                    │   Market Data Recv)    │  ← Receives 500K ticks/sec
                    └─────┬──────────┬──────┘
                          │          │
              ┌───────────┘          └───────────────┐
              ▼                                      ▼
    ┌──────────────────┐                  ┌──────────────────┐
    │  Order Execution │                  │  Market Data     │
    │  Engine          │                  │  Distributor     │
    │  (Order Queue +  │                  │  (Kafka topic    │
    │   Risk Check +   │                  │   per instrument │
    │   Matching)      │                  │   group)         │
    └────────┬─────────┘                  └────────┬─────────┘
             │                                     │
             │                                     ▼
             │                            ┌──────────────────┐
             │                            │  WebSocket       │
             │                            │  Gateway Cluster │
             │                            │  (1M connections)│
             │                            └────────┬─────────┘
             │                                     │
             ▼                                     ▼
    ┌──────────────────┐               ┌───────────────────┐
    │  PostgreSQL      │               │  Mobile / Web     │
    │  (Orders, Trades │               │  Clients          │
    │   Holdings)      │               └───────────────────┘
    └────────┬─────────┘                        ▲
             │                                  │
             ▼                                  │
    ┌──────────────────┐               ┌──────────────────┐
    │  Settlement      │               │  API Gateway      │
    │  Service         │               │  (Auth, Rate      │
    │  (T+1 batch)     │               │   Limit, Route)   │
    └──────────────────┘               └──────────────────┘

Component Breakdown

1. API Gateway

  • Handles authentication (JWT + session), rate limiting (1000 req/sec per user), and request routing
  • TLS termination, request logging for audit trail
  • Separate pools for order placement (low-latency path) vs portfolio queries (can tolerate higher latency)

2. Order Execution Engine

  • Consumes from Kafka order queue (partitioned by user_id for ordering guarantees)
  • Pipeline: Validation → Idempotency Check → Margin Block → Risk Check → Send to Exchange
  • Each step is synchronous within the pipeline — total < 5ms broker-side
  • Idempotency: check Redis SET NX on idempotency_key before processing
  • Margin block: atomic decrement in Redis (optimistic lock), reconcile with PostgreSQL async

3. Exchange Gateway

  • FIX 4.2/5.0 protocol adapter for exchange communication
  • Maintains persistent TCP connections to exchange (2-3 connections for redundancy)
  • Handles order acknowledgements, partial fills, full fills, and rejections
  • Publishes execution reports back to Kafka → consumed by Order Status Updater

4. Market Data Distributor

  • Receives raw tick data from exchange multicast feed (UDP)
  • Normalizes, timestamps, and publishes to Kafka topics (partitioned by instrument group)
  • Fan-out to WebSocket gateways via internal pub-sub (Redis Pub/Sub or dedicated message bus)

5. WebSocket Gateway Cluster

  • Stateful connections — each server holds ~100K connections
  • Subscription registry: which user subscribes to which instruments (in-memory hash map)
  • On tick arrival: look up subscribers → push to their WebSocket connections
  • Horizontal scaling: consistent hashing of instruments across gateway nodes
  • Heartbeat + reconnection handling

6. Risk Management Service

  • Pre-trade checks: margin sufficiency, quantity limits, price band checks (circuit breaker)
  • Real-time position tracking: aggregated net position per user per instrument
  • Kills switch: can disable trading for a user or instrument instantly
  • Runs as a sidecar to the Order Execution Engine for minimal latency

7. Settlement Service

  • Batch job runs after market close
  • T+1 settlement: moves funds from blocked margin to debited/credited
  • Updates holdings table (delivery trades), closes intraday positions
  • Generates contract notes (regulatory requirement)
  • Reconciles with exchange settlement files

8. Portfolio Service

  • Computes real-time P&L by combining positions with live market prices
  • Holdings P&L = (LTP - avg_cost) × quantity
  • Positions P&L = realized + unrealized (mark-to-market)
  • Caches in Redis, updated on each relevant tick

6. Deep Dives (15 min)

Deep Dive 1: Handling the Market Open Spike (9:15 AM)

The Problem: At 9:15 AM IST, the Indian stock market opens. Within the first 10 seconds, order volume spikes from 0 to 50K orders/sec. Pre-market orders (placed between 9:00–9:15) flood in simultaneously. This is the most critical moment — if we drop orders, users lose real money.

Solution: Queue-Based Order Processing for Fairness

Pre-market (9:00–9:15):
  User places order → Order persisted to DB (status: PENDING_OPEN)
                     → Enqueued to Kafka partition (by user_id)
                     → Acknowledged to user (202)

Market open (9:15 AM):
  Exchange Gateway sends "market open" signal
  → Order Execution Engine begins draining Kafka queues
  → Orders processed in FIFO order within each partition
  → Margin checks happen in real-time (Redis atomic ops)
  → Orders sent to exchange via FIX protocol (batched if allowed)

Key Design Choices:

  • Pre-warm everything: At 9:00 AM, load all user margins into Redis, pre-validate instruments, establish exchange connections
  • Kafka partitioning by user_id: Ensures a user’s orders are processed in sequence (no race conditions on margin)
  • Backpressure: If exchange gateway queue backs up, apply per-user throttling (max 10 orders in flight). Don’t drop orders — slow them down.
  • Circuit breaker on the broker side: If exchange rejects > 5% of orders in a 1-second window, pause and re-check connectivity
  • Pre-computed margin blocks: For pre-market orders, margin is already blocked. At 9:15, skip the margin check — go straight to exchange submission.

Fairness Guarantee: Orders are timestamped at ingestion (nanosecond precision). Kafka preserves ordering within a partition. For orders across users, the exchange’s matching engine determines execution priority (price-time priority), not us — our job is to submit them as fast as possible in the order they were received.

Capacity Planning:

  • Kafka: 100 partitions × 500 orders/sec/partition = 50K orders/sec throughput
  • Exchange Gateway: 5 FIX sessions × 10K orders/sec each = 50K orders/sec
  • Redis: margin check is a single DECRBY — easily handles 500K ops/sec on a single node

Deep Dive 2: Margin & Risk Management — Preventing Negative Balances

The Problem: A user has ₹1,00,000 available margin. They rapidly fire 10 orders each requiring ₹50,000 margin. Without proper concurrency control, all 10 could pass the margin check simultaneously, and the user ends up with -₹4,00,000 exposure. The broker is now liable.

Solution: Pessimistic Margin Blocking with Redis Atomic Operations

Order arrives → Lua script on Redis:
  local available = redis.call('GET', 'margin:' .. user_id)
  local required = ARGV[1]
  if tonumber(available) >= tonumber(required) then
    redis.call('DECRBY', 'margin:' .. user_id, required)
    return 1  -- approved
  else
    return 0  -- rejected: insufficient margin
  end

Why Redis Lua script?

  • Atomic: check-and-decrement in a single operation — no TOCTOU race
  • Fast: < 0.1ms per operation
  • The Lua script runs single-threaded on Redis — natural serialization per key

Margin Calculation:

Required Margin = Σ(blocked for pending orders) + Σ(exposure for open positions) + new order requirement

For equities (CNC):     margin = order_value × 1.0 (full payment)
For intraday (MIS):     margin = order_value × 0.20 (5× leverage)
For F&O (NRML):         margin = SPAN margin + exposure margin (from exchange)

Lifecycle of Margin:

  1. Order placed → margin BLOCKED in Redis (available ↓, blocked ↑)
  2. Order rejected by exchange → margin UNBLOCKED (rollback)
  3. Order executed → margin CONSUMED (blocked ↓, used ↑)
  4. Order cancelled by user → margin UNBLOCKED (rollback)
  5. Position closed → margin RELEASED (used ↓, available ↑)

Reconciliation:

  • Every 30 seconds, a background job reconciles Redis margin cache with PostgreSQL margin_ledger
  • If drift > ₹100, raise an alert — indicates a bug or race condition
  • End-of-day: full reconciliation with exchange clearing data

Kill Switch:

  • If a user’s total exposure exceeds their risk limit, the Risk Service sends a “freeze” command
  • All pending orders are cancelled, new orders are rejected
  • Requires manual review by risk team to unfreeze

Deep Dive 3: Real-Time P&L Calculation and Market Data Fan-Out

The Problem: 1M connected users, each watching ~10 instruments. Each instrument ticks ~100 times/sec. We need to compute and push P&L updates in real-time without melting the servers.

Solution: Tiered Fan-Out with Instrument-Level Aggregation

Exchange Feed → Market Data Distributor → Kafka (by instrument group)
                                              │
                    ┌─────────────────────────┼─────────────────────┐
                    ▼                         ▼                     ▼
            WS Gateway 1              WS Gateway 2           WS Gateway N
            (instruments A-F)         (instruments G-M)      (instruments N-Z)
                    │                         │                     │
            Local subscriber map      Local subscriber map    Local subscriber map
            instrument → [conn1,      instrument → [conn1,
                          conn2, ...]              conn2, ...]

Key Optimizations:

  1. Server-side throttling: Even if an instrument ticks 100 times/sec, push to clients at most 4 times/sec (250ms intervals). Batch intermediate ticks and send only the latest state.
  2. Delta compression: Don’t send the full tick every time. Send only changed fields: {"ltp": 2451.30} instead of the full 15-field tick object.
  3. Instrument-level partitioning: Each WebSocket gateway node is responsible for a subset of instruments. When a tick arrives, only the relevant gateway node processes it — no broadcast to all nodes.
  4. P&L computed on the client: Server sends (LTP, prev_close). Client has (quantity, avg_cost) locally. P&L = (LTP - avg_cost) × quantity. This eliminates per-user server-side computation.
  5. Snapshot + stream: On WebSocket connect, client receives a full snapshot of subscribed instruments. Then receives incremental tick updates. Handles reconnection by re-sending snapshot.

Numbers:

  • 500K ticks/sec ingested
  • After throttling (4 pushes/sec/instrument): 5,000 instruments × 4 = 20K messages/sec per gateway node
  • Each message fans out to ~200 subscribers on that node = 4M WebSocket writes/sec per node
  • 10 gateway nodes → total 40M WebSocket writes/sec
  • Each write ~100 bytes → ~4 GB/sec outbound bandwidth (manageable with 10 Gbps NICs)

7. Extensions (2 min)

  • GTT (Good Till Triggered) orders: Persist in a separate store, checked against every tick using an in-memory price index (similar to the alert system). Triggered orders enter the normal order pipeline. Challenge: millions of GTT orders × 500K ticks/sec — need efficient data structure (sorted set or interval tree).

  • After-market / pre-market orders (AMO): Orders placed outside market hours are queued in a priority queue. At next market open, they are released in timestamp order. Need separate margin blocking that accounts for overnight price gaps.

  • Multi-exchange routing: Smart order routing (SOR) that splits large orders across NSE and BSE based on best available price and liquidity depth. Requires real-time order book aggregation from multiple exchanges.

  • Paper trading / virtual portfolio: Separate order pipeline that simulates execution against real market data without touching real margins or exchange. Useful for new users. Shares the market data infrastructure but bypasses the exchange gateway entirely.

  • Regulatory reporting & audit trail: Every order mutation (place, modify, cancel, execute) logged to an append-only audit table with nanosecond timestamps, user IP, device fingerprint, and request trace ID. Required for SEBI/SEC compliance. Stored in a separate time-series database (TimescaleDB) with 7-year retention.