1. Requirements & Scope (5 min)

Question Statement

Design a payment payout system for a marketplace (like Amazon, Uber, Airbnb) that pays out sellers/drivers/hosts.

Constraints:

  • Payment gateway takes ~1 minute to process a payment
  • Fixed fee per transfer charged by the payment gateway

Requirements (in order of importance):

  1. Audit log of all payments made out to sellers
  2. Sellers paid out in their preferred form of payment, as soon as possible
  3. Sellers can check the status of their payments, or see steps required to fix problems
  4. Keep payment gateway fees to a minimum
  5. No payments should be dropped
  6. No duplicate payments should be made out

Functional Requirements

# Requirement Details
FR1 Payment Processing Process seller payouts in their preferred method (bank transfer, PayPal, UPI, etc.)
FR2 Batching Batch multiple small payments into fewer gateway calls to minimize fees
FR3 Idempotency No duplicate payments — exactly-once execution guarantee
FR4 Audit Log Immutable, append-only log of every payment state transition
FR5 Status Tracking Real-time, queryable status for every payment with actionable error messages
FR6 Retry & Recovery Failed payments are retried with backoff; stuck payments are recovered automatically
FR7 Settlement Tracking Track whether money actually landed in seller’s bank (gateway success ≠ bank settlement)
FR8 Reconciliation Daily comparison of gateway records vs internal records to catch mismatches

Non-Functional Requirements

# Requirement Target
NFR1 Durability Zero data loss. RPO = 0. Every payment survives any single system failure
NFR2 Consistency Strong consistency for payment state. No scenario where payment is charged but not recorded
NFR3 Availability 99.99% for payment ingestion and status API
NFR4 Ingestion Latency < 200ms to accept and acknowledge a payment request
NFR5 Status Lookup Latency < 100ms
NFR6 Execution Latency Minutes are acceptable (batching + gateway processing time)
NFR7 Idempotency Every payment has a globally unique idempotency key at every level

2. Estimation (3 min)

Active sellers:           1M
Payouts per day:          10M transactions
Ingestion rate:           10M / 86400 ≈ 115 payments/sec
Batch ratio (10:1):       ~12 gateway calls/sec
Audit log growth:         10M records/day → ~3.6B/year

3. API Design (3 min)

POST /payments                      → Initiate a payout
  Request:
    {
      idempotency_key: "order_12345_payout_v1",   → client generates this
      seller_id: "s456",
      amount: 250.00,
      currency: "USD",
      payment_method: "bank_transfer"
    }
  Response: 202 Accepted
    {
      payment_id: "p789",
      status: "PENDING"
    }

GET /payments/{payment_id}          → Check payment status
  Response:
    {
      payment_id: "p789",
      status: "ACCEPTED",
      amount: 250.00,
      batch_id: "b101",
      failure_reason: null,
      action_required: null,
      estimated_settlement: "2026-02-26T14:00:00Z"
    }

GET /sellers/{seller_id}/payments   → Seller payment history
  Query params: ?status=FAILED&from=2026-01-01&to=2026-02-25

4. Payment State Machine

PENDING ───▶ BATCHED ───▶ SUBMITTED ───▶ ACCEPTED ───▶ SETTLED ✓
                                          │
                                          ├───▶ REVERSED (bank rejected after acceptance)
                                          │
                                          └───▶ RETURNED (settled but bounced back)

At any failed stage:
  FAILED ───▶ (re-enters as PENDING after fix or retry)
Status Meaning Seller-Facing Message
PENDING Payment received, waiting to be batched “Payment received, queued for processing”
BATCHED Grouped with other payments for fee optimization “Payment queued for processing”
SUBMITTED Batch sent to payment gateway “Payment sent to payment processor”
ACCEPTED Gateway accepted, in transit to bank “Payment in transit to your bank”
SETTLED Money confirmed in seller’s account “Payment deposited in your account”
REVERSED Bank rejected the transfer “Payment failed: [reason]. [action required]”
RETURNED Settled but bounced back “Payment returned by your bank. Please contact your bank.”
FAILED Exhausted retries or unrecoverable error “Payment failed: [reason]. [action required]”

Key insight: Gateway returning “success” only means ACCEPTED, not SETTLED. True settlement confirmation comes hours/days later via webhooks or polling.


5. Data Model (5 min)

Payments Table (source of truth, massive)

CREATE TABLE payments (
    payment_id       UUID PRIMARY KEY,
    idempotency_key  VARCHAR UNIQUE,        -- prevents duplicate API calls
    seller_id        UUID,
    amount           DECIMAL(12,2),
    currency         VARCHAR(3),
    payment_method   VARCHAR,               -- bank_transfer, paypal, upi
    status           VARCHAR,               -- PENDING, BATCHED, FAILED
    batch_id         UUID,
    failure_reason   TEXT,
    action_required  TEXT,                   -- "Update bank details"
    created_at       TIMESTAMP,
    updated_at       TIMESTAMP
);

CREATE INDEX idx_seller_status ON payments(seller_id, status);
CREATE INDEX idx_batch ON payments(batch_id);
CREATE INDEX idx_idempotency ON payments(idempotency_key);

Note: This table is NEVER scanned by background jobs. It serves ingestion and status lookups only.

Payment Batches Table (~500 active rows at any time)

CREATE TABLE payment_batches (
    batch_id             UUID PRIMARY KEY,
    idempotency_key      VARCHAR UNIQUE,    -- deterministic: hash(seller + method + sorted payment_ids)
    seller_id            UUID,
    payment_method       VARCHAR,
    total_amount         DECIMAL(12,2),
    payment_count        INT,
    status               VARCHAR,           -- OPEN, SEALED, SUBMITTED, ACCEPTED,
                                            --   FULLY_SETTLED, PARTIALLY_SETTLED, FULLY_REVERSED
    gateway_ref          VARCHAR,
    sealed_at            TIMESTAMP,
    submitted_at         TIMESTAMP,
    completed_at         TIMESTAMP,
    created_at           TIMESTAMP
);

Batch Payments Table (per-payment settlement tracking, ~500K active rows)

CREATE TABLE batch_payments (
    batch_id        UUID,
    payment_id      UUID,
    gateway_ref     VARCHAR,                -- per-payment reference from gateway
    status          VARCHAR,                -- ACCEPTED, SETTLED, REVERSED
    settled_at      TIMESTAMP,
    failure_reason  TEXT,
    PRIMARY KEY (batch_id, payment_id)
);

Why this table? Most gateways process each payment individually within a batch. Some settle, some fail. This table tracks per-payment status without scanning the massive payments table.

At any time: ~500 active batches × ~1000 payments = ~500K rows. Tiny.

Settlement Queue (~500 rows, ephemeral)

CREATE TABLE settlement_queue (
    batch_id        UUID PRIMARY KEY,
    gateway_ref     VARCHAR,
    total_payments  INT,
    settled_count   INT DEFAULT 0,
    reversed_count  INT DEFAULT 0,
    accepted_at     TIMESTAMP,
    last_checked_at TIMESTAMP,
    check_count     INT DEFAULT 0,
    status          VARCHAR                 -- IN_PROGRESS, FULLY_SETTLED,
                                            --   PARTIALLY_SETTLED, FULLY_REVERSED
);

Purpose: Settlement checker polls THIS table every 30 min instead of scanning the payments table. Only contains unsettled batches. Rows are removed once fully resolved.

Settlement Ledger (append-only, for reconciliation)

CREATE TABLE settlement_ledger (
    ledger_id       BIGSERIAL PRIMARY KEY,
    batch_id        UUID,
    gateway_ref     VARCHAR,
    amount          DECIMAL(12,2),
    settled_at      TIMESTAMP,
    source          VARCHAR                 -- 'webhook' or 'polling'
);

Audit Log (append-only, immutable)

CREATE TABLE payment_audit_log (
    log_id           BIGSERIAL PRIMARY KEY,
    payment_id       UUID,
    batch_id         UUID,
    previous_status  VARCHAR,
    new_status       VARCHAR,
    changed_by       VARCHAR,               -- "batch_aggregator", "executor", "webhook", "manual"
    metadata         JSONB,                 -- gateway response, error details
    created_at       TIMESTAMP
);

-- NEVER UPDATE or DELETE from this table. Append only.
-- One row per BATCH state change (not per payment) to avoid 1000 inserts per batch.

Table Size Summary

Table               Active Rows         Purpose                      Scanned By
─────               ───────────         ───────                      ──────────
payments            billions over time  Ingestion, status lookup     Status Service (by PK)
payment_batches     ~500                Batch tracking               Executor, Status Service
batch_payments      ~500K               Per-payment settlement       Settlement Service
settlement_queue    ~500                Unsettled batch polling      Settlement Checker
settlement_ledger   grows daily         Reconciliation               Recon Service (by date)
audit_log           grows daily         Audit trail                  Audit Service (by date)

6. High-Level Design (5 min)

                              ┌───────────────────┐
                              │   Marketplace       │
                              │   (client)          │
                              └────────┬────────────┘
                                       │
                              ┌────────▼────────────┐
                              │   API Gateway        │
                              │   (auth, rate        │
                              │    limit)            │
                              └─────┬─────────────┬──┘
                                    │             │
                          Write ────┘             └──── Read
                          path                          path
                           │                             │
                    ┌──────▼───────┐            ┌───────▼───────┐
                    │  Payment     │            │  Status       │
                    │  Service     │            │  Service      │
                    └──────┬───────┘            └───────┬───────┘
                           │                            │
                 ┌─────────┴──────────┐                 │
                 ▼                    ▼                  ▼
          ┌──────────┐        ┌──────────┐      ┌──────────┐
          │PostgreSQL │        │  Kafka   │      │  Redis   │
          │(payments, │        │          │      │  (cache) │
          │ audit log)│        └──────┬───┘      └──────────┘
          └──────────┘               │
                             ┌───────▼───────┐
                             │   Batch       │
                             │ Aggregator    │
                             └───────┬───────┘
                                     │
                             ┌───────▼───────┐
                             │  Payment      │
                             │  Executor     │
                             └───────┬───────┘
                                     │
                             ┌───────▼───────┐
                             │  Payment      │
                             │  Gateway      │
                             │  (Stripe,     │
                             │   PayPal)     │
                             └────┬───────┬──┘
                                  │       │
                         webhook  │       │  polling
                                  ▼       ▼
                      ┌────────────────────────┐
                      │  Settlement Service     │
                      └────────────┬────────────┘
                                   │
                      ┌────────────▼────────────┐
                      │  Reconciliation         │
                      │  Service (daily)        │
                      └─────────────────────────┘

7. Idempotency — Who Sets It At Each Level

Level 1: Individual payment
  Set by:    CLIENT (marketplace)
  Key:       "order_12345_payout_v1"
  Purpose:   Client retries same request → gets same result, no duplicate payment created

Level 2: Batch
  Set by:    BATCH AGGREGATOR
  Key:       deterministic_hash(seller_id + payment_method + sorted(payment_ids))
  Purpose:   Aggregator crashes and restarts → recomputes same hash → no duplicate batch

Level 3: Gateway call
  Set by:    PAYMENT EXECUTOR
  Key:       "batch_{batch_id}_attempt_{attempt_number}"
  Purpose:   Executor retries gateway call → gateway returns previous result, no double charge

8. Write Path — Detailed Flow (10 min)

Step 1: Payment Ingestion (Payment Service)

Client: POST /payments {idempotency_key, seller_id, amount, method}
    │
    ▼
Payment Service:
    1. Check idempotency_key in DB
       → exists? Return existing payment status
       → new? Continue

    2. Write to PostgreSQL: status = PENDING
       (this is the WAL → once written, payment will NEVER be lost)

    3. Publish to Kafka: topic=payments, event=payment_created

    4. Return 202 Accepted {payment_id, status: "PENDING"}

Step 2: Batching (Batch Aggregator Service)

Kafka consumer reads payment_created events:
    │
    ▼
Batch Aggregator:
    Groups payments by (seller_id, payment_method)

    Flush batch when:
      - Batch sum > $100                    → fee optimization
      - OR 1 hour elapsed since first payment in batch  → latency SLA
      - OR seller marked "instant payout"   → premium feature

    On flush:
      1. Create batch record in payment_batches: status = SEALED
      2. Compute batch idempotency key = hash(seller + method + sorted payment_ids)
      3. Update individual payments: status = BATCHED, batch_id = X
      4. Publish to Kafka: event = batch_sealed

Step 3: Execution (Payment Executor Service)

Kafka consumer reads batch_sealed events:
    │
    ▼
Payment Executor:
    1. Acquire distributed lock in Redis:
       SET exec:{batch_id} "executor-3" EX 120 NX
       (prevents duplicate execution)

    2. Background thread: renew lock every 30 seconds
       EXPIRE exec:{batch_id} 120

    3. Call payment gateway with idempotency key:
       POST /gateway/transfer {
         amount, destination, webhook_url,
         idempotency_key: "batch_b789_attempt_1"
       }
       (takes ~1 minute)

    4. Gateway returns ACCEPTED:
       - Update payment_batches: status = ACCEPTED
       - Insert into batch_payments (per-payment gateway refs)
       - Insert into settlement_queue: status = IN_PROGRESS
       - Insert audit log entry

    5. Release lock: DEL exec:{batch_id}

Redis Lock Details

Lock TTL: 120 seconds
  Gateway processing:  ~60s
  Network buffer:      ~15s
  DB update buffer:    ~5s
  Safety margin:       ~40s

Lock renewal: every 30 seconds (heartbeat pattern)
  t=0    SET exec:batch_789 "executor-3" EX 120 NX    → acquire
  t=30   EXPIRE exec:batch_789 120                      → renew
  t=60   EXPIRE exec:batch_789 120                      → renew
  t=90   Gateway responds → DEL exec:batch_789          → release

If executor crashes:
  No more renewals → lock expires at t=120
  Another executor acquires lock → retries with same idempotency key
  Gateway returns previous result → no duplicate charge

9. Settlement Tracking (5 min)

Why Gateway Success ≠ Settlement

Timeline:
  t=0          You call gateway API
  t=1 min      Gateway: "ACCEPTED"          → NOT final
  t=2 hours    Bank processes transfer
  t=2.5 hours  Bank: "settled"              → money actually moved

  Or:
  t=24 hours   Bank: "reversed → invalid account"  → failed after acceptance

Two Mechanisms: Webhook + Polling

┌───────────────┐     webhook (fast)    ┌────────────────────┐
│   Gateway     │ ────────────────────▶ │  Webhook Handler   │───▶ Update DB
│               │                       └────────────────────┘
│               │     polling (backup)  ┌────────────────────┐
│               │ ────────────────────▶ │ Settlement Checker │───▶ Update DB
└───────────────┘                       └────────────────────┘

Both are idempotent. Receiving the same status twice is fine.

Webhook Handler

POST /webhooks/gateway
{
  batch_ref: "gw_xyz",
  results: [
    {gateway_ref: "gw_p1", status: "settled"},
    {gateway_ref: "gw_p2", status: "settled"},
    {gateway_ref: "gw_p847", status: "reversed", reason: "INVALID_ACCOUNT"}
  ]
}

Handler:
  1. Update batch_payments for each payment
  2. Update settlement_queue counters
  3. If all payments accounted for:
     - Mark batch FULLY_SETTLED or PARTIALLY_SETTLED
     - Remove from settlement_queue
  4. For reversed payments:
     - Update main payments table: status=FAILED, action_required="..."
     - Notify seller
     - Re-enqueue as PENDING if retryable

Settlement Checker (polls every 30 min)

SELECT * FROM settlement_queue WHERE status = 'IN_PROGRESS'
→ ~500 rows (tiny!)

For each batch:
  Poll gateway: GET /transfers/{gateway_ref}/status
  Update batch_payments with results
  Update settlement_queue counters

  If check_count > 48 (24 hours of checking):
    → Escalate to ops team

Partial Settlement Handling

Batch b789: 1000 payments submitted

Webhook arrives:
  998 settled ✓
  2 reversed ✗ (reason: INVALID_ACCOUNT)

Actions:
  1. batch_payments: 998 rows → SETTLED, 2 rows → REVERSED
  2. payment_batches: status = PARTIALLY_SETTLED
  3. Main payments table: update ONLY the 2 failed payments
     (not 1000 — just 2 targeted updates)
  4. Notify affected sellers with actionable error
  5. Once seller fixes bank details → payment re-enters as PENDING → re-batched

10. Failure Handling — Every Failure Point (10 min)

F1: Client Request Times Out

Client sends POST /payments → no response after 5 seconds
Client retries with SAME idempotency_key

Payment Service:
  SELECT * FROM payments WHERE idempotency_key = 'order_12345_payout_v1'
  → Found? Return existing record
  → Not found? Create new payment

Result: No duplicate.

F2: Payment Service Crashes After DB Write, Before Kafka Publish

Payment in PG as PENDING ✓
Kafka publish failed ✗

Recovery sweeper (every 5 min):
  SELECT * FROM payments
  WHERE status = 'PENDING'
    AND created_at < NOW() - INTERVAL '5 minutes'

  → Re-publishes to Kafka
  → Batch aggregator deduplicates by payment_id

F3: Kafka Goes Down

Payment Service can't publish.

Payment is already in PG as PENDING.
Return 503 to client (client retries, idempotency key catches it).
Recovery sweeper will eventually re-publish when Kafka recovers.

Payment is NEVER lost → PG is the source of truth.

F4: Batch Aggregator Crashes

Was building a batch in memory with 8 of 10 payments. Crashes.

On restart:
  - Kafka consumer replays from last committed offset
  - Rebuilds in-progress batch
  - Deterministic batch idempotency key → same batch recreated

Result: No payments lost. Batch delayed by a few minutes.

F5: Executor Crashes

Case A: Before gateway call

Lock acquired ✓ → Gateway call ✗ (crash)

Lock expires after 120s.
Recovery sweeper finds batch with status=SEALED.
Another executor picks it up.

Result: Delayed, no duplicate.

Case B: During gateway call (worst case)

Lock acquired ✓ → Gateway call in progress ✓ → Crash, don't know result

Lock expires. Another executor picks up the batch.
Calls gateway AGAIN with same idempotency key.
Gateway recognizes the key:
  → Already succeeded? Returns success. No double charge.
  → Still processing? Returns "in progress". Executor waits.
  → Failed? Returns failure. Executor records failure.

Result: No duplicate payment. Gateway idempotency saves us.

Case C: After gateway success, before DB update

Lock acquired ✓ → Gateway success ✓ → DB update ✗ (crash)

Lock expires. Another executor picks up.
Calls gateway with same idempotency key → returns "already succeeded."
Executor updates DB: status = ACCEPTED.

Result: Delayed DB update, correct final state.

F6: Payment Gateway Failure

Explicit error:

Gateway: {"error": "invalid_bank_account", "code": "R04"}

Executor:
  Update payment: status = FAILED
  Set failure_reason + action_required
  Seller sees actionable error

Timeout (no response):

DO NOT retry blindly.

1. Update status: UNKNOWN
2. Call gateway status API: GET /transfers/{ref}/status
3. Succeeded → update SUCCEEDED
4. Failed → safe to retry
5. Still processing → wait and poll again

Transient error (500):

Retry with exponential backoff:
  Attempt 1: wait 5s
  Attempt 2: wait 15s
  Attempt 3: wait 45s
  Attempt 4: wait 120s
  Max attempts: 5

All retries use SAME idempotency key.
After max attempts → FAILED, alert ops team.

11. Recovery Sweeper (Safety Net)

Runs every 5 minutes. Catches anything that fell through the cracks.

1. Stuck PENDING (not published to Kafka):
   WHERE status='PENDING' AND created_at < NOW() - INTERVAL '5 minutes'
   → Re-publish to Kafka

2. Stuck BATCHED (aggregator crashed):
   WHERE status='BATCHED' AND updated_at < NOW() - INTERVAL '1 hour'
   → Re-seal batch, re-enqueue for execution

3. Stuck SUBMITTED (executor crashed):
   WHERE status='SUBMITTED' AND updated_at < NOW() - INTERVAL '30 minutes'
   → Query gateway for status, update accordingly

4. Stuck PROCESSING (gateway taking too long):
   WHERE status='PROCESSING' AND updated_at < NOW() - INTERVAL '1 hour'
   → Query gateway, escalate if no response

All actions are idempotent. Safe to run repeatedly.

12. Read Path — Status Lookup

GET /payments/{payment_id}
    │
    ▼
Status Service:
  1. Check Redis cache (hot payments, last 24h)
     → hit? Return cached status

  2. Cache miss → query PostgreSQL:
     SELECT p.*, bp.status as settlement_status, bp.failure_reason
     FROM payments p
     LEFT JOIN batch_payments bp ON p.payment_id = bp.payment_id
     WHERE p.payment_id = 'p847'

  3. If batched, derive effective status:
     - Payment status = BATCHED, batch status = FULLY_SETTLED → "SETTLED"
     - Payment in batch_payments as REVERSED → "FAILED" + reason + action

  4. Return to caller, cache in Redis with 60s TTL

Status Resolution Logic

If payment has no batch_id:
  → Return payment.status directly (PENDING or FAILED)

If payment has batch_id:
  → Check batch_payments for per-payment settlement status
  → If batch_payments says SETTLED → effective status = SETTLED
  → If batch_payments says REVERSED → effective status = FAILED + reason
  → If not yet in batch_payments → derive from payment_batches.status

13. Daily Reconciliation

Reconciliation Service (daily cron):

  YOUR SIDE:
    SELECT * FROM settlement_ledger WHERE settled_at::date = '2026-02-24'
    → ~200 rows (200 batches settled yesterday)

  GATEWAY SIDE:
    GET /gateway/settlements?date=2026-02-24
    → list of settlements from gateway

  COMPARE:
    ┌────────────────────────────────────────────────────────────┐
    │  Gateway says settled, DB says settled   → ✓ match        │
    │  Gateway says settled, DB says accepted  → ⚠ webhook missed│
    │  Gateway says reversed, DB says settled  → 🚨 critical     │
    │  DB says settled, gateway has no record  → 🚨 phantom      │
    └────────────────────────────────────────────────────────────┘

  Flag mismatches → alert finance team → manual review

14. Audit Log Strategy

Log at BATCH level, not payment level:

  Batch sealed:    1 audit row  (not 1000)
  Batch submitted: 1 audit row
  Batch settled:   1 audit row
  Payment failed:  1 audit row per failed payment (only failures, not successes)

This keeps audit log growth manageable:
  10M payments/day ÷ 1000 per batch = 10K batches/day
  ~4-5 audit rows per batch lifecycle = ~50K audit rows/day
  vs 40-50M rows/day if logged per-payment

15. Component Responsibility Map

Component               Reads From              Writes To
─────────               ──────────              ─────────
Payment Service         PostgreSQL              PostgreSQL, Kafka
Batch Aggregator        Kafka                   PostgreSQL, Kafka
Payment Executor        Kafka, Redis            PostgreSQL, Redis, Gateway
Settlement Service      settlement_queue, GW    batch_payments, payment_batches, payments
Reconciliation Service  settlement_ledger, GW   Alert system
Recovery Sweeper        PostgreSQL              Kafka, PostgreSQL
Status Service          Redis, PostgreSQL       Redis (cache)
Audit Service           audit_log table         —

16. No-Drop Guarantee End to End

Layer                       Protection
─────                       ──────────
Client → Payment Service    Idempotency key (no duplicate creation)
Payment Service → Kafka     PG write-ahead (sweeper re-publishes if Kafka missed)
Kafka → Batch Aggregator    Consumer offsets (replay on crash)
Aggregator → Executor       Deterministic batch key (no duplicate batch)
Executor → Gateway          Redis lock + gateway idempotency key (no duplicate charge)
Gateway → Settlement        Webhook + polling + reconciliation (no missed settlement)
Everything                  Recovery sweeper catches anything stuck > threshold

17. Key Design Decisions

Decision Rationale
PostgreSQL for payments (not NoSQL) Financial data needs ACID. Idempotency = unique constraint. Status + audit = single transaction. At 115 writes/sec, PG handles this easily.
Kafka between ingestion and batching Decouples accept from process. Payment accepted = durably stored. Aggregator can crash and catch up from offset.
Separate batch aggregator and executor Different concerns: WHEN to pay vs HOW to pay. Different scaling needs, different failure modes. Executor retries independently without re-batching.
Distributed lock on executor Prevents duplicate gateway calls when executor crashes/restarts. Combined with idempotency key = guaranteed no duplicate payment.
Settlement queue (separate small table) Never scan the massive payments table for background work. Settlement queue has ~500 rows at any time.
batch_payments table for per-payment tracking Handles partial settlement (998 succeed, 2 fail) without scanning the payments table. ~500K active rows.
Webhook + polling for settlement Webhook = fast notification. Polling = safety net for missed webhooks. Both idempotent.
Batch-level audit logging 50K rows/day instead of 40-50M. Payment-level detail only for failures.

18. Monitoring & Alerts

Alert on:
  - Payment stuck in PENDING > 10 minutes
  - Batch stuck in SEALED > 30 minutes
  - Gateway error rate > 5%
  - Settlement queue growing (batches not settling)
  - Reconciliation mismatches (any)
  - Kafka consumer lag > 1 minute

Dashboard:
  - Payments in each state (real-time)
  - Average time to payout (ingestion → settlement)
  - Gateway fee spend (daily/monthly)
  - Failure rate by reason (invalid account, frozen, etc.)
  - Batch size distribution
  - Settlement time distribution