1. Requirements & Scope (5 min)
Functional Requirements
- Process credit card authorization requests in real-time — validate card, check funds, place a hold, return approve/decline
- Handle settlement/clearing: batch process authorized transactions at end of day, move funds from issuing bank to acquiring bank
- Support tokenization — replace sensitive card numbers with non-reversible tokens for PCI DSS compliance
- Detect and flag fraudulent transactions in real-time using rules and ML models before authorization
- Handle refunds, chargebacks, and multi-currency transactions with proper exchange rate management
Non-Functional Requirements
- Availability: 99.999% for the authorization path — every second of downtime costs merchants millions in lost sales
- Latency: Authorization response < 300ms end-to-end (merchant → acquirer → card network → issuer → back). Our processing adds < 50ms.
- Consistency: Strong consistency for financial operations. Every transaction must be exactly-once. Duplicate charges are unacceptable.
- Scale: 100K transactions/sec peak (Black Friday). 5B transactions/day average. $10T annual payment volume.
- Security: PCI DSS Level 1 compliance. Card data encrypted at rest (AES-256) and in transit (TLS 1.3). HSM for key management.
2. Estimation (3 min)
Traffic
- 5B transactions/day average = 58K TPS average
- Peak (Black Friday, flash sales): 150K TPS
- Each authorization involves: fraud check + balance check + hold placement = 3-5 internal operations per transaction
Storage
- Transaction records: 5B/day × 500 bytes = 2.5TB/day, ~900TB/year
- Token vault: 2B unique cards × 200 bytes = 400GB (fits in memory with replication)
- Fraud model features: 5B/day × 1KB (computed features) = 5TB/day (stored for model training, purged after 90 days)
Financial Math
- Average transaction: $50
- Annual volume: 5B/day × 365 × $50 = $91T
- Interchange fee (~2%): $1.8T/year in fees flowing through the system
- A 1-second outage at peak: 150K transactions × $50 = $7.5M in potentially lost transactions
Key Insight
This system has zero tolerance for data loss or duplication. A lost transaction means either the merchant doesn’t get paid or the customer gets charged without receiving goods. A duplicate means double-charging. Every operation must be idempotent and durable.
3. API Design (3 min)
Authorization
POST /v1/authorize
Headers: {
"Idempotency-Key": "merchant_txn_abc123", // CRITICAL for exactly-once
"X-Merchant-Id": "m_456"
}
Body: {
"card_token": "tok_xyz789", // tokenized card, never raw PAN
"amount": 4999, // cents, always integer to avoid float issues
"currency": "USD",
"merchant_category_code": "5411", // groceries
"merchant_name": "Whole Foods #1234",
"billing_address": { ... }, // for AVS check
"metadata": { "order_id": "ord_123" }
}
Response 200: {
"authorization_id": "auth_abc",
"status": "approved", // or "declined"
"decline_reason": null, // "insufficient_funds", "fraud_suspected", etc.
"avs_result": "Y", // address verification
"cvv_result": "M", // CVV match
"auth_code": "A12345",
"remaining_hold_amount": 4999
}
Capture (Settlement)
POST /v1/capture
Body: {
"authorization_id": "auth_abc",
"amount": 4999, // can be <= auth amount (partial capture)
"idempotency_key": "cap_abc123"
}
Response 200: { "capture_id": "cap_def", "status": "captured" }
Refund
POST /v1/refund
Body: {
"capture_id": "cap_def",
"amount": 2000, // partial refund
"reason": "customer_request",
"idempotency_key": "ref_abc123"
}
Key Decisions
- Amounts in smallest currency unit (cents): Avoids floating-point precision issues. $49.99 = 4999 cents.
- Idempotency keys on every mutating endpoint: If a network timeout causes a retry, the system returns the same response without re-processing.
- Token-only API: Raw card numbers (PANs) never reach this system. Tokenization happens at the edge (POS terminal or browser SDK).
4. Data Model (3 min)
Transactions (PostgreSQL with partitioning)
Table: authorizations
auth_id (PK) | uuid
idempotency_key (UQ) | varchar(100)
merchant_id | uuid
card_token | varchar(64) (FK → token_vault)
amount_cents | bigint
currency | char(3)
status | enum('pending', 'approved', 'declined', 'expired', 'reversed')
decline_reason | varchar(50)
fraud_score | float
auth_code | varchar(10)
avs_result | char(1)
cvv_result | char(1)
mcc | char(4)
created_at | timestamp
expires_at | timestamp -- auths expire after 7-30 days
-- Partitioned by created_at (daily)
Table: captures
capture_id (PK) | uuid
auth_id (FK) | uuid
idempotency_key (UQ) | varchar(100)
amount_cents | bigint
status | enum('pending', 'settled', 'failed')
settled_at | timestamp
batch_id | uuid -- settlement batch
Table: refunds
refund_id (PK) | uuid
capture_id (FK) | uuid
idempotency_key (UQ) | varchar(100)
amount_cents | bigint
reason | varchar(50)
status | enum('pending', 'processed', 'failed')
Token Vault (Isolated, HSM-backed)
Table: token_vault (separate database, separate network segment)
token (PK) | varchar(64)
encrypted_pan | bytea -- AES-256 encrypted with HSM-managed key
card_fingerprint | varchar(64) -- hash of PAN, for dedup without decryption
expiry_month | int
expiry_year | int
issuer_bin | char(6) -- first 6 digits, for routing
created_at | timestamp
Why These Choices
- PostgreSQL for transactions: ACID guarantees, strong consistency, mature partitioning (daily partitions for 5B txns/day). Each partition is ~2.5TB.
- Separate token vault database: PCI DSS requires cardholder data isolated in its own network segment (CDE — Cardholder Data Environment). The token vault has no direct internet access.
- HSM (Hardware Security Module): Encryption keys for the token vault are managed by HSMs — tamper-resistant hardware. Keys never exist in software.
5. High-Level Design (12 min)
Authorization Flow
Merchant POS/App
→ Acquirer (merchant's bank processor)
→ Card Network (Visa/Mastercard)
→ Our Authorization Service:
1. Parse request, validate format
2. Detokenize: token_vault lookup → get BIN for routing
3. Fraud Check (< 20ms):
→ Rules engine (velocity checks, geo checks)
→ ML model (real-time feature computation + inference)
→ If fraud_score > threshold → decline
4. Issuer Routing: based on BIN, route to correct issuing bank
5. Balance/Credit Check: (at issuing bank)
→ Sufficient funds? → place hold
6. Log transaction (sync write to DB)
7. Return auth response
← Card Network
← Acquirer
← Merchant
Total: < 300ms end-to-end
Settlement Flow (batch, end of day)
Settlement Batch Job (runs at EOD):
1. Collect all captures from the day
→ SELECT * FROM captures WHERE status = 'pending' AND created_at >= today
2. Group by issuing bank
3. For each bank:
→ Create settlement file (ISO 8583 format)
→ Submit to card network's clearing system
4. Card network calculates interchange fees and nets positions:
→ Merchant's bank receives: transaction_amount - interchange_fee
→ Issuing bank pays: transaction_amount - interchange_fee
5. Update capture status → 'settled'
6. Generate merchant payout records
Components
- API Gateway: TLS termination, authentication, rate limiting, idempotency layer. All merchant-facing endpoints.
- Authorization Engine: Core transaction processing. Orchestrates fraud check, balance check, and hold placement. Stateless — all state in DB.
- Tokenization Service: Runs in the CDE (isolated network). Handles PAN → token and token → BIN lookups. Connected to HSMs.
- Fraud Detection Service: Real-time scoring pipeline. Rules engine + ML model. Feature store (Redis) for velocity counters (transactions per card per hour, etc.).
- Issuer Gateway: Routes authorization requests to the correct issuing bank based on BIN. Manages connections to 10K+ issuing banks worldwide.
- Settlement Engine: Batch processing of captures into settlement files. Runs at EOD. Handles netting, interchange calculation, and reconciliation.
- Reconciliation Service: Compares our transaction records with card network records daily. Flags mismatches for manual review. Critical for financial accuracy.
- Chargeback Handler: Processes dispute requests. Manages the lifecycle: dispute filed → merchant notified → evidence submitted → resolution.
Idempotency Layer (critical)
On every mutating request:
1. Hash(idempotency_key) → look up in idempotency_store (Redis + PostgreSQL)
2. If found: return cached response (exact same response as original)
3. If not found:
→ Process request
→ Store response keyed by idempotency_key (TTL: 24 hours)
→ Return response
This guarantees exactly-once processing even with retries.
6. Deep Dives (15 min)
Deep Dive 1: PCI DSS Compliance and Tokenization
PCI DSS Level 1 applies to any entity processing > 6M card transactions/year. At 5B/day, we’re far beyond that.
Key requirements and how we meet them:
Requirement 1: Network segmentation
Public Zone:
→ API Gateway (no card data — only tokens)
→ Merchant Dashboard
Private Zone:
→ Authorization Engine
→ Settlement Engine
→ Fraud Service
CDE (Cardholder Data Environment):
→ Token Vault
→ HSMs
→ No internet access (air-gapped from public zone)
→ All access logged and audited
→ Separate credentials, separate VPN
Tokenization flow:
1. Merchant's browser SDK (PCI-compliant iframe) → sends raw PAN directly to Tokenization Service
(merchant's server NEVER sees the raw PAN — reduces their PCI scope)
2. Tokenization Service:
→ Validate card (Luhn check, expiry)
→ Generate fingerprint: SHA-256(PAN + salt) — for deduplication
→ Check if fingerprint exists → if yes, return existing token
→ If new: encrypt PAN with AES-256-GCM using HSM-managed key
→ Store encrypted PAN in token vault
→ Return token (format-preserving: "tok_" + random, same length as PAN)
3. All subsequent API calls use token, never PAN
4. Detokenization only happens inside the CDE for:
→ Generating settlement files (need last 4 digits)
→ Routing to issuer (need BIN — first 6 digits)
→ Even these use partial decryption (only decrypt needed digits)
Key management:
- Master key stored in HSM (never exportable)
- Data encryption keys (DEKs) encrypted by master key (envelope encryption)
- DEK rotation every 90 days — re-encrypt token vault with new DEK
- All key operations logged to immutable audit log
Deep Dive 2: Real-Time Fraud Detection
Two-phase approach: rules engine + ML model, both must complete in < 20ms
Phase 1: Rules Engine (< 5ms)
Hard rules (instant decline):
- Card reported stolen/lost
- Country on sanctions list
- Merchant on blacklist
- Card expired
Velocity rules (Redis counters):
- > 5 transactions in 1 minute on same card → flag
- > $5,000 in 1 hour on same card → flag
- Transaction in Country A, then Country B within 1 hour → impossible travel → flag
- First transaction on a new card > $500 → flag
Redis operations (O(1) per check):
INCR fraud:card:{token}:txn_count:min:{minute_bucket}
INCRBY fraud:card:{token}:amount:hour:{hour_bucket} {amount}
GET fraud:card:{token}:last_country
Phase 2: ML Model (< 15ms)
Features (computed in real-time):
- Transaction features: amount, MCC, time of day, day of week
- Card features: age, avg transaction amount, typical MCC, typical country
- Merchant features: chargeback rate, average ticket size
- Behavioral: time since last transaction, deviation from spending pattern
- Network: number of cards used at this merchant today
Model: gradient boosted trees (XGBoost/LightGBM)
- Trained offline on historical fraud-labeled data
- Inference: < 5ms (pre-loaded model, feature vector is ~200 floats)
- Output: fraud_score [0, 1]
- Threshold: 0.85 → auto-decline, 0.5-0.85 → step-up auth (3D Secure), < 0.5 → approve
Model serving:
- Model loaded in-memory on each authorization node
- Updated via blue-green deployment (no downtime)
- Shadow mode for new models: run both old and new, compare, promote when new model improves
Feedback loop:
- Chargebacks labeled as fraud → add to training data
- Declined transactions that were legitimate (customer calls) → false positive feedback
- Retrain model weekly with fresh data
Deep Dive 3: Idempotency and Exactly-Once Processing
The problem: Network partitions between merchant and our API gateway are common. The merchant retries. Without idempotency, the customer gets charged twice.
Implementation:
Idempotency key: merchant-provided unique ID per transaction attempt
Step 1: Before processing, check idempotency store:
-- PostgreSQL (durable):
SELECT response_body, status_code
FROM idempotency_keys
WHERE key = $1 AND merchant_id = $2;
-- Redis (fast path, cache):
GET idempotency:{merchant_id}:{key}
Step 2: If found → return cached response immediately (no re-processing)
Step 3: If not found:
a. Insert a "processing" record:
INSERT INTO idempotency_keys (key, merchant_id, status)
VALUES ($1, $2, 'processing');
-- UNIQUE constraint prevents parallel duplicate inserts
b. Process the transaction normally
c. Update with the response:
UPDATE idempotency_keys
SET status = 'completed', response_body = $3, status_code = $4
WHERE key = $1;
d. Cache in Redis (TTL: 24 hours)
Step 4: Race condition handling:
If step 3a fails (unique constraint violation) → another thread is processing the same key
→ Retry with exponential backoff until the other thread completes
→ Return the completed response
Edge case: What if the server crashes mid-processing (after step 3a, before 3c)?
- The idempotency record is in “processing” state with no response
- Cleanup job: any “processing” records older than 60 seconds → mark as “failed”
- Next retry from merchant → sees “failed” → re-processes from scratch
- The authorization itself is also idempotent at the issuer level (same auth code for same card+amount+merchant within a window)
Multi-currency handling:
POST /v1/authorize
Body: {
"amount": 4200,
"currency": "EUR", // customer's card currency: USD
"fx_rate_id": "fx_abc123" // locked rate from /v1/fx-rates
}
FX rate locking:
1. Merchant calls GET /v1/fx-rates?from=EUR&to=USD → returns rate + fx_rate_id (valid for 60 seconds)
2. Authorization uses the locked rate: 4200 EUR × 1.0852 = $45.58 USD authorized
3. Settlement uses the rate at settlement time (may differ — merchant bears FX risk, or we offer guaranteed rate for a fee)
7. Extensions (2 min)
- 3D Secure (step-up authentication): For medium-risk transactions (fraud score 0.5-0.85), redirect the customer to their issuing bank’s authentication page (password, biometric, SMS OTP). Shifts fraud liability from merchant to issuer.
- Recurring payments and subscriptions: Store a card-on-file token. Process recurring charges on a schedule. Handle declines gracefully with smart retry logic (retry on the 1st, 15th; avoid weekends; try alternate card on file).
- Real-time merchant analytics: Stream transaction data to a real-time analytics engine. Merchants see live dashboards: revenue, approval rate, average ticket size, chargeback rate. Alert on anomalies (sudden drop in approval rate → possible configuration issue).
- Network tokenization: Replace our tokens with card network tokens (Visa Token Service, Mastercard MDES). Network tokens have higher approval rates because the issuing bank trusts them more. Automatically update when cards are reissued.
- Instant payouts: Instead of T+2 settlement (standard), offer merchants instant or same-day payouts for a fee. Requires pre-funding a pool and managing settlement risk.