1. Requirements & Scope (5 min)
Functional Requirements
- Create and manage experiments with multiple variants (A/B/n) and traffic allocation percentages
- Assign users deterministically to experiment variants (same user always sees the same variant)
- Support mutual exclusion (user in experiment X cannot be in experiment Y) and layering (independent experiments can run simultaneously)
- Collect and compute metrics (conversion rate, revenue, engagement) with statistical significance testing
- Provide a dashboard showing experiment results, confidence intervals, sample sizes, and guardrail metric alerts
Non-Functional Requirements
- Availability: 99.99% for the assignment service — if it goes down, every feature behind a flag breaks
- Latency: < 5ms for variant assignment — it’s in the critical path of page renders and API calls
- Consistency: Assignment must be deterministic and sticky. A user must always see the same variant for the lifetime of an experiment.
- Scale: 500M daily active users, 50K+ concurrent experiments, 1B+ assignment checks/day
- Durability: No event loss for exposure and metric events — statistical validity depends on complete data
2. Estimation (3 min)
Traffic
- 500M DAU, average 20 page views/day = 10B page views/day
- Each page view checks ~10 experiments = 100B assignment checks/day ≈ 1.15M checks/sec
- Exposure logging: ~10B exposure events/day (one per experiment per user per session)
- Metric events: ~50B events/day (clicks, conversions, revenue, etc.)
Storage
- Experiment configs: 50K experiments × 5KB = 250MB — trivially fits in memory
- Exposure events: 10B/day × 100 bytes = 1TB/day
- Metric events: 50B/day × 150 bytes = 7.5TB/day
- Total raw event storage: ~8.5TB/day, ~3PB/year → needs columnar storage (data warehouse)
Compute
- Statistical analysis: for each experiment, aggregate metrics across millions of users. A single experiment with 10M users and 5 metrics requires scanning ~50M rows. Running 50K experiments → batch compute pipeline (Spark/Presto), not real-time.
- Pre-aggregation per experiment per day reduces warehouse scans dramatically.
Key Insight
Assignment is a latency-critical, read-heavy problem (solved by hashing, no database needed). Analysis is a compute-heavy, batch problem (solved by a data pipeline). These are two very different subsystems.
3. API Design (3 min)
Experiment Management
POST /experiments
Body: {
"name": "checkout_button_color",
"description": "Test blue vs green checkout button",
"layer": "checkout_ui",
"variants": [
{"name": "control", "weight": 50},
{"name": "treatment_blue", "weight": 25},
{"name": "treatment_green", "weight": 25}
],
"metrics": ["conversion_rate", "revenue_per_user", "bounce_rate"],
"guardrail_metrics": ["error_rate", "page_load_time"],
"target_sample_size": 500000,
"targeting": {"country": ["US", "CA"], "platform": ["web"]},
"status": "draft"
}
PATCH /experiments/{id}/status
Body: {"status": "running"} // draft → running → stopped → completed
GET /experiments/{id}/results
Response: {
"experiment_id": "exp_123",
"variants": [
{
"name": "control",
"users": 251023,
"conversion_rate": 0.0342,
"confidence_interval": [0.0331, 0.0353],
"revenue_per_user": 2.14
},
{
"name": "treatment_blue",
"users": 124891,
"conversion_rate": 0.0389,
"confidence_interval": [0.0372, 0.0406],
"revenue_per_user": 2.31,
"p_value": 0.003,
"lift": "+13.7%",
"significant": true
}
],
"guardrails": {
"error_rate": {"status": "pass", "delta": "+0.01%"},
"page_load_time": {"status": "pass", "delta": "-12ms"}
}
}
SDK / Assignment Endpoint
GET /assignments?user_id=u_abc&context=checkout_page
Response: {
"assignments": {
"checkout_button_color": "treatment_blue",
"checkout_layout": "control",
"pricing_display": "variant_a"
}
}
Key Decision
Assignment is computed client-side via a hash function with locally-cached experiment configs. The SDK only calls the server to refresh configs (every 60 seconds). This eliminates the assignment service as a bottleneck entirely.
4. Data Model (3 min)
Experiments (PostgreSQL)
Table: experiments
experiment_id (PK) | uuid
name | varchar(200)
description | text
layer_id | uuid (FK → layers)
status | enum('draft', 'running', 'stopped', 'completed')
targeting_rules | jsonb
metrics | jsonb (list of metric definitions)
guardrail_metrics | jsonb
target_sample_size | int
created_by | varchar(100)
created_at | timestamp
started_at | timestamp
stopped_at | timestamp
Table: variants
variant_id (PK) | uuid
experiment_id (FK) | uuid
name | varchar(100)
weight | int (0-10000, basis points for precision)
Table: layers
layer_id (PK) | uuid
name | varchar(100)
description | text
traffic_percentage | int (0-100)
Exposure Events (Kafka → Data Warehouse / ClickHouse)
Table: exposures
event_id | uuid
experiment_id | uuid
variant_name | varchar(100)
user_id | varchar(100)
timestamp | timestamp
context | jsonb (page, platform, country)
Metric Events (Kafka → Data Warehouse)
Table: metric_events
event_id | uuid
user_id | varchar(100)
metric_name | varchar(100)
value | float
timestamp | timestamp
context | jsonb
Why These Choices
- PostgreSQL for experiments: Low volume, strong consistency for config, ACID guarantees for status transitions
- ClickHouse/BigQuery for events: Columnar storage optimized for aggregation queries across billions of rows. Events are append-only → perfect for columnar.
- Kafka as event bus: Decouples ingestion from processing, handles 50B+ events/day, provides replay capability
5. High-Level Design (12 min)
Assignment Flow
Client App (SDK)
→ Fetches experiment configs (cached locally, refreshed every 60s)
→ On feature flag check:
1. Hash(user_id + experiment_salt) → deterministic bucket [0, 10000]
2. Check targeting rules (country, platform, user segment)
3. Check layer allocation (is user in this layer's traffic?)
4. Map bucket to variant based on weights
5. Log exposure event → local buffer → Kafka
→ Return variant name to application code
Analysis Pipeline
Kafka (exposure + metric events)
→ Stream Processing (Flink/Spark Streaming)
→ Deduplicate events
→ Join exposures with metrics (by user_id)
→ Pre-aggregate per experiment per variant per day
→ Write to Data Warehouse (ClickHouse)
→ Batch Analysis (hourly/daily)
→ Compute statistical significance per experiment
→ Check guardrail metrics
→ Update experiment results in PostgreSQL
→ Dashboard reads from PostgreSQL + ClickHouse
Config Distribution
Experiment Manager (CRUD) → PostgreSQL
→ Config Publisher (on change)
→ Serialize all active experiments to a JSON blob
→ Push to CDN / Edge KV store
→ SDK polls CDN every 60s (or SSE for real-time push)
Components
- SDK (client-side): Handles assignment via hashing. Caches experiment configs. Logs exposure events. Available for web (JS), iOS, Android, server-side (Python, Go, Java).
- Experiment Manager Service: CRUD for experiments, layers, variants. Validates configs. Publishes config snapshots.
- Config CDN: Global edge distribution of experiment configs. Ensures < 5ms assignment even without server calls.
- Event Ingestion (Kafka): Receives exposure and metric events. Partitioned by user_id for ordered processing.
- Stream Processor (Flink): Deduplicates exposures, pre-aggregates metrics, joins exposure with outcome data.
- Data Warehouse (ClickHouse): Stores raw and aggregated event data. Supports ad-hoc queries.
- Analysis Engine: Runs statistical tests (t-test, chi-square, sequential analysis). Computes confidence intervals and p-values.
- Dashboard: Visualizes experiment results, power analysis, metric trends, guardrail alerts.
6. Deep Dives (15 min)
Deep Dive 1: Deterministic Assignment via Hashing
The core requirement is: the same user always gets the same variant, without storing any per-user state.
Algorithm:
bucket = hash(experiment_salt + user_id) % 10000 // range [0, 9999]
Variants:
control: weight=5000 → buckets [0, 4999]
treatment_blue: weight=2500 → buckets [5000, 7499]
treatment_green: weight=2500 → buckets [7500, 9999]
If bucket = 6231 → user gets treatment_blue
Why this works:
- Deterministic: same input always produces the same hash. No state to store or look up.
- Uniform: a good hash (MD5, MurmurHash3, SHA-256) distributes uniformly across buckets
- Experiment-specific salt ensures the same user gets different assignments across independent experiments (avoiding correlation)
Hash function choice: MurmurHash3 — fast (< 1 microsecond), excellent uniformity, not cryptographic (no need for it here). Avoid Math.random() or any non-deterministic approach.
Ramp-up without reassignment: If we increase treatment from 25% to 50%, we extend the bucket range [5000, 7499] → [5000, 9999]. Users already in treatment stay in treatment. Some control users move to treatment. No one in treatment moves back. This is critical for experiment integrity.
Deep Dive 2: Mutual Exclusion and Layering
Problem: If a user is in Experiment A (checkout flow change) and Experiment B (pricing display change), and revenue increases, which experiment caused it? We can’t tell — they’re confounded.
Solution: Layers
A layer is a slice of traffic. Each layer is independent:
Layer: "checkout_ui" (100% of traffic)
├── Experiment A: checkout_button_color (50% of layer traffic)
└── Experiment B: checkout_layout (30% of layer traffic)
→ Mutually exclusive: A and B share the layer's traffic
Layer: "pricing" (100% of traffic)
└── Experiment C: pricing_display (100% of layer traffic)
→ Independent: C runs on all users regardless of checkout_ui assignments
How mutual exclusion works within a layer:
layer_bucket = hash(layer_salt + user_id) % 10000
Experiment A: allocated buckets [0, 4999] (50%)
Experiment B: allocated buckets [5000, 7999] (30%)
Free traffic: buckets [8000, 9999] (20%)
User with layer_bucket = 3000 → eligible for Experiment A (not B)
User with layer_bucket = 6500 → eligible for Experiment B (not A)
User with layer_bucket = 9000 → not in any experiment in this layer
Within an experiment, a second hash (experiment_salt + user_id) determines the variant.
Key insight: Different salt per layer ensures layer assignments are independent. A user in bucket 3000 for “checkout_ui” might be in bucket 7500 for “pricing” — completely uncorrelated.
Deep Dive 3: Statistical Analysis and Guardrails
Sample Size Calculation (pre-experiment):
For a two-sided t-test:
n = (Z_{α/2} + Z_β)² × 2σ² / δ²
Where:
α = 0.05 (significance level)
β = 0.20 (power = 80%)
σ = baseline metric standard deviation
δ = minimum detectable effect (MDE)
Example: baseline conversion = 3.5%, MDE = 0.5% (relative 14%)
σ² = p(1-p) = 0.035 × 0.965 = 0.0338
n = (1.96 + 0.84)² × 2 × 0.0338 / (0.005)²
n ≈ 21,200 per variant → ~42,400 total
At 100K users/day → ~1 day to reach significance
Sequential Testing (avoiding peeking problem):
The classic problem: a PM checks results every day, and on day 3, p-value dips below 0.05. They call the experiment. But the p-value would have risen back above 0.05 on day 5 — this was a false positive caused by repeated testing.
Solution: Always-valid confidence intervals / sequential analysis
- Use methods like mSPRT (mixture Sequential Probability Ratio Test) or confidence sequences
- These produce p-values that are valid at any stopping time, not just at a pre-determined sample size
- Cost: requires ~20-30% more samples for the same power as a fixed-horizon test
- Benefit: teams can check results any time without inflating false positive rates
Guardrail Metrics:
- Define “do no harm” metrics: page load time, error rate, crash rate, revenue (for non-revenue experiments)
- Run a one-sided test: is the treatment significantly WORSE than control?
- If any guardrail is triggered (p < 0.01 for degradation), auto-alert the experiment owner and optionally auto-stop the experiment
- Guardrails use a more lenient significance threshold (0.01 instead of 0.05) to reduce false alarms while still catching real regressions
Metric Pipeline:
Raw events → Flink →
1. Deduplicate by (user_id, experiment_id, day) for exposures
2. Join: for each exposed user, collect their metric events
3. Aggregate: per user → sum/count/mean of each metric
4. Per experiment → compute:
- Mean and variance per variant
- Welch's t-test (unequal variances)
- Confidence intervals
- Sequential testing boundaries
5. Write results → PostgreSQL → Dashboard
7. Extensions (2 min)
- Feature flags integration: Experiments are just feature flags with measurement. Support permanent flags, kill switches, and gradual rollouts (0% → 5% → 25% → 100%) using the same assignment infrastructure.
- Multi-armed bandit: Instead of fixed 50/50 traffic, dynamically shift traffic toward the winning variant using Thompson Sampling or UCB. Reduces opportunity cost of running losing variants.
- Holdback groups: Permanently keep 1-5% of users on the old experience as a long-term baseline. Measures cumulative effect of all shipped experiments.
- Interaction detection: Automatically detect when two experiments in different layers interact (e.g., combined effect differs from sum of individual effects). Uses ANOVA on the cross-product of variant assignments.
- Automated experiment lifecycle: Auto-stop experiments when they reach statistical significance or target sample size. Auto-ramp winners to 100%. Alert on stale experiments running > 30 days with no decision.