Design an A/B Testing Platform

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Traffic
- Storage
- Compute
- Key Insight
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Create and manage experiments with multiple variants (A/B/n) and traffic allocation percentages
Assign users deterministically to experiment variants (same user always sees the same variant)
Support mutual exclusion (user in experiment X cannot be in experiment Y) and layering (independent experiments can run simultaneously)
Collect and compute metrics (conversion rate, revenue, engagement) with statistical significance testing
Provide a dashboard showing experiment results, confidence intervals, sample sizes, and guardrail metric alerts

Non-Functional Requirements

Availability: 99.99% for the assignment service — if it goes down, every feature behind a flag breaks
Latency: < 5ms for variant assignment — it’s in the critical path of page renders and API calls
Consistency: Assignment must be deterministic and sticky. A user must always see the same variant for the lifetime of an experiment.
Scale: 500M daily active users, 50K+ concurrent experiments, 1B+ assignment checks/day
Durability: No event loss for exposure and metric events — statistical validity depends on complete data

2. Estimation (3 min)

Traffic

500M DAU, average 20 page views/day = 10B page views/day
Each page view checks ~10 experiments = 100B assignment checks/day ≈ 1.15M checks/sec
Exposure logging: ~10B exposure events/day (one per experiment per user per session)
Metric events: ~50B events/day (clicks, conversions, revenue, etc.)

Storage

Experiment configs: 50K experiments × 5KB = 250MB — trivially fits in memory
Exposure events: 10B/day × 100 bytes = 1TB/day
Metric events: 50B/day × 150 bytes = 7.5TB/day
Total raw event storage: ~8.5TB/day, ~3PB/year → needs columnar storage (data warehouse)

Compute

Statistical analysis: for each experiment, aggregate metrics across millions of users. A single experiment with 10M users and 5 metrics requires scanning ~50M rows. Running 50K experiments → batch compute pipeline (Spark/Presto), not real-time.
Pre-aggregation per experiment per day reduces warehouse scans dramatically.

Key Insight

Assignment is a latency-critical, read-heavy problem (solved by hashing, no database needed). Analysis is a compute-heavy, batch problem (solved by a data pipeline). These are two very different subsystems.

3. API Design (3 min)

Experiment Management

POST /experiments
  Body: {
    "name": "checkout_button_color",
    "description": "Test blue vs green checkout button",
    "layer": "checkout_ui",
    "variants": [
      {"name": "control", "weight": 50},
      {"name": "treatment_blue", "weight": 25},
      {"name": "treatment_green", "weight": 25}
    ],
    "metrics": ["conversion_rate", "revenue_per_user", "bounce_rate"],
    "guardrail_metrics": ["error_rate", "page_load_time"],
    "target_sample_size": 500000,
    "targeting": {"country": ["US", "CA"], "platform": ["web"]},
    "status": "draft"
  }

PATCH /experiments/{id}/status
  Body: {"status": "running"}  // draft → running → stopped → completed

GET /experiments/{id}/results
  Response: {
    "experiment_id": "exp_123",
    "variants": [
      {
        "name": "control",
        "users": 251023,
        "conversion_rate": 0.0342,
        "confidence_interval": [0.0331, 0.0353],
        "revenue_per_user": 2.14
      },
      {
        "name": "treatment_blue",
        "users": 124891,
        "conversion_rate": 0.0389,
        "confidence_interval": [0.0372, 0.0406],
        "revenue_per_user": 2.31,
        "p_value": 0.003,
        "lift": "+13.7%",
        "significant": true
      }
    ],
    "guardrails": {
      "error_rate": {"status": "pass", "delta": "+0.01%"},
      "page_load_time": {"status": "pass", "delta": "-12ms"}
    }
  }

SDK / Assignment Endpoint

GET /assignments?user_id=u_abc&context=checkout_page
  Response: {
    "assignments": {
      "checkout_button_color": "treatment_blue",
      "checkout_layout": "control",
      "pricing_display": "variant_a"
    }
  }

Key Decision

Assignment is computed client-side via a hash function with locally-cached experiment configs. The SDK only calls the server to refresh configs (every 60 seconds). This eliminates the assignment service as a bottleneck entirely.

4. Data Model (3 min)

Experiments (PostgreSQL)

Table: experiments
  experiment_id   (PK) | uuid
  name                  | varchar(200)
  description           | text
  layer_id              | uuid (FK → layers)
  status                | enum('draft', 'running', 'stopped', 'completed')
  targeting_rules       | jsonb
  metrics               | jsonb (list of metric definitions)
  guardrail_metrics     | jsonb
  target_sample_size    | int
  created_by            | varchar(100)
  created_at            | timestamp
  started_at            | timestamp
  stopped_at            | timestamp

Table: variants
  variant_id      (PK) | uuid
  experiment_id   (FK) | uuid
  name                  | varchar(100)
  weight                | int (0-10000, basis points for precision)

Table: layers
  layer_id        (PK) | uuid
  name                  | varchar(100)
  description           | text
  traffic_percentage    | int (0-100)

Exposure Events (Kafka → Data Warehouse / ClickHouse)

Table: exposures
  event_id              | uuid
  experiment_id         | uuid
  variant_name          | varchar(100)
  user_id               | varchar(100)
  timestamp             | timestamp
  context               | jsonb (page, platform, country)

Metric Events (Kafka → Data Warehouse)

Table: metric_events
  event_id              | uuid
  user_id               | varchar(100)
  metric_name           | varchar(100)
  value                 | float
  timestamp             | timestamp
  context               | jsonb

Why These Choices

PostgreSQL for experiments: Low volume, strong consistency for config, ACID guarantees for status transitions
ClickHouse/BigQuery for events: Columnar storage optimized for aggregation queries across billions of rows. Events are append-only → perfect for columnar.
Kafka as event bus: Decouples ingestion from processing, handles 50B+ events/day, provides replay capability

5. High-Level Design (12 min)

Assignment Flow

Client App (SDK)
  → Fetches experiment configs (cached locally, refreshed every 60s)
  → On feature flag check:
      1. Hash(user_id + experiment_salt) → deterministic bucket [0, 10000]
      2. Check targeting rules (country, platform, user segment)
      3. Check layer allocation (is user in this layer's traffic?)
      4. Map bucket to variant based on weights
      5. Log exposure event → local buffer → Kafka
  → Return variant name to application code

Analysis Pipeline

Kafka (exposure + metric events)
  → Stream Processing (Flink/Spark Streaming)
    → Deduplicate events
    → Join exposures with metrics (by user_id)
    → Pre-aggregate per experiment per variant per day
  → Write to Data Warehouse (ClickHouse)
  → Batch Analysis (hourly/daily)
    → Compute statistical significance per experiment
    → Check guardrail metrics
    → Update experiment results in PostgreSQL
  → Dashboard reads from PostgreSQL + ClickHouse

Config Distribution

Experiment Manager (CRUD) → PostgreSQL
  → Config Publisher (on change)
    → Serialize all active experiments to a JSON blob
    → Push to CDN / Edge KV store
    → SDK polls CDN every 60s (or SSE for real-time push)

Components

SDK (client-side): Handles assignment via hashing. Caches experiment configs. Logs exposure events. Available for web (JS), iOS, Android, server-side (Python, Go, Java).
Experiment Manager Service: CRUD for experiments, layers, variants. Validates configs. Publishes config snapshots.
Config CDN: Global edge distribution of experiment configs. Ensures < 5ms assignment even without server calls.
Event Ingestion (Kafka): Receives exposure and metric events. Partitioned by user_id for ordered processing.
Stream Processor (Flink): Deduplicates exposures, pre-aggregates metrics, joins exposure with outcome data.
Data Warehouse (ClickHouse): Stores raw and aggregated event data. Supports ad-hoc queries.
Analysis Engine: Runs statistical tests (t-test, chi-square, sequential analysis). Computes confidence intervals and p-values.
Dashboard: Visualizes experiment results, power analysis, metric trends, guardrail alerts.

6. Deep Dives (15 min)

Deep Dive 1: Deterministic Assignment via Hashing

The core requirement is: the same user always gets the same variant, without storing any per-user state.

Algorithm:

bucket = hash(experiment_salt + user_id) % 10000   // range [0, 9999]

Variants:
  control:         weight=5000 → buckets [0, 4999]
  treatment_blue:  weight=2500 → buckets [5000, 7499]
  treatment_green: weight=2500 → buckets [7500, 9999]

If bucket = 6231 → user gets treatment_blue

Why this works:

Deterministic: same input always produces the same hash. No state to store or look up.
Uniform: a good hash (MD5, MurmurHash3, SHA-256) distributes uniformly across buckets
Experiment-specific salt ensures the same user gets different assignments across independent experiments (avoiding correlation)

Hash function choice: MurmurHash3 — fast (< 1 microsecond), excellent uniformity, not cryptographic (no need for it here). Avoid Math.random() or any non-deterministic approach.

Ramp-up without reassignment: If we increase treatment from 25% to 50%, we extend the bucket range [5000, 7499] → [5000, 9999]. Users already in treatment stay in treatment. Some control users move to treatment. No one in treatment moves back. This is critical for experiment integrity.

Deep Dive 2: Mutual Exclusion and Layering

Problem: If a user is in Experiment A (checkout flow change) and Experiment B (pricing display change), and revenue increases, which experiment caused it? We can’t tell — they’re confounded.

Solution: Layers

A layer is a slice of traffic. Each layer is independent:

Layer: "checkout_ui" (100% of traffic)
  ├── Experiment A: checkout_button_color (50% of layer traffic)
  └── Experiment B: checkout_layout (30% of layer traffic)
  → Mutually exclusive: A and B share the layer's traffic

Layer: "pricing" (100% of traffic)
  └── Experiment C: pricing_display (100% of layer traffic)
  → Independent: C runs on all users regardless of checkout_ui assignments

How mutual exclusion works within a layer:

layer_bucket = hash(layer_salt + user_id) % 10000

Experiment A: allocated buckets [0, 4999] (50%)
Experiment B: allocated buckets [5000, 7999] (30%)
Free traffic: buckets [8000, 9999] (20%)

User with layer_bucket = 3000 → eligible for Experiment A (not B)
User with layer_bucket = 6500 → eligible for Experiment B (not A)
User with layer_bucket = 9000 → not in any experiment in this layer

Within an experiment, a second hash (experiment_salt + user_id) determines the variant.

Key insight: Different salt per layer ensures layer assignments are independent. A user in bucket 3000 for “checkout_ui” might be in bucket 7500 for “pricing” — completely uncorrelated.

Deep Dive 3: Statistical Analysis and Guardrails

Sample Size Calculation (pre-experiment):

For a two-sided t-test:
  n = (Z_{α/2} + Z_β)² × 2σ² / δ²

Where:
  α = 0.05 (significance level)
  β = 0.20 (power = 80%)
  σ = baseline metric standard deviation
  δ = minimum detectable effect (MDE)

Example: baseline conversion = 3.5%, MDE = 0.5% (relative 14%)
  σ² = p(1-p) = 0.035 × 0.965 = 0.0338
  n = (1.96 + 0.84)² × 2 × 0.0338 / (0.005)²
  n ≈ 21,200 per variant → ~42,400 total
  At 100K users/day → ~1 day to reach significance

Sequential Testing (avoiding peeking problem):

The classic problem: a PM checks results every day, and on day 3, p-value dips below 0.05. They call the experiment. But the p-value would have risen back above 0.05 on day 5 — this was a false positive caused by repeated testing.

Solution: Always-valid confidence intervals / sequential analysis

Use methods like mSPRT (mixture Sequential Probability Ratio Test) or confidence sequences
These produce p-values that are valid at any stopping time, not just at a pre-determined sample size
Cost: requires ~20-30% more samples for the same power as a fixed-horizon test
Benefit: teams can check results any time without inflating false positive rates

Guardrail Metrics:

Define “do no harm” metrics: page load time, error rate, crash rate, revenue (for non-revenue experiments)
Run a one-sided test: is the treatment significantly WORSE than control?
If any guardrail is triggered (p < 0.01 for degradation), auto-alert the experiment owner and optionally auto-stop the experiment
Guardrails use a more lenient significance threshold (0.01 instead of 0.05) to reduce false alarms while still catching real regressions

Metric Pipeline:

Raw events → Flink →
  1. Deduplicate by (user_id, experiment_id, day) for exposures
  2. Join: for each exposed user, collect their metric events
  3. Aggregate: per user → sum/count/mean of each metric
  4. Per experiment → compute:
     - Mean and variance per variant
     - Welch's t-test (unequal variances)
     - Confidence intervals
     - Sequential testing boundaries
  5. Write results → PostgreSQL → Dashboard

7. Extensions (2 min)

Feature flags integration: Experiments are just feature flags with measurement. Support permanent flags, kill switches, and gradual rollouts (0% → 5% → 25% → 100%) using the same assignment infrastructure.
Multi-armed bandit: Instead of fixed 50/50 traffic, dynamically shift traffic toward the winning variant using Thompson Sampling or UCB. Reduces opportunity cost of running losing variants.
Holdback groups: Permanently keep 1-5% of users on the old experience as a long-term baseline. Measures cumulative effect of all shipped experiments.
Interaction detection: Automatically detect when two experiments in different layers interact (e.g., combined effect differs from sum of individual effects). Uses ANOVA on the cross-product of variant assignments.
Automated experiment lifecycle: Auto-stop experiments when they reach statistical significance or target sample size. Auto-ramp winners to 100%. Alert on stale experiments running > 30 days with no decision.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Compute#

Key Insight#

3. API Design (3 min)#

Experiment Management#

SDK / Assignment Endpoint#

Key Decision#

4. Data Model (3 min)#

Experiments (PostgreSQL)#

Exposure Events (Kafka → Data Warehouse / ClickHouse)#

Metric Events (Kafka → Data Warehouse)#

Why These Choices#

5. High-Level Design (12 min)#

Assignment Flow#

Analysis Pipeline#

Config Distribution#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Deterministic Assignment via Hashing#

Deep Dive 2: Mutual Exclusion and Layering#

Deep Dive 3: Statistical Analysis and Guardrails#

7. Extensions (2 min)#