1. Requirements & Scope (5 min)

Functional Requirements

  1. Collect client-side events (page views, clicks, custom events) from millions of websites via a lightweight JavaScript SDK
  2. Provide real-time dashboards showing active users, page views per second, and top pages (within ~30 seconds of event)
  3. Support batch analytics queries — daily/weekly/monthly reports on sessions, funnels, bounce rate, conversion paths
  4. Sessionize raw events into user sessions with configurable timeout (default 30 minutes of inactivity)
  5. Count unique visitors accurately across time ranges (daily, weekly, monthly) with deduplication

Non-Functional Requirements

  • Availability: 99.99% for the ingestion pipeline — dropping events is unacceptable for paying customers. Dashboard reads can tolerate brief degradation.
  • Latency: Event ingestion < 50ms (client-perceived). Real-time dashboard data within 30 seconds of event. Batch reports within minutes of scheduled time.
  • Consistency: Eventual consistency is acceptable. Real-time dashboards are approximate. Batch reports must be accurate to within 1%.
  • Scale: 10M tracked websites, 1M events/sec ingestion, 500TB+ of raw event data per year
  • Durability: Zero event loss once acknowledged. Raw events retained for 2 years, aggregated data retained indefinitely.

2. Estimation (3 min)

Traffic

  • 10M websites tracked, average 10 page views/sec per active site
  • Not all sites active simultaneously — assume 100K sites active at peak
  • Peak ingestion: 100K sites × 10 events/sec = 1M events/sec
  • Average ingestion: ~300K events/sec
  • Read QPS (dashboard queries): ~50K/sec (most are cached)

Storage

  • Average event payload: ~500 bytes (URL, timestamp, user agent, referrer, custom dimensions, session cookie)
  • Daily raw events: 300K/sec × 86,400 sec × 500 bytes = ~13TB/day
  • Yearly raw events: ~4.7PB/year
  • Aggregated rollups (hourly/daily per site per dimension): ~1% of raw = ~50TB/year

Unique Visitor Counting

  • 10M sites, each with up to 100M monthly unique visitors
  • Exact counting: 10M sites × 100M visitors × 8 bytes = 8PB (impossible)
  • HyperLogLog: 10M sites × 12KB per HLL = 120GB for monthly uniques — fits in memory

Key Insight

This is a write-heavy, read-light system. Ingestion throughput and storage cost dominate. The core challenge is building a pipeline that can ingest 1M events/sec, make data queryable in real-time, and run efficient batch aggregations without bankrupting the storage budget.


3. API Design (3 min)

Event Collection Endpoint (called by JS SDK)

POST /collect
Content-Type: application/json

Body: {
  "tracking_id": "UA-123456",
  "client_id": "c8f3a2...",            // first-party cookie UUID
  "events": [                           // batched for efficiency
    {
      "type": "pageview",
      "timestamp": 1708632000,
      "url": "https://example.com/blog",
      "referrer": "https://google.com",
      "title": "My Blog Post",
      "screen_resolution": "1920x1080",
      "language": "en-US",
      "custom_dimensions": { "author": "alice", "category": "tech" }
    }
  ]
}
Response 204 No Content                  // fast ack, no body

Query API (for dashboard / reports)

GET /api/v1/report?tracking_id=UA-123456
  &metric=pageviews,unique_visitors,sessions
  &dimension=page_path,referrer
  &start_date=2024-02-01
  &end_date=2024-02-22
  &granularity=daily
  &filters=country:US;device:mobile
  &sort=-pageviews
  &limit=100

Response 200: {
  "rows": [
    { "date": "2024-02-01", "page_path": "/blog", "pageviews": 12340, "unique_visitors": 8900, "sessions": 9200 },
    ...
  ],
  "totals": { "pageviews": 450000, "unique_visitors": 120000, "sessions": 135000 },
  "sampled": false
}

Real-Time API

GET /api/v1/realtime?tracking_id=UA-123456
Response 200: {
  "active_users": 1423,
  "pageviews_per_second": 87,
  "top_pages": [ { "path": "/", "active_users": 312 }, ... ],
  "top_referrers": [ ... ]
}

Key Decisions

  • Batch events in the SDK (send every 5 seconds or on page unload via navigator.sendBeacon)
  • 204 No Content response — minimal latency, no body to parse
  • Pixel fallbackGET /collect?data=base64encoded... for environments where POST is blocked
  • Query API supports sampling flag — for very large sites, queries over raw data use 10% sample and extrapolate

4. Data Model (3 min)

Raw Events (stored in columnar format — Parquet on S3/GCS)

Schema: raw_events (partitioned by tracking_id + date)
  event_id            | UUID
  tracking_id         | string          -- site identifier
  client_id           | string          -- user cookie
  session_id          | string          -- computed during sessionization
  event_type          | string          -- pageview, click, custom
  timestamp           | int64           -- Unix ms
  url                 | string
  referrer            | string
  page_title          | string
  user_agent          | string
  ip_hash             | string          -- hashed for privacy
  country             | string          -- derived from IP
  city                | string
  device_type         | enum            -- desktop, mobile, tablet
  browser             | string
  os                  | string
  screen_resolution   | string
  custom_dimensions   | map<string, string>

Pre-Aggregated Rollups (ClickHouse or Druid)

Table: hourly_metrics
  tracking_id         | string
  date_hour           | datetime
  dimension_key       | string          -- e.g., "page_path=/blog"
  pageviews           | uint64
  unique_visitors_hll | binary          -- serialized HyperLogLog
  sessions            | uint64
  total_duration_sec  | uint64
  bounce_count        | uint64

Session Table (derived via sessionization job)

Table: sessions
  session_id          | string (PK)
  tracking_id         | string
  client_id           | string
  start_time          | datetime
  end_time            | datetime
  duration_sec        | int
  page_count          | int
  entry_page          | string
  exit_page           | string
  referrer            | string
  is_bounce           | boolean         -- page_count == 1
  country             | string
  device_type         | string
  conversion_events   | array<string>

DB Choice

  • Raw events: S3 + Parquet. Cheap, durable, query with Spark/Presto for batch.
  • Real-time aggregates: ClickHouse (or Apache Druid). Columnar OLAP database optimized for time-series aggregation. Handles 1M inserts/sec and sub-second analytical queries.
  • HyperLogLog sketches: Stored in ClickHouse as AggregateFunction(uniq, String) — native HLL support with merge capabilities across time ranges.
  • Metadata (tracking configs, users): PostgreSQL.

5. High-Level Design (12 min)

Data Flow

Website (JS SDK)
  → CDN Edge (collect endpoint)
  → Kafka (raw events topic, partitioned by tracking_id)
  → Two consumers in parallel:
     ├── Real-Time Pipeline (Flink/Spark Streaming)
     │     → Pre-aggregate in 10-second windows
     │     → Write to ClickHouse (real-time tables)
     │     → Update HyperLogLog sketches (in-memory, flushed every minute)
     │     → Power real-time dashboard API
     └── Batch Pipeline (Spark, runs hourly)
           → Read from Kafka → S3 (raw Parquet)
           → Sessionization job (group events by client_id, 30-min timeout)
           → Aggregation job (hourly/daily rollups with HLL uniques)
           → Write to ClickHouse (batch tables)
           → Power historical reporting API

Components

  1. JS SDK: Lightweight (~15KB), collects pageviews and custom events, batches and sends via sendBeacon or XHR. Generates client_id (UUID stored in first-party cookie).
  2. Collection Tier (CDN Edge): HTTP endpoint deployed at edge (CloudFront/Fastly). Validates payload, assigns server timestamp, publishes to Kafka. Returns 204 immediately.
  3. Kafka Cluster: Central event bus. Partitioned by tracking_id to ensure all events for a site land on the same partition (critical for sessionization). Retention: 7 days.
  4. Real-Time Pipeline (Flink): Consumes from Kafka, maintains in-memory counters per tracking_id (active users, page views/sec). Flushes to ClickHouse every 10 seconds. Maintains HLL sketches for real-time unique counting.
  5. Batch Pipeline (Spark): Hourly Spark jobs read from S3 raw data. Sessionization job groups events by (tracking_id, client_id) sorted by timestamp, splits sessions on 30-minute gaps. Aggregation job produces hourly and daily rollups.
  6. ClickHouse Cluster: OLAP database for both real-time and batch-aggregated data. Supports sub-second queries over billions of rows with columnar compression.
  7. Query Service: Stateless API servers that translate dashboard queries into ClickHouse SQL. Handles caching (Redis, 30-second TTL for real-time, 5-minute for reports), sampling decisions, and access control.
  8. Monitoring: Ingestion lag (Kafka consumer offset), ClickHouse query latency, event drop rate, HLL error rate sampling.

Deployment

Edge (30+ PoPs worldwide):
  Collection endpoints → local Kafka producer → regional Kafka cluster

Region A (primary processing):
  Kafka Cluster (100+ brokers) → Flink (real-time) + Spark (batch)
  ClickHouse Cluster (sharded by tracking_id, replicated)
  Query Service (behind LB)

Region B (standby):
  Kafka MirrorMaker → standby pipeline (hot failover)

6. Deep Dives (15 min)

Deep Dive 1: Unique Visitor Counting with HyperLogLog

The problem: Counting exact unique visitors across arbitrary time ranges is prohibitively expensive. A site with 100M monthly visitors would need a 100M-entry set per month. Across 10M sites and multiple time granularities, this is petabytes of set data.

HyperLogLog (HLL):

  • Probabilistic data structure that estimates cardinality with ~0.8% standard error using only 12KB of memory
  • Key property: HLL sketches are mergeable. You can compute hourly HLLs and merge them to get daily/weekly/monthly uniques without re-scanning raw data.

Implementation:

Event arrives: client_id = "abc123", tracking_id = "UA-456"

1. Real-time: Hash client_id → update in-memory HLL for (UA-456, current_hour)
2. Hourly batch: For each (tracking_id, hour), create HLL from all client_ids
3. Daily unique query:
   daily_hll = merge(hour_0_hll, hour_1_hll, ..., hour_23_hll)
   unique_visitors ≈ daily_hll.estimate()
4. Monthly unique query:
   monthly_hll = merge(day_1_hll, day_2_hll, ..., day_30_hll)
   unique_visitors ≈ monthly_hll.estimate()

ClickHouse native support:

-- Store HLL as AggregateFunction
CREATE TABLE hourly_uniques (
  tracking_id String,
  hour DateTime,
  visitors AggregateFunction(uniq, String)
) ENGINE = AggregatingMergeTree()
ORDER BY (tracking_id, hour);

-- Insert
INSERT INTO hourly_uniques
SELECT tracking_id, toStartOfHour(timestamp), uniqState(client_id)
FROM raw_events
GROUP BY tracking_id, toStartOfHour(timestamp);

-- Query monthly uniques (merges hourly HLLs)
SELECT tracking_id, uniqMerge(visitors) as unique_visitors
FROM hourly_uniques
WHERE hour BETWEEN '2024-02-01' AND '2024-02-29'
GROUP BY tracking_id;

Accuracy vs Memory trade-off:

Precision Memory per HLL Standard Error
10 1 KB 3.25%
12 4 KB 1.63%
14 (default) 16 KB 0.81%
16 64 KB 0.41%

For 10M sites × 24 hourly HLLs × 16KB = 3.8TB/day at precision 14. Manageable. Use precision 12 (4KB) for smaller sites to save space.

Deep Dive 2: Sessionization

The problem: Raw events are just timestamped actions. A “session” is a derived concept — a sequence of events from the same user with no gap longer than 30 minutes.

Sessionization algorithm:

Input:  Events for (tracking_id, client_id) sorted by timestamp
Output: session_id assigned to each event

session_id = generate_uuid()
last_event_time = null

for each event in sorted_events:
  if last_event_time != null AND (event.timestamp - last_event_time) > 30 minutes:
    session_id = generate_uuid()    // new session
  event.session_id = session_id
  last_event_time = event.timestamp

At scale (Spark implementation):

1. Read hourly raw events from S3 (Parquet)
2. Partition by (tracking_id, client_id)   // repartition to co-locate user events
3. Sort within partition by timestamp
4. Apply sessionization UDF (window function with 30-min gap detection)
5. Write sessionized events back to S3
6. Aggregate session-level metrics (duration, page count, bounce, entry/exit page)

Challenge — cross-boundary sessions: A session might span two hourly batches. User visits at 13:50, clicks around until 14:10.

  • Solution: Sessionization runs on overlapping windows. The hourly job for 14:00 also reads events from 13:30-14:00 (30-minute lookback). Dedup by session_id when merging.
  • Alternative: Use Flink for continuous sessionization with session windows (event-time based, gap = 30 minutes). More complex operationally but eliminates boundary issues.

Real-time active sessions:

  • Flink maintains a session state store keyed by (tracking_id, client_id)
  • On each event: check if existing session is still active (last event < 30 min ago)
  • If yes: update session. If no: close old session, open new.
  • “Active users right now” = count of sessions with last event within 5 minutes.

Deep Dive 3: Sampling Strategies for Large Sites

The problem: A top-100 website might generate 100K events/sec — 8.6B events/day. Running ad-hoc queries over this data is slow and expensive.

When to sample:

  • Real-time dashboard: always use full data (already pre-aggregated)
  • Standard reports (daily/weekly): use full data (pre-computed rollups)
  • Ad-hoc exploration queries: sample when raw table scan > 1B rows

Sampling strategies:

1. Client-side sampling (configured per tracking_id):

// SDK checks sampling rate on init
if (Math.random() > samplingRate) return; // skip all events for this session
// Send events with sampling_weight = 1 / samplingRate
  • Pros: Reduces ingestion volume and cost
  • Cons: Loses data permanently; can’t un-sample later

2. Query-time sampling (better):

-- ClickHouse sample clause
SELECT page_path, count() * 10 as estimated_pageviews
FROM raw_events
SAMPLE 0.1   -- 10% sample, deterministic by row hash
WHERE tracking_id = 'UA-123456'
  AND date = '2024-02-22'
GROUP BY page_path
ORDER BY estimated_pageviews DESC;
  • Pros: Raw data fully preserved, sample at query time
  • Cons: Still need to touch 10% of data; works well with columnar storage (ClickHouse can skip 90% of granules)

3. Pre-computed samples (best for large sites):

  • During batch processing, create a 10% deterministic sample table: hash(client_id) % 10 == 0
  • Ad-hoc queries run against the sample table with 10x multiplier
  • Dashboard shows “This report is based on X% of sessions” (like GA does)
  • For metrics where sampling error matters (conversion rate), provide confidence intervals

Adaptive sampling: Sites with < 100K daily events: no sampling. 100K-10M: sample at 50%. 10M+: sample at 10%. Configured automatically based on rolling 7-day event volume.


7. Extensions (2 min)

  • Funnel analysis: Define ordered sequences of events (e.g., landing → add to cart → checkout → purchase). Pre-compute funnel steps using window functions in ClickHouse. Show conversion and drop-off rates at each step.
  • Cohort analysis: Group users by first-visit date (cohort) and track retention over time. Requires joining session data with a users table keyed by client_id → first_seen_date.
  • Attribution modeling: Multi-touch attribution across marketing channels. Track UTM parameters, assign credit to touchpoints using last-click, linear, or time-decay models.
  • Privacy and consent: GDPR/CCPA compliance — support opt-out, data deletion by client_id, anonymized IP mode, cookie-less tracking via fingerprinting (with consent). Server-side tracking API for first-party data.
  • Custom event schemas: Let customers define typed event schemas (e.g., purchase with amount, currency, product_id). Validate at ingestion, store in dedicated columns for efficient querying.