1. Requirements & Scope (5 min)
Functional Requirements
- Collect client-side events (page views, clicks, custom events) from millions of websites via a lightweight JavaScript SDK
- Provide real-time dashboards showing active users, page views per second, and top pages (within ~30 seconds of event)
- Support batch analytics queries — daily/weekly/monthly reports on sessions, funnels, bounce rate, conversion paths
- Sessionize raw events into user sessions with configurable timeout (default 30 minutes of inactivity)
- Count unique visitors accurately across time ranges (daily, weekly, monthly) with deduplication
Non-Functional Requirements
- Availability: 99.99% for the ingestion pipeline — dropping events is unacceptable for paying customers. Dashboard reads can tolerate brief degradation.
- Latency: Event ingestion < 50ms (client-perceived). Real-time dashboard data within 30 seconds of event. Batch reports within minutes of scheduled time.
- Consistency: Eventual consistency is acceptable. Real-time dashboards are approximate. Batch reports must be accurate to within 1%.
- Scale: 10M tracked websites, 1M events/sec ingestion, 500TB+ of raw event data per year
- Durability: Zero event loss once acknowledged. Raw events retained for 2 years, aggregated data retained indefinitely.
2. Estimation (3 min)
Traffic
- 10M websites tracked, average 10 page views/sec per active site
- Not all sites active simultaneously — assume 100K sites active at peak
- Peak ingestion: 100K sites × 10 events/sec = 1M events/sec
- Average ingestion: ~300K events/sec
- Read QPS (dashboard queries): ~50K/sec (most are cached)
Storage
- Average event payload: ~500 bytes (URL, timestamp, user agent, referrer, custom dimensions, session cookie)
- Daily raw events: 300K/sec × 86,400 sec × 500 bytes = ~13TB/day
- Yearly raw events: ~4.7PB/year
- Aggregated rollups (hourly/daily per site per dimension): ~1% of raw = ~50TB/year
Unique Visitor Counting
- 10M sites, each with up to 100M monthly unique visitors
- Exact counting: 10M sites × 100M visitors × 8 bytes = 8PB (impossible)
- HyperLogLog: 10M sites × 12KB per HLL = 120GB for monthly uniques — fits in memory
Key Insight
This is a write-heavy, read-light system. Ingestion throughput and storage cost dominate. The core challenge is building a pipeline that can ingest 1M events/sec, make data queryable in real-time, and run efficient batch aggregations without bankrupting the storage budget.
3. API Design (3 min)
Event Collection Endpoint (called by JS SDK)
POST /collect
Content-Type: application/json
Body: {
"tracking_id": "UA-123456",
"client_id": "c8f3a2...", // first-party cookie UUID
"events": [ // batched for efficiency
{
"type": "pageview",
"timestamp": 1708632000,
"url": "https://example.com/blog",
"referrer": "https://google.com",
"title": "My Blog Post",
"screen_resolution": "1920x1080",
"language": "en-US",
"custom_dimensions": { "author": "alice", "category": "tech" }
}
]
}
Response 204 No Content // fast ack, no body
Query API (for dashboard / reports)
GET /api/v1/report?tracking_id=UA-123456
&metric=pageviews,unique_visitors,sessions
&dimension=page_path,referrer
&start_date=2024-02-01
&end_date=2024-02-22
&granularity=daily
&filters=country:US;device:mobile
&sort=-pageviews
&limit=100
Response 200: {
"rows": [
{ "date": "2024-02-01", "page_path": "/blog", "pageviews": 12340, "unique_visitors": 8900, "sessions": 9200 },
...
],
"totals": { "pageviews": 450000, "unique_visitors": 120000, "sessions": 135000 },
"sampled": false
}
Real-Time API
GET /api/v1/realtime?tracking_id=UA-123456
Response 200: {
"active_users": 1423,
"pageviews_per_second": 87,
"top_pages": [ { "path": "/", "active_users": 312 }, ... ],
"top_referrers": [ ... ]
}
Key Decisions
- Batch events in the SDK (send every 5 seconds or on page unload via
navigator.sendBeacon) - 204 No Content response — minimal latency, no body to parse
- Pixel fallback —
GET /collect?data=base64encoded...for environments where POST is blocked - Query API supports sampling flag — for very large sites, queries over raw data use 10% sample and extrapolate
4. Data Model (3 min)
Raw Events (stored in columnar format — Parquet on S3/GCS)
Schema: raw_events (partitioned by tracking_id + date)
event_id | UUID
tracking_id | string -- site identifier
client_id | string -- user cookie
session_id | string -- computed during sessionization
event_type | string -- pageview, click, custom
timestamp | int64 -- Unix ms
url | string
referrer | string
page_title | string
user_agent | string
ip_hash | string -- hashed for privacy
country | string -- derived from IP
city | string
device_type | enum -- desktop, mobile, tablet
browser | string
os | string
screen_resolution | string
custom_dimensions | map<string, string>
Pre-Aggregated Rollups (ClickHouse or Druid)
Table: hourly_metrics
tracking_id | string
date_hour | datetime
dimension_key | string -- e.g., "page_path=/blog"
pageviews | uint64
unique_visitors_hll | binary -- serialized HyperLogLog
sessions | uint64
total_duration_sec | uint64
bounce_count | uint64
Session Table (derived via sessionization job)
Table: sessions
session_id | string (PK)
tracking_id | string
client_id | string
start_time | datetime
end_time | datetime
duration_sec | int
page_count | int
entry_page | string
exit_page | string
referrer | string
is_bounce | boolean -- page_count == 1
country | string
device_type | string
conversion_events | array<string>
DB Choice
- Raw events: S3 + Parquet. Cheap, durable, query with Spark/Presto for batch.
- Real-time aggregates: ClickHouse (or Apache Druid). Columnar OLAP database optimized for time-series aggregation. Handles 1M inserts/sec and sub-second analytical queries.
- HyperLogLog sketches: Stored in ClickHouse as
AggregateFunction(uniq, String)— native HLL support with merge capabilities across time ranges. - Metadata (tracking configs, users): PostgreSQL.
5. High-Level Design (12 min)
Data Flow
Website (JS SDK)
→ CDN Edge (collect endpoint)
→ Kafka (raw events topic, partitioned by tracking_id)
→ Two consumers in parallel:
├── Real-Time Pipeline (Flink/Spark Streaming)
│ → Pre-aggregate in 10-second windows
│ → Write to ClickHouse (real-time tables)
│ → Update HyperLogLog sketches (in-memory, flushed every minute)
│ → Power real-time dashboard API
└── Batch Pipeline (Spark, runs hourly)
→ Read from Kafka → S3 (raw Parquet)
→ Sessionization job (group events by client_id, 30-min timeout)
→ Aggregation job (hourly/daily rollups with HLL uniques)
→ Write to ClickHouse (batch tables)
→ Power historical reporting API
Components
- JS SDK: Lightweight (~15KB), collects pageviews and custom events, batches and sends via
sendBeaconor XHR. Generatesclient_id(UUID stored in first-party cookie). - Collection Tier (CDN Edge): HTTP endpoint deployed at edge (CloudFront/Fastly). Validates payload, assigns server timestamp, publishes to Kafka. Returns 204 immediately.
- Kafka Cluster: Central event bus. Partitioned by
tracking_idto ensure all events for a site land on the same partition (critical for sessionization). Retention: 7 days. - Real-Time Pipeline (Flink): Consumes from Kafka, maintains in-memory counters per tracking_id (active users, page views/sec). Flushes to ClickHouse every 10 seconds. Maintains HLL sketches for real-time unique counting.
- Batch Pipeline (Spark): Hourly Spark jobs read from S3 raw data. Sessionization job groups events by
(tracking_id, client_id)sorted by timestamp, splits sessions on 30-minute gaps. Aggregation job produces hourly and daily rollups. - ClickHouse Cluster: OLAP database for both real-time and batch-aggregated data. Supports sub-second queries over billions of rows with columnar compression.
- Query Service: Stateless API servers that translate dashboard queries into ClickHouse SQL. Handles caching (Redis, 30-second TTL for real-time, 5-minute for reports), sampling decisions, and access control.
- Monitoring: Ingestion lag (Kafka consumer offset), ClickHouse query latency, event drop rate, HLL error rate sampling.
Deployment
Edge (30+ PoPs worldwide):
Collection endpoints → local Kafka producer → regional Kafka cluster
Region A (primary processing):
Kafka Cluster (100+ brokers) → Flink (real-time) + Spark (batch)
ClickHouse Cluster (sharded by tracking_id, replicated)
Query Service (behind LB)
Region B (standby):
Kafka MirrorMaker → standby pipeline (hot failover)
6. Deep Dives (15 min)
Deep Dive 1: Unique Visitor Counting with HyperLogLog
The problem: Counting exact unique visitors across arbitrary time ranges is prohibitively expensive. A site with 100M monthly visitors would need a 100M-entry set per month. Across 10M sites and multiple time granularities, this is petabytes of set data.
HyperLogLog (HLL):
- Probabilistic data structure that estimates cardinality with ~0.8% standard error using only 12KB of memory
- Key property: HLL sketches are mergeable. You can compute hourly HLLs and merge them to get daily/weekly/monthly uniques without re-scanning raw data.
Implementation:
Event arrives: client_id = "abc123", tracking_id = "UA-456"
1. Real-time: Hash client_id → update in-memory HLL for (UA-456, current_hour)
2. Hourly batch: For each (tracking_id, hour), create HLL from all client_ids
3. Daily unique query:
daily_hll = merge(hour_0_hll, hour_1_hll, ..., hour_23_hll)
unique_visitors ≈ daily_hll.estimate()
4. Monthly unique query:
monthly_hll = merge(day_1_hll, day_2_hll, ..., day_30_hll)
unique_visitors ≈ monthly_hll.estimate()
ClickHouse native support:
-- Store HLL as AggregateFunction
CREATE TABLE hourly_uniques (
tracking_id String,
hour DateTime,
visitors AggregateFunction(uniq, String)
) ENGINE = AggregatingMergeTree()
ORDER BY (tracking_id, hour);
-- Insert
INSERT INTO hourly_uniques
SELECT tracking_id, toStartOfHour(timestamp), uniqState(client_id)
FROM raw_events
GROUP BY tracking_id, toStartOfHour(timestamp);
-- Query monthly uniques (merges hourly HLLs)
SELECT tracking_id, uniqMerge(visitors) as unique_visitors
FROM hourly_uniques
WHERE hour BETWEEN '2024-02-01' AND '2024-02-29'
GROUP BY tracking_id;
Accuracy vs Memory trade-off:
| Precision | Memory per HLL | Standard Error |
|---|---|---|
| 10 | 1 KB | 3.25% |
| 12 | 4 KB | 1.63% |
| 14 (default) | 16 KB | 0.81% |
| 16 | 64 KB | 0.41% |
For 10M sites × 24 hourly HLLs × 16KB = 3.8TB/day at precision 14. Manageable. Use precision 12 (4KB) for smaller sites to save space.
Deep Dive 2: Sessionization
The problem: Raw events are just timestamped actions. A “session” is a derived concept — a sequence of events from the same user with no gap longer than 30 minutes.
Sessionization algorithm:
Input: Events for (tracking_id, client_id) sorted by timestamp
Output: session_id assigned to each event
session_id = generate_uuid()
last_event_time = null
for each event in sorted_events:
if last_event_time != null AND (event.timestamp - last_event_time) > 30 minutes:
session_id = generate_uuid() // new session
event.session_id = session_id
last_event_time = event.timestamp
At scale (Spark implementation):
1. Read hourly raw events from S3 (Parquet)
2. Partition by (tracking_id, client_id) // repartition to co-locate user events
3. Sort within partition by timestamp
4. Apply sessionization UDF (window function with 30-min gap detection)
5. Write sessionized events back to S3
6. Aggregate session-level metrics (duration, page count, bounce, entry/exit page)
Challenge — cross-boundary sessions: A session might span two hourly batches. User visits at 13:50, clicks around until 14:10.
- Solution: Sessionization runs on overlapping windows. The hourly job for 14:00 also reads events from 13:30-14:00 (30-minute lookback). Dedup by session_id when merging.
- Alternative: Use Flink for continuous sessionization with session windows (event-time based, gap = 30 minutes). More complex operationally but eliminates boundary issues.
Real-time active sessions:
- Flink maintains a session state store keyed by
(tracking_id, client_id) - On each event: check if existing session is still active (last event < 30 min ago)
- If yes: update session. If no: close old session, open new.
- “Active users right now” = count of sessions with last event within 5 minutes.
Deep Dive 3: Sampling Strategies for Large Sites
The problem: A top-100 website might generate 100K events/sec — 8.6B events/day. Running ad-hoc queries over this data is slow and expensive.
When to sample:
- Real-time dashboard: always use full data (already pre-aggregated)
- Standard reports (daily/weekly): use full data (pre-computed rollups)
- Ad-hoc exploration queries: sample when raw table scan > 1B rows
Sampling strategies:
1. Client-side sampling (configured per tracking_id):
// SDK checks sampling rate on init
if (Math.random() > samplingRate) return; // skip all events for this session
// Send events with sampling_weight = 1 / samplingRate
- Pros: Reduces ingestion volume and cost
- Cons: Loses data permanently; can’t un-sample later
2. Query-time sampling (better):
-- ClickHouse sample clause
SELECT page_path, count() * 10 as estimated_pageviews
FROM raw_events
SAMPLE 0.1 -- 10% sample, deterministic by row hash
WHERE tracking_id = 'UA-123456'
AND date = '2024-02-22'
GROUP BY page_path
ORDER BY estimated_pageviews DESC;
- Pros: Raw data fully preserved, sample at query time
- Cons: Still need to touch 10% of data; works well with columnar storage (ClickHouse can skip 90% of granules)
3. Pre-computed samples (best for large sites):
- During batch processing, create a 10% deterministic sample table:
hash(client_id) % 10 == 0 - Ad-hoc queries run against the sample table with 10x multiplier
- Dashboard shows “This report is based on X% of sessions” (like GA does)
- For metrics where sampling error matters (conversion rate), provide confidence intervals
Adaptive sampling: Sites with < 100K daily events: no sampling. 100K-10M: sample at 50%. 10M+: sample at 10%. Configured automatically based on rolling 7-day event volume.
7. Extensions (2 min)
- Funnel analysis: Define ordered sequences of events (e.g., landing → add to cart → checkout → purchase). Pre-compute funnel steps using window functions in ClickHouse. Show conversion and drop-off rates at each step.
- Cohort analysis: Group users by first-visit date (cohort) and track retention over time. Requires joining session data with a users table keyed by
client_id → first_seen_date. - Attribution modeling: Multi-touch attribution across marketing channels. Track UTM parameters, assign credit to touchpoints using last-click, linear, or time-decay models.
- Privacy and consent: GDPR/CCPA compliance — support opt-out, data deletion by client_id, anonymized IP mode, cookie-less tracking via fingerprinting (with consent). Server-side tracking API for first-party data.
- Custom event schemas: Let customers define typed event schemas (e.g.,
purchasewithamount,currency,product_id). Validate at ingestion, store in dedicated columns for efficient querying.