1. Requirements & Scope (5 min)
Functional Requirements
- Track every article share event across the platform (social shares, link copies, email shares)
- Return the top-K most shared articles for multiple time windows: last 1 minute, last 1 hour, last 24 hours
- Support real-time updates — the top-K list refreshes within seconds of share events
- Provide both global top-K and per-category top-K (e.g., top sports articles, top tech articles)
- Expose an API for clients to query current trending articles with share counts
Non-Functional Requirements
- Availability: 99.99% — trending articles is a high-visibility feature; downtime is immediately noticeable
- Latency: < 50ms for top-K queries. Share event ingestion can tolerate up to 5 seconds of delay before reflecting in results.
- Consistency: Approximate counts are acceptable. If an article has 10,000 shares, reporting 9,950 is fine. Rankings may be slightly stale by a few seconds.
- Scale: 100K share events/sec at peak (viral events, breaking news). 10M+ unique articles in the system. Top-K queries at 50K QPS.
- Durability: Share events must not be lost (feed into analytics). Top-K results are recomputable from raw events.
2. Estimation (3 min)
Traffic
- Share events: 100K/sec peak, 30K/sec average
- Daily share events: 30K x 86,400 = 2.6 billion/day
- Top-K query QPS: 50K/sec (served from cache, cheap)
Storage
- Each share event: article_id (8 bytes) + user_id (8 bytes) + timestamp (8 bytes) + type (1 byte) + category (2 bytes) = ~30 bytes
- Daily raw events: 2.6B x 30 bytes = 78 GB/day
- 30-day retention for raw events: 2.3 TB
Counting Infrastructure
- Count-Min Sketch for approximate counting:
- 4 hash functions x 1M counters each = 4M counters
- Each counter: 4 bytes → 16 MB per time window
- 3 time windows (1min, 1hr, 24hr) → 48 MB total
- Fits entirely in memory on a single node
Key Insight
The core challenge is not storage — it is maintaining accurate, real-time rankings over sliding time windows at 100K events/sec. The Count-Min Sketch + min-heap approach keeps this in O(log K) per event with minimal memory.
3. API Design (3 min)
// Query top-K articles
GET /trending/articles?window=1h&k=100&category=tech
Response 200: {
"window": "1h",
"timestamp": 1708632060,
"articles": [
{"article_id": "abc123", "title": "...", "share_count": 15420, "rank": 1},
{"article_id": "def456", "title": "...", "share_count": 12301, "rank": 2},
...
]
}
// Ingest share event (internal — from share service)
POST /events/share
Body: {
"article_id": "abc123",
"user_id": "user_789",
"share_type": "twitter", // twitter, facebook, email, copy_link
"category": "tech",
"timestamp": 1708632060000
}
Response 202: { "status": "accepted" }
// Get share count for a specific article
GET /articles/{article_id}/shares?window=24h
Response 200: {
"article_id": "abc123",
"window": "24h",
"share_count": 45230,
"rank": 7
}
Key Decisions
- Share event ingestion is async (202 Accepted) — event is queued for processing, not synchronously counted
- Top-K queries served from pre-computed cache — no on-the-fly aggregation
- Category filtering is a first-class parameter, not a client-side filter on global results
4. Data Model (3 min)
Raw Events (Kafka topic → cold storage)
Topic: article-shares
Key: article_id
Value: {
article_id | string
user_id | string
share_type | enum(twitter, facebook, email, copy_link)
category | string
timestamp | int64
}
Partitioned by: article_id (co-locate events for same article)
Aggregated Counts (Redis)
// Real-time counts per time window (Sorted Set)
Key: topk:{window}:{category}
Type: Sorted Set
Member: article_id
Score: share_count
Example:
topk:1h:global → {("abc123", 15420), ("def456", 12301), ...}
topk:1h:tech → {("abc123", 8200), ("ghi789", 5100), ...}
topk:24h:global → ...
// Per-article count for specific lookups
Key: shares:{article_id}:{window}
Type: String (integer)
TTL: window duration
Pre-computed Results (Cache)
Key: trending:{window}:{category}:{k}
Value: JSON array of top-K articles with metadata
TTL: 5 seconds (1min window), 30 seconds (1hr window), 5 minutes (24hr window)
Why These Choices
- Kafka for event ingestion: durable, replayable, handles 100K events/sec easily
- Redis Sorted Sets for real-time rankings: O(log N) insert, O(K) to get top-K, in-memory speed
- Separate cache layer for query results: decouples query load from counting infrastructure
5. High-Level Design (12 min)
Architecture
Share Events (100K/sec)
│
▼
┌─────────────────┐
│ Kafka Cluster │ (article-shares topic, 64 partitions)
│ (event buffer) │
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Stream Processor (Flink/custom) │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Per-partition worker: │ │
│ │ 1. Read share event │ │
│ │ 2. Update Count-Min Sketch (approx) │ │
│ │ 3. Update min-heap (top-K candidate) │ │
│ │ 4. Emit to aggregator if top-K change│ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Aggregator (single node per window): │ │
│ │ 1. Merge per-partition top-K lists │ │
│ │ 2. Produce global top-K │ │
│ │ 3. Write to Redis Sorted Set │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Redis Cluster │ ←── │ Cache Refresh │
│ (sorted sets, │ │ (every 5-30s, │
│ per-article) │ │ build JSON) │
└────────┬────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ API Servers │ ← Client queries (50K QPS)
│ (read from │
│ cached JSON) │
└─────────────────┘
Time Window Management
1-minute window (tumbling):
┌──────┐ ┌──────┐ ┌──────┐
│ :00 │ │ :01 │ │ :02 │ ...
│ -:01 │ │ -:02 │ │ -:03 │
└──────┘ └──────┘ └──────┘
Each bucket: independent Count-Min Sketch + heap
On expiry: discard bucket, current top-K = latest bucket only
1-hour window (sliding, approximated with 60 tumbling 1-min buckets):
Current count = sum of last 60 one-minute buckets
Every minute: drop oldest bucket, add new bucket
Re-merge heaps from all 60 buckets → new top-K
24-hour window (sliding, approximated with 24 tumbling 1-hr buckets):
Current count = sum of last 24 one-hour buckets
Re-merge hourly → new top-K
Components
- Kafka Cluster: 64 partitions for the share events topic. Handles 100K events/sec with headroom. Partitioned by article_id for locality.
- Stream Processors (Flink or custom): 64 parallel workers (one per partition). Each maintains a Count-Min Sketch and min-heap per time window. Stateful, checkpointed.
- Aggregator: Merges per-partition top-K lists into global top-K. One aggregator per (window, category) pair. Lightweight — just a heap merge.
- Redis Cluster: Stores current top-K sorted sets and per-article counts. Serves as the source of truth for the API layer.
- API Servers (stateless): Serve cached top-K JSON. Cache refreshed every 5-30 seconds depending on window granularity.
- Cold Storage (S3/HDFS): Raw share events archived for batch analytics, ML model training, and historical trend analysis.
6. Deep Dives (15 min)
Deep Dive 1: Count-Min Sketch + Min-Heap for Streaming Top-K
The problem: We have 100K events/sec and 10M unique articles. We need top-100 at any instant. Maintaining exact counts for all 10M articles in real-time is expensive. Can we do better?
Count-Min Sketch (CMS):
Data structure: 2D array of counters, d rows x w columns
d = number of hash functions (typically 4-5)
w = number of counters per row (typically 10,000 - 1,000,000)
Increment(article_id):
for i in 0..d:
j = hash_i(article_id) % w
CMS[i][j] += 1
Query(article_id):
return min(CMS[i][hash_i(article_id) % w] for i in 0..d)
Properties:
- Never undercounts (can only overcount due to hash collisions)
- Error bound: overcount ≤ ε * total_count with probability 1 - δ
- Space: O(1/ε * ln(1/δ))
- With w = 100,000 and d = 5: ε ≈ 0.00001, δ ≈ 0.03
→ overcount ≤ 0.001% of total events with 97% probability
Min-Heap for Top-K:
Maintain a min-heap of size K (e.g., K = 100)
On each event:
1. Increment article count in CMS
2. estimated_count = CMS.query(article_id)
3. If article_id already in heap:
Update its count, sift up if needed
4. Else if estimated_count > heap.min():
Remove heap minimum
Insert (article_id, estimated_count)
5. Else: ignore (not in top-K)
Complexity per event: O(d) for CMS + O(log K) for heap = O(d + log K)
With d=5, K=100: ~12 operations per event — trivially fast
Why not exact counting?
Exact approach: HashMap<article_id, count>
- 10M articles x 20 bytes = 200MB per time window
- Finding top-K: O(N) scan or maintain sorted structure → O(N log N)
- Works but:
a. Higher memory (200MB vs 16MB for CMS)
b. Harder to merge across partitions (merge 64 hash maps vs merge 64 heaps)
For this use case: CMS is preferred because:
- Approximate counts are explicitly acceptable
- Memory is 10x lower
- Merging is trivial (element-wise max of CMS arrays, merge K-heaps)
- We only care about the top-K, not exact counts for all articles
Deep Dive 2: Sliding Windows vs Tumbling Windows
Tumbling (fixed) windows:
│←── Window 1 ──→│←── Window 2 ──→│←── Window 3 ──→│
0 60 120 180
Problems:
- Boundary effect: an article gets 500 shares at t=59 and 500 at t=61
- Window 1 sees 500, Window 2 sees 500
- Neither window shows the true spike of 1000 shares in 2 seconds
- Top-K can miss genuinely trending articles at window boundaries
Sliding windows (ideal but expensive):
At any time t, the window covers [t - W, t]:
- Continuously sliding, no boundaries
- Perfectly captures all spikes
- But: requires tracking every individual event with its timestamp
- Memory: O(events_in_window) — could be billions for 24-hour window
Practical approach: Sliding window approximated with sub-windows:
For 1-hour sliding window:
Maintain 60 tumbling 1-minute buckets, each with its own CMS + heap
At time t:
Active buckets = last 60 minutes of buckets
For top-K query:
1. For each article in any bucket's heap:
total_count = sum of CMS.query(article) across all 60 buckets
2. Build top-K from total counts
Alternatively: merge all 60 heaps → candidate set of up to 60*K articles → re-rank
On minute boundary:
- Create new empty bucket
- Drop bucket from 61 minutes ago
- Re-compute merged top-K
Granularity vs accuracy trade-off:
- 60 one-minute buckets → max boundary error = 1 minute (good enough)
- 12 five-minute buckets → max error = 5 minutes (acceptable for 1-hour window)
- For 24-hour window: 24 one-hour sub-buckets → max error = 1 hour
Memory: 60 CMS sketches x 16MB = 960MB per window
→ Optimize: use smaller CMS for fine-grained buckets (100K counters → 1.6MB each → 96MB total)
Deep Dive 3: Lambda Architecture vs Kappa Architecture
Lambda Architecture (batch + real-time):
┌──── Batch Layer (Hadoop/Spark) ──── Batch View ───┐
│ (exact counts, hourly) │
Raw Events ───────┤ ├─→ Serving Layer → API
│ │
└──── Speed Layer (Flink/custom) ──── Real-time View┘
(approximate, real-time)
How it works:
1. Speed layer: CMS + heap, approximate, updates in seconds
2. Batch layer: MapReduce exact count every hour
3. Serving layer: merge batch (accurate but stale) + speed (approximate but fresh)
4. When batch catches up, speed layer results for that period are discarded
Pros: Exact counts eventually, real-time approximation in between
Cons: Two separate codepaths (batch + stream), operational complexity, data reconciliation
Kappa Architecture (stream only):
Raw Events → Kafka → Stream Processor → Views → API
↑
└── Replay from offset 0 to recompute
How it works:
1. Single stream processor handles all counting
2. If logic changes or corruption occurs: replay Kafka from the beginning
3. Build new view in parallel, swap atomically
Pros: Single codebase, simpler operations, Kafka is the source of truth
Cons: Replay can be slow for large windows (24hr = 2.6B events to reprocess)
Recommendation for this system:
Kappa architecture with tiered accuracy:
- 1-minute window: pure streaming (CMS + heap), approximate
- 1-hour window: streaming with sub-minute buckets, approximate
- 24-hour window: streaming for real-time + periodic batch correction (every hour)
The batch job runs exact GROUP BY article_id, COUNT(*) on the last 24 hours
and replaces the approximate counts
This hybrid gives:
- Real-time freshness for all windows
- Exact 24-hour counts (corrected hourly)
- Single streaming codebase + one simple batch job
7. Extensions (2 min)
- Weighted sharing: Not all shares are equal. A share by a user with 1M followers is more impactful than a share by a user with 10 followers. Weight share events by the sharer’s follower count or engagement rate.
- Spam/bot detection: Filter out bot-generated share events before counting. Use rate limiting per user, anomaly detection on share velocity (article going from 0 to 10K shares in 1 minute with no corresponding view increase), and CAPTCHAs on share actions.
- Personalized trending: Instead of global top-K, compute per-user or per-region top-K. Use the user’s interest graph (followed categories, past reading history) to re-rank the global top-K. Two-stage: global candidate generation + personalized re-ranking.
- Historical trends: Store top-K snapshots every hour for historical analysis. “What was trending last Tuesday at 3pm?” Enables trend-over-time visualizations and retrospective analysis for editorial teams.
- Real-time notifications: When an article enters the top-K for the first time, push a notification to subscribed editorial dashboards and content recommendation pipelines. Use a change-detection layer that compares consecutive top-K snapshots.