Design a System to Identify Top-K Shared Articles

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
- Key Decisions
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Track every article share event across the platform (social shares, link copies, email shares)
Return the top-K most shared articles for multiple time windows: last 1 minute, last 1 hour, last 24 hours
Support real-time updates — the top-K list refreshes within seconds of share events
Provide both global top-K and per-category top-K (e.g., top sports articles, top tech articles)
Expose an API for clients to query current trending articles with share counts

Non-Functional Requirements

Availability: 99.99% — trending articles is a high-visibility feature; downtime is immediately noticeable
Latency: < 50ms for top-K queries. Share event ingestion can tolerate up to 5 seconds of delay before reflecting in results.
Consistency: Approximate counts are acceptable. If an article has 10,000 shares, reporting 9,950 is fine. Rankings may be slightly stale by a few seconds.
Scale: 100K share events/sec at peak (viral events, breaking news). 10M+ unique articles in the system. Top-K queries at 50K QPS.
Durability: Share events must not be lost (feed into analytics). Top-K results are recomputable from raw events.

2. Estimation (3 min)

Traffic

Share events: 100K/sec peak, 30K/sec average
Daily share events: 30K x 86,400 = 2.6 billion/day
Top-K query QPS: 50K/sec (served from cache, cheap)

Storage

Each share event: article_id (8 bytes) + user_id (8 bytes) + timestamp (8 bytes) + type (1 byte) + category (2 bytes) = ~30 bytes
Daily raw events: 2.6B x 30 bytes = 78 GB/day
30-day retention for raw events: 2.3 TB

Counting Infrastructure

Count-Min Sketch for approximate counting:
- 4 hash functions x 1M counters each = 4M counters
- Each counter: 4 bytes → 16 MB per time window
- 3 time windows (1min, 1hr, 24hr) → 48 MB total
- Fits entirely in memory on a single node

Key Insight

The core challenge is not storage — it is maintaining accurate, real-time rankings over sliding time windows at 100K events/sec. The Count-Min Sketch + min-heap approach keeps this in O(log K) per event with minimal memory.

3. API Design (3 min)

// Query top-K articles
GET /trending/articles?window=1h&k=100&category=tech
  Response 200: {
    "window": "1h",
    "timestamp": 1708632060,
    "articles": [
      {"article_id": "abc123", "title": "...", "share_count": 15420, "rank": 1},
      {"article_id": "def456", "title": "...", "share_count": 12301, "rank": 2},
      ...
    ]
  }

// Ingest share event (internal — from share service)
POST /events/share
  Body: {
    "article_id": "abc123",
    "user_id": "user_789",
    "share_type": "twitter",         // twitter, facebook, email, copy_link
    "category": "tech",
    "timestamp": 1708632060000
  }
  Response 202: { "status": "accepted" }

// Get share count for a specific article
GET /articles/{article_id}/shares?window=24h
  Response 200: {
    "article_id": "abc123",
    "window": "24h",
    "share_count": 45230,
    "rank": 7
  }

Key Decisions

Share event ingestion is async (202 Accepted) — event is queued for processing, not synchronously counted
Top-K queries served from pre-computed cache — no on-the-fly aggregation
Category filtering is a first-class parameter, not a client-side filter on global results

4. Data Model (3 min)

Raw Events (Kafka topic → cold storage)

Topic: article-shares
  Key: article_id
  Value: {
    article_id    | string
    user_id       | string
    share_type    | enum(twitter, facebook, email, copy_link)
    category      | string
    timestamp     | int64
  }
  Partitioned by: article_id (co-locate events for same article)

Aggregated Counts (Redis)

// Real-time counts per time window (Sorted Set)
Key: topk:{window}:{category}
Type: Sorted Set
Member: article_id
Score: share_count

Example:
  topk:1h:global    → {("abc123", 15420), ("def456", 12301), ...}
  topk:1h:tech      → {("abc123", 8200), ("ghi789", 5100), ...}
  topk:24h:global   → ...

// Per-article count for specific lookups
Key: shares:{article_id}:{window}
Type: String (integer)
TTL: window duration

Pre-computed Results (Cache)

Key: trending:{window}:{category}:{k}
Value: JSON array of top-K articles with metadata
TTL: 5 seconds (1min window), 30 seconds (1hr window), 5 minutes (24hr window)

Why These Choices

Kafka for event ingestion: durable, replayable, handles 100K events/sec easily
Redis Sorted Sets for real-time rankings: O(log N) insert, O(K) to get top-K, in-memory speed
Separate cache layer for query results: decouples query load from counting infrastructure

5. High-Level Design (12 min)

Architecture

Share Events (100K/sec)
  │
  ▼
┌─────────────────┐
│  Kafka Cluster   │  (article-shares topic, 64 partitions)
│  (event buffer)  │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────┐
│         Stream Processor (Flink/custom)      │
│                                               │
│  ┌──────────────────────────────────────┐    │
│  │ Per-partition worker:                 │    │
│  │  1. Read share event                  │    │
│  │  2. Update Count-Min Sketch (approx)  │    │
│  │  3. Update min-heap (top-K candidate) │    │
│  │  4. Emit to aggregator if top-K change│    │
│  └──────────────────────────────────────┘    │
│                                               │
│  ┌──────────────────────────────────────┐    │
│  │ Aggregator (single node per window):  │    │
│  │  1. Merge per-partition top-K lists   │    │
│  │  2. Produce global top-K              │    │
│  │  3. Write to Redis Sorted Set         │    │
│  └──────────────────────────────────────┘    │
└──────────────────────────────────────────────┘
         │
         ▼
┌─────────────────┐      ┌─────────────────┐
│  Redis Cluster   │ ←──  │ Cache Refresh    │
│  (sorted sets,   │      │ (every 5-30s,   │
│   per-article)   │      │  build JSON)     │
└────────┬────────┘      └─────────────────┘
         │
         ▼
┌─────────────────┐
│   API Servers    │  ← Client queries (50K QPS)
│  (read from      │
│   cached JSON)   │
└─────────────────┘

Time Window Management

1-minute window (tumbling):
  ┌──────┐ ┌──────┐ ┌──────┐
  │ :00  │ │ :01  │ │ :02  │  ...
  │ -:01 │ │ -:02 │ │ -:03 │
  └──────┘ └──────┘ └──────┘
  Each bucket: independent Count-Min Sketch + heap
  On expiry: discard bucket, current top-K = latest bucket only

1-hour window (sliding, approximated with 60 tumbling 1-min buckets):
  Current count = sum of last 60 one-minute buckets
  Every minute: drop oldest bucket, add new bucket
  Re-merge heaps from all 60 buckets → new top-K

24-hour window (sliding, approximated with 24 tumbling 1-hr buckets):
  Current count = sum of last 24 one-hour buckets
  Re-merge hourly → new top-K

Components

Kafka Cluster: 64 partitions for the share events topic. Handles 100K events/sec with headroom. Partitioned by article_id for locality.
Stream Processors (Flink or custom): 64 parallel workers (one per partition). Each maintains a Count-Min Sketch and min-heap per time window. Stateful, checkpointed.
Aggregator: Merges per-partition top-K lists into global top-K. One aggregator per (window, category) pair. Lightweight — just a heap merge.
Redis Cluster: Stores current top-K sorted sets and per-article counts. Serves as the source of truth for the API layer.
API Servers (stateless): Serve cached top-K JSON. Cache refreshed every 5-30 seconds depending on window granularity.
Cold Storage (S3/HDFS): Raw share events archived for batch analytics, ML model training, and historical trend analysis.

6. Deep Dives (15 min)

Deep Dive 1: Count-Min Sketch + Min-Heap for Streaming Top-K

The problem: We have 100K events/sec and 10M unique articles. We need top-100 at any instant. Maintaining exact counts for all 10M articles in real-time is expensive. Can we do better?

Count-Min Sketch (CMS):

Data structure: 2D array of counters, d rows x w columns
  d = number of hash functions (typically 4-5)
  w = number of counters per row (typically 10,000 - 1,000,000)

Increment(article_id):
  for i in 0..d:
    j = hash_i(article_id) % w
    CMS[i][j] += 1

Query(article_id):
  return min(CMS[i][hash_i(article_id) % w] for i in 0..d)

Properties:
  - Never undercounts (can only overcount due to hash collisions)
  - Error bound: overcount ≤ ε * total_count with probability 1 - δ
  - Space: O(1/ε * ln(1/δ))
  - With w = 100,000 and d = 5: ε ≈ 0.00001, δ ≈ 0.03
    → overcount ≤ 0.001% of total events with 97% probability

Min-Heap for Top-K:

Maintain a min-heap of size K (e.g., K = 100)

On each event:
  1. Increment article count in CMS
  2. estimated_count = CMS.query(article_id)
  3. If article_id already in heap:
       Update its count, sift up if needed
  4. Else if estimated_count > heap.min():
       Remove heap minimum
       Insert (article_id, estimated_count)
  5. Else: ignore (not in top-K)

Complexity per event: O(d) for CMS + O(log K) for heap = O(d + log K)
With d=5, K=100: ~12 operations per event — trivially fast

Why not exact counting?

Exact approach: HashMap<article_id, count>
  - 10M articles x 20 bytes = 200MB per time window
  - Finding top-K: O(N) scan or maintain sorted structure → O(N log N)
  - Works but:
    a. Higher memory (200MB vs 16MB for CMS)
    b. Harder to merge across partitions (merge 64 hash maps vs merge 64 heaps)

For this use case: CMS is preferred because:
  - Approximate counts are explicitly acceptable
  - Memory is 10x lower
  - Merging is trivial (element-wise max of CMS arrays, merge K-heaps)
  - We only care about the top-K, not exact counts for all articles

Deep Dive 2: Sliding Windows vs Tumbling Windows

Tumbling (fixed) windows:

│←── Window 1 ──→│←── Window 2 ──→│←── Window 3 ──→│
0               60              120              180

Problems:
  - Boundary effect: an article gets 500 shares at t=59 and 500 at t=61
  - Window 1 sees 500, Window 2 sees 500
  - Neither window shows the true spike of 1000 shares in 2 seconds
  - Top-K can miss genuinely trending articles at window boundaries

Sliding windows (ideal but expensive):

At any time t, the window covers [t - W, t]:
  - Continuously sliding, no boundaries
  - Perfectly captures all spikes
  - But: requires tracking every individual event with its timestamp
  - Memory: O(events_in_window) — could be billions for 24-hour window

Practical approach: Sliding window approximated with sub-windows:

For 1-hour sliding window:
  Maintain 60 tumbling 1-minute buckets, each with its own CMS + heap
  At time t:
    Active buckets = last 60 minutes of buckets
    For top-K query:
      1. For each article in any bucket's heap:
           total_count = sum of CMS.query(article) across all 60 buckets
      2. Build top-K from total counts
    Alternatively: merge all 60 heaps → candidate set of up to 60*K articles → re-rank

  On minute boundary:
    - Create new empty bucket
    - Drop bucket from 61 minutes ago
    - Re-compute merged top-K

Granularity vs accuracy trade-off:
  - 60 one-minute buckets → max boundary error = 1 minute (good enough)
  - 12 five-minute buckets → max error = 5 minutes (acceptable for 1-hour window)
  - For 24-hour window: 24 one-hour sub-buckets → max error = 1 hour

Memory: 60 CMS sketches x 16MB = 960MB per window
  → Optimize: use smaller CMS for fine-grained buckets (100K counters → 1.6MB each → 96MB total)

Deep Dive 3: Lambda Architecture vs Kappa Architecture

Lambda Architecture (batch + real-time):

                   ┌──── Batch Layer (Hadoop/Spark) ──── Batch View ───┐
                   │      (exact counts, hourly)                        │
Raw Events ───────┤                                                     ├─→ Serving Layer → API
                   │                                                     │
                   └──── Speed Layer (Flink/custom) ──── Real-time View┘
                          (approximate, real-time)

How it works:
  1. Speed layer: CMS + heap, approximate, updates in seconds
  2. Batch layer: MapReduce exact count every hour
  3. Serving layer: merge batch (accurate but stale) + speed (approximate but fresh)
  4. When batch catches up, speed layer results for that period are discarded

Pros: Exact counts eventually, real-time approximation in between
Cons: Two separate codepaths (batch + stream), operational complexity, data reconciliation

Kappa Architecture (stream only):

Raw Events → Kafka → Stream Processor → Views → API
                ↑
                └── Replay from offset 0 to recompute

How it works:
  1. Single stream processor handles all counting
  2. If logic changes or corruption occurs: replay Kafka from the beginning
  3. Build new view in parallel, swap atomically

Pros: Single codebase, simpler operations, Kafka is the source of truth
Cons: Replay can be slow for large windows (24hr = 2.6B events to reprocess)

Recommendation for this system:

Kappa architecture with tiered accuracy:
  - 1-minute window: pure streaming (CMS + heap), approximate
  - 1-hour window: streaming with sub-minute buckets, approximate
  - 24-hour window: streaming for real-time + periodic batch correction (every hour)
    The batch job runs exact GROUP BY article_id, COUNT(*) on the last 24 hours
    and replaces the approximate counts

This hybrid gives:
  - Real-time freshness for all windows
  - Exact 24-hour counts (corrected hourly)
  - Single streaming codebase + one simple batch job

7. Extensions (2 min)

Weighted sharing: Not all shares are equal. A share by a user with 1M followers is more impactful than a share by a user with 10 followers. Weight share events by the sharer’s follower count or engagement rate.
Spam/bot detection: Filter out bot-generated share events before counting. Use rate limiting per user, anomaly detection on share velocity (article going from 0 to 10K shares in 1 minute with no corresponding view increase), and CAPTCHAs on share actions.
Personalized trending: Instead of global top-K, compute per-user or per-region top-K. Use the user’s interest graph (followed categories, past reading history) to re-rank the global top-K. Two-stage: global candidate generation + personalized re-ranking.
Historical trends: Store top-K snapshots every hour for historical analysis. “What was trending last Tuesday at 3pm?” Enables trend-over-time visualizations and retrospective analysis for editorial teams.
Real-time notifications: When an article enters the top-K for the first time, push a notification to subscribed editorial dashboards and content recommendation pipelines. Use a change-detection layer that compares consecutive top-K snapshots.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Counting Infrastructure#

Key Insight#

3. API Design (3 min)#

Key Decisions#

4. Data Model (3 min)#

Raw Events (Kafka topic → cold storage)#

Aggregated Counts (Redis)#

Pre-computed Results (Cache)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

Time Window Management#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Count-Min Sketch + Min-Heap for Streaming Top-K#

Deep Dive 2: Sliding Windows vs Tumbling Windows#

Deep Dive 3: Lambda Architecture vs Kappa Architecture#

7. Extensions (2 min)#