Design Twitter Trending Topics

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Traffic
- Storage
- Compute
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Detect trending topics — Identify hashtags, phrases, and named entities that are experiencing an abnormal surge in tweet volume (velocity over absolute volume)
Geographic trending — Compute separate trend lists for global, per-country, and per-city levels (~500 geographic regions)
Personalized trends — Show a mix of global/local trends and topics relevant to the user’s interests (who they follow, what they engage with)
Trend context — For each trending topic, display explanatory context: “Trending because of [event]”, tweet count, related tweets, and a category (Sports, Politics, Entertainment, etc.)
Trend timeline — Show how a topic’s volume changed over time (sparkline), when it started trending, and whether it’s rising, peaking, or declining

Non-Functional Requirements

Availability: 99.99% — the trends sidebar is visible on every Twitter page. Downtime is highly visible.
Latency: Trend list must load in < 200ms. Trend detection lag: a topic should appear in the trending list within 2-5 minutes of its surge beginning.
Freshness: Trends update every 1-2 minutes. Stale trends (showing yesterday’s event) severely degrade trust.
Scale: Twitter processes ~500M tweets/day (6,000 tweets/sec average, 15,000/sec peak). Each tweet may contain 0-5 hashtagged or extractable topics. That’s up to 75,000 topic events/sec at peak.
Spam resilience: Coordinated bot campaigns must not be able to artificially push a topic to trending. False positives are worse than false negatives (a fake trend damages trust).

2. Estimation (3 min)

Traffic

Tweets/day: 500M → 6,000 tweets/sec avg, 15,000/sec peak
Topics extracted per tweet: ~2 (hashtags + NLP-extracted phrases) → 12,000 topic events/sec avg, 30,000/sec peak
Trend reads: Every Twitter page load fetches trends. 300M DAU × 10 page loads/day = 3B trend reads/day → 35,000 reads/sec avg, 100,000 reads/sec peak
Geographic granularity: ~500 regions × trend list refresh every 60 seconds = 500 trend computations/minute

Storage

Real-time counters: Sliding window counts for ~5M unique topics across 500 regions
- Each topic-region pair: topic_hash (8B) + count (8B) + window metadata (32B) ≈ 48B
- 5M topics × 500 regions × 48B = 120 GB — fits in memory (Redis cluster)
- In practice, most topics only trend in 1-3 regions, so effective storage is much less
Historical trends: Archive of what trended, when, and why
- 500 regions × 30 trends × 1440 minutes/day × 365 days = 7.9B records/year
- Each record: ~200B → 1.6 TB/year — manageable in a columnar store (ClickHouse)
Count-Min Sketch: For approximate frequency counting of all topics
- 5 hash functions × 1M counters × 4 bytes = 20 MB per sketch per region
- 500 regions × 20 MB = 10 GB total — trivially small

Compute

Topic extraction (NLP): 15,000 tweets/sec × ~1ms per tweet = 15 CPU-seconds/sec → 15 cores dedicated to NLP extraction
Trend scoring: 500 regions × every 60 seconds, score 10K candidate topics per region = 500 × 10K / 60 ≈ 83K scoring operations/sec — lightweight math, < 1 core

3. API Design (3 min)

// Get trending topics for a user (personalized + geographic)
GET /api/v1/trends
  Params:
    user_id (optional)          // for personalized trends
    woeid (optional)            // "Where On Earth ID" — geographic region. 1 = global
    count (default 30)          // number of trends to return
  → 200 {
    "as_of": "2024-03-15T14:32:00Z",
    "location": { "name": "San Francisco", "woeid": 2487956 },
    "trends": [
      {
        "name": "#SuperBowl",
        "query": "%23SuperBowl",
        "url": "https://twitter.com/search?q=%23SuperBowl",
        "tweet_volume_24h": 2340000,
        "tweet_volume_1h": 185000,
        "category": "Sports",
        "context": "The Super Bowl is happening tonight at Allegiant Stadium",
        "started_trending_at": "2024-03-15T10:00:00Z",
        "trend_type": "breaking",         // breaking | sustained | recurring
        "sparkline": [12, 15, 45, 120, 185],  // hourly volumes (last 5h)
        "promoted": false
      },
      ...
    ]
  }

// Get available trending locations
GET /api/v1/trends/available
  → 200 { "locations": [{ "name": "Worldwide", "woeid": 1 }, ...] }

// Get trend details (for "Why is X trending" page)
GET /api/v1/trends/{topic}/details
  Params: woeid
  → 200 {
    "topic": "#SuperBowl",
    "summary": "The Super Bowl LVIII between...",
    "related_topics": ["#Chiefs", "#49ers", "#HalftimeShow"],
    "top_tweets": [...],
    "volume_timeseries": [...],          // minute-by-minute for last 24h
    "demographics": { "top_geos": [...], "top_age_groups": [...] }
  }

Internal APIs

// Topic ingestion (called by tweet processing pipeline)
POST /internal/topics/ingest
  Body: {
    "tweet_id": "...",
    "topics": ["#SuperBowl", "Chiefs", "halftime show"],
    "user_id": "...",
    "geo": { "country": "US", "city": "San Francisco" },
    "timestamp": "..."
  }

// Admin: suppress a trend (safety/trust intervention)
POST /internal/trends/suppress
  Body: { "topic": "#ScamCoin", "reason": "coordinated_inauthentic", "duration_hours": 24 }

Key Decisions

WOEID (Where On Earth ID): Using Yahoo’s WOEID system for geographic identifiers — it’s a well-established standard for location hierarchy (city → state → country → world).
Sparkline data inline: Including a small array of recent hourly volumes allows the client to render a trend line without a separate API call.
Trend type classification: Distinguishing “breaking” (sudden spike), “sustained” (high volume over hours), and “recurring” (regularly trends at this time, like #MondayMotivation) helps the UI prioritize display.

4. Data Model (3 min)

Topic Counts — Sliding Window (Redis)

Key Pattern	Type	Notes
`topic:{hash}:window:{region}:{minute_bucket}`	INT	Tweet count in this 1-minute bucket
`topic:{hash}:meta`	Hash	`{name, category, first_seen, is_hashtag}`
`trend:list:{woeid}`	Sorted Set	Current trends, scored by trend_score. Top 30 = the trend list
`trend:suppress:{topic_hash}`	String	TTL-based suppression flag

Sliding Window Design:

Each topic-region pair has 60 minute-buckets (1-hour window) plus 24 hour-buckets (24-hour window)
To compute “volume in the last hour”: SUM the last 60 minute-buckets
Old buckets expire automatically via Redis TTL
This gives us both short-term velocity (for trend detection) and longer-term volume (for display)

Count-Min Sketch (In-Memory, per Region)

Field	Type	Notes
sketch	5 × 1M array of uint32	Approximate frequency for all topics
time_window	enum	5min, 15min, 1hr — separate sketches per window
region_id	INT	WOEID

Why Count-Min Sketch?

We can’t maintain exact counters for every possible n-gram in every tweet in every region
CMS gives us O(1) increment and query with bounded overcount error
Space: 20 MB per region vs potentially unbounded exact counter storage
Error bound: with width 1M and depth 5, overcount probability < 0.007% for ε = 10^-6

Trend History (ClickHouse — Columnar OLAP)

Column	Type	Notes
topic	String	The trending topic text
topic_hash	UInt64	For efficient joins
woeid	UInt32	Geographic region
trended_at	DateTime	When it entered the trending list
left_trending_at	DateTime	When it left
peak_volume_1h	UInt32	Max hourly volume during the trend
trend_type	Enum8	breaking, sustained, recurring
category	String	Sports, Politics, Entertainment, etc.
suppressed	Bool	Was it manually suppressed?
suppression_reason	String	If suppressed, why

User Interest Profile (for Personalized Trends)

Column	Type	Notes
user_id	BIGINT (PK)
topic_affinities	JSONB	`{"Sports": 0.8, "Politics": 0.3, "Tech": 0.9, ...}`
followed_topics	SET	Explicit topic follows
geo_home	INT	WOEID of home location
last_updated	TIMESTAMP

5. High-Level Design (12 min)

Architecture Overview

┌────────────────────────────────────────────────────────────────────────┐
│                        Tweet Ingestion Pipeline                        │
│                                                                        │
│  Tweet → Kafka Topic: "tweets"                                         │
│            │                                                           │
│            ├──→ Topic Extraction Service (NLP + hashtag parsing)        │
│            │      │                                                    │
│            │      ▼                                                    │
│            │    Kafka Topic: "topic-events"                             │
│            │      │                                                    │
│            │      ├──→ Counter Service (updates sliding window counts)  │
│            │      │      │                                             │
│            │      │      ▼                                             │
│            │      │    Redis Cluster (per-region minute-bucket counts)  │
│            │      │    Count-Min Sketches (in-memory)                   │
│            │      │                                                    │
│            │      └──→ Spam Filter Service (bot detection pipeline)     │
│            │             │                                             │
│            │             ▼                                             │
│            │           Flagged topic-events removed or downweighted     │
│            │                                                           │
└────────────┼───────────────────────────────────────────────────────────┘
             │
             │  Every 60 seconds:
             ▼
┌────────────────────────────────────────────────────────────────────────┐
│                       Trend Scoring Pipeline                           │
│                                                                        │
│  Trend Scorer (per-region, runs every 60s):                            │
│    1. Read sliding window counts for top-N candidate topics            │
│    2. Compute velocity score (current window vs baseline)              │
│    3. Apply spam penalty, category bonus, freshness boost              │
│    4. Rank topics → top 30 = trending list for this region             │
│    5. Write to Redis sorted set: trend:list:{woeid}                    │
│    6. Write to ClickHouse for historical archive                       │
│                                                                        │
│  Trend Contextualization:                                              │
│    - For each new trend, fetch representative tweets                   │
│    - Run LLM summarization: "Why is X trending?"                       │
│    - Classify category (Sports, Politics, etc.)                        │
│    - Store context in Redis with TTL                                   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────────────────────────────┐
│                        Trend Serving Layer                             │
│                                                                        │
│  ┌────────────────┐     ┌─────────────────┐    ┌───────────────────┐  │
│  │ Trend API      │     │ Personalization  │    │ CDN / Edge Cache  │  │
│  │ Service        │ ←── │ Service          │    │ (global trends    │  │
│  │ (reads Redis   │     │ (blends global + │    │  cached 60s)      │  │
│  │  sorted sets)  │     │  local + personal│    │                   │  │
│  └────────────────┘     │  trends)         │    └───────────────────┘  │
│                         └─────────────────┘                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Component Breakdown

1. Topic Extraction Service

Consumes the raw tweet stream from Kafka
Extracts: explicit hashtags, @mentions (for person-trending), and NLP-extracted noun phrases / named entities
NLP pipeline: tokenize → NER (spaCy / custom model) → phrase normalization (lowercase, lemmatize, merge variants like “Super Bowl” / “SuperBowl” / “#superbowl”)
Output: {tweet_id, user_id, topics[], geo, timestamp} → published to topic-events Kafka topic
Throughput: 15K tweets/sec × ~1ms NLP = 15 cores. Horizontally scalable Kafka consumer group.

2. Counter Service

Consumes topic-events from Kafka
For each topic-event, increments:
- Redis: topic:{hash}:window:{region}:{minute_bucket} for the appropriate regions (city, country, global)
- In-memory Count-Min Sketch for the current time window
Also maintains a “candidate set” — topics whose recent velocity exceeds a threshold → these are the candidates evaluated by the Trend Scorer

3. Spam Filter Service

Runs in parallel with the Counter Service on the same topic-events stream
Checks each tweet/topic against bot detection signals:
- User account age < 7 days and tweeting about a trending topic → suspicious
- Burst of identical tweets from different accounts within seconds → coordinated
- Known bot network graph clusters (pre-computed offline)
Output: a “spam score” per topic-event. Events with high spam score are either dropped or downweighted in counters.

4. Trend Scorer (Cron Job, every 60 seconds)

For each of the 500 regions, scores all candidate topics and selects the top 30
This is the core algorithm — detailed in Deep Dive 1

5. Personalization Service

When a user requests trends, blends three sources:
- Global/local trends (from Redis sorted set for the user’s WOEID)
- Personal topics (topics that are surging AND match the user’s interest profile)
- Followed topics (topics the user explicitly follows that are active)
Blending ratio: typically 60% geographic, 25% personalized, 15% followed topics
Ensures diversity: no more than 3 trends from the same category

6. CDN / Edge Cache

Global and per-country trend lists are cacheable (same for all users in that region)
Cache with 60-second TTL at the CDN edge → serves 90%+ of trend reads without hitting the API
Personalized trends are not cacheable (per-user) → served from the Trend API directly

Request Flow: User Opens Twitter and Sees Trends

Client → CDN: GET /api/v1/trends?woeid=2487956&user_id=u_123
CDN: cache miss (personalized request, not cacheable)
CDN → Load Balancer → Trend API Service:
  1. Fetch geographic trends: Redis ZREVRANGE trend:list:2487956 0 29
     → Top 30 trends for San Francisco with scores
  2. Fetch user profile: user_interest_cache:{u_123} → {Sports: 0.8, Tech: 0.9}
  3. Personalization: boost Sports/Tech trends, inject 5-8 personalized topics
  4. For each trend: fetch context from Redis (trend:{topic}:context)
  5. Assemble response with sparklines, categories, context
  6. Return to client (< 50ms from API, < 200ms end-to-end)

6. Deep Dives (15 min)

Deep Dive 1: Trend Scoring Algorithm — Velocity over Volume

The Problem: Absolute tweet volume is a poor signal for “trending.” Justin Bieber gets 500K tweets/day every day — that’s not trending, that’s baseline. A local earthquake that goes from 0 to 50K tweets in 10 minutes IS trending, despite lower absolute volume.

Solution: Velocity-Based Scoring with Time Decay

The core insight: a topic is trending when its current velocity significantly exceeds its expected baseline velocity.

Step 1: Compute current velocity

v_current = tweet_count(topic, region, last_5_minutes) / 5

Step 2: Compute baseline velocity

v_baseline = exponential_moving_average(
  tweet_count(topic, region, per_5min_window),
  alpha = 0.1,  // slow-moving average over ~50 windows = ~4 hours
  windows = last_48_hours
)

Step 3: Compute trend score

raw_score = (v_current - v_baseline) / max(v_baseline, min_baseline)

min_baseline prevents division by zero for brand-new topics (set to 1.0)
This is essentially a z-score: how many “standard deviations” above baseline is the current velocity?

Step 4: Apply modifiers

trend_score = raw_score
  × freshness_boost(time_since_started_trending)   // newer trends score higher
  × spam_penalty(spam_score)                         // 0.0 to 1.0
  × diversity_penalty(category_count_in_top_30)      // penalize if too many from same category
  × volume_floor(v_current)                          // must have minimum absolute volume to qualify

Freshness boost: freshness = max(0, 1 - (minutes_trending / 360)) — a trend that’s been trending for 6+ hours gets no freshness boost. This ensures rotation and prevents stale trends.

Volume floor: Even with high velocity, a topic with only 10 tweets shouldn’t trend. Minimum thresholds:

Global: 10,000 tweets in last hour
Country: 1,000 tweets in last hour
City: 100 tweets in last hour

Breaking vs Sustained vs Recurring:

Breaking: v_current > 10 × v_baseline AND started_trending < 30 min ago
Sustained: v_current > 3 × v_baseline AND trending for > 2 hours
Recurring: Topic has trended at this time-of-week in 3+ of the last 8 weeks (detected from ClickHouse history)

Example:

Topic: “#Earthquake”
Baseline: 5 tweets/5min (people always tweet about earthquakes)
Current: 8,000 tweets/5min (earthquake just happened in LA)
Raw score: (8000 - 5) / 5 = 1,599 — extremely high, instant #1 trend
Classification: Breaking (1599 > 10 × baseline, started < 30 min ago)

Deep Dive 2: Spam & Bot Filtering for Trend Integrity

The Problem: Coordinated bot networks regularly attempt to push topics to trending for political manipulation, crypto pump-and-dump schemes, and astroturfing campaigns. A single successful fake trend damages user trust and can cause real-world harm.

Multi-Layer Defense:

Layer 1: Account-Level Signals (Pre-filter) Each tweet gets an “account trust score” based on:

Account age (< 30 days = low trust)
Follower/following ratio (1:1000 following:follower with few followers = bot indicator)
Profile completeness (no bio, no avatar, default name = low trust)
Historical behavior (ratio of original tweets to retweets, tweet frequency distribution)
Graph cluster: does this account belong to a known bot network cluster? (computed offline with graph community detection)

Tweets from low-trust accounts are downweighted (not removed) in trend counters:

effective_count = sum(trust_weight(account) for account in tweets_about_topic)
// trust_weight: 1.0 for normal accounts, 0.1 for low-trust, 0.0 for known bots

Layer 2: Behavioral Anomaly Detection (Real-Time)

Burst detection: If 80%+ of tweets about a topic in the last 5 minutes come from accounts created in the last 48 hours → flag topic as suspicious
Content similarity: If > 60% of tweets about a topic are near-identical (cosine similarity > 0.9 on TF-IDF vectors) → coordinated campaign
Temporal pattern: Natural trends have a smooth growth curve. Bot campaigns show a step-function (0 → 10K instantly). Fit a curve; if the R^2 of a step function is higher than a sigmoid → suspicious.
Geographic anomaly: A topic trending “worldwide” but 90% of tweets come from a single country with VPN-diverse IPs → suspicious

Layer 3: Human Review (Post-Detection)

Topics flagged by layers 1-2 are routed to a Trust & Safety review queue
Reviewers can: suppress the trend, label it as spam, or clear it as legitimate
Suppressed trends get a TTL-based block in Redis (trend:suppress:{topic_hash})
Feedback loop: reviewer decisions train the ML models for layers 1-2

Layer 4: Proactive Blocking

Maintain a blocklist of known spam topics (crypto scam tokens, repeated harassment campaigns)
These topics are pre-suppressed before they can ever appear in trends
Updated by both automated detection and manual curation

Metrics:

False positive rate target: < 0.1% (legitimate trends incorrectly suppressed)
False negative rate target: < 5% (spam trends that slip through)
Detection latency: < 2 minutes from bot campaign start to suppression

Deep Dive 3: Count-Min Sketch for Scalable Approximate Counting

The Problem: We need to count the frequency of millions of unique topics (hashtags, phrases, n-grams) across 500 regions in real time. Exact counters for every possible topic-region pair would require unbounded memory.

Solution: Count-Min Sketch (CMS)

A CMS is a probabilistic data structure that provides approximate frequency counts with a guaranteed upper bound on overcount error and zero undercount.

Structure:

d = 5 hash functions (rows)
w = 1,000,000 counters per row (columns)

CMS[d][w] — a 5 × 1M matrix of uint32 counters

Increment(topic):
  for i in 0..d:
    j = hash_i(topic) % w
    CMS[i][j] += 1

Query(topic):
  return min(CMS[i][hash_i(topic) % w] for i in 0..d)

Why min? Hash collisions cause overcounting. Taking the minimum across all d hash functions minimizes the collision impact. The true count is always ≤ the CMS estimate.

Error Bounds:

With w = 1M, d = 5:
- ε (error margin) = e/w = 2.72/1M ≈ 0.00027% of total count
- δ (failure probability) = e^(-d) = e^(-5) ≈ 0.67%
- Meaning: with 99.33% probability, the estimated count is within 0.00027% of the true total item count

Why CMS over exact counters?

Space: 5 × 1M × 4B = 20 MB per region per time window
vs exact: potentially millions of unique topics × 8B each = hundreds of MB, growing unboundedly
CMS is O(d) for both increment and query — constant time regardless of unique topic count

Integration with Trend Detection:

For each time window (5min, 15min, 1hr), maintain a separate CMS per region.

Candidate selection:
  - Maintain an exact "top-K heavy hitters" structure alongside the CMS
  - When a topic's CMS count crosses a threshold → promote to exact counter
  - Only promoted topics become trend candidates

This hybrid approach gives us:
  - CMS: space-efficient counting of the long tail (millions of low-frequency topics)
  - Exact counters: precise counts for the ~10K topics that matter (trend candidates)

Sliding Window with CMS:

Maintain 12 CMS instances per region: one for each 5-minute bucket in the last hour
“Volume in the last hour” = sum of all 12 sketches (CMS supports merging via element-wise addition)
Every 5 minutes, discard the oldest sketch and create a new empty one
This gives us a sliding window with 5-minute granularity using only 12 × 20 MB = 240 MB per region

Space-Saving Alternative: For the heavy hitters detection, we use the Space-Saving algorithm alongside CMS:

Maintains exact counts for the top K items (K = 10,000)
When a new item arrives that isn’t tracked, evict the item with the lowest count and replace it
Guaranteed to find all items with frequency > N/K (where N is total events)
This feeds the candidate set to the Trend Scorer

7. Extensions (2 min)

“Why is X trending?” with LLM summarization: For each new trending topic, sample 50 representative tweets, feed them to an LLM (with safety guardrails), and generate a 1-2 sentence explanation. Cache the summary with a 10-minute TTL. This transforms a raw hashtag into actionable context for users.
Predictive trending: Use historical patterns and early signals (velocity of velocity — acceleration) to predict topics that will trend in the next 30-60 minutes. Surface these as “rising” topics before they hit the main trending list. Useful for newsrooms and journalists.
Cross-platform trend correlation: Correlate Twitter trends with signals from other platforms (Reddit posts, Google Search spikes, news article publications). This validates trend legitimacy (a topic trending simultaneously across platforms is almost certainly real, not bot-driven) and enriches context.
Advertiser trend targeting: Allow advertisers to bid on trending topics in real-time — “when #SuperBowl is trending, show my ad.” Requires a real-time auction integration that evaluates trend status at ad-serving time. Massive revenue opportunity (trending moments have the highest engagement).
Trend notifications: Push-notify users when a topic they follow or a topic relevant to their interests starts trending. Must be carefully rate-limited (max 3 trend notifications/day) to avoid notification fatigue. Use the breaking/sustained/recurring classification to only notify on breaking trends.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Compute#

3. API Design (3 min)#

Trending Topics API#

Internal APIs#

Key Decisions#

4. Data Model (3 min)#

Topic Counts — Sliding Window (Redis)#

Count-Min Sketch (In-Memory, per Region)#

Trend History (ClickHouse — Columnar OLAP)#

User Interest Profile (for Personalized Trends)#

5. High-Level Design (12 min)#

Architecture Overview#

Component Breakdown#

Request Flow: User Opens Twitter and Sees Trends#

6. Deep Dives (15 min)#

Deep Dive 1: Trend Scoring Algorithm — Velocity over Volume#

Deep Dive 2: Spam & Bot Filtering for Trend Integrity#

Deep Dive 3: Count-Min Sketch for Scalable Approximate Counting#

7. Extensions (2 min)#