1. Requirements & Scope (5 min)
Functional Requirements
- Detect trending topics — Identify hashtags, phrases, and named entities that are experiencing an abnormal surge in tweet volume (velocity over absolute volume)
- Geographic trending — Compute separate trend lists for global, per-country, and per-city levels (~500 geographic regions)
- Personalized trends — Show a mix of global/local trends and topics relevant to the user’s interests (who they follow, what they engage with)
- Trend context — For each trending topic, display explanatory context: “Trending because of [event]”, tweet count, related tweets, and a category (Sports, Politics, Entertainment, etc.)
- Trend timeline — Show how a topic’s volume changed over time (sparkline), when it started trending, and whether it’s rising, peaking, or declining
Non-Functional Requirements
- Availability: 99.99% — the trends sidebar is visible on every Twitter page. Downtime is highly visible.
- Latency: Trend list must load in < 200ms. Trend detection lag: a topic should appear in the trending list within 2-5 minutes of its surge beginning.
- Freshness: Trends update every 1-2 minutes. Stale trends (showing yesterday’s event) severely degrade trust.
- Scale: Twitter processes ~500M tweets/day (6,000 tweets/sec average, 15,000/sec peak). Each tweet may contain 0-5 hashtagged or extractable topics. That’s up to 75,000 topic events/sec at peak.
- Spam resilience: Coordinated bot campaigns must not be able to artificially push a topic to trending. False positives are worse than false negatives (a fake trend damages trust).
2. Estimation (3 min)
Traffic
- Tweets/day: 500M → 6,000 tweets/sec avg, 15,000/sec peak
- Topics extracted per tweet: ~2 (hashtags + NLP-extracted phrases) → 12,000 topic events/sec avg, 30,000/sec peak
- Trend reads: Every Twitter page load fetches trends. 300M DAU × 10 page loads/day = 3B trend reads/day → 35,000 reads/sec avg, 100,000 reads/sec peak
- Geographic granularity: ~500 regions × trend list refresh every 60 seconds = 500 trend computations/minute
Storage
- Real-time counters: Sliding window counts for ~5M unique topics across 500 regions
- Each topic-region pair: topic_hash (8B) + count (8B) + window metadata (32B) ≈ 48B
- 5M topics × 500 regions × 48B = 120 GB — fits in memory (Redis cluster)
- In practice, most topics only trend in 1-3 regions, so effective storage is much less
- Historical trends: Archive of what trended, when, and why
- 500 regions × 30 trends × 1440 minutes/day × 365 days = 7.9B records/year
- Each record: ~200B → 1.6 TB/year — manageable in a columnar store (ClickHouse)
- Count-Min Sketch: For approximate frequency counting of all topics
- 5 hash functions × 1M counters × 4 bytes = 20 MB per sketch per region
- 500 regions × 20 MB = 10 GB total — trivially small
Compute
- Topic extraction (NLP): 15,000 tweets/sec × ~1ms per tweet = 15 CPU-seconds/sec → 15 cores dedicated to NLP extraction
- Trend scoring: 500 regions × every 60 seconds, score 10K candidate topics per region = 500 × 10K / 60 ≈ 83K scoring operations/sec — lightweight math, < 1 core
3. API Design (3 min)
Trending Topics API
// Get trending topics for a user (personalized + geographic)
GET /api/v1/trends
Params:
user_id (optional) // for personalized trends
woeid (optional) // "Where On Earth ID" — geographic region. 1 = global
count (default 30) // number of trends to return
→ 200 {
"as_of": "2024-03-15T14:32:00Z",
"location": { "name": "San Francisco", "woeid": 2487956 },
"trends": [
{
"name": "#SuperBowl",
"query": "%23SuperBowl",
"url": "https://twitter.com/search?q=%23SuperBowl",
"tweet_volume_24h": 2340000,
"tweet_volume_1h": 185000,
"category": "Sports",
"context": "The Super Bowl is happening tonight at Allegiant Stadium",
"started_trending_at": "2024-03-15T10:00:00Z",
"trend_type": "breaking", // breaking | sustained | recurring
"sparkline": [12, 15, 45, 120, 185], // hourly volumes (last 5h)
"promoted": false
},
...
]
}
// Get available trending locations
GET /api/v1/trends/available
→ 200 { "locations": [{ "name": "Worldwide", "woeid": 1 }, ...] }
// Get trend details (for "Why is X trending" page)
GET /api/v1/trends/{topic}/details
Params: woeid
→ 200 {
"topic": "#SuperBowl",
"summary": "The Super Bowl LVIII between...",
"related_topics": ["#Chiefs", "#49ers", "#HalftimeShow"],
"top_tweets": [...],
"volume_timeseries": [...], // minute-by-minute for last 24h
"demographics": { "top_geos": [...], "top_age_groups": [...] }
}
Internal APIs
// Topic ingestion (called by tweet processing pipeline)
POST /internal/topics/ingest
Body: {
"tweet_id": "...",
"topics": ["#SuperBowl", "Chiefs", "halftime show"],
"user_id": "...",
"geo": { "country": "US", "city": "San Francisco" },
"timestamp": "..."
}
// Admin: suppress a trend (safety/trust intervention)
POST /internal/trends/suppress
Body: { "topic": "#ScamCoin", "reason": "coordinated_inauthentic", "duration_hours": 24 }
Key Decisions
- WOEID (Where On Earth ID): Using Yahoo’s WOEID system for geographic identifiers — it’s a well-established standard for location hierarchy (city → state → country → world).
- Sparkline data inline: Including a small array of recent hourly volumes allows the client to render a trend line without a separate API call.
- Trend type classification: Distinguishing “breaking” (sudden spike), “sustained” (high volume over hours), and “recurring” (regularly trends at this time, like #MondayMotivation) helps the UI prioritize display.
4. Data Model (3 min)
Topic Counts — Sliding Window (Redis)
| Key Pattern | Type | Notes |
|---|---|---|
topic:{hash}:window:{region}:{minute_bucket} |
INT | Tweet count in this 1-minute bucket |
topic:{hash}:meta |
Hash | {name, category, first_seen, is_hashtag} |
trend:list:{woeid} |
Sorted Set | Current trends, scored by trend_score. Top 30 = the trend list |
trend:suppress:{topic_hash} |
String | TTL-based suppression flag |
Sliding Window Design:
- Each topic-region pair has 60 minute-buckets (1-hour window) plus 24 hour-buckets (24-hour window)
- To compute “volume in the last hour”: SUM the last 60 minute-buckets
- Old buckets expire automatically via Redis TTL
- This gives us both short-term velocity (for trend detection) and longer-term volume (for display)
Count-Min Sketch (In-Memory, per Region)
| Field | Type | Notes |
|---|---|---|
| sketch | 5 × 1M array of uint32 | Approximate frequency for all topics |
| time_window | enum | 5min, 15min, 1hr — separate sketches per window |
| region_id | INT | WOEID |
Why Count-Min Sketch?
- We can’t maintain exact counters for every possible n-gram in every tweet in every region
- CMS gives us O(1) increment and query with bounded overcount error
- Space: 20 MB per region vs potentially unbounded exact counter storage
- Error bound: with width 1M and depth 5, overcount probability < 0.007% for ε = 10^-6
Trend History (ClickHouse — Columnar OLAP)
| Column | Type | Notes |
|---|---|---|
| topic | String | The trending topic text |
| topic_hash | UInt64 | For efficient joins |
| woeid | UInt32 | Geographic region |
| trended_at | DateTime | When it entered the trending list |
| left_trending_at | DateTime | When it left |
| peak_volume_1h | UInt32 | Max hourly volume during the trend |
| trend_type | Enum8 | breaking, sustained, recurring |
| category | String | Sports, Politics, Entertainment, etc. |
| suppressed | Bool | Was it manually suppressed? |
| suppression_reason | String | If suppressed, why |
User Interest Profile (for Personalized Trends)
| Column | Type | Notes |
|---|---|---|
| user_id | BIGINT (PK) | |
| topic_affinities | JSONB | {"Sports": 0.8, "Politics": 0.3, "Tech": 0.9, ...} |
| followed_topics | SET | Explicit topic follows |
| geo_home | INT | WOEID of home location |
| last_updated | TIMESTAMP |
5. High-Level Design (12 min)
Architecture Overview
┌────────────────────────────────────────────────────────────────────────┐
│ Tweet Ingestion Pipeline │
│ │
│ Tweet → Kafka Topic: "tweets" │
│ │ │
│ ├──→ Topic Extraction Service (NLP + hashtag parsing) │
│ │ │ │
│ │ ▼ │
│ │ Kafka Topic: "topic-events" │
│ │ │ │
│ │ ├──→ Counter Service (updates sliding window counts) │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ Redis Cluster (per-region minute-bucket counts) │
│ │ │ Count-Min Sketches (in-memory) │
│ │ │ │
│ │ └──→ Spam Filter Service (bot detection pipeline) │
│ │ │ │
│ │ ▼ │
│ │ Flagged topic-events removed or downweighted │
│ │ │
└────────────┼───────────────────────────────────────────────────────────┘
│
│ Every 60 seconds:
▼
┌────────────────────────────────────────────────────────────────────────┐
│ Trend Scoring Pipeline │
│ │
│ Trend Scorer (per-region, runs every 60s): │
│ 1. Read sliding window counts for top-N candidate topics │
│ 2. Compute velocity score (current window vs baseline) │
│ 3. Apply spam penalty, category bonus, freshness boost │
│ 4. Rank topics → top 30 = trending list for this region │
│ 5. Write to Redis sorted set: trend:list:{woeid} │
│ 6. Write to ClickHouse for historical archive │
│ │
│ Trend Contextualization: │
│ - For each new trend, fetch representative tweets │
│ - Run LLM summarization: "Why is X trending?" │
│ - Classify category (Sports, Politics, etc.) │
│ - Store context in Redis with TTL │
│ │
└────────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────┐
│ Trend Serving Layer │
│ │
│ ┌────────────────┐ ┌─────────────────┐ ┌───────────────────┐ │
│ │ Trend API │ │ Personalization │ │ CDN / Edge Cache │ │
│ │ Service │ ←── │ Service │ │ (global trends │ │
│ │ (reads Redis │ │ (blends global + │ │ cached 60s) │ │
│ │ sorted sets) │ │ local + personal│ │ │ │
│ └────────────────┘ │ trends) │ └───────────────────┘ │
│ └─────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
Component Breakdown
1. Topic Extraction Service
- Consumes the raw tweet stream from Kafka
- Extracts: explicit hashtags, @mentions (for person-trending), and NLP-extracted noun phrases / named entities
- NLP pipeline: tokenize → NER (spaCy / custom model) → phrase normalization (lowercase, lemmatize, merge variants like “Super Bowl” / “SuperBowl” / “#superbowl”)
- Output:
{tweet_id, user_id, topics[], geo, timestamp}→ published totopic-eventsKafka topic - Throughput: 15K tweets/sec × ~1ms NLP = 15 cores. Horizontally scalable Kafka consumer group.
2. Counter Service
- Consumes
topic-eventsfrom Kafka - For each topic-event, increments:
- Redis:
topic:{hash}:window:{region}:{minute_bucket}for the appropriate regions (city, country, global) - In-memory Count-Min Sketch for the current time window
- Redis:
- Also maintains a “candidate set” — topics whose recent velocity exceeds a threshold → these are the candidates evaluated by the Trend Scorer
3. Spam Filter Service
- Runs in parallel with the Counter Service on the same
topic-eventsstream - Checks each tweet/topic against bot detection signals:
- User account age < 7 days and tweeting about a trending topic → suspicious
- Burst of identical tweets from different accounts within seconds → coordinated
- Known bot network graph clusters (pre-computed offline)
- Output: a “spam score” per topic-event. Events with high spam score are either dropped or downweighted in counters.
4. Trend Scorer (Cron Job, every 60 seconds)
- For each of the 500 regions, scores all candidate topics and selects the top 30
- This is the core algorithm — detailed in Deep Dive 1
5. Personalization Service
- When a user requests trends, blends three sources:
- Global/local trends (from Redis sorted set for the user’s WOEID)
- Personal topics (topics that are surging AND match the user’s interest profile)
- Followed topics (topics the user explicitly follows that are active)
- Blending ratio: typically 60% geographic, 25% personalized, 15% followed topics
- Ensures diversity: no more than 3 trends from the same category
6. CDN / Edge Cache
- Global and per-country trend lists are cacheable (same for all users in that region)
- Cache with 60-second TTL at the CDN edge → serves 90%+ of trend reads without hitting the API
- Personalized trends are not cacheable (per-user) → served from the Trend API directly
Request Flow: User Opens Twitter and Sees Trends
Client → CDN: GET /api/v1/trends?woeid=2487956&user_id=u_123
CDN: cache miss (personalized request, not cacheable)
CDN → Load Balancer → Trend API Service:
1. Fetch geographic trends: Redis ZREVRANGE trend:list:2487956 0 29
→ Top 30 trends for San Francisco with scores
2. Fetch user profile: user_interest_cache:{u_123} → {Sports: 0.8, Tech: 0.9}
3. Personalization: boost Sports/Tech trends, inject 5-8 personalized topics
4. For each trend: fetch context from Redis (trend:{topic}:context)
5. Assemble response with sparklines, categories, context
6. Return to client (< 50ms from API, < 200ms end-to-end)
6. Deep Dives (15 min)
Deep Dive 1: Trend Scoring Algorithm — Velocity over Volume
The Problem: Absolute tweet volume is a poor signal for “trending.” Justin Bieber gets 500K tweets/day every day — that’s not trending, that’s baseline. A local earthquake that goes from 0 to 50K tweets in 10 minutes IS trending, despite lower absolute volume.
Solution: Velocity-Based Scoring with Time Decay
The core insight: a topic is trending when its current velocity significantly exceeds its expected baseline velocity.
Step 1: Compute current velocity
v_current = tweet_count(topic, region, last_5_minutes) / 5
Step 2: Compute baseline velocity
v_baseline = exponential_moving_average(
tweet_count(topic, region, per_5min_window),
alpha = 0.1, // slow-moving average over ~50 windows = ~4 hours
windows = last_48_hours
)
Step 3: Compute trend score
raw_score = (v_current - v_baseline) / max(v_baseline, min_baseline)
min_baselineprevents division by zero for brand-new topics (set to 1.0)- This is essentially a z-score: how many “standard deviations” above baseline is the current velocity?
Step 4: Apply modifiers
trend_score = raw_score
× freshness_boost(time_since_started_trending) // newer trends score higher
× spam_penalty(spam_score) // 0.0 to 1.0
× diversity_penalty(category_count_in_top_30) // penalize if too many from same category
× volume_floor(v_current) // must have minimum absolute volume to qualify
Freshness boost: freshness = max(0, 1 - (minutes_trending / 360)) — a trend that’s been trending for 6+ hours gets no freshness boost. This ensures rotation and prevents stale trends.
Volume floor: Even with high velocity, a topic with only 10 tweets shouldn’t trend. Minimum thresholds:
- Global: 10,000 tweets in last hour
- Country: 1,000 tweets in last hour
- City: 100 tweets in last hour
Breaking vs Sustained vs Recurring:
- Breaking:
v_current > 10 × v_baselineANDstarted_trending < 30 min ago - Sustained:
v_current > 3 × v_baselineANDtrending for > 2 hours - Recurring: Topic has trended at this time-of-week in 3+ of the last 8 weeks (detected from ClickHouse history)
Example:
- Topic: “#Earthquake”
- Baseline: 5 tweets/5min (people always tweet about earthquakes)
- Current: 8,000 tweets/5min (earthquake just happened in LA)
- Raw score: (8000 - 5) / 5 = 1,599 — extremely high, instant #1 trend
- Classification: Breaking (1599 > 10 × baseline, started < 30 min ago)
Deep Dive 2: Spam & Bot Filtering for Trend Integrity
The Problem: Coordinated bot networks regularly attempt to push topics to trending for political manipulation, crypto pump-and-dump schemes, and astroturfing campaigns. A single successful fake trend damages user trust and can cause real-world harm.
Multi-Layer Defense:
Layer 1: Account-Level Signals (Pre-filter) Each tweet gets an “account trust score” based on:
- Account age (< 30 days = low trust)
- Follower/following ratio (1:1000 following:follower with few followers = bot indicator)
- Profile completeness (no bio, no avatar, default name = low trust)
- Historical behavior (ratio of original tweets to retweets, tweet frequency distribution)
- Graph cluster: does this account belong to a known bot network cluster? (computed offline with graph community detection)
Tweets from low-trust accounts are downweighted (not removed) in trend counters:
effective_count = sum(trust_weight(account) for account in tweets_about_topic)
// trust_weight: 1.0 for normal accounts, 0.1 for low-trust, 0.0 for known bots
Layer 2: Behavioral Anomaly Detection (Real-Time)
- Burst detection: If 80%+ of tweets about a topic in the last 5 minutes come from accounts created in the last 48 hours → flag topic as suspicious
- Content similarity: If > 60% of tweets about a topic are near-identical (cosine similarity > 0.9 on TF-IDF vectors) → coordinated campaign
- Temporal pattern: Natural trends have a smooth growth curve. Bot campaigns show a step-function (0 → 10K instantly). Fit a curve; if the R^2 of a step function is higher than a sigmoid → suspicious.
- Geographic anomaly: A topic trending “worldwide” but 90% of tweets come from a single country with VPN-diverse IPs → suspicious
Layer 3: Human Review (Post-Detection)
- Topics flagged by layers 1-2 are routed to a Trust & Safety review queue
- Reviewers can: suppress the trend, label it as spam, or clear it as legitimate
- Suppressed trends get a TTL-based block in Redis (
trend:suppress:{topic_hash}) - Feedback loop: reviewer decisions train the ML models for layers 1-2
Layer 4: Proactive Blocking
- Maintain a blocklist of known spam topics (crypto scam tokens, repeated harassment campaigns)
- These topics are pre-suppressed before they can ever appear in trends
- Updated by both automated detection and manual curation
Metrics:
- False positive rate target: < 0.1% (legitimate trends incorrectly suppressed)
- False negative rate target: < 5% (spam trends that slip through)
- Detection latency: < 2 minutes from bot campaign start to suppression
Deep Dive 3: Count-Min Sketch for Scalable Approximate Counting
The Problem: We need to count the frequency of millions of unique topics (hashtags, phrases, n-grams) across 500 regions in real time. Exact counters for every possible topic-region pair would require unbounded memory.
Solution: Count-Min Sketch (CMS)
A CMS is a probabilistic data structure that provides approximate frequency counts with a guaranteed upper bound on overcount error and zero undercount.
Structure:
d = 5 hash functions (rows)
w = 1,000,000 counters per row (columns)
CMS[d][w] — a 5 × 1M matrix of uint32 counters
Increment(topic):
for i in 0..d:
j = hash_i(topic) % w
CMS[i][j] += 1
Query(topic):
return min(CMS[i][hash_i(topic) % w] for i in 0..d)
Why min? Hash collisions cause overcounting. Taking the minimum across all d hash functions minimizes the collision impact. The true count is always ≤ the CMS estimate.
Error Bounds:
- With w = 1M, d = 5:
- ε (error margin) = e/w = 2.72/1M ≈ 0.00027% of total count
- δ (failure probability) = e^(-d) = e^(-5) ≈ 0.67%
- Meaning: with 99.33% probability, the estimated count is within 0.00027% of the true total item count
Why CMS over exact counters?
- Space: 5 × 1M × 4B = 20 MB per region per time window
- vs exact: potentially millions of unique topics × 8B each = hundreds of MB, growing unboundedly
- CMS is O(d) for both increment and query — constant time regardless of unique topic count
Integration with Trend Detection:
For each time window (5min, 15min, 1hr), maintain a separate CMS per region.
Candidate selection:
- Maintain an exact "top-K heavy hitters" structure alongside the CMS
- When a topic's CMS count crosses a threshold → promote to exact counter
- Only promoted topics become trend candidates
This hybrid approach gives us:
- CMS: space-efficient counting of the long tail (millions of low-frequency topics)
- Exact counters: precise counts for the ~10K topics that matter (trend candidates)
Sliding Window with CMS:
- Maintain 12 CMS instances per region: one for each 5-minute bucket in the last hour
- “Volume in the last hour” = sum of all 12 sketches (CMS supports merging via element-wise addition)
- Every 5 minutes, discard the oldest sketch and create a new empty one
- This gives us a sliding window with 5-minute granularity using only 12 × 20 MB = 240 MB per region
Space-Saving Alternative: For the heavy hitters detection, we use the Space-Saving algorithm alongside CMS:
- Maintains exact counts for the top K items (K = 10,000)
- When a new item arrives that isn’t tracked, evict the item with the lowest count and replace it
- Guaranteed to find all items with frequency > N/K (where N is total events)
- This feeds the candidate set to the Trend Scorer
7. Extensions (2 min)
- “Why is X trending?” with LLM summarization: For each new trending topic, sample 50 representative tweets, feed them to an LLM (with safety guardrails), and generate a 1-2 sentence explanation. Cache the summary with a 10-minute TTL. This transforms a raw hashtag into actionable context for users.
- Predictive trending: Use historical patterns and early signals (velocity of velocity — acceleration) to predict topics that will trend in the next 30-60 minutes. Surface these as “rising” topics before they hit the main trending list. Useful for newsrooms and journalists.
- Cross-platform trend correlation: Correlate Twitter trends with signals from other platforms (Reddit posts, Google Search spikes, news article publications). This validates trend legitimacy (a topic trending simultaneously across platforms is almost certainly real, not bot-driven) and enriches context.
- Advertiser trend targeting: Allow advertisers to bid on trending topics in real-time — “when #SuperBowl is trending, show my ad.” Requires a real-time auction integration that evaluates trend status at ad-serving time. Massive revenue opportunity (trending moments have the highest engagement).
- Trend notifications: Push-notify users when a topic they follow or a topic relevant to their interests starts trending. Must be carefully rate-limited (max 3 trend notifications/day) to avoid notification fatigue. Use the breaking/sustained/recurring classification to only notify on breaking trends.