The Twitter timeline looks trivial until you say the numbers out loud. “Show me a list of tweets from people I follow, newest first.” It is a join. SELECT * FROM tweets WHERE author IN (my followees) ORDER BY time DESC LIMIT 50. Done. Then the interviewer points out that some users follow 5,000 accounts, some accounts have 100 million followers, the timeline must load in under 200ms, and 300 million people refresh it all day. The join is now the most expensive query on the internet.

The entire problem is one decision and its consequences: do you build each user’s timeline when a tweet is posted (fan-out on write), or when the user opens the app (fan-out on read)? Every hard part - the celebrity who breaks fan-out-on-write, the firehose of follows, the cache that holds 300 million timelines - falls out of that one call. Let me do it properly.

Functional Requirements (FR)

In scope:

  • Post a tweet. A user publishes a short text tweet (assume text plus optional media reference). It must appear on their followers’ timelines quickly.
  • Home timeline. Given a user, return a reverse-chronological-ish feed of recent tweets from the accounts they follow, paginated.
  • Follow / unfollow. A user follows or unfollows another account; this changes whose tweets show up in their timeline.
  • User (author) timeline. The list of a single user’s own tweets - a much simpler, secondary read.
  • Ranking. Order the home timeline by relevance, not pure recency (the modern Twitter behavior). Treat this as a layer on top of a candidate set.

Explicitly out of scope (say this out loud so you control the scope):

  • Replies, retweets, quote-tweets, likes as first-class graph features. I will mention retweets where they matter for fan-out but I am not designing the full engagement graph.
  • Direct messages, notifications, search, trends, ads. Each is its own system.
  • Media storage and transcoding. Tweets carry a media URL pointing at a separate blob/CDN system; I am not designing that.
  • Spam, abuse, and content moderation pipelines. They exist; I am not designing them here.

The one decision that drives everything: this is a massively read-heavy system with a brutally skewed fan-out. Reads (timeline loads) dominate writes (tweets) by ~100:1, and the cost of a single write depends entirely on how many followers the author has. That asymmetry is the whole problem.

Non-Functional Requirements (NFR)

  • Scale: ~300M daily active users. ~500M tweets/day. Average user follows a few hundred accounts; the distribution has a long, heavy tail (celebrities with 50M-150M followers).
  • Latency: home timeline read p99 under 200ms. This is the number users feel. Posting a tweet can be slower from the author’s perspective (p99 under 500ms to acknowledge) because the expensive fan-out happens asynchronously.
  • Availability: 99.99% on the timeline read path. A blank timeline is the most visible failure Twitter can have. Posting can tolerate marginally lower availability since it is async behind the scenes.
  • Consistency: eventual. It is completely fine if your tweet shows up on a follower’s timeline a few seconds late. Nobody needs read-after-write across the social graph. We trade strict consistency for availability and latency without hesitation.
  • Durability: a posted tweet must never be lost. The tweet itself is durable, replicated storage. The materialized timelines are a cache and can be rebuilt - losing a timeline is recoverable, losing a tweet is not.
  • Freshness: a tweet should reach most followers’ timelines within a few seconds under normal load.

Back-of-the-Envelope Estimation (BoE)

Real numbers, with the arithmetic, because the architecture is dictated by them.

Writes (tweets):

500M tweets / day
= 500,000,000 / 86,400 seconds
≈ 5,800 tweets/sec  (call it ~6K WPS average)
Peak (3x average) ≈ 18K WPS

Reads (timeline loads):

300M DAU, each opening the app and refreshing several times.
Assume ~15 timeline fetches/user/day (open, pull-to-refresh, scroll pages).
300M * 15 = 4.5B timeline reads/day
= 4.5B / 86,400 ≈ 52,000 reads/sec average
Peak (3x) ≈ 150K timeline reads/sec

So roughly 150K timeline reads/sec at peak vs ~18K tweets/sec. Read:write ≈ 100:1 on request count. But the real story is fan-out amplification, next.

Fan-out amplification (the killer number):

Average followers per posting account: assume ~200 (heavily skewed).
Each tweet, under fan-out-on-write, must be written into ~200 timelines.
6K tweets/sec * 200 followers = 1.2M timeline-writes/sec average.
Peak: 18K * 200 = 3.6M timeline-writes/sec.

Now the tail: a celebrity with 100M followers posting once
= 100,000,000 timeline inserts from a SINGLE tweet.

That last line is the entire reason this problem is hard. One tweet can be 100 million writes. You cannot do that synchronously, and for the biggest accounts you should not do it at all (deep dive below).

Storage:

Per tweet, the canonical row:

tweet_id       8 bytes  (Snowflake-style 64-bit, time-sortable)
author_id      8 bytes
text           ~200 bytes (280 chars, mostly ASCII; budget generous)
media_url      ~100 bytes (nullable)
created_at     8 bytes (often encoded in tweet_id)
metadata       ~76 bytes
------------------------------------------------
≈ 400 bytes/tweet
500M tweets/day * 400 bytes ≈ 200 GB/day
* 365 ≈ 73 TB/year of raw tweets

Over 5 years, ~365TB of tweet bodies. That is the durable store and it is large but very tractable for a sharded cluster. The materialized timelines are far bigger in row count but tiny per entry.

Timeline cache size:

The home timeline is a list of tweet IDs (8 bytes each), not full tweets. Cache the most recent ~800 entries per active user.

Per timeline: 800 tweet_ids * 8 bytes ≈ 6.4 KB (round to ~8KB with overhead)
Cache only ACTIVE users (say the 300M DAU, not all 1B+ accounts):
300M * 8KB ≈ 2.4 TB of timeline cache

~2.4TB of RAM across a Redis cluster. That is a few hundred Redis nodes - large but completely standard at this scale. The point: store IDs, not tweets, and only for active users. We hydrate IDs into full tweets at read time from a separate cache.

Bandwidth:

A timeline page = ~50 tweets * ~400 bytes hydrated ≈ 20 KB.
Read bandwidth at peak: 150K reads/sec * 20KB ≈ 3 GB/sec served.

Real but handled by CDN/edge for media and horizontal read scaling for tweet bodies. The hard constraint is request rate and fan-out write volume, not raw bandwidth.

High-Level Design (HLD)

Two paths with wildly different cost profiles: the write/post path (cheap per request but with catastrophic fan-out amplification) and the read/timeline path (the high-QPS, latency-critical one). I separate them with a queue so the expensive fan-out never blocks the user who posted.

                         ┌──────────────────────────┐
                         │         Clients          │
                         │  (mobile, web, API)      │
                         └────────────┬─────────────┘
                                      │
                         ┌────────────▼─────────────┐
                         │      Load Balancer        │
                         └──────┬───────────────┬────┘
              WRITE / POST path │               │ READ / TIMELINE path
                 ┌──────────────▼──┐       ┌─────▼──────────────┐
                 │  Tweet Service   │       │ Timeline Service   │
                 │  (stateless)     │       │  (stateless)       │
                 └───┬──────────┬───┘       └─────┬──────────────┘
                     │          │                 │
        write tweet  │          │ emit            │ read precomputed
        (durable)    │          │ fan-out job     │ timeline (IDs)
            ┌────────▼───┐  ┌────▼─────────┐   ┌───▼────────────┐
            │ Tweet Store │  │ Fan-out Queue │   │ Timeline Cache │
            │ (sharded by │  │   (Kafka)     │   │  (Redis: user  │
            │  tweet_id)  │  └────┬──────────┘   │  -> [tweetID]) │
            └─────────────┘       │              └───┬────────────┘
                                  │ consumed          │ hydrate IDs
                         ┌────────▼─────────┐    ┌────▼───────────┐
                         │  Fan-out Workers  │    │  Tweet Cache    │
                         │ (push tweet_id    │    │  (Redis: id ->  │
                         │  into follower    │───▶│   tweet body)   │
                         │  timelines)       │    └────────────────┘
                         └────────┬──────────┘
                                  │ reads follower lists
                         ┌────────▼─────────┐    ┌────────────────┐
                         │  Social Graph     │    │ Ranking Service │
                         │  Service          │    │ (scores         │
                         │ (follows/followers)│    │  candidates)    │
                         └───────────────────┘    └────────────────┘

Write flow (post a tweet):

  1. Client POST /tweet. Tweet Service validates and assigns a time-sortable tweet_id (Snowflake).
  2. Write the tweet to the durable, sharded Tweet Store (sharded by tweet_id). Also write into the Tweet Cache.
  3. Acknowledge the author immediately. The author’s own timeline is updated synchronously (one write) so they see their tweet instantly.
  4. Emit a fan-out job to Kafka: {tweet_id, author_id}.
  5. Fan-out Workers consume the job, look up the author’s followers from the Social Graph Service, and push tweet_id into each follower’s cached timeline list - but only for non-celebrity authors (see deep dive).

Read flow (load home timeline) - the hot path:

  1. Client GET /timeline. Timeline Service reads the user’s precomputed timeline (a list of tweet IDs) from the Timeline Cache. This is a single Redis read of ~800 IDs. Fast.
  2. Merge in celebrity tweets at read time: the user’s followed celebrities are not fanned out on write. Timeline Service fetches each followed celebrity’s recent tweet IDs (themselves cached) and merges them into the candidate list. This is the hybrid model.
  3. The merged candidate set of tweet IDs goes to the Ranking Service, which scores and orders them.
  4. Hydrate the top N IDs into full tweet objects from the Tweet Cache (batch MGET).
  5. Return the page. Cache misses on hydration fall through to the Tweet Store.

The key insight is in steps 1 and 2: most of the work was already done at write time, so the read is a cheap list fetch plus a small merge. That is what buys sub-200ms reads at 150K QPS. The celebrity merge is the deliberate exception that keeps fan-out-on-write from exploding.

Component Deep Dive

The hard parts, naive-first then evolved: (1) fan-out on write vs read and the hybrid, (2) the celebrity hot-key problem, (3) timeline caching and hydration, (4) feed ranking.

1. Fan-out: on write vs on read

This is the heart of the problem. There are two pure strategies and they are mirror images of each other.

Approach A: Fan-out on read (the naive query)

When the user opens the app, go get the tweets right then: read their follow list, fetch recent tweets from every followee, merge-sort by time, return the top 50.

SELECT tweet_id, author_id, created_at, text
FROM tweets
WHERE author_id IN (/* the 300 accounts I follow */)
ORDER BY created_at DESC
LIMIT 50;

Where it breaks:

  • This runs on every single timeline load, at 150K reads/sec. Each one is a scatter-gather across potentially hundreds of authors living on different shards, then an in-memory merge-sort. A user following 5,000 accounts triggers a 5,000-way fan-in per refresh.
  • The work is repeated every time even though nothing changed between refreshes. You recompute the same merge over and over.
  • p99 latency is uncontrollable: it depends on how many people you follow and which shards are slow. You cannot hold 200ms.

Fan-out on read is cheap to write (one row per tweet, no amplification) but ruinously expensive to read. For a read-heavy system, optimizing the rare write at the cost of the common read is backwards.

Approach B: Fan-out on write (precompute the timeline)

Invert it. When a tweet is posted, immediately push its ID into the materialized timeline of every follower. The timeline is precomputed and sitting in a cache. A read is then just “fetch my list.”

on post(tweet):
    followers = graph.get_followers(tweet.author_id)
    for f in followers:
        timeline_cache.lpush(f, tweet.tweet_id)   # newest first
        timeline_cache.ltrim(f, 0, 800)           # cap length

Read becomes trivial:

on load_timeline(user):
    tweet_ids = timeline_cache.lrange(user, 0, 50)
    return hydrate(tweet_ids)

This is fantastic for reads - a single list fetch, predictable, sub-millisecond at the cache. And reads are 100x more common than writes, so we move the cost to the rarer operation. This is the right default.

Where the naive version breaks: the fan-out cost is O(followers) per tweet, and the follower distribution has a monstrous tail. Posting for an account with 100M followers means 100M timeline writes for one tweet. At celebrity posting rates that is a write storm that can stall the queue, and it means a single tweet from one account can saturate the fan-out workers and delay everyone else’s tweets. Pure fan-out-on-write dies on celebrities.

Approach C: The hybrid (the answer)

Use fan-out on write for normal accounts, and fan-out on read for celebrities. Split authors by follower count at a threshold (say ~100K followers; tune empirically).

  • Normal author posts -> fan out to followers’ timelines at write time, as in Approach B. Cheap because they have few followers.
  • Celebrity author posts -> do NOT fan out. Just store the tweet. Their tweet sits in their own (cached) author timeline.
  • On read, the Timeline Service takes the user’s precomputed timeline (normal-author tweets, already merged) and merges in the recent tweets of the handful of celebrities that user follows, pulled from those celebrities’ cached author timelines. Then ranks and returns.
on post(tweet):
    store(tweet)
    if follower_count(author) < CELEB_THRESHOLD:
        enqueue_fanout(tweet)        # push to followers' timelines
    else:
        pass                          # celebrity: pulled at read time

on load_timeline(user):
    base = timeline_cache.lrange(user, 0, 800)          # fanned-out tweets
    celebs = graph.celebrities_followed_by(user)        # small set
    celeb_tweets = [author_timeline_cache.recent(c) for c in celebs]
    candidates = merge(base, flatten(celeb_tweets))
    return rank_and_hydrate(candidates)

Why this is the right call:

  • Fan-out-on-write handles the 99.9% of accounts with few followers, so reads for those are precomputed and instant.
  • Fan-out-on-read handles the tiny number of celebrities, so one tweet never triggers 100M writes. A user follows only a few dozen celebrities at most, so the read-time merge is small and bounded.
  • The expensive amplification is eliminated for exactly the accounts that caused it, and the per-read cost stays small for exactly the accounts that don’t.
StrategyWrite costRead costBreaks onVerdict
Fan-out on readO(1) storeO(followees) scatter-gather every readHigh-follow users, read QPSToo slow to read
Fan-out on writeO(followers) per tweetO(1) list fetchCelebrities (100M writes/tweet)Great except the tail
Hybrid (write + celeb read)O(followers) for normal, O(1) for celebO(1) + small celeb mergeNothing at scaleThe answer

This hybrid is the canonical Twitter answer, and the threshold split is the single most important decision in the design.

2. The celebrity hot-key problem

Even with the hybrid, celebrities create two distinct hot-key problems: a write-side one (already solved by not fanning them out) and a read-side one.

Read-side hot key: when a celebrity tweets, tens of millions of their followers will, within minutes, all open the app and pull that same celebrity’s recent tweets at read time. Every one of those reads hits the same cache entry - the celebrity’s author timeline, and the same hydrated tweet body. One Redis key now absorbs, say, 100K+ reads/sec. A single Redis node has a per-key throughput ceiling; this melts it.

Fixes, layered:

  • Hot tweet cache replication. Replicate the hottest authors’ timelines and the hottest tweet bodies across multiple cache nodes and read from a random replica. 100K RPS spread across 10 replicas is 10K each - fine.
  • Per-instance local cache. Each Timeline Service instance keeps a tiny in-process LRU (TTL of a few seconds) of the top few thousand hot tweets and hot-celebrity timelines. 100K RPS spread across 100 service instances is 1K each, served from local RAM without touching Redis at all. A few-second TTL is acceptable because nobody notices a celebrity tweet being 3 seconds stale.
  • Edge/CDN for tweet bodies. Tweet content is immutable once posted, so the hydrated tweet body can be served from a CDN. The viral tweet itself gets served at the edge.

Write-side hot key (follow storms): when a celebrity does something newsworthy, millions of new follows can hit the graph for one account in a short window, hammering the shard that owns that account’s follower list. We do not need that follower list synchronously (we are not fanning them out), so follow writes go through a queue and the follower-count is maintained as an approximate, asynchronously updated counter. We never block a follow on updating a 100M-entry list synchronously.

The combination - not fanning out celebrities + replicating their hot read keys + per-instance local cache + edge cache for bodies - solves the celebrity problem end to end.

3. Timeline caching and hydration

The naive version: store full tweet objects in each user’s materialized timeline.

Where it breaks: the same tweet from a 200-follower user gets copied, in full, into 200 timelines. A viral retweet chain duplicates the same 400-byte body thousands of times across the cache. Storage explodes, and if you ever edit/delete a tweet you must hunt down every copy.

Fix: store only tweet IDs in timelines; hydrate at read time.

  • The Timeline Cache maps user_id -> [tweet_id, tweet_id, ...] (a capped Redis list, newest first, length ~800). 8 bytes per entry. This is the precomputed feed.
  • The Tweet Cache maps tweet_id -> tweet_body, stored exactly once regardless of how many timelines reference it. On read, batch-fetch the top N IDs with a single MGET. One tweet body, one cache entry, referenced by ID everywhere.
Timeline Cache (per user, capped list):
  user:42  ->  [tw_9931, tw_9930, tw_9925, ... up to 800]

Tweet Cache (shared, dedup'd):
  tw_9931  ->  {author, text, media_url, created_at}

This gives us two wins: storage is bounded (the 2.4TB estimate above assumes IDs only), and edits/deletes touch exactly one Tweet Cache entry - the timelines hold a reference, not a copy. A deleted tweet simply fails hydration and is filtered out of the page.

Capping and trimming: every timeline is LTRIMmed to ~800 entries on insert. Nobody scrolls back 800 tweets; deep history is served by fan-out-on-read against the Tweet Store as a rare fallback. Capping keeps per-user memory flat.

Rebuild path: because timelines are a cache, a cold or evicted timeline (e.g. a user who was inactive and got evicted, then returns) is rebuilt on demand: read their follow list, pull recent tweets from each followee (fan-out-on-read, the Approach A query), populate the cache, and from then on keep it warm via fan-out-on-write. So inactive users cost zero standing cache, and only active users hold the 2.4TB. This is why we only materialize timelines for DAU, not all 1B+ accounts.

4. Feed ranking

The naive version: pure reverse-chronological. Merge candidates by created_at descending, return. Simple and was the original Twitter.

Where it falls short (not “breaks” - it works, it’s just worse): with hundreds of followees, the best content gets buried under high-frequency, low-value posters, and a user who opens the app once a day misses the one tweet they actually cared about because it scrolled past. Engagement drops. Modern Twitter ranks.

Evolved design: ranking as a layer on top of the candidate set. Crucially, ranking does NOT replace fan-out - it consumes the candidate IDs that fan-out produced. The pipeline:

  1. Candidate generation (already done): the merged set of fanned-out tweets + followed-celebrity tweets, maybe a few hundred candidate IDs.
  2. Feature hydration: for each candidate, pull lightweight features - tweet age, author affinity (how often this user engages with that author), tweet engagement velocity (likes/retweets per minute so far), media presence, whether it is a reply.
  3. Scoring: a lightweight model (logistic regression / gradient-boosted trees, or a small neural ranker) produces a relevance score per candidate. The score predicts probability of engagement.
  4. Ordering and diversity rules: sort by score, then apply business rules - don’t show 5 tweets from the same author in a row, inject a small amount of recency so the feed feels live, demote already-seen tweets.
candidates  ->  hydrate features  ->  score(model)  ->  reorder + diversity  ->  page

Keep ranking out of the write path entirely. We do not precompute scores at fan-out time because affinity and engagement-velocity features change minute to minute; scoring at read time on a few hundred candidates is cheap (sub-10ms with a light model) and always fresh. The expensive part (candidate generation) is precomputed; the cheap, freshness-sensitive part (scoring) is at read time. That split is the whole trick.

For latency safety, the ranker has a strict time budget; if scoring is slow or the model service is down, fall back to reverse-chronological. The feed must render even if ranking fails - degrade, never blank.

API Design & Data Schema

API

POST /api/v1/tweets
Body:
{
  "text": "shipping the new timeline service today",
  "media_url": "https://cdn.x.com/m/abc123.jpg"   // optional
}
Response 201:
{
  "tweet_id": "1789324411223300",
  "author_id": "42",
  "created_at": "2026-06-27T09:00:00Z"
}
Errors:
  400  empty or over-length text
  401  not authenticated
  429  rate limit exceeded
GET /api/v1/timeline?limit=50&cursor=<opaque>
Response 200:
{
  "tweets": [
    { "tweet_id": "...", "author": {...}, "text": "...",
      "media_url": "...", "created_at": "..." },
    ...
  ],
  "next_cursor": "<opaque>"      // for pagination; cursor encodes position
}
Latency target: p99 < 200ms
Errors:
  401  not authenticated
POST   /api/v1/follow      { "target_id": "999" }   -> 202 Accepted (async)
DELETE /api/v1/follow      { "target_id": "999" }   -> 202 Accepted
GET    /api/v1/users/{id}/tweets?limit=50&cursor=.. -> author timeline

Pagination is cursor-based, not offset-based. Offsets break when new tweets arrive between pages (you get duplicates or gaps). The cursor encodes the last-seen tweet_id; since IDs are time-sortable (Snowflake), the cursor is a stable position in the stream.

Data store choices

This system uses several stores because the access patterns genuinely differ. Be explicit about each.

1. Tweet Store - NoSQL wide-column (Cassandra / DynamoDB). The access pattern is point-lookup by tweet_id and range-scan of an author’s recent tweets. No joins, no cross-row transactions, needs to hold ~365TB and scale writes horizontally. That is the sweet spot for a distributed wide-column store, not a single relational DB.

Table: tweets
  tweet_id     BIGINT   PARTITION KEY   -- Snowflake: time-sortable 64-bit
  author_id    BIGINT
  text         STRING
  media_url    STRING   NULL
  created_at   TIMESTAMP                -- also embedded in tweet_id
  reply_to     BIGINT   NULL

  Shard by: tweet_id (uniform distribution, single-key reads land on one shard)
  Secondary access: author_timelines table below for "an author's tweets"
Table: author_timelines        -- an author's own tweets, for the celeb read-merge
  author_id    BIGINT   PARTITION KEY
  tweet_id     BIGINT   CLUSTERING KEY DESC   -- ordered newest first
  Shard by: author_id
  Read pattern: "recent K tweets by author X" = single-partition range scan

2. Social Graph Store. Follows are a graph: follower_id -> followee_id. We need two queries: “who do I follow” (for read-time celeb merge and timeline rebuild) and “who follows X” (for fan-out on write). Store both directions.

Table: following            Table: followers
  user_id   PARTITION KEY     user_id   PARTITION KEY     -- the followed account
  followee  CLUSTERING KEY     follower  CLUSTERING KEY
  Shard by: user_id           Shard by: user_id

"who do I follow"  -> following[me]      (single partition)
"who follows X"    -> followers[X]       (single partition; huge for celebs)
follower_count[X]  -> maintained as an approximate async counter

Why two tables: you cannot efficiently answer both directions from one. The cost is doubled writes on follow (insert into both) - acceptable, follows are rare relative to reads. We could use a dedicated graph database, but a wide-column store with these two materialized adjacency lists scales more simply at this size.

Why NoSQL over SQL across the board: the workload is high-volume, key/partition-scoped reads and writes with no need for cross-entity ACID transactions, demanding horizontal scale to hundreds of TB. SQL’s joins and transactions are exactly the features we do not need here, and a single relational instance cannot hold this data or sustain the write rate. The one place strong consistency would matter (e.g. a billing ledger) does not exist in this problem.

3. Timeline Cache and Tweet Cache - Redis. Covered in the deep dive: user_id -> [tweet_id] capped lists, and tweet_id -> body shared entries. Pure in-memory, replicated, sharded by user_id and tweet_id respectively.

Bottlenecks & Scaling

Where it breaks first, in order, and the fix for each:

1. Fan-out write amplification on celebrities (breaks first and worst). One tweet = up to 100M timeline writes. Fix: the hybrid model. Do not fan out accounts above the follower threshold; merge their tweets at read time instead. This is the single decision that makes the system possible. Tune the threshold by watching fan-out queue lag.

2. Timeline read QPS (150K/sec). Cannot recompute a merge per read. Fix: precomputed, cached timelines (fan-out on write) so a read is a single Redis list fetch plus a small bounded celeb merge. Store IDs not bodies; hydrate with one batched MGET.

3. Celebrity read hot key. Millions reading the same celeb timeline / tweet body at once. Fix: replicate hot keys across multiple cache nodes (read a random replica), per-instance local LRU with a few-second TTL, and CDN for immutable tweet bodies. Covered in deep dive.

4. Fan-out queue lag / thundering herd. A burst of posting (or a near-threshold mega-account) can back the Kafka fan-out queue up, delaying delivery for everyone. Fix: partition the fan-out queue, scale workers horizontally, and rate-control per-author fan-out so one heavy account cannot starve the rest. Workers are idempotent (pushing the same tweet_id twice is a no-op on a deduped list) so retries are safe.

5. Tweet Store storage and write throughput (~200GB/day, 18K WPS peak). Fix: shard by tweet_id. Snowflake IDs are uniformly distributed, so writes spread evenly with no hot shard. Replication factor 3 for durability. Author-timeline table shards by author_id; the only naturally hot partition is a celebrity’s own tweet list, which we handle with cache replication, not by hitting the store.

Shard keys, stated plainly: Tweet Store -> tweet_id (every read is a single-key lookup). Author timelines -> author_id (range scan of one author’s tweets). Social graph -> user_id on both adjacency tables. Timeline cache -> user_id. Choosing created_at as a shard key anywhere would be a classic mistake: it concentrates all current writes on one shard (the “now” shard) and creates a permanent write hot spot.

6. Timeline cache memory growth. Fix: cap each timeline to ~800 entries (LTRIM on insert), and only materialize timelines for active users. Inactive users are evicted and rebuilt on return via fan-out-on-read. This keeps standing memory at ~2.4TB (DAU only), not tens of TB (all accounts).

7. Ranking service as a latency/availability risk on the read path. Fix: strict scoring time budget and a reverse-chronological fallback if the model service is slow or down. The feed degrades to recency; it never blanks.

8. Single points of failure. Tweet Service and Timeline Service are stateless behind load balancers - scale horizontally, any instance serves any request. Kafka, the caches, and the stores are all replicated. Fan-out workers are a horizontally-scaled, idempotent consumer group. No single box whose loss stops timelines from loading.

9. Multi-region latency for a global user base. Fix: deploy Timeline Service and caches in multiple regions, route via GeoDNS. Because timelines are eventually consistent and rebuildable, cross-region replication is forgiving - a few seconds of lag on a far-region timeline is invisible. The Tweet Store replicates asynchronously across regions; the author always reads their own just-posted tweet from their home region (read-your-writes on the author timeline only).

Wrap-Up

The trade-offs that define this design:

  • Hybrid fan-out over either pure model. We fan out on write for normal accounts (fast reads, the common case) and on read for celebrities (no 100M-write explosion). We trade a small read-time merge for eliminating the catastrophic tail. This is the single most important call in the whole system.
  • Precompute the timeline, store IDs not bodies. We move work off the 100x-more-common read onto the rarer write, and dedup tweet bodies behind a shared Tweet Cache so a viral tweet is one cache entry, not thousands.
  • Eventual consistency, embraced. A tweet appearing a few seconds late on a follower’s timeline is free latency and availability. We never pay for strict consistency we don’t need.
  • Ranking at read time, on top of fan-out. Candidate generation is precomputed; the freshness-sensitive scoring is cheap and done per read, with a reverse-chronological fallback so the feed never blanks.
  • NoSQL, sharded by the read key everywhere. Partition-scoped, transaction-free access at hundreds of TB is exactly what wide-column stores are for, and sharding by the read key keeps every hot-path read a single-partition lookup.

One-line summary: a read-heavy, eventually-consistent social feed where normal accounts are fanned out to precomputed, ID-only cached timelines on write while celebrities are merged in on read, hot keys are absorbed by replicated and per-instance caches, and a read-time ranker orders the result in under 200ms at 150K timeline reads per second.