Design a Notification System (Push, Email, SMS) - System Design

“Send the user a notification” sounds like one function call. Then you write it down. Which channel - push, email, SMS, in-app? What if the user turned push off but left email on? What if APNs is down for ten minutes? What if the order service retries and now the user gets charged-confirmation SMS three times? What if a bug in one service starts emitting a million notifications a second and you SMS your entire user base at a rupee each? The one-line function is actually a distributed system with a fan-out, a provider integration layer, a retry engine, a preference store, and a rate limiter, and every one of those is a place it breaks.

The core of this problem is that a notification has to go from one logical event (“your order shipped”) to zero or more physical deliveries across different channels and different third-party providers, exactly the right number of times, respecting what the user asked for, without ever hammering a downstream provider or the user. Let me build it properly.

Functional Requirements (FR)

In scope:

Send a notification. An upstream service (orders, payments, social, marketing) triggers a notification to a user through an internal API. The system decides which channels to use and delivers on each.
Multi-channel delivery. Push (mobile via APNs/FCM, web push), email (via SES/SendGrid), and SMS (via Twilio/SNS). One logical notification can fan out to several channels.
User preferences. Users control which channels they receive, per category (transactional, promotional, social). Promotional off must never suppress a critical transactional message.
Templates. Notifications are rendered from templates with variables (name, order id, amount) so services send data, not formatted strings.
Delivery guarantees. At-least-once delivery with deduplication so the effective behavior is close to exactly-once per (user, event, channel). Retries on transient provider failures.
Rate control. Cap how many notifications a single user receives in a window, and protect providers from overload / respect their quotas.

Explicitly out of scope (say it so you own the scope):

The full marketing campaign / segmentation engine (audience building, A/B, scheduling). I design the delivery pipeline it feeds into, not the campaign UI.
In-app notification feed storage and the unread-badge read model. I mention the in-app channel but do not design the feed store.
Analytics/attribution dashboards beyond the delivery status we must track for retries.
The device-token registration client SDK internals. I design the token store and its API, not the mobile SDK.

The decision that drives everything: a notification is fire-and-forget from the caller’s view, but at-least-once with dedup underneath, fanned out across unreliable third-party providers. The caller must not block on a slow SMS provider, must not be able to send a duplicate, and must not be able to bypass a user’s preferences. That asymmetry - cheap synchronous accept, expensive asynchronous multi-provider delivery - is the whole shape of the system.

Non-Functional Requirements (NFR)

Scale: assume a large consumer app. ~100M registered users, ~20M DAU. Peak ~10K notification-events/sec ingested (a marketing blast or a flash-sale spike drives the peak). Each event fans out to ~1.5 channels on average, so ~15K channel-deliveries/sec at peak.
Latency: transactional notifications (OTP, order shipped, payment) should reach the provider within a couple of seconds, p99 under ~5s end to end. Promotional can tolerate minutes. The ingestion accept must be fast, p99 under 50ms, because callers are on a request path.
Availability: 99.9%+ on ingestion. Accepting a notification must almost never fail; the durable queue absorbs downstream outages. Delivery itself is best-effort against providers that themselves go down.
Consistency: eventual. It is fine for a notification to arrive a few seconds late. What is NOT fine is a duplicate (dedup must be strong) or a preference violation (a user who opted out getting spammed).
Durability: an accepted notification must not be silently dropped. Once we return 202 to the caller, that notification is on a durable queue and will be attempted (or deliberately suppressed by preferences, which is logged). Losing an OTP because a worker crashed is a real user-facing failure.
Ordering: best-effort per user. Notifications are largely independent; we do not need strict global ordering. Where it matters (a “shipped” then “delivered” pair) we lean on timestamps, not a total order.

Back-of-the-Envelope Estimation (BoE)

Real numbers with the arithmetic, because they dictate the architecture.

Ingestion (notification events):

Assume 20M DAU, each receiving ~5 notifications/day on average
(transactional + a couple promotional).
20M * 5 = 100M notification-events/day
= 100,000,000 / 86,400 sec
≈ 1,160 events/sec average
Peak (marketing blast / flash sale, ~8x avg) ≈ 10,000 events/sec

Channel deliveries (after fan-out):

Average fan-out ≈ 1.5 channels per event (many are single-channel push,
some transactional go push + email + SMS).
Average: 1,160 * 1.5 ≈ 1,740 deliveries/sec
Peak:   10,000 * 1.5 ≈ 15,000 deliveries/sec

Split by channel at peak, roughly: push ~9K/sec, email ~4K/sec, SMS ~2K/sec. SMS is the lowest volume but the most expensive per message and the most rate-limited by providers, so it dominates the cost and quota story.

Storage (notification log for status + dedup + retries):

Per notification-delivery record:

notification_id   16 bytes (UUID)
dedup_key         ~40 bytes (hash of user+event+channel)
user_id           8 bytes
channel           1 byte
template_id       8 bytes
payload_ref       ~40 bytes (rendered payload or pointer)
status            1 byte  (queued/sent/delivered/failed)
provider_msg_id   ~40 bytes
attempts          1 byte
timestamps        ~24 bytes
--------------------------------------------------
≈ 180 bytes, round to ~250 bytes with overhead/indexes

100M events/day * 1.5 = 150M delivery records/day
150M * 250 bytes ≈ 37.5 GB/day
* 365 ≈ 13.7 TB/year

We do not keep raw records forever. Hot status/dedup data lives ~7-30 days (long enough for dedup windows and retry/troubleshooting), then rolls to cold storage or is dropped. So the online store holds ~1-2 TB, very tractable for a sharded cluster.

Dedup cache:

Dedup keys kept for a 24-72h window in a fast store (Redis).
Peak-ish steady 1,740 deliveries/sec * 86,400 * 3 days ≈ 450M keys
Each key ~40 bytes key + ~40 bytes value + overhead ≈ ~120 bytes
450M * 120 bytes ≈ 54 GB

~50-60 GB of Redis for dedup with TTL eviction. A handful of nodes. The dedup layer is also backed by a unique constraint in the durable store so a cache miss cannot cause a duplicate (belt and suspenders, covered in the deep dive).

Device token store:

100M users * ~1.5 devices avg = 150M tokens
Per token: user_id(8) + token(~120) + platform(1) + timestamps(16) ≈ ~150 bytes
150M * 150 bytes ≈ 22 GB

Small. Fits comfortably in a sharded store, cached hot.

The headline: ingestion is only ~10K/sec at peak, which is modest. The hard parts are not raw throughput - they are the fan-out correctness, the unreliable providers behind retries, dedup under at-least-once, and not melting a provider or a user with rate control.

High-Level Design (HLD)

The spine is an ingestion API that does almost nothing synchronously except validate, dedup-check, and enqueue, followed by an asynchronous pipeline that resolves preferences, renders templates, fans out per channel, and hands each message to a channel-specific worker that talks to an abstracted provider. A durable queue between accept and delivery is what buys us availability and retries.

                          ┌────────────────────────────┐
                          │   Upstream Services         │
                          │ (orders, payments, social,  │
                          │  marketing/campaign engine) │
                          └───────────────┬─────────────┘
                                          │ POST /notifications
                          ┌───────────────▼─────────────┐
                          │   Notification API (Ingest)  │
                          │  validate + dedup-check +    │
                          │  enqueue. Stateless.         │
                          └───────┬──────────────┬───────┘
                     dedup-check  │              │ enqueue event
                        ┌─────────▼────┐   ┌──────▼─────────────┐
                        │ Dedup Store  │   │  Ingest Queue      │
                        │ (Redis + DB  │   │  (Kafka topic)     │
                        │  unique key) │   └──────┬─────────────┘
                        └──────────────┘          │ consumed
                                          ┌────────▼─────────────┐
                                          │  Notification Worker  │
                                          │  (orchestrator):      │
                                          │  1 load prefs         │
                                          │  2 apply rate control │
                                          │  3 render template    │
                                          │  4 fan out per channel│
                                          └──┬───────┬────────┬───┘
                    reads prefs ┌────────────┘       │        └───────────┐ per-channel enqueue
             ┌──────────────────▼──┐        ┌────────▼───────┐   ┌─────────▼────────┐
             │ Preference Service   │        │ Rate Limiter   │   │ Per-Channel Queues│
             │ (user + category     │        │ (token bucket, │   │ push / email / sms│
             │  channel settings)   │        │  Redis)        │   │ (Kafka topics)    │
             └──────────────────────┘        └────────────────┘   └───┬───┬───┬──────┘
                                                                       │   │   │
                              ┌────────────────────────────────────────┘   │   └────────────┐
                     ┌────────▼────────┐               ┌────────▼────────┐        ┌──────────▼───────┐
                     │ Push Worker      │               │ Email Worker    │        │ SMS Worker        │
                     │ (APNs/FCM adapter)│              │ (SES/SendGrid)  │        │ (Twilio/SNS)      │
                     └───┬─────────────┘                └───┬────────────┘        └───┬──────────────┘
                         │ provider call                    │                         │
                 ┌───────▼──────────┐               ┌───────▼───────┐         ┌───────▼────────┐
                 │ Provider Adapter │◀── retries ──▶│ Provider Adapt│         │ Provider Adapt │
                 │ + circuit breaker│               │ + breaker     │         │ + breaker      │
                 └───────┬──────────┘               └───────────────┘         └────────────────┘
                         │ status callbacks (webhooks) update:
                 ┌───────▼────────────────────────────────────────────────┐
                 │  Notification Log / Status Store (delivered, failed,     │
                 │  bounced) + Device Token Store + Dead Letter Queue       │
                 └─────────────────────────────────────────────────────────┘

Ingestion flow (synchronous, must be fast):

Upstream calls POST /notifications with {user_id, event_type, category, template_id, data, dedup_key?, channels?}.
Notification API validates the payload and authenticates the caller (service token). It computes a dedup_key if the caller did not supply one (hash of user_id + event_type + a caller-provided idempotency id).
It does a fast dedup check against the Dedup Store. If this key was seen inside the window, return 202 with the existing notification_id and enqueue nothing. This makes the endpoint idempotent.
Otherwise it writes the event to the durable Ingest Queue (Kafka) and returns 202 Accepted immediately. The caller is now done. Everything expensive happens asynchronously.

Delivery flow (asynchronous):

Notification Workers consume from the Ingest Queue. For each event they load the user’s preferences (which channels are enabled for this category).
If the category is promotional and the user opted out, suppress (log it, do not deliver). Transactional/critical categories ignore promotional opt-outs but still respect explicit per-channel disables where allowed by policy (OTP always sends).
Apply rate control per user and per category. If the user is over their window budget for promotional messages, drop or defer. Transactional bypasses the user cap but still counts toward provider-side quotas.
Render the template with the event data into per-channel payloads (a push has title/body, an email has subject/html, an SMS has a short text).
Fan out: for each enabled channel, write a per-channel message onto that channel’s queue (push, email, sms topics), each carrying the rendered payload, the notification_id, and a per-channel dedup_key.
Channel workers consume their queue, look up delivery targets (device tokens for push, email address, phone number), call the provider through an adapter, and record the result. Transient failures are retried with backoff; permanent failures (invalid token, hard bounce) are recorded and the token/address is marked bad.
Provider status webhooks (delivered/bounced/opened) come back asynchronously and update the Notification Log. Terminal-failed messages after max retries land in a Dead Letter Queue for inspection.

The key structural insight: two queue hops. The first (Ingest Queue) decouples the caller from all delivery work. The second (per-channel queues) decouples channels from each other so a slow SMS provider cannot back up push delivery, and so each channel can be scaled and rate-limited independently. That separation is what makes the system resilient.

Component Deep Dive

The hard parts, naive-first then evolved: (1) fan-out across channels, (2) provider abstraction, (3) retries + dedup / idempotency, (4) user preferences, (5) rate control.

1. Fan-out across channels

The job: turn one logical event into the right set of physical deliveries.

Naive approach: fan out inside the ingestion request

The obvious first cut is to do it all in the POST /notifications handler: load prefs, render, then loop over channels and call APNs, SES, and Twilio inline, returning to the caller when they all complete.

def post_notification(req):
    prefs = prefs_db.get(req.user_id)
    for ch in enabled_channels(prefs, req.category):
        payload = render(req.template_id, req.data, ch)
        provider[ch].send(payload)   # blocking network call to APNs/SES/Twilio
    return 200

Where it breaks:

The caller is now blocked on three third-party network calls, the slowest of which sets the latency. Twilio having a bad minute makes the order service’s checkout endpoint slow. Coupling a user-facing request to a flaky SMS provider is unacceptable.
If the process crashes after sending push but before SMS, the notification is half-delivered with no record and no retry. There is no durability.
A provider outage takes down ingestion. There is nowhere to buffer.
No independent scaling: push is 9K/sec and SMS is 2K/sec with a hard provider quota, but they share one code path and one failure domain.

Doing fan-out synchronously on the request path fails on every axis: latency, durability, isolation, and scaling.

Evolved approach: two-stage async fan-out through queues

Split fan-out into two stages across two queue hops, as in the HLD.

Stage 1 - accept and enqueue. Ingestion only validates, dedups, and drops the event on the durable Ingest Queue. Returns 202 in milliseconds. No provider calls on the request path.
Stage 2a - orchestrate. Notification Workers consume the event, resolve preferences and rate limits once, render per channel, and then fan out by producing one message per enabled channel onto that channel’s own queue. This is the actual fan-out point, and it is cheap: it is just enqueues, no provider calls.
Stage 2b - deliver per channel. Each channel has its own worker pool and its own queue, so push, email, and SMS are fully isolated. A Twilio outage backs up only the SMS topic; push and email keep flowing. Each channel scales its consumer count to its own volume and its own provider quota.

Ingest event ─▶ [Ingest Queue] ─▶ Orchestrator
                                     │ render + fan out
                     ┌───────────────┼───────────────┐
                [push topic]    [email topic]    [sms topic]
                     │               │                │
               Push workers    Email workers     SMS workers

Why this is right:

Isolation: channels fail and scale independently. This is the single biggest win.
Durability: every stage boundary is a durable queue. A worker crash means the message is redelivered, not lost.
Bounded latency for the caller: the caller sees only the enqueue.
Natural place for retries and rate limits: each channel worker owns its own retry policy and its own provider rate budget, because each channel’s provider behaves differently.

One subtlety: fan-out must be idempotent per channel. If the orchestrator crashes after producing the push message but before the email message, it will be redelivered and re-run. Producing the push message again must not cause a duplicate push. We solve that with a per-channel dedup_key and the idempotency machinery in deep dive 3, so redelivery of the orchestration step is safe.

2. Provider abstraction

The job: talk to APNs, FCM, SES, SendGrid, Twilio, SNS - each with different auth, payload shapes, error codes, rate limits, and webhook formats - without that mess leaking into the core pipeline.

Naive approach: call each provider SDK directly in the worker

The first cut puts provider SDK calls straight into channel workers with per-provider branching.

def deliver_sms(msg):
    if region == "US":
        twilio.messages.create(to=msg.phone, body=msg.text, from_=NUM)
    else:
        sns.publish(PhoneNumber=msg.phone, Message=msg.text)

Where it breaks:

Provider-specific error handling, retry semantics, and rate limits are now smeared across the worker. Twilio’s 429 and SNS’s throttling exception mean the same thing but look different; the worker has to know both.
Adding or swapping a provider (failover from SendGrid to SES during an outage) is a code change deep in the delivery loop.
Testing requires mocking each SDK. There is no single seam.
Multi-provider routing (send US SMS via Twilio, India via a local aggregator for cost) becomes a pile of conditionals.

Provider details bleeding into core logic makes the system brittle exactly where the outside world is least reliable.

Evolved approach: a uniform ProviderAdapter interface behind a router

Define one interface every provider adapter implements, and put a channel router in front that picks the adapter and handles cross-provider concerns uniformly.

interface ProviderAdapter {
    send(Message) -> SendResult { provider_msg_id, status, retryable: bool }
    parseWebhook(raw) -> DeliveryStatus   // normalize delivered/bounced/failed
    supports(channel, region) -> bool
    quota() -> RateBudget
}

Each provider (ApnsAdapter, FcmAdapter, SesAdapter, SendGridAdapter, TwilioAdapter, SnsAdapter) implements this and translates its own quirks into the normalized SendResult. The rest of the system only ever sees retryable: true/false and a normalized status - it never learns what a Twilio 21610 (unsubscribed) code is; the adapter maps it to PERMANENT_FAILURE / mark address bad.
A channel router chooses which adapter to use based on channel, region, cost, and current health. This is where multi-provider policy lives: primary Email = SES, fallback = SendGrid; SMS routing by country for cost.
Each adapter carries a circuit breaker and its own rate budget (from quota()). When a provider starts failing or throttling, the breaker opens and the router fails over to the secondary or parks messages for retry, instead of hammering a dead provider.

Channel Worker ─▶ Router.pick(channel, region, health)
                     │
              ┌──────┴───────┐
        primary adapter   fallback adapter
        (SES, breaker)    (SendGrid, breaker)
              │
        normalized SendResult { retryable, provider_msg_id }

Why this is right: the volatile, ugly part of the system (third-party integrations) is quarantined behind a stable interface. The core pipeline reasons in terms of “retryable or not,” failover is a router policy change not a code surgery, and adding a provider is writing one class. Webhooks are normalized the same way so status updates from six providers land in one code path.

3. Retries, dedup, and idempotency

This is the correctness core. We promised at-least-once delivery (never silently drop) but the user must not get duplicates. Those two goals fight, and the resolution is dedup + idempotency around an at-least-once pipeline.

Naive approach: fire once, hope, and retry on any error

Send the message; if the call throws, retry a few times; assume the queue delivers each message exactly once.

Where it breaks:

Queues are at-least-once, not exactly-once. Kafka will redeliver a message if a consumer crashes after processing but before committing its offset. So the same event gets processed twice - two pushes.
Caller retries. The order service’s HTTP call to POST /notifications times out (but actually succeeded server-side); it retries; now there are two events.
Ambiguous provider failures. You call Twilio, the connection drops before you read the response. Did it send? Retrying might double-send; not retrying might drop an OTP. You cannot tell from the error alone.
Naive “retry on any error” also retries permanent failures (invalid phone number), wasting attempts and money, and can amplify a provider outage into a retry storm.

At-least-once machinery without dedup produces duplicates; retrying blindly produces both duplicates and retry storms.

Evolved approach: idempotency keys, a dedup store with a hard unique constraint, and typed retries

Idempotency and dedup. Every notification carries a dedup_key derived from user_id + event_type + caller_idempotency_id + channel. Dedup is enforced at two layers:

Fast path - Redis. SET dedup_key <notification_id> NX EX <window>. If it fails (key exists), this is a duplicate; skip. This absorbs the vast majority of duplicates cheaply.
Authoritative path - a unique constraint in the durable store. The Notification Log has a UNIQUE(dedup_key) index. The channel worker inserts a “sending” row keyed by dedup_key before calling the provider. If the insert violates the unique constraint, another worker already owns this delivery; skip. This is the backstop that guarantees correctness even if Redis evicts the key or has a gap. Redis is the fast filter; the DB unique key is the truth.

def deliver(msg):
    if not redis.set(msg.dedup_key, msg.id, nx=True, ex=WINDOW):
        return  # fast dedup hit
    try:
        db.insert_delivery(dedup_key=msg.dedup_key, status='sending')  # UNIQUE
    except UniqueViolation:
        return  # authoritative dedup hit
    result = adapter.send(msg)
    db.update_status(msg.dedup_key, result.status, result.provider_msg_id)

This makes redelivery from either queue hop, and caller retries, safe: the second attempt loses the race on the unique key and does nothing.

Typed retries. The adapter classifies every failure:

Failure class	Example	Action
Transient	5xx, timeout, throttle (429)	Retry with exponential backoff + jitter
Ambiguous	connection dropped mid-call	Retry, but the provider_msg_id + our dedup makes it safe; providers also dedup on our idempotency key when we pass one
Permanent	invalid token, hard bounce, unsubscribed	Do NOT retry. Mark token/address bad. Record terminal failure

Retries run with exponential backoff and jitter (e.g. 1s, 4s, 15s, 60s) to avoid a synchronized retry storm against a recovering provider. After N attempts the message goes to a Dead Letter Queue with its full context for manual inspection or a later bulk replay. We also pass our idempotency key to providers that support it (Twilio, SES) so even a genuine double-send is dedup’d on their side.

Retry mechanics without blocking a worker: rather than sleeping in the worker (which wastes a consumer slot), a failed message is re-enqueued onto a delay queue keyed by its next-attempt time, or given a visibility-timeout / scheduled redelivery. The worker moves on immediately; the message reappears when its backoff expires.

The combination - dedup key with a Redis fast path and a DB unique-constraint backstop, plus typed retries with backoff and a DLQ - turns a fundamentally at-least-once pipeline into effectively exactly-once delivery per (user, event, channel), which is the correctness bar.

4. User preferences

The job: deliver only on the channels the user allows, per category, without ever suppressing a message they must legally or functionally receive.

Naive approach: one “notifications on/off” flag per user

A single boolean on the user row: notifications enabled or not.

Where it breaks:

Users do not want all-or-nothing. They want promotional emails off but order updates on, push on but SMS off. One flag cannot express that.
Turning “notifications off” to stop promo spam would also kill their OTP and payment alerts - a functional and sometimes legal problem (transactional messages and OTPs must go through).
No per-channel granularity means the preference cannot inform fan-out at all.

A single flag collapses a two-dimensional problem (channel x category) into one bit and gets the important case (must-send transactional) wrong.

Evolved approach: a preference matrix of (category x channel) with mandatory overrides

Model preferences as a matrix: for each category (transactional, promotional, social, security) and each channel (push, email, sms, in-app), a boolean, with system-enforced rules that some (category, channel) cells cannot be disabled.

Preferences[user_id] = {
  transactional: { push: on,  email: on,  sms: on   },   # sms not user-disable-able for OTP
  promotional:   { push: off, email: on,  sms: off  },
  social:        { push: on,  email: off, sms: off  },
  security:      { push: on,  email: on,  sms: on   }     # security always on (locked)
}

The orchestrator, for a given event’s category, intersects the requested channels with the enabled cells to get the actual delivery set.
Mandatory categories (security OTP, critical transactional) are locked on at the policy layer - the UI does not let users disable them and the pipeline ignores any stale “off” for those cells. This is where the “promotional off must not suppress transactional” rule lives, concretely.
Preferences are read on every event, so they sit behind a cache (read-heavy, rarely written). Store is a simple keyed lookup by user_id.
Additional controls layer on the same store: quiet hours (defer non-urgent notifications to a per-user time window, respecting timezone), frequency caps (see rate control), and global unsubscribe for a channel (email List-Unsubscribe compliance).

enabled = requested_channels
          ∩ prefs[category]           # user choices
          ∪ mandatory[category]       # locked-on overrides win

Why this is right: it expresses exactly the two dimensions users care about, it makes the must-send guarantee a first-class locked cell rather than an afterthought, and it plugs straight into fan-out as a set intersection. Quiet hours and frequency caps ride on the same lookup so there is one preference read per event.

5. Rate control

The job: protect the user from being spammed, and protect the providers (and your bill) from overload, and stop a buggy upstream from blasting everyone.

Naive approach: a global counter and a hard cap

Keep one global counter of messages sent this second; if it exceeds a limit, drop. Or a per-user counter with a fixed daily cap.

Where it breaks:

A single global counter is a hot key and a coarse instrument: it throttles OTPs and marketing identically, so a promo surge can starve transactional OTP delivery. You must never rate-limit a security OTP because a marketing blast filled the bucket.
A fixed daily per-user cap does not smooth bursts within the day, and a naive counter reset at midnight lets a burst through right after reset.
Nothing here respects per-provider quotas. Twilio caps you at N messages/sec per number; blowing past it gets you 429s and eventually throttled or suspended. A user-side cap does nothing for the provider side.
A runaway loop in an upstream service (emit the same notification a million times) sails past a per-user cap if it spreads across users - you SMS your whole base before anyone notices.

One global cap conflates independent concerns (user experience, provider quota, abuse protection) and gets the priority ordering (transactional over promotional) backwards.

Evolved approach: layered token buckets - per user, per category, per provider - plus a kill switch

Rate control is not one limiter; it is several, at different scopes, using token buckets in Redis (smooths bursts, refills continuously, unlike a fixed window).

Per-user, per-category frequency cap. A token bucket per (user_id, category). Promotional might be “5 per day, burst 2”; social “10 per hour.” Transactional and security bypass the user cap entirely - they are never dropped for frequency. This is the “don’t spam the user, but always deliver the OTP” rule, enforced by scoping buckets per category and exempting the critical ones.
Per-provider / per-channel quota. A token bucket per provider (and per sending identity, e.g. per Twilio number or per SES sending domain) sized to the provider’s contracted rate. Channel workers acquire a token before each provider call; if none is available, the message is re-queued with backoff rather than sent, so we shape our outbound rate to exactly what the provider allows and never eat a 429. This is a smoothing throttle, not a drop.
Global abuse kill switch. A per-(upstream_service, event_type) rate monitor. If a single source suddenly exceeds a sane ceiling (a runaway loop emitting millions of events), trip a circuit breaker that pauses that source and alerts, instead of faithfully delivering a million-message bug to real users. This is the blast-radius limiter and it is worth its weight in gold the one day a bad deploy happens.

On each delivery:
  if category in (transactional, security):
      pass                                   # exempt from user cap
  elif not user_bucket[(user, category)].take():
      drop_or_defer()                        # user frequency cap
  if not provider_bucket[provider].take():
      requeue_with_backoff()                 # provider quota: shape, don't drop
  if source_monitor[(svc, event)].exceeds_ceiling():
      trip_kill_switch(svc, event)           # abuse / runaway protection

Why this is right: each concern gets its own limiter at its own scope, so they do not interfere - a marketing surge can be throttled to protect the user and the provider without ever touching OTP delivery. Token buckets smooth bursts instead of the cliff-edge of fixed windows. And the source-level kill switch means a single upstream bug cannot turn into a company-wide spam incident, which is the failure mode that actually makes the news.

API Design & Data Schema

API

The ingestion endpoint upstream services call:

POST /api/v1/notifications
Headers: Authorization: Service <token>
         Idempotency-Key: <caller-supplied unique id for this event>
Body:
{
  "user_id": "42",
  "event_type": "order_shipped",
  "category": "transactional",         // transactional|promotional|social|security
  "template_id": "order_shipped_v3",
  "data": { "order_id": "A123", "eta": "2026-07-08" },
  "channels": ["push", "email"],       // optional; omitted = let prefs decide
  "priority": "high"                   // high|normal|low; affects queue + rate policy
}
Response 202:
{
  "notification_id": "b1e2...",
  "status": "accepted",
  "dedup": false                       // true if this collapsed onto an existing one
}
Errors:
  400  invalid payload / unknown template
  401  bad service token
  409  (optional) explicit conflict if caller demands exactly-once semantics
  429  ingestion rate exceeded for this source

Preference management (called by the app / settings UI):

GET  /api/v1/users/{id}/preferences         -> the (category x channel) matrix
PUT  /api/v1/users/{id}/preferences         -> update mutable cells (locked cells rejected)
POST /api/v1/users/{id}/devices             -> register a push token {token, platform}
DELETE /api/v1/users/{id}/devices/{token}   -> unregister (or on hard failure)

Status / observability:

GET  /api/v1/notifications/{id}             -> per-channel delivery status
POST /api/v1/webhooks/{provider}            -> provider delivery callbacks (delivered/bounced)

Ingestion returns 202, not 200 - we accepted responsibility, we did not finish delivering. The Idempotency-Key header is what makes caller retries safe; it feeds directly into the dedup_key.

Data stores

Different access patterns, different stores. Be explicit.

1. Notification Log / Status Store - NoSQL wide-column (Cassandra / DynamoDB), with a unique-key backstop. The access pattern is: write a record per delivery, look it up by notification_id or dedup_key, update its status on webhook, and range-scan a user’s recent notifications. High write volume (150M/day), no joins, TTL-based expiry. That is a wide-column sweet spot. The one relational-flavored need - a hard UNIQUE(dedup_key) for the dedup backstop - is served either by the store’s conditional-write (IF NOT EXISTS) or by a small dedicated dedup table.

Table: notifications
  notification_id  UUID     PARTITION KEY
  user_id          BIGINT
  event_type       STRING
  category         STRING
  created_at       TIMESTAMP
  TTL: 30 days

Table: deliveries                     -- one row per (notification, channel)
  dedup_key        STRING   PARTITION KEY   -- UNIQUE; conditional-write on insert
  notification_id  UUID
  user_id          BIGINT
  channel          STRING
  provider         STRING
  provider_msg_id  STRING   NULL
  status           STRING   -- queued|sending|sent|delivered|failed|bounced|suppressed
  attempts         INT
  updated_at       TIMESTAMP
  TTL: 30 days
  Shard by: dedup_key   (uniform; every dedup check + status update is single-key)

Table: user_notifications             -- for "show me my recent notifications"
  user_id          BIGINT   PARTITION KEY
  notification_id  UUID     CLUSTERING KEY DESC
  Shard by: user_id   (single-partition range scan of a user's recent notifications)

2. Preference Store - can be SQL or a keyed KV; small and read-heavy. 100M users, a small matrix each, read on every event, rarely written. A relational table is perfectly fine here (the data is tiny and structured), fronted by a cache. If you prefer one storage engine, a KV keyed by user_id holding the matrix as a document also works.

Table: preferences
  user_id     BIGINT   PRIMARY KEY
  matrix      JSON     -- { category: { channel: bool } }
  quiet_hours JSON     -- { start, end, timezone }
  updated_at  TIMESTAMP
  Cache: user_id -> matrix, high hit rate, invalidate on PUT

3. Device Token Store - NoSQL KV. user_id -> [ {token, platform, last_seen, valid} ]. Point lookup by user at push time; write on register; mark invalid on permanent APNs/FCM failure. Small (~22GB), sharded by user_id.

Table: devices
  user_id     BIGINT   PARTITION KEY
  token       STRING   CLUSTERING KEY
  platform    STRING   -- ios|android|web
  valid       BOOL
  last_seen   TIMESTAMP
  Shard by: user_id

4. Dedup Store - Redis. dedup_key -> notification_id with TTL = the dedup window (24-72h). The fast filter in front of the DB unique constraint.

5. Rate-limit Store - Redis. Token buckets: bucket:{user}:{category} and bucket:provider:{name}, updated with atomic Lua scripts so take-a-token is race-free across workers.

Why NoSQL for the log and tokens, SQL-optional for preferences: the log and token workloads are high-volume, key-scoped, TTL’d, and need horizontal scale with no cross-entity transactions - wide-column territory. Preferences are small, structured, and low-write, so a relational table is fine and its constraints (locked mandatory cells) are easy to enforce; the only place we genuinely want a hard uniqueness guarantee is the dedup backstop, which we get from a conditional write. We do not need global ACID transactions anywhere, which is exactly why NoSQL fits the hot paths.

Bottlenecks & Scaling

Where it breaks first, in order, and the fix for each.

1. A slow or dead third-party provider (breaks first). APNs/Twilio/SES having a bad ten minutes must not stall the pipeline or the caller. Fix: the two queue hops isolate the caller entirely, and per-channel queues isolate channels from each other. Provider adapters have circuit breakers and the router fails over to a secondary provider (SES -> SendGrid). Failed messages ride a delay/retry queue with exponential backoff so a recovering provider is not stampeded. The channel’s queue simply grows during the outage and drains after - buffered, not lost.

2. Duplicates under at-least-once + caller retries + queue redelivery. The correctness failure users actually notice. Fix: dedup_key enforced by a Redis fast path plus a DB unique-constraint backstop, and we pass idempotency keys to providers that support them. Redelivery and caller retries lose the race on the unique key and no-op.

3. A runaway upstream (bug emits millions of events). The nightmare: SMS the whole user base at real money each. Fix: the per-source kill switch / rate monitor. Abnormal volume from one (service, event_type) trips a breaker that pauses that source and alerts, capping the blast radius before it reaches users or the provider bill.

4. Provider quota / 429 throttling under a marketing blast. 15K deliveries/sec peak, but SMS providers cap per-number rates. Fix: per-provider token buckets; workers acquire a token before sending and re-queue with backoff when none is available, shaping outbound rate to the contracted quota. Spread SMS across multiple sending numbers/short codes and route by region to distribute load and cost. This shapes, it never drops transactional traffic.

5. Hot keys in the rate-limit and dedup stores. A viral event means one template and many users hit the same limiter/dedup shards. Fix: shard buckets by user_id/provider so they spread; use atomic Lua for the take so no lock contention; replicate hot provider-bucket keys. Dedup keys are per (user, event, channel) so they are naturally high-cardinality and spread well.

6. Orchestrator / ingestion throughput. ~10K events/sec peak on ingest, higher fan-out on delivery. Fix: ingestion and all workers are stateless behind load balancers - scale horizontally. Kafka topics are partitioned (partition by user_id so a user’s notifications keep per-user order and spread evenly). Consumer groups scale independently per channel to match each channel’s volume.

7. Preference and token reads on every event. A DB read per event at 10K/sec would hammer the store. Fix: cache preferences and device tokens aggressively (user_id-keyed, high hit rate, invalidate on write). These are read-mostly, so the cache carries almost all traffic.

8. Priority inversion - a promo blast delaying an OTP. If OTPs and marketing share a queue, a 10M-message campaign sitting in front of an OTP delays a login. Fix: priority lanes. High-priority/transactional traffic uses separate, short queues (or a dedicated high-priority topic and worker pool) so it is never stuck behind a promotional backlog. Bulk/promotional runs on its own lane that can lag without hurting anyone.

9. Status-store write amplification and growth (~37GB/day). Fix: TTL the log to ~30 days online, roll older records to cold storage/analytics. Shard by dedup_key/user_id. Status webhook updates are single-key updates, cheap at scale.

10. Single points of failure. Every stateful piece (Kafka, Redis, the stores) is replicated; every stateless piece (API, orchestrator, channel workers) is horizontally scaled behind LBs. The Dead Letter Queue catches terminally-failed messages so nothing vanishes silently - a human can inspect and replay. No single box whose loss stops notifications from flowing.

Shard keys, stated plainly: Ingest and per-channel Kafka topics -> partition by user_id (per-user ordering + even spread). Deliveries table -> dedup_key (every dedup check and status update is single-key). user_notifications and devices and preferences -> user_id. Rate buckets -> (user_id, category) and provider. Never shard the log by created_at - it would concentrate all current writes on the “now” shard, a permanent hot spot.

Wrap-Up

The trade-offs that define this design:

Accept fast, deliver async, across two queue hops. The caller gets a 202 in milliseconds; all provider work happens behind durable queues. We trade immediate delivery for availability, durability, and isolation - a dead provider can never stall a caller.
At-least-once plus dedup = effectively exactly-once. We lean into at-least-once queues and caller retries, then kill duplicates with a dedup_key fronted by Redis and backstopped by a hard DB unique constraint. Correctness comes from the backstop, speed from the cache.
Provider ugliness quarantined behind one adapter interface. The volatile outside world (six providers, six error dialects, six webhook formats) is normalized to retryable: true/false, so the core pipeline stays simple and failover is a router policy, not surgery.
Preferences as a (category x channel) matrix with locked cells. The must-send guarantee (OTP, security) is a first-class locked cell, so “promotional off” can never suppress a transactional message.
Layered rate control, not one cap. Per-user-per-category buckets protect the user, per-provider buckets protect the quota and the bill, and a per-source kill switch caps the blast radius of a runaway upstream - three independent concerns, three independent limiters, transactional always exempt.

One-line summary: a fire-and-forget ingestion API that durably queues events, fans them out across isolated per-channel pipelines to provider adapters behind circuit breakers, guarantees effectively-once delivery via dedup keys with a DB unique-constraint backstop and typed retries, and enforces user preferences and layered token-bucket rate control so the right messages reach the right users the right number of times without ever melting a provider or spamming a person.

Functional Requirements (FR)#

Non-Functional Requirements (NFR)#

Back-of-the-Envelope Estimation (BoE)#

High-Level Design (HLD)#

Component Deep Dive#

1. Fan-out across channels#

Naive approach: fan out inside the ingestion request#

Evolved approach: two-stage async fan-out through queues#

2. Provider abstraction#

Naive approach: call each provider SDK directly in the worker#

Evolved approach: a uniform ProviderAdapter interface behind a router#

3. Retries, dedup, and idempotency#

Naive approach: fire once, hope, and retry on any error#

Evolved approach: idempotency keys, a dedup store with a hard unique constraint, and typed retries#

4. User preferences#

Naive approach: one “notifications on/off” flag per user#

Evolved approach: a preference matrix of (category x channel) with mandatory overrides#

5. Rate control#

Naive approach: a global counter and a hard cap#

Evolved approach: layered token buckets - per user, per category, per provider - plus a kill switch#

API Design & Data Schema#

API#

Data stores#

Bottlenecks & Scaling#

Wrap-Up#

Comments

Functional Requirements (FR)

Non-Functional Requirements (NFR)

Back-of-the-Envelope Estimation (BoE)

High-Level Design (HLD)

Component Deep Dive

1. Fan-out across channels

Naive approach: fan out inside the ingestion request

Evolved approach: two-stage async fan-out through queues

2. Provider abstraction

Naive approach: call each provider SDK directly in the worker

Evolved approach: a uniform ProviderAdapter interface behind a router

3. Retries, dedup, and idempotency

Naive approach: fire once, hope, and retry on any error

Evolved approach: idempotency keys, a dedup store with a hard unique constraint, and typed retries

4. User preferences

Naive approach: one “notifications on/off” flag per user

Evolved approach: a preference matrix of (category x channel) with mandatory overrides

5. Rate control

Naive approach: a global counter and a hard cap

Evolved approach: layered token buckets - per user, per category, per provider - plus a kill switch

API Design & Data Schema

API

Data stores

Bottlenecks & Scaling

Wrap-Up