1. Requirements & Scope (5 min)

Functional Requirements

  1. Send notifications across multiple channels: push notifications (iOS/Android), SMS, email, and in-app
  2. Support user notification preferences — users can opt in/out per channel, per notification type (e.g., marketing vs transactional)
  3. Template-based notifications with variable substitution (e.g., “Hi {{name}}, your order {{order_id}} has shipped”)
  4. Rate limiting per user per channel — no user receives more than N notifications per hour (prevent notification fatigue)
  5. Track delivery status for each notification (sent, delivered, opened, clicked, bounced, failed) with retry on failure

Non-Functional Requirements

  • Availability: 99.99% — transactional notifications (password resets, 2FA codes) are critical path
  • Latency: Transactional notifications delivered within 5 seconds of trigger. Marketing/batch notifications within 30 minutes.
  • Consistency: At-least-once delivery. No notification should be silently dropped. Deduplication prevents sending the same notification twice.
  • Scale: 1B notifications/day (across all channels), 50K concurrent sends/sec at peak, 500M registered users
  • Durability: Every notification request and its delivery status persisted. Full audit trail.

2. Estimation (3 min)

Traffic

  • 1B notifications/day = ~11,500 notifications/sec average
  • Peak (morning/evening, campaign blasts): 5× = ~58,000 notifications/sec
  • Channel breakdown: Push 50% (500M), Email 30% (300M), In-app 15% (150M), SMS 5% (50M)
  • API ingestion (trigger events): ~100K events/sec (many produce multiple notifications)

Storage

  • Notification record: ~500 bytes (recipient, channel, template, payload, status, timestamps)
  • 1B/day × 500 bytes = 500 GB/day, 30-day retention = 15 TB
  • User preferences: 500M users × 200 bytes = 100 GB (fits in a single DB)
  • Templates: ~10K templates × 5 KB = 50 MB (negligible)

Third-Party Throughput

  • Push (APNs/FCM): Both support ~100K sends/sec with connection pooling
  • Email (SES/SendGrid): SES supports 50K emails/sec at scale
  • SMS (Twilio/SNS): ~1000 SMS/sec per account (need multiple accounts or regional providers for 50M/day)
  • SMS is the bottleneck — 50M SMS/day ÷ 86400 = 579/sec average, 2900/sec peak. Need multiple provider accounts.

Key Insight

This is a high-throughput, multi-channel delivery pipeline where the core challenges are: (1) routing to the right channel at the right time, (2) respecting user preferences and rate limits, (3) handling third-party provider failures gracefully, and (4) ensuring no notification is lost or duplicated.


3. API Design (3 min)

Send Notification

POST /notifications
Body: {
  "recipient_id": "user_456",
  "notification_type": "order_shipped",    // maps to template + preference rules
  "priority": "high",                       // high (transactional) / normal / low (marketing)
  "channels": ["push", "email"],            // optional override; default: user preference
  "template_data": {
    "name": "Alice",
    "order_id": "ORD-789",
    "tracking_url": "https://track.example.com/789"
  },
  "idempotency_key": "order-789-shipped",   // prevents duplicate sends
  "schedule_at": null                        // null = send immediately
}
Response 202: {
  "notification_id": "notif_abc123",
  "status": "queued",
  "channels_targeted": ["push", "email"]
}

Batch Send (for campaigns)

POST /notifications/batch
Body: {
  "notification_type": "weekly_digest",
  "audience": {
    "segment_id": "active_users_7d",         // pre-computed segment
    "filters": { "country": "US", "plan": "pro" }
  },
  "template_data": { "week": "Feb 12-18" },
  "throttle": {
    "max_per_second": 10000,                 // ramp up slowly
    "send_window": { "start": "09:00", "end": "21:00", "timezone": "recipient_local" }
  }
}
Response 202: { "batch_id": "batch_xyz", "estimated_recipients": 2400000 }

Delivery Status

GET /notifications/notif_abc123/status
Response 200: {
  "notification_id": "notif_abc123",
  "channels": {
    "push": { "status": "delivered", "delivered_at": "...", "opened_at": "..." },
    "email": { "status": "sent", "sent_at": "...", "opened_at": null }
  }
}

User Preferences

GET /users/user_456/notification-preferences
PUT /users/user_456/notification-preferences
Body: {
  "channels": {
    "push": { "enabled": true, "quiet_hours": { "start": "22:00", "end": "08:00" } },
    "email": { "enabled": true },
    "sms": { "enabled": false }
  },
  "types": {
    "order_updates": { "push": true, "email": true, "sms": true },
    "marketing": { "push": false, "email": true, "sms": false },
    "security_alerts": { "push": true, "email": true, "sms": true }
  }
}

Key Decisions

  • 202 Accepted (not 200) — notification is queued, not delivered synchronously
  • Idempotency key prevents duplicate sends (critical for retried API calls)
  • Priority levels map to separate queues — transactional notifications skip the marketing queue
  • Schedule support — send at a future time or within recipient’s local time window

4. Data Model (3 min)

Notifications (PostgreSQL, partitioned by created_at)

Table: notifications
  notification_id     | UUID (PK)
  recipient_id        | varchar(50)     -- user_id
  notification_type   | varchar(100)    -- e.g., 'order_shipped'
  priority            | enum('critical', 'high', 'normal', 'low')
  idempotency_key     | varchar(255)    -- unique index
  template_id         | varchar(100)
  template_data       | jsonb
  status              | enum('queued', 'processing', 'sent', 'partially_sent', 'failed')
  created_at          | timestamp
  scheduled_at        | timestamp       -- NULL for immediate
  batch_id            | UUID            -- NULL for individual

Index: (idempotency_key) UNIQUE
Index: (recipient_id, created_at)
Index: (batch_id, status)

Delivery Records (per channel attempt)

Table: delivery_records (partitioned by created_at monthly)
  delivery_id         | UUID (PK)
  notification_id     | UUID (FK)
  channel             | enum('push', 'email', 'sms', 'in_app')
  status              | enum('pending', 'sent', 'delivered', 'opened', 'clicked', 'bounced', 'failed')
  provider            | varchar(50)     -- 'apns', 'fcm', 'ses', 'twilio'
  provider_message_id | varchar(200)    -- ID from the provider for tracking
  attempt_count       | int
  last_attempt_at     | timestamp
  delivered_at        | timestamp
  error_message       | text
  metadata            | jsonb           -- open/click tracking data

Index: (notification_id, channel)
Index: (provider_message_id)         -- for webhook callbacks from providers

User Preferences (PostgreSQL)

Table: user_preferences
  user_id             | varchar(50) (PK)
  global_push_enabled | boolean
  global_email_enabled| boolean
  global_sms_enabled  | boolean
  quiet_hours_start   | time
  quiet_hours_end     | time
  timezone            | varchar(50)
  type_preferences    | jsonb           -- { "order_updates": {"push": true, "email": true}, ... }
  updated_at          | timestamp

Templates (PostgreSQL)

Table: templates
  template_id         | varchar(100) (PK)
  channel             | enum('push', 'email', 'sms')
  subject             | text            -- for email
  body                | text            -- with {{variable}} placeholders
  version             | int
  active              | boolean

Device Tokens (PostgreSQL)

Table: device_tokens
  user_id             | varchar(50)
  device_id           | varchar(200)
  platform            | enum('ios', 'android', 'web')
  token               | varchar(500)    -- APNs/FCM token
  is_active           | boolean
  last_used_at        | timestamp

Index: (user_id, is_active)

Why PostgreSQL

  • ACID guarantees for idempotency (unique constraint on idempotency_key)
  • JSONB for flexible template data and preferences
  • Partitioning for high-volume delivery records (partition by month, drop after 30 days)
  • Read replicas for status queries (delivery tracking)

5. High-Level Design (12 min)

Architecture

Trigger Sources (services, campaigns, scheduled)
  → API Service (validation, dedup, preference check)
  → Kafka (notification events, partitioned by priority)
     ├── Topic: notifications.critical  (password reset, 2FA)
     ├── Topic: notifications.high      (order updates)
     ├── Topic: notifications.normal    (social, reminders)
     └── Topic: notifications.low       (marketing, digest)

  → Notification Processor (stateless workers):
       1. Render template with data
       2. Resolve channels (user preferences)
       3. Check rate limits (Redis)
       4. Check quiet hours
       5. Fan out: one message per channel to channel-specific queues

  → Channel Queues (Kafka):
     ├── Topic: channel.push   → Push Sender  → APNs / FCM
     ├── Topic: channel.email  → Email Sender  → SES / SendGrid
     ├── Topic: channel.sms    → SMS Sender    → Twilio / SNS
     └── Topic: channel.in_app → In-App Sender → WebSocket / Polling API

  → Delivery Webhooks (from providers):
       → Webhook Ingestion Service → update delivery_records status

Detailed Flow (single notification)

1. POST /notifications { recipient: "user_456", type: "order_shipped", ... }
2. API Service:
   a. Check idempotency_key → if exists, return existing notification_id
   b. Validate payload, look up template
   c. Write notification to PostgreSQL (status: 'queued')
   d. Publish to Kafka (notifications.high)

3. Notification Processor consumes message:
   a. Fetch user preferences for user_456
   b. Resolve channels:
      - Type "order_shipped" → user wants push + email
      - Check: push enabled? → yes. email enabled? → yes.
   c. Rate limit check (Redis):
      Key: rl:user_456:push → 12 push notifications in last hour (limit: 20) → OK
      Key: rl:user_456:email → 5 emails in last hour (limit: 10) → OK
   d. Quiet hours check:
      User timezone: US/Pacific, current time: 2 PM → OK (quiet = 10 PM - 8 AM)
   e. Render templates:
      Push: "Your order ORD-789 has shipped! Track it here."
      Email: Full HTML template with tracking link
   f. Fan out to channel queues:
      → channel.push: { notification_id, user_456, rendered_push_body, device_tokens: [...] }
      → channel.email: { notification_id, user_456, rendered_email, email: "alice@..." }

4. Push Sender:
   a. Look up device tokens for user_456 (2 devices: iPhone, Android)
   b. Send to APNs (iPhone token) → success
   c. Send to FCM (Android token) → success
   d. Write delivery_records: { channel: push, status: 'sent', provider_message_id: "apns_xyz" }

5. Email Sender:
   a. Send via SES → success, message_id: "ses_abc"
   b. Write delivery_record: { channel: email, status: 'sent' }

6. Later: SES webhook → email opened → update delivery_record status to 'opened'

Components

  1. API Service: REST endpoint. Validates, deduplicates (idempotency_key), writes to DB, publishes to Kafka.
  2. Notification Processor: Core orchestration. Resolves preferences, applies rate limits, renders templates, fans out to channel queues.
  3. Channel Senders (per channel): Push Sender manages APNs/FCM connection pools. Email Sender manages SES/SendGrid. SMS Sender manages Twilio. Each handles provider-specific retry logic.
  4. Rate Limiter (Redis): Per-user, per-channel sliding window counters. Shared across all processor instances.
  5. Template Engine: Renders templates with Mustache/Handlebars-style variable substitution. Supports i18n (locale-based template selection).
  6. Webhook Ingestion: Receives delivery callbacks from providers (delivery receipts, opens, clicks, bounces). Updates delivery_records.
  7. Kafka: Decouples ingestion from processing. Priority topics ensure transactional notifications are never stuck behind a marketing blast.

6. Deep Dives (15 min)

Deep Dive 1: Rate Limiting and Deduplication

Per-user rate limiting:

Goal: No user receives > 20 push notifications per hour, or > 5 SMS per day.

Redis implementation (sliding window counter):
  Key: rl:{user_id}:{channel}:{window}
  Example: rl:user_456:push:hourly

  On notification:
    1. INCR key
    2. If first increment (count = 1): EXPIRE key 3600
    3. If count > 20: drop this notification, log as 'rate_limited'

Per-channel, per-type limits:
  rl:user_456:push:marketing:daily → limit 3 (marketing push per day)
  rl:user_456:push:transactional:hourly → limit 50 (transactional is higher)

Global rate limiting (protect providers):

Problem: A campaign targeting 10M users at once will overwhelm SES (50K/sec limit).

Solution: Token bucket per provider
  Key: provider:ses:tokens
  Capacity: 50,000 tokens
  Refill: 50,000 tokens/sec

  Email Sender checks bucket before each send:
    if tokens > 0: decrement, send
    else: wait (backpressure via Kafka consumer lag)

Deduplication (exactly-once send):

Scenario: API retry sends the same notification twice.

Layer 1: Idempotency key (API level)
  INSERT INTO notifications ... ON CONFLICT (idempotency_key) DO NOTHING;
  → Returns existing notification_id if duplicate

Layer 2: Delivery-level dedup (processor level)
  Before sending, check: does delivery_record exist for (notification_id, channel)?
  If yes and status != 'failed': skip (already sent or in progress)
  This handles reprocessing of Kafka messages after consumer restart.

Layer 3: Provider-level dedup
  Some providers (SES) support message dedup IDs.
  Pass notification_id as dedup key to prevent double-send at provider level.

Deep Dive 2: Handling Provider Failures and Multi-Provider Failover

Problem: Third-party providers (APNs, SES, Twilio) have outages. We need to keep delivering notifications.

Retry strategy (per channel):

Attempt 1: Send via primary provider (e.g., SES for email)
  If success → done
  If transient error (429, 500, timeout):
    → Exponential backoff: 1s, 2s, 4s, 8s, 16s
    → Max 5 retries over ~30 seconds
  If permanent error (invalid email, unsubscribed):
    → Don't retry. Mark as 'failed'. Update device_tokens (deactivate bad token).

Multi-provider failover:

Email: Primary = SES, Secondary = SendGrid
Push: Primary = FCM/APNs, Secondary = none (platform-specific)
SMS: Primary = Twilio, Secondary = AWS SNS, Tertiary = Vonage

Failover logic:
  if primary_provider.error_rate > 10% over last 5 minutes:
    → Circuit breaker: route to secondary provider
    → Try primary again after 60 seconds

  Per-message failover:
    if primary fails for this specific message after 3 retries:
      → Try secondary provider
      → If secondary also fails: dead letter queue

Provider health monitoring:

Each sender tracks per-provider metrics:
  - Success rate (last 1 min, 5 min)
  - P50/P99 latency
  - Error types (rate limited, server error, auth error)

Dashboard: real-time provider health
Alert: if success rate < 95% for any provider → page oncall

Handling stale device tokens (push notifications):

APNs and FCM return specific error codes for invalid tokens:
  APNs: "Unregistered" or "BadDeviceToken"
  FCM: "NotRegistered"

On such error:
  1. Mark token as inactive in device_tokens table
  2. Don't retry this notification to this token
  3. If user has other active tokens, notification still delivered

Proactive cleanup:
  - FCM: periodic instance ID checks
  - APNs: feedback service (deprecated) / HTTP/2 error responses
  - Delete tokens not used in > 90 days

Deep Dive 3: Priority Queues and Campaign Throttling

Problem: A marketing campaign targeting 10M users should not delay a password reset notification.

Priority isolation:

Kafka topics by priority:
  notifications.critical  → 100 consumer instances (password reset, 2FA)
  notifications.high      → 50 consumer instances (order updates)
  notifications.normal    → 30 consumer instances (social notifications)
  notifications.low       → 20 consumer instances (marketing, digest)

Critical notifications are processed by dedicated consumers that never compete
with marketing traffic. Even if the low-priority queue has 10M pending messages,
critical notifications flow through their own isolated pipeline.

Campaign throttling:

Campaign: "Weekly digest to 10M users"
Constraint: Don't send more than 10K/sec (provider limits, user experience)

Implementation:
  1. Campaign Service creates 10M notification requests
  2. Instead of dumping all into Kafka at once:
     → Use a "drip" publisher that emits 10K messages/sec to Kafka
     → Controlled by a token bucket with rate = 10,000/sec
  3. Additionally, respect per-user local time:
     → For each user, compute send_time based on their timezone
     → Queue messages with schedule_at = 9:00 AM in user's timezone
     → Scheduler releases them at the appropriate time
  4. If campaign is paused mid-send:
     → Stop the drip publisher
     → Already-queued messages are still processed (or use a campaign-status check before send)

Backpressure handling:

If channel queue (e.g., channel.email) builds up beyond 1M messages:
  1. Auto-scale Email Sender instances (more consumers)
  2. If still growing: throttle the Notification Processor for low-priority traffic
  3. Never throttle critical/high priority — those have separate queues
  4. Alert if queue depth exceeds 5M (30-minute processing delay at 50K/sec)

7. Extensions (2 min)

  • Notification center (in-app inbox): Store all notifications in a per-user inbox (Cassandra or DynamoDB). Users see notification history, mark as read, and catch up on missed push notifications. Support pagination and unread count badge.
  • A/B testing notifications: Test different templates, send times, or channels. Split users into control/treatment groups. Track open rates, click-through rates, and conversion as metrics.
  • Intelligent send-time optimization: ML model predicts the best time to send each user a notification based on their historical open patterns. Instead of sending at campaign time, delay to each user’s optimal window.
  • Notification grouping and summarization: If a user has 20 unread “X liked your post” notifications, group them into “X, Y, and 18 others liked your post.” Reduces notification fatigue and improves engagement.
  • Cross-channel escalation: If a push notification is not opened within 30 minutes, automatically escalate to email. If email not opened in 2 hours, escalate to SMS (for critical notifications only). Configurable escalation chains per notification type.