1. Requirements & Scope (5 min)
Functional Requirements
- Send notifications across multiple channels: push notifications (iOS/Android), SMS, email, and in-app
- Support user notification preferences — users can opt in/out per channel, per notification type (e.g., marketing vs transactional)
- Template-based notifications with variable substitution (e.g., “Hi {{name}}, your order {{order_id}} has shipped”)
- Rate limiting per user per channel — no user receives more than N notifications per hour (prevent notification fatigue)
- Track delivery status for each notification (sent, delivered, opened, clicked, bounced, failed) with retry on failure
Non-Functional Requirements
- Availability: 99.99% — transactional notifications (password resets, 2FA codes) are critical path
- Latency: Transactional notifications delivered within 5 seconds of trigger. Marketing/batch notifications within 30 minutes.
- Consistency: At-least-once delivery. No notification should be silently dropped. Deduplication prevents sending the same notification twice.
- Scale: 1B notifications/day (across all channels), 50K concurrent sends/sec at peak, 500M registered users
- Durability: Every notification request and its delivery status persisted. Full audit trail.
2. Estimation (3 min)
Traffic
- 1B notifications/day = ~11,500 notifications/sec average
- Peak (morning/evening, campaign blasts): 5× = ~58,000 notifications/sec
- Channel breakdown: Push 50% (500M), Email 30% (300M), In-app 15% (150M), SMS 5% (50M)
- API ingestion (trigger events): ~100K events/sec (many produce multiple notifications)
Storage
- Notification record: ~500 bytes (recipient, channel, template, payload, status, timestamps)
- 1B/day × 500 bytes = 500 GB/day, 30-day retention = 15 TB
- User preferences: 500M users × 200 bytes = 100 GB (fits in a single DB)
- Templates: ~10K templates × 5 KB = 50 MB (negligible)
Third-Party Throughput
- Push (APNs/FCM): Both support ~100K sends/sec with connection pooling
- Email (SES/SendGrid): SES supports 50K emails/sec at scale
- SMS (Twilio/SNS): ~1000 SMS/sec per account (need multiple accounts or regional providers for 50M/day)
- SMS is the bottleneck — 50M SMS/day ÷ 86400 = 579/sec average, 2900/sec peak. Need multiple provider accounts.
Key Insight
This is a high-throughput, multi-channel delivery pipeline where the core challenges are: (1) routing to the right channel at the right time, (2) respecting user preferences and rate limits, (3) handling third-party provider failures gracefully, and (4) ensuring no notification is lost or duplicated.
3. API Design (3 min)
Send Notification
POST /notifications
Body: {
"recipient_id": "user_456",
"notification_type": "order_shipped", // maps to template + preference rules
"priority": "high", // high (transactional) / normal / low (marketing)
"channels": ["push", "email"], // optional override; default: user preference
"template_data": {
"name": "Alice",
"order_id": "ORD-789",
"tracking_url": "https://track.example.com/789"
},
"idempotency_key": "order-789-shipped", // prevents duplicate sends
"schedule_at": null // null = send immediately
}
Response 202: {
"notification_id": "notif_abc123",
"status": "queued",
"channels_targeted": ["push", "email"]
}
Batch Send (for campaigns)
POST /notifications/batch
Body: {
"notification_type": "weekly_digest",
"audience": {
"segment_id": "active_users_7d", // pre-computed segment
"filters": { "country": "US", "plan": "pro" }
},
"template_data": { "week": "Feb 12-18" },
"throttle": {
"max_per_second": 10000, // ramp up slowly
"send_window": { "start": "09:00", "end": "21:00", "timezone": "recipient_local" }
}
}
Response 202: { "batch_id": "batch_xyz", "estimated_recipients": 2400000 }
Delivery Status
GET /notifications/notif_abc123/status
Response 200: {
"notification_id": "notif_abc123",
"channels": {
"push": { "status": "delivered", "delivered_at": "...", "opened_at": "..." },
"email": { "status": "sent", "sent_at": "...", "opened_at": null }
}
}
User Preferences
GET /users/user_456/notification-preferences
PUT /users/user_456/notification-preferences
Body: {
"channels": {
"push": { "enabled": true, "quiet_hours": { "start": "22:00", "end": "08:00" } },
"email": { "enabled": true },
"sms": { "enabled": false }
},
"types": {
"order_updates": { "push": true, "email": true, "sms": true },
"marketing": { "push": false, "email": true, "sms": false },
"security_alerts": { "push": true, "email": true, "sms": true }
}
}
Key Decisions
- 202 Accepted (not 200) — notification is queued, not delivered synchronously
- Idempotency key prevents duplicate sends (critical for retried API calls)
- Priority levels map to separate queues — transactional notifications skip the marketing queue
- Schedule support — send at a future time or within recipient’s local time window
4. Data Model (3 min)
Notifications (PostgreSQL, partitioned by created_at)
Table: notifications
notification_id | UUID (PK)
recipient_id | varchar(50) -- user_id
notification_type | varchar(100) -- e.g., 'order_shipped'
priority | enum('critical', 'high', 'normal', 'low')
idempotency_key | varchar(255) -- unique index
template_id | varchar(100)
template_data | jsonb
status | enum('queued', 'processing', 'sent', 'partially_sent', 'failed')
created_at | timestamp
scheduled_at | timestamp -- NULL for immediate
batch_id | UUID -- NULL for individual
Index: (idempotency_key) UNIQUE
Index: (recipient_id, created_at)
Index: (batch_id, status)
Delivery Records (per channel attempt)
Table: delivery_records (partitioned by created_at monthly)
delivery_id | UUID (PK)
notification_id | UUID (FK)
channel | enum('push', 'email', 'sms', 'in_app')
status | enum('pending', 'sent', 'delivered', 'opened', 'clicked', 'bounced', 'failed')
provider | varchar(50) -- 'apns', 'fcm', 'ses', 'twilio'
provider_message_id | varchar(200) -- ID from the provider for tracking
attempt_count | int
last_attempt_at | timestamp
delivered_at | timestamp
error_message | text
metadata | jsonb -- open/click tracking data
Index: (notification_id, channel)
Index: (provider_message_id) -- for webhook callbacks from providers
User Preferences (PostgreSQL)
Table: user_preferences
user_id | varchar(50) (PK)
global_push_enabled | boolean
global_email_enabled| boolean
global_sms_enabled | boolean
quiet_hours_start | time
quiet_hours_end | time
timezone | varchar(50)
type_preferences | jsonb -- { "order_updates": {"push": true, "email": true}, ... }
updated_at | timestamp
Templates (PostgreSQL)
Table: templates
template_id | varchar(100) (PK)
channel | enum('push', 'email', 'sms')
subject | text -- for email
body | text -- with {{variable}} placeholders
version | int
active | boolean
Device Tokens (PostgreSQL)
Table: device_tokens
user_id | varchar(50)
device_id | varchar(200)
platform | enum('ios', 'android', 'web')
token | varchar(500) -- APNs/FCM token
is_active | boolean
last_used_at | timestamp
Index: (user_id, is_active)
Why PostgreSQL
- ACID guarantees for idempotency (unique constraint on idempotency_key)
- JSONB for flexible template data and preferences
- Partitioning for high-volume delivery records (partition by month, drop after 30 days)
- Read replicas for status queries (delivery tracking)
5. High-Level Design (12 min)
Architecture
Trigger Sources (services, campaigns, scheduled)
→ API Service (validation, dedup, preference check)
→ Kafka (notification events, partitioned by priority)
├── Topic: notifications.critical (password reset, 2FA)
├── Topic: notifications.high (order updates)
├── Topic: notifications.normal (social, reminders)
└── Topic: notifications.low (marketing, digest)
→ Notification Processor (stateless workers):
1. Render template with data
2. Resolve channels (user preferences)
3. Check rate limits (Redis)
4. Check quiet hours
5. Fan out: one message per channel to channel-specific queues
→ Channel Queues (Kafka):
├── Topic: channel.push → Push Sender → APNs / FCM
├── Topic: channel.email → Email Sender → SES / SendGrid
├── Topic: channel.sms → SMS Sender → Twilio / SNS
└── Topic: channel.in_app → In-App Sender → WebSocket / Polling API
→ Delivery Webhooks (from providers):
→ Webhook Ingestion Service → update delivery_records status
Detailed Flow (single notification)
1. POST /notifications { recipient: "user_456", type: "order_shipped", ... }
2. API Service:
a. Check idempotency_key → if exists, return existing notification_id
b. Validate payload, look up template
c. Write notification to PostgreSQL (status: 'queued')
d. Publish to Kafka (notifications.high)
3. Notification Processor consumes message:
a. Fetch user preferences for user_456
b. Resolve channels:
- Type "order_shipped" → user wants push + email
- Check: push enabled? → yes. email enabled? → yes.
c. Rate limit check (Redis):
Key: rl:user_456:push → 12 push notifications in last hour (limit: 20) → OK
Key: rl:user_456:email → 5 emails in last hour (limit: 10) → OK
d. Quiet hours check:
User timezone: US/Pacific, current time: 2 PM → OK (quiet = 10 PM - 8 AM)
e. Render templates:
Push: "Your order ORD-789 has shipped! Track it here."
Email: Full HTML template with tracking link
f. Fan out to channel queues:
→ channel.push: { notification_id, user_456, rendered_push_body, device_tokens: [...] }
→ channel.email: { notification_id, user_456, rendered_email, email: "alice@..." }
4. Push Sender:
a. Look up device tokens for user_456 (2 devices: iPhone, Android)
b. Send to APNs (iPhone token) → success
c. Send to FCM (Android token) → success
d. Write delivery_records: { channel: push, status: 'sent', provider_message_id: "apns_xyz" }
5. Email Sender:
a. Send via SES → success, message_id: "ses_abc"
b. Write delivery_record: { channel: email, status: 'sent' }
6. Later: SES webhook → email opened → update delivery_record status to 'opened'
Components
- API Service: REST endpoint. Validates, deduplicates (idempotency_key), writes to DB, publishes to Kafka.
- Notification Processor: Core orchestration. Resolves preferences, applies rate limits, renders templates, fans out to channel queues.
- Channel Senders (per channel): Push Sender manages APNs/FCM connection pools. Email Sender manages SES/SendGrid. SMS Sender manages Twilio. Each handles provider-specific retry logic.
- Rate Limiter (Redis): Per-user, per-channel sliding window counters. Shared across all processor instances.
- Template Engine: Renders templates with Mustache/Handlebars-style variable substitution. Supports i18n (locale-based template selection).
- Webhook Ingestion: Receives delivery callbacks from providers (delivery receipts, opens, clicks, bounces). Updates delivery_records.
- Kafka: Decouples ingestion from processing. Priority topics ensure transactional notifications are never stuck behind a marketing blast.
6. Deep Dives (15 min)
Deep Dive 1: Rate Limiting and Deduplication
Per-user rate limiting:
Goal: No user receives > 20 push notifications per hour, or > 5 SMS per day.
Redis implementation (sliding window counter):
Key: rl:{user_id}:{channel}:{window}
Example: rl:user_456:push:hourly
On notification:
1. INCR key
2. If first increment (count = 1): EXPIRE key 3600
3. If count > 20: drop this notification, log as 'rate_limited'
Per-channel, per-type limits:
rl:user_456:push:marketing:daily → limit 3 (marketing push per day)
rl:user_456:push:transactional:hourly → limit 50 (transactional is higher)
Global rate limiting (protect providers):
Problem: A campaign targeting 10M users at once will overwhelm SES (50K/sec limit).
Solution: Token bucket per provider
Key: provider:ses:tokens
Capacity: 50,000 tokens
Refill: 50,000 tokens/sec
Email Sender checks bucket before each send:
if tokens > 0: decrement, send
else: wait (backpressure via Kafka consumer lag)
Deduplication (exactly-once send):
Scenario: API retry sends the same notification twice.
Layer 1: Idempotency key (API level)
INSERT INTO notifications ... ON CONFLICT (idempotency_key) DO NOTHING;
→ Returns existing notification_id if duplicate
Layer 2: Delivery-level dedup (processor level)
Before sending, check: does delivery_record exist for (notification_id, channel)?
If yes and status != 'failed': skip (already sent or in progress)
This handles reprocessing of Kafka messages after consumer restart.
Layer 3: Provider-level dedup
Some providers (SES) support message dedup IDs.
Pass notification_id as dedup key to prevent double-send at provider level.
Deep Dive 2: Handling Provider Failures and Multi-Provider Failover
Problem: Third-party providers (APNs, SES, Twilio) have outages. We need to keep delivering notifications.
Retry strategy (per channel):
Attempt 1: Send via primary provider (e.g., SES for email)
If success → done
If transient error (429, 500, timeout):
→ Exponential backoff: 1s, 2s, 4s, 8s, 16s
→ Max 5 retries over ~30 seconds
If permanent error (invalid email, unsubscribed):
→ Don't retry. Mark as 'failed'. Update device_tokens (deactivate bad token).
Multi-provider failover:
Email: Primary = SES, Secondary = SendGrid
Push: Primary = FCM/APNs, Secondary = none (platform-specific)
SMS: Primary = Twilio, Secondary = AWS SNS, Tertiary = Vonage
Failover logic:
if primary_provider.error_rate > 10% over last 5 minutes:
→ Circuit breaker: route to secondary provider
→ Try primary again after 60 seconds
Per-message failover:
if primary fails for this specific message after 3 retries:
→ Try secondary provider
→ If secondary also fails: dead letter queue
Provider health monitoring:
Each sender tracks per-provider metrics:
- Success rate (last 1 min, 5 min)
- P50/P99 latency
- Error types (rate limited, server error, auth error)
Dashboard: real-time provider health
Alert: if success rate < 95% for any provider → page oncall
Handling stale device tokens (push notifications):
APNs and FCM return specific error codes for invalid tokens:
APNs: "Unregistered" or "BadDeviceToken"
FCM: "NotRegistered"
On such error:
1. Mark token as inactive in device_tokens table
2. Don't retry this notification to this token
3. If user has other active tokens, notification still delivered
Proactive cleanup:
- FCM: periodic instance ID checks
- APNs: feedback service (deprecated) / HTTP/2 error responses
- Delete tokens not used in > 90 days
Deep Dive 3: Priority Queues and Campaign Throttling
Problem: A marketing campaign targeting 10M users should not delay a password reset notification.
Priority isolation:
Kafka topics by priority:
notifications.critical → 100 consumer instances (password reset, 2FA)
notifications.high → 50 consumer instances (order updates)
notifications.normal → 30 consumer instances (social notifications)
notifications.low → 20 consumer instances (marketing, digest)
Critical notifications are processed by dedicated consumers that never compete
with marketing traffic. Even if the low-priority queue has 10M pending messages,
critical notifications flow through their own isolated pipeline.
Campaign throttling:
Campaign: "Weekly digest to 10M users"
Constraint: Don't send more than 10K/sec (provider limits, user experience)
Implementation:
1. Campaign Service creates 10M notification requests
2. Instead of dumping all into Kafka at once:
→ Use a "drip" publisher that emits 10K messages/sec to Kafka
→ Controlled by a token bucket with rate = 10,000/sec
3. Additionally, respect per-user local time:
→ For each user, compute send_time based on their timezone
→ Queue messages with schedule_at = 9:00 AM in user's timezone
→ Scheduler releases them at the appropriate time
4. If campaign is paused mid-send:
→ Stop the drip publisher
→ Already-queued messages are still processed (or use a campaign-status check before send)
Backpressure handling:
If channel queue (e.g., channel.email) builds up beyond 1M messages:
1. Auto-scale Email Sender instances (more consumers)
2. If still growing: throttle the Notification Processor for low-priority traffic
3. Never throttle critical/high priority — those have separate queues
4. Alert if queue depth exceeds 5M (30-minute processing delay at 50K/sec)
7. Extensions (2 min)
- Notification center (in-app inbox): Store all notifications in a per-user inbox (Cassandra or DynamoDB). Users see notification history, mark as read, and catch up on missed push notifications. Support pagination and unread count badge.
- A/B testing notifications: Test different templates, send times, or channels. Split users into control/treatment groups. Track open rates, click-through rates, and conversion as metrics.
- Intelligent send-time optimization: ML model predicts the best time to send each user a notification based on their historical open patterns. Instead of sending at campaign time, delay to each user’s optimal window.
- Notification grouping and summarization: If a user has 20 unread “X liked your post” notifications, group them into “X, Y, and 18 others liked your post.” Reduces notification fatigue and improves engagement.
- Cross-channel escalation: If a push notification is not opened within 30 minutes, automatically escalate to email. If email not opened in 2 hours, escalate to SMS (for critical notifications only). Configurable escalation chains per notification type.