1. Requirements & Scope (5 min)
Functional Requirements
- Define on-call rotation schedules (weekly, daily, custom) with multiple layers (primary, secondary, manager) and automatic handoffs
- Route incoming alerts through escalation policies: try primary on-call → wait N minutes → escalate to secondary → wait → escalate to manager
- Deliver alerts via multiple channels (push notification, SMS, phone call) with configurable per-user preferences and retry logic
- Track acknowledgment, resolution, and incident lifecycle (triggered → acknowledged → resolved) with timestamps and responder actions
- Support schedule overrides (swap shifts, temporary coverage) and fatigue prevention (quiet hours, max alerts/hour limits)
Non-Functional Requirements
- Availability: 99.999% — this system pages humans during outages. If PagerDuty is down when production is down, no one gets woken up. It must be the most reliable system in the organization.
- Latency: Alert delivery within 30 seconds of trigger. Phone call initiation within 60 seconds. Escalation timing accurate to within 5 seconds.
- Consistency: An alert must be delivered to exactly one on-call person at a time (no missed alerts, no duplicate pages for the same incident at the same escalation level). Acknowledgment must be strongly consistent.
- Scale: 10,000 teams, 100,000 users, 500K alerts/day, 50K concurrent active incidents.
- Durability: Complete audit trail of every alert, delivery attempt, acknowledgment, and escalation. Zero tolerance for lost alerts.
2. Estimation (3 min)
Traffic
- Incoming alerts: 500K/day = ~6 alerts/sec (peak 10x during widespread outage = 60/sec)
- Delivery attempts per alert: avg 2.5 (primary gets push + SMS, sometimes escalates to secondary)
- Total deliveries: 500K × 2.5 = 1.25M delivery attempts/day = ~15/sec
- Phone calls: ~10% of alerts escalate to phone call = 50K calls/day
- Schedule lookups: Every alert → resolve current on-call → 500K lookups/day (cached)
- API requests (schedule management, incident updates): ~200K/day
Storage
- Incidents: 500K/day × 5KB (full incident record with timeline) = 2.5GB/day = ~900GB/year
- Schedules: 10,000 teams × 10KB = 100MB (tiny, mostly static)
- Audit log: 1.25M delivery events/day × 500 bytes = 625MB/day = ~225GB/year
- User preferences: 100K users × 1KB = 100MB
Key Insight
This is a reliability-critical orchestration system. The data volume is modest, but the reliability requirements are extreme. The system must work when everything else is broken (datacenter fires, DNS outages, cloud region failures). The core challenge is guaranteeing alert delivery within tight time windows with multiple fallback channels.
3. API Design (3 min)
Schedule Management
POST /v1/schedules
Body: {
"team_id": "team_infra",
"name": "Infra On-Call Primary",
"timezone": "America/New_York",
"rotation_type": "weekly", // weekly, daily, custom
"handoff_time": "09:00", // rotation changes at 9 AM
"handoff_day": "monday",
"participants": [
{ "user_id": "alice", "order": 1 },
{ "user_id": "bob", "order": 2 },
{ "user_id": "carol", "order": 3 }
],
"layers": [
{ "name": "primary", "schedule_id": "sched_primary" },
{ "name": "secondary", "schedule_id": "sched_secondary" },
{ "name": "manager", "schedule_id": "sched_manager" }
]
}
GET /v1/schedules/{schedule_id}/on-call?at=2024-02-22T18:00:00Z
Response: { "user_id": "alice", "layer": "primary",
"shift_start": "2024-02-19T09:00:00-05:00",
"shift_end": "2024-02-26T09:00:00-05:00" }
POST /v1/schedules/{schedule_id}/overrides
Body: { "user_id": "bob", "start": "2024-02-22T18:00:00Z",
"end": "2024-02-23T09:00:00Z", "reason": "Alice at dentist" }
Alert Ingestion
POST /v1/alerts
Body: {
"routing_key": "team_infra_critical", // maps to escalation policy
"severity": "critical", // critical, warning, info
"summary": "Database CPU > 95% for 5 minutes",
"source": "datadog",
"dedup_key": "db-cpu-prod-primary", // for grouping/dedup
"details": { "host": "db-prod-1", "cpu_pct": 97.3, "duration_min": 5 },
"links": [
{ "text": "Runbook", "href": "https://wiki.example.com/db-cpu-runbook" },
{ "text": "Dashboard", "href": "https://grafana.example.com/db-dashboard" }
]
}
Response 202: { "incident_id": "inc_abc", "status": "triggered",
"assigned_to": "alice", "dedup_key": "db-cpu-prod-primary" }
Incident Management
POST /v1/incidents/{incident_id}/acknowledge
Body: { "user_id": "alice" }
Response 200: { "status": "acknowledged", "acknowledged_at": "2024-02-22T18:01:30Z" }
POST /v1/incidents/{incident_id}/resolve
Body: { "user_id": "alice", "resolution_note": "Killed runaway query, CPU back to normal" }
POST /v1/incidents/{incident_id}/escalate
Body: { "user_id": "alice", "reason": "Need DBA help" }
// Immediately escalates to next level, regardless of timeout
Key Decisions
dedup_keyenables alert grouping: if the same dedup_key fires while an incident is active, it groups (no new incident created). Prevents alert storms.- 202 Accepted response: alert processing is async to ensure the API never blocks (monitoring systems must not be blocked by the alerting system).
- Routing keys decouple alert sources from escalation policies (change routing without changing monitoring tool config).
4. Data Model (3 min)
Schedules (PostgreSQL)
Table: schedules
schedule_id (PK) | uuid
team_id (FK) | uuid
name | varchar(200)
timezone | varchar(50)
rotation_type | enum('weekly', 'daily', 'custom')
handoff_time | time
handoff_day | varchar(10) -- NULL for daily rotation
created_at | timestamp
Table: schedule_participants
schedule_id (FK) | uuid
user_id (FK) | uuid
rotation_order | int
(composite PK: schedule_id + user_id)
Table: schedule_overrides
override_id (PK) | uuid
schedule_id (FK) | uuid
user_id | uuid -- who is covering
start_time | timestamp
end_time | timestamp
reason | text
created_by | uuid
Escalation Policies (PostgreSQL)
Table: escalation_policies
policy_id (PK) | uuid
team_id (FK) | uuid
name | varchar(200)
repeat_count | int -- how many times to cycle through all levels (0 = once)
Table: escalation_levels
policy_id (FK) | uuid
level_order | int -- 1 = primary, 2 = secondary, etc.
schedule_id (FK) | uuid -- which schedule to page
timeout_minutes | int -- wait this long before escalating to next level
(composite PK: policy_id + level_order)
Incidents (PostgreSQL)
Table: incidents
incident_id (PK) | uuid
policy_id (FK) | uuid
dedup_key | varchar(200) -- indexed, for grouping
severity | enum('critical', 'warning', 'info')
summary | text
source | varchar(100)
details | jsonb
status | enum('triggered', 'acknowledged', 'resolved')
current_level | int -- current escalation level
assigned_to (FK) | uuid -- current responder
triggered_at | timestamp
acknowledged_at | timestamp
resolved_at | timestamp
Table: incident_timeline
entry_id (PK) | uuid
incident_id (FK) | uuid
event_type | enum('triggered', 'notified', 'delivery_success', 'delivery_failed',
'acknowledged', 'escalated', 'resolved', 'note_added')
user_id | uuid
channel | varchar(20) -- 'push', 'sms', 'phone', 'email'
details | jsonb
created_at | timestamp
Delivery Queue (Redis + PostgreSQL)
// Redis sorted set for timed deliveries
ZADD delivery_queue {delivery_time_epoch} {delivery_job_json}
// Delivery job:
{
"incident_id": "inc_abc",
"user_id": "alice",
"channel": "push",
"attempt": 1,
"scheduled_at": 1708632090
}
Why PostgreSQL + Redis?
- PostgreSQL: ACID for incident state (acknowledgment must be atomic, escalation must be consistent)
- Redis: Sorted set as a reliable delay queue for timed escalations (check every second for due deliveries)
- Both: replicated across multiple availability zones
5. High-Level Design (12 min)
Architecture
Alert Sources (Datadog, Prometheus, Custom)
→ POST /v1/alerts
→ API Gateway (multi-region, active-active)
→ Alert Ingestion Service:
1. Validate alert payload
2. Dedup check: is there an active incident for this dedup_key?
→ If yes: group (add event to existing incident)
→ If no: create new incident
3. Resolve escalation policy → find current on-call user
4. Enqueue notification delivery
Notification Pipeline:
→ Delivery Scheduler (reads from Redis sorted set):
Every 1 second: pop all jobs where scheduled_at <= now
→ For each job:
1. Determine channel (push → SMS → phone, based on user preferences)
2. Send via channel provider:
→ Push: Firebase Cloud Messaging / Apple Push Notification
→ SMS: Twilio
→ Phone: Twilio (voice call with TTS)
3. Log delivery attempt in incident_timeline
4. If delivery fails → retry on next channel
→ Escalation Timer:
When incident is triggered at level 1 with 5-min timeout:
→ Schedule escalation job at (triggered_at + 5 min) in Redis sorted set
→ If acknowledged before timeout: cancel escalation job
→ If not acknowledged: escalation job fires → page level 2 on-call
Acknowledgment Flow:
User receives push notification / SMS / phone call
→ Taps "Acknowledge" in app / replies "ACK" to SMS / presses 1 on phone
→ POST /v1/incidents/{id}/acknowledge
→ Update incident status to "acknowledged"
→ Cancel pending escalation timer
→ Log in timeline
→ Notify team channel (Slack): "Alice acknowledged database CPU alert"
Multi-Region Deployment (Critical for 99.999%)
Region A (us-east-1): Region B (us-west-2):
API Gateway (active) API Gateway (active)
Alert Ingestion Service Alert Ingestion Service
Delivery Scheduler Delivery Scheduler
PostgreSQL Primary →→→→→→→→→→→→→→→→ PostgreSQL Replica (standby)
Redis Primary →→→→→→→→→→→→→→→→→→→→→ Redis Replica (standby)
Both regions can accept alerts independently.
PostgreSQL: synchronous replication (strong consistency for incidents)
Failover: if Region A is down, Region B promotes to primary within 30 seconds.
Delivery providers (Twilio, FCM) are external — not affected by our region failures.
Components
- Alert Ingestion Service: Receives alerts, deduplicates, creates incidents. Stateless, auto-scaled.
- Schedule Resolver: Given a team + time, determine who is on-call. Evaluates rotation rules, applies overrides. Caches resolved schedule in Redis (5-minute TTL).
- Delivery Scheduler: Reads from Redis sorted set every second. Dispatches notification jobs to channel-specific workers.
- Push Worker: Sends push notifications via FCM/APNS. Handles token management and delivery receipts.
- SMS Worker: Sends SMS via Twilio. Handles delivery status callbacks. Supports 2-way SMS (reply “ACK” to acknowledge).
- Voice Worker: Initiates phone calls via Twilio Voice API. Text-to-speech reads alert summary. User presses 1 to acknowledge, 2 to escalate.
- Escalation Engine: Manages escalation timers. Cancels timers on acknowledgment. Fires escalation (page next level) on timeout.
- Incident Manager: Tracks incident lifecycle. Provides API for acknowledge, resolve, add notes, manual escalation.
6. Deep Dives (15 min)
Deep Dive 1: Reliable Alert Delivery — Never Miss a Page
The problem: At 3 AM, a critical production database goes down. The on-call engineer must be woken up within 60 seconds. If the push notification doesn’t wake them, SMS must follow. If SMS fails, a phone call. If no one responds, escalate. This must work every single time.
Multi-channel delivery with fallback:
User preference: { primary: "push", fallback_1: "sms", fallback_2: "phone" }
Timeouts: push (30s), sms (60s), phone (90s)
Delivery flow for a critical alert:
T+0s: Send push notification
T+30s: If not acknowledged → send SMS
T+60s: If not acknowledged → initiate phone call
T+90s: If not acknowledged → escalate to next level
Implementation:
1. Immediately enqueue: push delivery at T+0
2. Enqueue: SMS delivery at T+30 (in Redis sorted set)
3. Enqueue: phone delivery at T+60
4. Enqueue: escalation at T+90
On acknowledgment at any point:
→ Cancel all pending delivery jobs for this incident+user
→ Mark SMS/phone jobs as "cancelled" (idempotent)
Delivery reliability per channel:
Push notifications:
- FCM/APNS delivery rate: ~97% (3% lost due to battery optimization, connection issues)
- Not reliable enough alone for critical alerts
- No delivery confirmation from device (only "sent to provider")
- Mitigation: always follow up with SMS for critical alerts after 30s
SMS:
- Delivery rate: ~99.5% (0.5% lost due to carrier issues, phone off)
- Delivery receipt available (Twilio callback)
- More reliable than push but slower (5-30 second delivery)
- Cost: $0.0075 per SMS
Phone call:
- Delivery rate: ~99.9% (0.1% failure due to phone off, no signal)
- Undeniable: phone rings until answered or voicemail
- Most expensive: $0.013/min
- Most disruptive: guaranteed to wake someone up
Strategy by severity:
Critical: push + SMS simultaneously, phone after 60s
Warning: push, SMS after 120s
Info: push only, no escalation
Preventing delivery failures:
1. Redundant delivery providers:
Primary SMS: Twilio
Fallback SMS: Vonage (if Twilio is down)
Primary Voice: Twilio
Fallback Voice: Amazon Connect
If primary provider returns error → immediately retry on fallback
Provider health check: ping every 30 seconds, auto-switch if unhealthy
2. Delivery job persistence:
Every delivery job written to PostgreSQL AND Redis
Redis is the fast queue; PostgreSQL is the durable backup
Reconciliation job every 60 seconds: compare Redis queue with PostgreSQL
If a job exists in PostgreSQL but not Redis (Redis lost it) → re-enqueue
3. At-least-once delivery guarantee:
Delivery workers are idempotent (sending duplicate SMS is better than not sending)
Each delivery attempt logged with unique attempt_id
Dedup at provider level (Twilio's idempotency key)
Deep Dive 2: Escalation Engine and Timing Guarantees
The problem: If the primary on-call doesn’t acknowledge within 5 minutes, escalate to secondary. Timing must be precise — a 5-minute timeout must fire between 4:55 and 5:05, not at 5:30.
Escalation timer implementation:
When incident triggers at level 1:
escalation_policy = {
levels: [
{ order: 1, schedule: "primary", timeout: 5min },
{ order: 2, schedule: "secondary", timeout: 10min },
{ order: 3, schedule: "manager", timeout: 15min }
],
repeat: 1 // after cycling all levels, repeat once more
}
1. Page level 1 (primary on-call: Alice)
2. Schedule escalation: "escalate inc_abc to level 2 at T+5min"
→ ZADD escalation_queue (now + 300) "inc_abc:level:2"
If Alice acknowledges at T+3min:
→ ZREM escalation_queue "inc_abc:level:2" // cancel escalation
→ Done
If T+5min passes without acknowledgment:
→ Escalation worker picks up job from sorted set
→ Resolve level 2 on-call: Bob
→ Page Bob (same multi-channel delivery flow)
→ Schedule next escalation: "escalate to level 3 at T+15min"
→ Update incident: current_level = 2, assigned_to = Bob
If all levels exhausted and repeat > 0:
→ Start over from level 1 (re-page Alice)
→ Decrement repeat counter
Timing precision:
Problem: Delivery scheduler polls Redis every 1 second.
Worst case timing error: 1 second (acceptable)
But what if the scheduler process crashes?
→ Standby scheduler in Region B takes over
→ PostgreSQL has the incident with escalation_due_at timestamp
→ Recovery: scan all active incidents, re-enqueue any missed escalations
→ Max delay: 30 seconds (failover time) + 1 second (polling) = 31 seconds
Monitoring the escalation engine itself:
→ "Meta-alert": if escalation_queue has jobs overdue by > 60 seconds → alert the platform team
→ This meta-alert goes through a SEPARATE alerting path (direct Twilio call to CTO)
Edge cases:
1. Schedule handoff during incident:
Alert fires at 8:55 AM. Alice is on-call until 9:00 AM. Bob starts at 9:00 AM.
→ Alice gets paged (she's on-call at trigger time)
→ If escalation fires at 9:00 AM, primary on-call is now Bob
→ Bob gets the escalation (correct — Alice's shift ended)
2. Override during incident:
Carol creates override to cover Alice from 8:00-10:00 PM
Alert fires at 8:30 PM
→ Carol gets paged (override takes precedence)
3. User has Do Not Disturb (quiet hours) set:
Alice sets quiet hours: 10 PM - 7 AM (info/warning only)
Critical alert at 3 AM:
→ Quiet hours DO NOT apply to critical alerts (always deliver)
→ Quiet hours suppress info/warning alerts (those wait until 7 AM or go to secondary)
Deep Dive 3: Fatigue Prevention and Alert Quality
The problem: If an engineer gets 200 alerts in one night, they start ignoring them. Alert fatigue is the #1 cause of missed critical incidents. The system must prevent this.
Fatigue prevention mechanisms:
1. Alert grouping (dedup_key):
Database CPU high → incident created (dedup_key: "db-cpu-prod")
Database CPU high → same dedup_key → grouped into existing incident (no new page)
Database CPU high → same dedup_key → still grouped
Result: 1 page instead of 50 pages for the same issue
2. Alert suppression rules:
"If there's an active P1 incident for service X, suppress P3/P4 alerts for X"
Logic: downstream alerts are noise if the root cause is already being worked on
3. Per-user rate limiting:
Max 10 pages per hour per user
If exceeded: route to secondary on-call
Alert to manager: "Alice has been paged 10 times in the last hour"
4. Quiet hours (configurable per user):
Default: suppress info/warning during 10 PM - 7 AM
Critical alerts: always deliver (override quiet hours)
Warnings during quiet hours: queue and deliver at 7 AM as a batch
5. Snooze:
After acknowledging, user can "snooze" for 30 min
If the same dedup_key fires again within 30 min → don't re-page
After snooze expires: new alerts for same key trigger normally
Alert quality scoring:
Track per alert source:
- Acknowledge rate: what % of alerts are acknowledged (vs. auto-resolved)?
- Time to acknowledge: how quickly do engineers respond?
- Time to resolve: how long is the incident open?
- False positive rate: alerts that resolve without human action (likely noise)
Score = (acknowledge_rate * 0.3) + (1 - false_positive_rate) * 0.5 + (resolution_speed) * 0.2
If score < 0.3 for a routing key:
→ Flag as "noisy alert" in dashboard
→ Suggest to team: "This alert fires 200 times/week with 80% false positive rate. Consider tuning the threshold or converting to warning severity."
Monthly alert quality report per team:
- Total alerts: 1,234
- Acknowledged: 856 (69%)
- False positives: 312 (25%)
- Mean time to acknowledge: 3.2 minutes
- Noisiest alerts: [list]
- Recommendation: tune 3 alerts to reduce noise by 40%
Incident post-mortem data:
After resolution, auto-generate incident summary:
Incident: Database CPU > 95%
Duration: 23 minutes (triggered → resolved)
Timeline:
18:00:00 Triggered (CPU at 97.3%)
18:00:05 Push notification sent to Alice
18:00:32 SMS sent to Alice (push not acknowledged)
18:01:30 Alice acknowledged via SMS reply
18:01:45 Alice added note: "Investigating runaway query"
18:12:00 Alice added note: "Killed query, CPU dropping"
18:23:00 Alice resolved: "Root cause: unoptimized JOIN in new deployment"
Responder: Alice (primary on-call)
Escalations: 0
Related alerts: 3 (grouped by dedup_key)
7. Extensions (2 min)
- Slack/Teams integration: Dedicated incident channel auto-created on trigger. All timeline events posted. Responders can acknowledge and add notes directly from Slack. War room coordination for major incidents.
- Intelligent routing (ML): Learn which engineers are best at resolving which types of incidents. If a database alert fires, route to the DBA who resolved similar issues fastest, even if they’re secondary on-call.
- On-call compensation tracking: Track hours spent on-call, incidents responded to, and sleep interruptions. Generate fair compensation reports. Identify teams with disproportionate on-call burden.
- Runbook automation: Attach runbooks to alert types. When an alert fires, present the relevant runbook steps in the incident channel. One-click execution of common remediation steps (restart service, scale up, rollback deployment).
- Global incident coordination: For company-wide outages affecting multiple teams, create a “major incident” that links related team incidents. Assign an incident commander. Broadcast status updates to all stakeholders. Auto-generate status page updates.