Design an On-Call Rotation and Alerting System (PagerDuty)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Define on-call rotation schedules (weekly, daily, custom) with multiple layers (primary, secondary, manager) and automatic handoffs
Route incoming alerts through escalation policies: try primary on-call → wait N minutes → escalate to secondary → wait → escalate to manager
Deliver alerts via multiple channels (push notification, SMS, phone call) with configurable per-user preferences and retry logic
Track acknowledgment, resolution, and incident lifecycle (triggered → acknowledged → resolved) with timestamps and responder actions
Support schedule overrides (swap shifts, temporary coverage) and fatigue prevention (quiet hours, max alerts/hour limits)

Non-Functional Requirements

Availability: 99.999% — this system pages humans during outages. If PagerDuty is down when production is down, no one gets woken up. It must be the most reliable system in the organization.
Latency: Alert delivery within 30 seconds of trigger. Phone call initiation within 60 seconds. Escalation timing accurate to within 5 seconds.
Consistency: An alert must be delivered to exactly one on-call person at a time (no missed alerts, no duplicate pages for the same incident at the same escalation level). Acknowledgment must be strongly consistent.
Scale: 10,000 teams, 100,000 users, 500K alerts/day, 50K concurrent active incidents.
Durability: Complete audit trail of every alert, delivery attempt, acknowledgment, and escalation. Zero tolerance for lost alerts.

2. Estimation (3 min)

Traffic

Incoming alerts: 500K/day = ~6 alerts/sec (peak 10x during widespread outage = 60/sec)
Delivery attempts per alert: avg 2.5 (primary gets push + SMS, sometimes escalates to secondary)
Total deliveries: 500K × 2.5 = 1.25M delivery attempts/day = ~15/sec
Phone calls: ~10% of alerts escalate to phone call = 50K calls/day
Schedule lookups: Every alert → resolve current on-call → 500K lookups/day (cached)
API requests (schedule management, incident updates): ~200K/day

Storage

Incidents: 500K/day × 5KB (full incident record with timeline) = 2.5GB/day = ~900GB/year
Schedules: 10,000 teams × 10KB = 100MB (tiny, mostly static)
Audit log: 1.25M delivery events/day × 500 bytes = 625MB/day = ~225GB/year
User preferences: 100K users × 1KB = 100MB

Key Insight

This is a reliability-critical orchestration system. The data volume is modest, but the reliability requirements are extreme. The system must work when everything else is broken (datacenter fires, DNS outages, cloud region failures). The core challenge is guaranteeing alert delivery within tight time windows with multiple fallback channels.

3. API Design (3 min)

Schedule Management

POST /v1/schedules
  Body: {
    "team_id": "team_infra",
    "name": "Infra On-Call Primary",
    "timezone": "America/New_York",
    "rotation_type": "weekly",        // weekly, daily, custom
    "handoff_time": "09:00",          // rotation changes at 9 AM
    "handoff_day": "monday",
    "participants": [
      { "user_id": "alice", "order": 1 },
      { "user_id": "bob", "order": 2 },
      { "user_id": "carol", "order": 3 }
    ],
    "layers": [
      { "name": "primary", "schedule_id": "sched_primary" },
      { "name": "secondary", "schedule_id": "sched_secondary" },
      { "name": "manager", "schedule_id": "sched_manager" }
    ]
  }

GET /v1/schedules/{schedule_id}/on-call?at=2024-02-22T18:00:00Z
  Response: { "user_id": "alice", "layer": "primary",
              "shift_start": "2024-02-19T09:00:00-05:00",
              "shift_end": "2024-02-26T09:00:00-05:00" }

POST /v1/schedules/{schedule_id}/overrides
  Body: { "user_id": "bob", "start": "2024-02-22T18:00:00Z",
          "end": "2024-02-23T09:00:00Z", "reason": "Alice at dentist" }

Alert Ingestion

POST /v1/alerts
  Body: {
    "routing_key": "team_infra_critical",    // maps to escalation policy
    "severity": "critical",                   // critical, warning, info
    "summary": "Database CPU > 95% for 5 minutes",
    "source": "datadog",
    "dedup_key": "db-cpu-prod-primary",       // for grouping/dedup
    "details": { "host": "db-prod-1", "cpu_pct": 97.3, "duration_min": 5 },
    "links": [
      { "text": "Runbook", "href": "https://wiki.example.com/db-cpu-runbook" },
      { "text": "Dashboard", "href": "https://grafana.example.com/db-dashboard" }
    ]
  }
  Response 202: { "incident_id": "inc_abc", "status": "triggered",
                  "assigned_to": "alice", "dedup_key": "db-cpu-prod-primary" }

Incident Management

POST /v1/incidents/{incident_id}/acknowledge
  Body: { "user_id": "alice" }
  Response 200: { "status": "acknowledged", "acknowledged_at": "2024-02-22T18:01:30Z" }

POST /v1/incidents/{incident_id}/resolve
  Body: { "user_id": "alice", "resolution_note": "Killed runaway query, CPU back to normal" }

POST /v1/incidents/{incident_id}/escalate
  Body: { "user_id": "alice", "reason": "Need DBA help" }
  // Immediately escalates to next level, regardless of timeout

Key Decisions

dedup_key enables alert grouping: if the same dedup_key fires while an incident is active, it groups (no new incident created). Prevents alert storms.
202 Accepted response: alert processing is async to ensure the API never blocks (monitoring systems must not be blocked by the alerting system).
Routing keys decouple alert sources from escalation policies (change routing without changing monitoring tool config).

4. Data Model (3 min)

Schedules (PostgreSQL)

Table: schedules
  schedule_id      (PK) | uuid
  team_id          (FK) | uuid
  name                  | varchar(200)
  timezone              | varchar(50)
  rotation_type         | enum('weekly', 'daily', 'custom')
  handoff_time          | time
  handoff_day           | varchar(10)       -- NULL for daily rotation
  created_at            | timestamp

Table: schedule_participants
  schedule_id      (FK) | uuid
  user_id          (FK) | uuid
  rotation_order        | int
  (composite PK: schedule_id + user_id)

Table: schedule_overrides
  override_id      (PK) | uuid
  schedule_id      (FK) | uuid
  user_id               | uuid             -- who is covering
  start_time            | timestamp
  end_time              | timestamp
  reason                | text
  created_by            | uuid

Escalation Policies (PostgreSQL)

Table: escalation_policies
  policy_id        (PK) | uuid
  team_id          (FK) | uuid
  name                  | varchar(200)
  repeat_count          | int              -- how many times to cycle through all levels (0 = once)

Table: escalation_levels
  policy_id        (FK) | uuid
  level_order           | int              -- 1 = primary, 2 = secondary, etc.
  schedule_id      (FK) | uuid             -- which schedule to page
  timeout_minutes       | int              -- wait this long before escalating to next level
  (composite PK: policy_id + level_order)

Incidents (PostgreSQL)

Table: incidents
  incident_id      (PK) | uuid
  policy_id        (FK) | uuid
  dedup_key             | varchar(200)     -- indexed, for grouping
  severity              | enum('critical', 'warning', 'info')
  summary               | text
  source                | varchar(100)
  details               | jsonb
  status                | enum('triggered', 'acknowledged', 'resolved')
  current_level         | int              -- current escalation level
  assigned_to      (FK) | uuid             -- current responder
  triggered_at          | timestamp
  acknowledged_at       | timestamp
  resolved_at           | timestamp

Table: incident_timeline
  entry_id         (PK) | uuid
  incident_id      (FK) | uuid
  event_type            | enum('triggered', 'notified', 'delivery_success', 'delivery_failed',
                                'acknowledged', 'escalated', 'resolved', 'note_added')
  user_id               | uuid
  channel               | varchar(20)      -- 'push', 'sms', 'phone', 'email'
  details               | jsonb
  created_at            | timestamp

Delivery Queue (Redis + PostgreSQL)

// Redis sorted set for timed deliveries
ZADD delivery_queue {delivery_time_epoch} {delivery_job_json}

// Delivery job:
{
  "incident_id": "inc_abc",
  "user_id": "alice",
  "channel": "push",
  "attempt": 1,
  "scheduled_at": 1708632090
}

Why PostgreSQL + Redis?

PostgreSQL: ACID for incident state (acknowledgment must be atomic, escalation must be consistent)
Redis: Sorted set as a reliable delay queue for timed escalations (check every second for due deliveries)
Both: replicated across multiple availability zones

5. High-Level Design (12 min)

Architecture

Alert Sources (Datadog, Prometheus, Custom)
  → POST /v1/alerts
    → API Gateway (multi-region, active-active)
      → Alert Ingestion Service:
        1. Validate alert payload
        2. Dedup check: is there an active incident for this dedup_key?
           → If yes: group (add event to existing incident)
           → If no: create new incident
        3. Resolve escalation policy → find current on-call user
        4. Enqueue notification delivery

Notification Pipeline:
  → Delivery Scheduler (reads from Redis sorted set):
    Every 1 second: pop all jobs where scheduled_at <= now
      → For each job:
        1. Determine channel (push → SMS → phone, based on user preferences)
        2. Send via channel provider:
           → Push: Firebase Cloud Messaging / Apple Push Notification
           → SMS: Twilio
           → Phone: Twilio (voice call with TTS)
        3. Log delivery attempt in incident_timeline
        4. If delivery fails → retry on next channel

  → Escalation Timer:
    When incident is triggered at level 1 with 5-min timeout:
      → Schedule escalation job at (triggered_at + 5 min) in Redis sorted set
      → If acknowledged before timeout: cancel escalation job
      → If not acknowledged: escalation job fires → page level 2 on-call

Acknowledgment Flow:
  User receives push notification / SMS / phone call
    → Taps "Acknowledge" in app / replies "ACK" to SMS / presses 1 on phone
      → POST /v1/incidents/{id}/acknowledge
        → Update incident status to "acknowledged"
        → Cancel pending escalation timer
        → Log in timeline
        → Notify team channel (Slack): "Alice acknowledged database CPU alert"

Multi-Region Deployment (Critical for 99.999%)

Region A (us-east-1):                Region B (us-west-2):
  API Gateway (active)                 API Gateway (active)
  Alert Ingestion Service              Alert Ingestion Service
  Delivery Scheduler                   Delivery Scheduler
  PostgreSQL Primary →→→→→→→→→→→→→→→→ PostgreSQL Replica (standby)
  Redis Primary →→→→→→→→→→→→→→→→→→→→→ Redis Replica (standby)

  Both regions can accept alerts independently.
  PostgreSQL: synchronous replication (strong consistency for incidents)
  Failover: if Region A is down, Region B promotes to primary within 30 seconds.

  Delivery providers (Twilio, FCM) are external — not affected by our region failures.

Components

Alert Ingestion Service: Receives alerts, deduplicates, creates incidents. Stateless, auto-scaled.
Schedule Resolver: Given a team + time, determine who is on-call. Evaluates rotation rules, applies overrides. Caches resolved schedule in Redis (5-minute TTL).
Delivery Scheduler: Reads from Redis sorted set every second. Dispatches notification jobs to channel-specific workers.
Push Worker: Sends push notifications via FCM/APNS. Handles token management and delivery receipts.
SMS Worker: Sends SMS via Twilio. Handles delivery status callbacks. Supports 2-way SMS (reply “ACK” to acknowledge).
Voice Worker: Initiates phone calls via Twilio Voice API. Text-to-speech reads alert summary. User presses 1 to acknowledge, 2 to escalate.
Escalation Engine: Manages escalation timers. Cancels timers on acknowledgment. Fires escalation (page next level) on timeout.
Incident Manager: Tracks incident lifecycle. Provides API for acknowledge, resolve, add notes, manual escalation.

6. Deep Dives (15 min)

Deep Dive 1: Reliable Alert Delivery — Never Miss a Page

The problem: At 3 AM, a critical production database goes down. The on-call engineer must be woken up within 60 seconds. If the push notification doesn’t wake them, SMS must follow. If SMS fails, a phone call. If no one responds, escalate. This must work every single time.

Multi-channel delivery with fallback:

User preference: { primary: "push", fallback_1: "sms", fallback_2: "phone" }
  Timeouts: push (30s), sms (60s), phone (90s)

Delivery flow for a critical alert:
  T+0s:  Send push notification
  T+30s: If not acknowledged → send SMS
  T+60s: If not acknowledged → initiate phone call
  T+90s: If not acknowledged → escalate to next level

Implementation:
  1. Immediately enqueue: push delivery at T+0
  2. Enqueue: SMS delivery at T+30 (in Redis sorted set)
  3. Enqueue: phone delivery at T+60
  4. Enqueue: escalation at T+90

  On acknowledgment at any point:
    → Cancel all pending delivery jobs for this incident+user
    → Mark SMS/phone jobs as "cancelled" (idempotent)

Delivery reliability per channel:

Push notifications:
  - FCM/APNS delivery rate: ~97% (3% lost due to battery optimization, connection issues)
  - Not reliable enough alone for critical alerts
  - No delivery confirmation from device (only "sent to provider")
  - Mitigation: always follow up with SMS for critical alerts after 30s

SMS:
  - Delivery rate: ~99.5% (0.5% lost due to carrier issues, phone off)
  - Delivery receipt available (Twilio callback)
  - More reliable than push but slower (5-30 second delivery)
  - Cost: $0.0075 per SMS

Phone call:
  - Delivery rate: ~99.9% (0.1% failure due to phone off, no signal)
  - Undeniable: phone rings until answered or voicemail
  - Most expensive: $0.013/min
  - Most disruptive: guaranteed to wake someone up

Strategy by severity:
  Critical: push + SMS simultaneously, phone after 60s
  Warning: push, SMS after 120s
  Info: push only, no escalation

Preventing delivery failures:

1. Redundant delivery providers:
   Primary SMS: Twilio
   Fallback SMS: Vonage (if Twilio is down)
   Primary Voice: Twilio
   Fallback Voice: Amazon Connect

   If primary provider returns error → immediately retry on fallback
   Provider health check: ping every 30 seconds, auto-switch if unhealthy

2. Delivery job persistence:
   Every delivery job written to PostgreSQL AND Redis
   Redis is the fast queue; PostgreSQL is the durable backup
   Reconciliation job every 60 seconds: compare Redis queue with PostgreSQL
   If a job exists in PostgreSQL but not Redis (Redis lost it) → re-enqueue

3. At-least-once delivery guarantee:
   Delivery workers are idempotent (sending duplicate SMS is better than not sending)
   Each delivery attempt logged with unique attempt_id
   Dedup at provider level (Twilio's idempotency key)

Deep Dive 2: Escalation Engine and Timing Guarantees

The problem: If the primary on-call doesn’t acknowledge within 5 minutes, escalate to secondary. Timing must be precise — a 5-minute timeout must fire between 4:55 and 5:05, not at 5:30.

Escalation timer implementation:

When incident triggers at level 1:
  escalation_policy = {
    levels: [
      { order: 1, schedule: "primary", timeout: 5min },
      { order: 2, schedule: "secondary", timeout: 10min },
      { order: 3, schedule: "manager", timeout: 15min }
    ],
    repeat: 1  // after cycling all levels, repeat once more
  }

  1. Page level 1 (primary on-call: Alice)
  2. Schedule escalation: "escalate inc_abc to level 2 at T+5min"
     → ZADD escalation_queue (now + 300) "inc_abc:level:2"

  If Alice acknowledges at T+3min:
     → ZREM escalation_queue "inc_abc:level:2"  // cancel escalation
     → Done

  If T+5min passes without acknowledgment:
     → Escalation worker picks up job from sorted set
     → Resolve level 2 on-call: Bob
     → Page Bob (same multi-channel delivery flow)
     → Schedule next escalation: "escalate to level 3 at T+15min"
     → Update incident: current_level = 2, assigned_to = Bob

  If all levels exhausted and repeat > 0:
     → Start over from level 1 (re-page Alice)
     → Decrement repeat counter

Timing precision:

Problem: Delivery scheduler polls Redis every 1 second.
  Worst case timing error: 1 second (acceptable)

But what if the scheduler process crashes?
  → Standby scheduler in Region B takes over
  → PostgreSQL has the incident with escalation_due_at timestamp
  → Recovery: scan all active incidents, re-enqueue any missed escalations
  → Max delay: 30 seconds (failover time) + 1 second (polling) = 31 seconds

Monitoring the escalation engine itself:
  → "Meta-alert": if escalation_queue has jobs overdue by > 60 seconds → alert the platform team
  → This meta-alert goes through a SEPARATE alerting path (direct Twilio call to CTO)

Edge cases:

1. Schedule handoff during incident:
   Alert fires at 8:55 AM. Alice is on-call until 9:00 AM. Bob starts at 9:00 AM.
   → Alice gets paged (she's on-call at trigger time)
   → If escalation fires at 9:00 AM, primary on-call is now Bob
   → Bob gets the escalation (correct — Alice's shift ended)

2. Override during incident:
   Carol creates override to cover Alice from 8:00-10:00 PM
   Alert fires at 8:30 PM
   → Carol gets paged (override takes precedence)

3. User has Do Not Disturb (quiet hours) set:
   Alice sets quiet hours: 10 PM - 7 AM (info/warning only)
   Critical alert at 3 AM:
   → Quiet hours DO NOT apply to critical alerts (always deliver)
   → Quiet hours suppress info/warning alerts (those wait until 7 AM or go to secondary)

Deep Dive 3: Fatigue Prevention and Alert Quality

The problem: If an engineer gets 200 alerts in one night, they start ignoring them. Alert fatigue is the #1 cause of missed critical incidents. The system must prevent this.

Fatigue prevention mechanisms:

1. Alert grouping (dedup_key):
   Database CPU high → incident created (dedup_key: "db-cpu-prod")
   Database CPU high → same dedup_key → grouped into existing incident (no new page)
   Database CPU high → same dedup_key → still grouped
   Result: 1 page instead of 50 pages for the same issue

2. Alert suppression rules:
   "If there's an active P1 incident for service X, suppress P3/P4 alerts for X"
   Logic: downstream alerts are noise if the root cause is already being worked on

3. Per-user rate limiting:
   Max 10 pages per hour per user
   If exceeded: route to secondary on-call
   Alert to manager: "Alice has been paged 10 times in the last hour"

4. Quiet hours (configurable per user):
   Default: suppress info/warning during 10 PM - 7 AM
   Critical alerts: always deliver (override quiet hours)
   Warnings during quiet hours: queue and deliver at 7 AM as a batch

5. Snooze:
   After acknowledging, user can "snooze" for 30 min
   If the same dedup_key fires again within 30 min → don't re-page
   After snooze expires: new alerts for same key trigger normally

Alert quality scoring:

Track per alert source:
  - Acknowledge rate: what % of alerts are acknowledged (vs. auto-resolved)?
  - Time to acknowledge: how quickly do engineers respond?
  - Time to resolve: how long is the incident open?
  - False positive rate: alerts that resolve without human action (likely noise)

Score = (acknowledge_rate * 0.3) + (1 - false_positive_rate) * 0.5 + (resolution_speed) * 0.2

If score < 0.3 for a routing key:
  → Flag as "noisy alert" in dashboard
  → Suggest to team: "This alert fires 200 times/week with 80% false positive rate. Consider tuning the threshold or converting to warning severity."

Monthly alert quality report per team:
  - Total alerts: 1,234
  - Acknowledged: 856 (69%)
  - False positives: 312 (25%)
  - Mean time to acknowledge: 3.2 minutes
  - Noisiest alerts: [list]
  - Recommendation: tune 3 alerts to reduce noise by 40%

Incident post-mortem data:

After resolution, auto-generate incident summary:
  Incident: Database CPU > 95%
  Duration: 23 minutes (triggered → resolved)
  Timeline:
    18:00:00  Triggered (CPU at 97.3%)
    18:00:05  Push notification sent to Alice
    18:00:32  SMS sent to Alice (push not acknowledged)
    18:01:30  Alice acknowledged via SMS reply
    18:01:45  Alice added note: "Investigating runaway query"
    18:12:00  Alice added note: "Killed query, CPU dropping"
    18:23:00  Alice resolved: "Root cause: unoptimized JOIN in new deployment"
  Responder: Alice (primary on-call)
  Escalations: 0
  Related alerts: 3 (grouped by dedup_key)

7. Extensions (2 min)

Slack/Teams integration: Dedicated incident channel auto-created on trigger. All timeline events posted. Responders can acknowledge and add notes directly from Slack. War room coordination for major incidents.
Intelligent routing (ML): Learn which engineers are best at resolving which types of incidents. If a database alert fires, route to the DBA who resolved similar issues fastest, even if they’re secondary on-call.
On-call compensation tracking: Track hours spent on-call, incidents responded to, and sleep interruptions. Generate fair compensation reports. Identify teams with disproportionate on-call burden.
Runbook automation: Attach runbooks to alert types. When an alert fires, present the relevant runbook steps in the incident channel. One-click execution of common remediation steps (restart service, scale up, rollback deployment).
Global incident coordination: For company-wide outages affecting multiple teams, create a “major incident” that links related team incidents. Assign an incident commander. Broadcast status updates to all stakeholders. Auto-generate status page updates.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Key Insight#

3. API Design (3 min)#

Schedule Management#

Alert Ingestion#

Incident Management#

Key Decisions#

4. Data Model (3 min)#

Schedules (PostgreSQL)#

Escalation Policies (PostgreSQL)#

Incidents (PostgreSQL)#

Delivery Queue (Redis + PostgreSQL)#

Why PostgreSQL + Redis?#

5. High-Level Design (12 min)#

Architecture#

Multi-Region Deployment (Critical for 99.999%)#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Reliable Alert Delivery — Never Miss a Page#

Deep Dive 2: Escalation Engine and Timing Guarantees#

Deep Dive 3: Fatigue Prevention and Alert Quality#

7. Extensions (2 min)#