Design a Distributed Message Queue (RabbitMQ)

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
- Traffic
- Storage
- Memory
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Producers publish messages to named exchanges, which route messages to queues based on routing rules
Consumers subscribe to queues and receive messages with acknowledgment-based delivery (message stays in queue until acked)
Support multiple exchange types: direct (exact routing key match), topic (pattern matching), fanout (broadcast to all bound queues)
Dead letter queues: messages that fail processing after N retries are moved to a DLQ for inspection
Message persistence: critical messages survive broker restarts (durable queues + persistent messages)

Non-Functional Requirements

Availability: 99.99% — the message queue is infrastructure that other services depend on. Downtime cascades.
Latency: < 5ms for message publish (broker acknowledges to producer). < 1ms for message delivery to a connected consumer.
Ordering: Messages within a single queue are delivered in FIFO order. No ordering guarantees across queues.
Delivery guarantees: At-least-once by default (ack-based). At-most-once available (auto-ack). Exactly-once achievable with idempotent consumers.
Scale: 100K messages/sec ingestion, 10,000 queues, 50,000 consumers, 10M messages in flight at any time

2. Estimation (3 min)

Traffic

Message publish rate: 100K messages/sec
Average message size: 2KB (headers + body)
Ingestion bandwidth: 100K x 2KB = 200 MB/sec
Consumer delivery rate: 100K messages/sec (steady state: publish rate = consume rate)
With replication (RF=2): internal bandwidth = 200 MB/sec additional

Storage

Messages in flight (unconsumed): 10M messages x 2KB = 20 GB
Peak backlog (consumer down for 1 hour): 100K/sec x 3600 x 2KB = 720 GB
- This is the sizing case for disk: must handle 1-hour consumer outages without data loss
Per broker: 6 brokers → 120 GB in-flight + up to 360 GB backlog per broker during incidents
Disk: 500 GB SSD per broker with headroom

Memory

Message metadata index: 10M messages x 100 bytes = 1 GB — fits easily in RAM
Queue metadata: 10,000 queues x 1KB = 10 MB
Consumer connection state: 50,000 consumers x 2KB = 100 MB
Total memory per broker: ~8 GB for metadata + OS page cache for message data

3. API Design (3 min)

Publisher API

// Declare an exchange
PUT /exchanges/{vhost}/{exchange_name}
  Body: {
    "type": "topic",                    // direct, topic, fanout, headers
    "durable": true,                    // survive broker restart
    "auto_delete": false                // delete when no queues bound
  }

// Publish a message
POST /exchanges/{vhost}/{exchange_name}/publish
  Body: {
    "routing_key": "order.created",
    "properties": {
      "delivery_mode": 2,              // 1=transient, 2=persistent
      "content_type": "application/json",
      "message_id": "msg_abc123",      // for deduplication
      "correlation_id": "req_xyz",     // for RPC-style patterns
      "expiration": "60000",           // TTL in milliseconds
      "headers": {"x-retry-count": 0}
    },
    "payload": "{\"order_id\": 12345, \"amount\": 99.99}"
  }
  Response 200: { "routed": true }     // message was routed to at least one queue

Consumer API (AMQP protocol, shown as pseudo-REST)

// Declare a queue
PUT /queues/{vhost}/{queue_name}
  Body: {
    "durable": true,
    "exclusive": false,                // exclusive to this connection
    "auto_delete": false,
    "arguments": {
      "x-dead-letter-exchange": "dlx",
      "x-dead-letter-routing-key": "dead.order",
      "x-message-ttl": 300000,        // 5 minutes
      "x-max-length": 1000000,        // max 1M messages
      "x-max-priority": 10            // enable priority queue
    }
  }

// Bind queue to exchange
POST /bindings/{vhost}/e/{exchange}/q/{queue}
  Body: { "routing_key": "order.*" }   // pattern for topic exchange

// Consume messages (long-lived connection, push-based)
// In practice: AMQP channel with basic.consume
// Simplified:
GET /queues/{vhost}/{queue_name}/get?count=10&ack_mode=manual
  Response 200: {
    "messages": [
      {
        "delivery_tag": 1,             // unique per channel, used for ack/nack
        "exchange": "orders",
        "routing_key": "order.created",
        "properties": {...},
        "payload": "{\"order_id\": 12345, ...}",
        "redelivered": false
      },
      ...
    ]
  }

// Acknowledge message (consumed successfully)
POST /queues/{vhost}/{queue_name}/ack
  Body: { "delivery_tag": 1, "multiple": false }

// Negative acknowledge (processing failed, requeue or dead-letter)
POST /queues/{vhost}/{queue_name}/nack
  Body: { "delivery_tag": 1, "requeue": false }  // false → route to DLQ

Key Decisions

Push-based delivery (broker pushes to consumers) — lower latency than polling, consumer prefetch controls flow
Manual acknowledgment by default — messages are not removed until consumer confirms processing success
DLQ routing on nack — failed messages automatically move to dead letter queue for debugging

4. Data Model (3 min)

Message (on-disk and in-flight)

Message:
  message_id       | uuid          -- publisher-assigned or broker-generated
  exchange         | string
  routing_key      | string
  body             | bytes         -- up to 128MB (configurable)
  properties:
    delivery_mode  | int (1 or 2)  -- transient vs persistent
    content_type   | string
    correlation_id | string
    reply_to       | string        -- for RPC pattern
    expiration     | string        -- TTL
    timestamp      | int64
    priority       | int (0-255)
    headers        | map<str,any>
  metadata:
    publish_seq    | int64         -- publisher confirm sequence number
    queue_position | int64         -- position in queue (for ordering)
    delivery_count | int           -- number of delivery attempts
    first_death_exchange | string  -- for DLQ: original exchange
    first_death_queue    | string  -- for DLQ: original queue
    first_death_reason   | string  -- rejected, expired, maxlen

Queue State (per queue, in memory + WAL)

Queue:
  name             | string (PK)
  vhost            | string
  durable          | boolean
  state:
    messages_ready      | int     -- messages waiting for delivery
    messages_unacked    | int     -- delivered but not yet acknowledged
    messages_total      | int
    consumers           | int     -- number of active consumers
    head_position       | int64   -- next message to deliver
    tail_position       | int64   -- where next publish lands

Message Index (in-memory):
  TreeMap<queue_position, MessageRef>
  MessageRef: {store_offset, size, expiry, priority}

  For priority queues: use a priority heap instead of a simple FIFO

Persistent Message Store (per broker, disk)

Message Store (append-only segments):
  ┌───────────────────────────────┐
  │ Segment 1 (16MB)             │
  │  [msg1][msg2][msg3]...       │
  │ Segment 2 (16MB)             │
  │  [msg4][msg5]...             │
  └───────────────────────────────┘

  On publish (persistent msg):
    1. Append message body to current segment
    2. fsync (or batch fsync every 100ms for throughput)
    3. Add to queue's in-memory index

  On acknowledge:
    1. Mark message as acked in index
    2. Segment compaction: when all messages in a segment are acked, delete segment

Bindings (in metadata store):
  Exchange → Queue mappings:
    ("orders", "topic") → [
      ("order-processing-queue", "order.created"),
      ("analytics-queue", "order.*"),
      ("audit-queue", "#")       // # matches everything in topic exchange
    ]

Why These Choices

Append-only segment files — high write throughput, sequential I/O. Similar to Kafka’s log segments.
In-memory index + on-disk messages — index is small (100 bytes per message), messages can be large (KBs). Hot messages served from page cache, cold messages from disk.
Per-queue ordering — each queue is an independent FIFO (or priority queue). No cross-queue coordination needed.

5. High-Level Design (12 min)

Architecture

Producers                              Consumers
  │                                       ▲
  │ AMQP / HTTP                           │ AMQP (push)
  ▼                                       │
┌──────────────────────────────────────────────────────┐
│                   Load Balancer                        │
│              (TCP passthrough, sticky)                  │
└────────────────────────┬─────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Broker 1    │ │   Broker 2    │ │   Broker 3    │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Exchange  │ │ │ │ Exchange  │ │ │ │ Exchange  │ │
│ │ Router    │ │ │ │ Router    │ │ │ │ Router    │ │
│ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │
│      │       │ │      │       │ │      │       │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Queue     │ │ │ │ Queue     │ │ │ │ Queue     │ │
│ │ Manager   │ │ │ │ Manager   │ │ │ │ Manager   │ │
│ │           │ │ │ │           │ │ │ │           │ │
│ │ Q1 (lead) │ │ │ │ Q1 (mirr) │ │ │ │ Q4 (lead) │ │
│ │ Q2 (lead) │ │ │ │ Q3 (lead) │ │ │ │ Q5 (lead) │ │
│ │ Q3 (mirr) │ │ │ │ Q4 (mirr) │ │ │ │ Q2 (mirr) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│              │ │              │ │              │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Message   │ │ │ │ Message   │ │ │ │ Message   │ │
│ │ Store     │ │ │ │ Store     │ │ │ │ Store     │ │
│ │ (Disk)    │ │ │ │ (Disk)    │ │ │ │ (Disk)    │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
        │                │                │
        └────────────────┼────────────────┘
                         │
              ┌──────────▼──────────┐
              │ Cluster Coordinator  │
              │ (Raft / mnesia)      │
              │ - Queue leadership   │
              │ - Membership         │
              │ - Exchange/binding   │
              │   metadata           │
              └─────────────────────┘

Publish Flow

Producer → Broker (any broker via load balancer)
  │
  ▼
1. Receive AMQP basic.publish frame
2. Exchange Router:
   a. Look up exchange by name
   b. Get all bindings for this exchange
   c. Match routing_key against binding patterns:
      - Direct: exact match (routing_key == binding_key)
      - Topic: pattern match ("order.created" matches "order.*" and "#")
      - Fanout: all bound queues (ignore routing key)
   d. Result: list of queues to route to

3. For each target queue:
   a. If queue leader is on THIS broker:
      → Append to local message store (disk write if persistent)
      → Add to queue's in-memory index
      → Replicate to mirror brokers (sync or async based on config)
   b. If queue leader is on ANOTHER broker:
      → Forward message to that broker (internal cluster protocol)

4. Publisher confirm:
   → If publisher confirms enabled: send basic.ack to producer with sequence number
   → Only after message is persisted + replicated to mirrors (if ha-mode=all)

Consume Flow

Consumer → Broker (connects to any broker, but deliveries come from queue leader)
  │
  ▼
1. Consumer subscribes: basic.consume(queue="order-processing")
2. Broker locates queue leader:
   a. If leader is on THIS broker: deliver directly
   b. If leader is on another broker: proxy the channel to leader broker

3. Queue leader delivers messages:
   a. Check prefetch limit (QoS): consumer has capacity for more messages?
   b. If yes: pop next message from queue head
   c. Mark message as "unacked" (in flight to consumer)
   d. Send basic.deliver frame to consumer
   e. Start ack timeout timer (30 seconds default)

4. Consumer processes message:
   a. Success → basic.ack(delivery_tag)
      → Broker removes message from queue and disk
   b. Failure → basic.nack(delivery_tag, requeue=false)
      → Broker routes to dead letter exchange (if configured)
      → Or basic.nack(delivery_tag, requeue=true) → put back at head of queue

5. Timeout (consumer doesn't ack within 30s):
   → Message requeued (redelivered flag set to true)
   → Delivered to same or different consumer

Components

Brokers (3-6): Each broker handles connections, routing, queue management, and message storage. Queues are distributed across brokers — each queue has a “leader” broker.
Exchange Router: In-memory routing table. Evaluates routing rules per exchange type. Direct exchange: O(1) hash lookup. Topic exchange: O(B) where B = number of bindings (trie-based optimization available). Fanout: O(Q) where Q = bound queues.
Queue Manager: Manages queue state: ready messages, unacked messages, consumers, prefetch counters. One leader per queue, with mirrors on other brokers for HA.
Message Store: Per-broker append-only segment files. Persistent messages fsync’d to disk. Segment compaction removes fully-acknowledged messages.
Cluster Coordinator: Manages cluster membership, queue leader election, exchange/binding metadata. Uses Raft consensus (RabbitMQ 3.x used Mnesia/Erlang distribution).
Management UI/API: HTTP API for monitoring: queue depths, message rates, consumer counts, connection status. Grafana dashboards for operations.

6. Deep Dives (15 min)

Deep Dive 1: Delivery Guarantees — At-Least-Once, At-Most-Once, Exactly-Once

At-most-once delivery:

Configuration:
  - Publisher: no confirms (fire-and-forget)
  - Consumer: auto-ack (basic.consume with no_ack=true)

Flow:
  Producer → Broker: message arrives, broker acks immediately
  Broker → Consumer: deliver message, remove from queue immediately
  If consumer crashes after delivery but before processing → message lost

Use case: Logging, metrics, best-effort notifications
Pros: Highest throughput, lowest latency
Cons: Messages can be lost

At-least-once delivery (default, recommended):

Configuration:
  - Publisher: confirm mode (basic.confirm)
  - Consumer: manual ack (basic.ack after processing)

Flow:
  Producer → Broker:
    1. Send message with publisher confirm enabled
    2. Broker persists to disk + replicates to mirrors
    3. Broker sends basic.ack(seq=N) to producer
    4. If producer doesn't receive ack → retransmit (may cause duplicate)

  Broker → Consumer:
    1. Deliver message (mark as unacked)
    2. Consumer processes
    3. Consumer sends basic.ack(delivery_tag)
    4. Broker removes message
    5. If consumer crashes before ack → broker redelivers (redelivered=true)

Duplicate scenarios:
  a. Producer retransmits (network blip after broker persisted) → duplicate in queue
  b. Consumer crashes after processing but before ack → duplicate delivery

Handling duplicates: consumers must be idempotent
  Example: UPDATE orders SET status='shipped' WHERE order_id=123 AND status='confirmed'
  Running this twice has the same effect as once → idempotent

Exactly-once delivery (end-to-end):

The broker cannot guarantee exactly-once alone. It requires cooperation:

Pattern 1: Idempotent Consumer (most common)
  - At-least-once delivery + consumer deduplication
  - Consumer maintains a set of processed message_ids
  - On receive: if message_id already processed → skip
  - message_id stored in the same DB transaction as the business operation:
    BEGIN;
      INSERT INTO processed_messages (message_id) VALUES ('msg_abc');
      UPDATE orders SET status = 'shipped' WHERE order_id = 123;
    COMMIT;
  - If the INSERT fails (duplicate key) → skip processing

Pattern 2: Outbox Pattern (for publish-side exactly-once)
  - Instead of publishing directly to the queue:
    BEGIN;
      UPDATE orders SET status = 'paid' WHERE order_id = 123;
      INSERT INTO outbox (message_id, exchange, routing_key, body) VALUES (...);
    COMMIT;
  - A separate CDC process reads the outbox table and publishes to the queue
  - Even if the CDC process crashes and retries, the outbox row is written exactly once

Pattern 3: Transactional publish + consume (RabbitMQ AMQP transactions)
  - basic.tx_select, basic.tx_commit
  - Groups publish + ack into an atomic transaction
  - Very slow (~250x slower than non-transactional) — rarely used
  - NOT recommended for high-throughput workloads

Deep Dive 2: Exchange Types and Routing

Direct Exchange:

Routing rule: exact match of routing_key to binding_key

  Exchange "orders" (direct)
    Binding: queue="order-processing", key="order.created"
    Binding: queue="invoice-service", key="order.paid"
    Binding: queue="shipping-service", key="order.shipped"

  Message with routing_key="order.created" → routes to "order-processing" only
  Message with routing_key="order.paid" → routes to "invoice-service" only

Implementation: HashMap<routing_key, List<Queue>>
  Lookup: O(1)

Topic Exchange:

Routing rule: pattern matching with wildcards
  * = exactly one word
  # = zero or more words
  Words separated by dots

  Exchange "events" (topic)
    Binding: queue="all-orders", key="order.#"          → matches order.created, order.paid, order.shipped.us
    Binding: queue="us-shipping", key="order.shipped.us" → matches order.shipped.us only
    Binding: queue="everything", key="#"                 → matches everything

  Message with routing_key="order.shipped.us":
    → "all-orders" (matches order.#)
    → "us-shipping" (matches order.shipped.us)
    → "everything" (matches #)

Implementation: Trie with wildcard nodes
  Build trie from all binding patterns
  On publish: traverse trie matching routing_key words
  Complexity: O(W * B) where W = words in key, B = branching factor at wildcard nodes
  In practice: very fast (< 1μs for typical patterns)

Fanout Exchange:

Routing rule: broadcast to ALL bound queues (routing_key ignored)

  Exchange "notifications" (fanout)
    Binding: queue="email-service"
    Binding: queue="push-service"
    Binding: queue="sms-service"

  Any message → routes to all three queues

Implementation: List<Queue>
  Lookup: O(Q) where Q = number of bound queues
  Use case: broadcast events, pub/sub pattern

Headers Exchange (advanced):

Routing rule: match on message header values (not routing key)

  Binding: queue="premium-orders", headers={"x-customer-tier": "premium", "x-match": "all"}
  Binding: queue="large-orders", headers={"x-order-value": ">1000", "x-match": "any"}

  Message with headers {"x-customer-tier": "premium", "x-order-value": "500"}:
    → "premium-orders" (all headers match)
    → NOT "large-orders" (order value < 1000)

Use case: Content-based routing without encoding routing info in the key

Deep Dive 3: Dead Letter Queues and Retry Patterns

Dead Letter Exchange (DLX):

Messages are "dead lettered" when:
  1. Consumer nacks with requeue=false (explicit rejection)
  2. Message TTL expires (x-message-ttl on queue or per-message expiration)
  3. Queue length exceeded (x-max-length, oldest messages evicted)

Configuration:
  Queue "order-processing":
    x-dead-letter-exchange: "dlx"
    x-dead-letter-routing-key: "dead.order"
    x-message-ttl: 300000           // messages expire after 5 minutes

  Exchange "dlx" (direct):
    Binding: queue="dead-letter-orders", key="dead.order"

When a message is dead lettered:
  - Original exchange/routing_key preserved in x-death headers
  - Death reason recorded: rejected, expired, maxlen
  - Death count incremented
  - Message routed to DLX with the specified routing key

Retry pattern with exponential backoff:

Problem: Consumer fails to process a message. We want to retry with increasing delays
         (1s, 5s, 30s, 5min) before giving up and dead-lettering.

Solution: Chain of delay queues using TTL + DLX:

  ┌──────────────────┐   nack    ┌─────────────────────┐
  │ order-processing │ ────────→ │ retry-1s (TTL=1000) │
  │ (main queue)     │           │ DLX → retry-exchange│
  └──────────────────┘           └─────────┬───────────┘
        ▲                                  │ expires
        │                                  ▼
        │                        ┌──────────────────┐
        └────────────────────────│ retry-exchange    │
                                 │ routes back to    │
                                 │ order-processing  │
                                 └──────────────────┘

With multiple retry tiers:
  Attempt 1 failure → retry-1s queue (TTL 1 second)
  Attempt 2 failure → retry-5s queue (TTL 5 seconds)
  Attempt 3 failure → retry-30s queue (TTL 30 seconds)
  Attempt 4 failure → retry-5min queue (TTL 5 minutes)
  Attempt 5 failure → dead-letter-orders queue (final, for human inspection)

Consumer logic:
  x_retry_count = message.headers.get("x-retry-count", 0)
  try:
    process(message)
    ack(message)
  except:
    if x_retry_count >= MAX_RETRIES:
      nack(message, requeue=false)  // → DLQ
    else:
      // Publish to appropriate retry queue with incremented retry count
      publish(retry_exchange, message, headers={"x-retry-count": x_retry_count + 1})
      ack(message)  // ack the original

DLQ monitoring and reprocessing:

Operations dashboard:
  - Alert when DLQ depth > 0 (messages failing processing)
  - DLQ messages include original exchange, routing key, death reason, death count
  - Engineers inspect DLQ messages to diagnose the failure

Reprocessing (after fixing the bug):
  POST /queues/{vhost}/dead-letter-orders/reprocess
    → Move all messages from DLQ back to original queue
    → Implemented as: consume from DLQ, publish to original exchange with original routing key

Automatic reprocessing:
  - Schedule a job that retries DLQ messages once per hour
  - If the message succeeds → done
  - If it fails again → put back in DLQ
  - After 24 hours in DLQ → archive to cold storage and alert on-call

7. Extensions (2 min)

Priority queues: Messages with higher priority are delivered before lower-priority messages, even if they arrived later. Implemented as N internal sub-queues (one per priority level) with highest-priority sub-queue drained first. Trade-off: O(P) overhead per dequeue where P = priority levels.
Delayed messages (scheduled delivery): Publish a message now but deliver it at a future time (e.g., “send reminder email in 30 minutes”). Implemented via a plugin that holds messages in a delay store and moves them to the target queue when the delivery time arrives. Alternative: use TTL on intermediate queues as described in the retry pattern.
Message deduplication: Broker-level dedup using message_id. Maintain a Bloom filter or TTL-bounded set of recent message_ids. If a duplicate arrives within the dedup window → silently discard. Prevents duplicate queue entries from producer retries.
Shovel and federation (multi-cluster): Replicate messages across geographically distributed RabbitMQ clusters. Shovel: move messages from a queue on Cluster A to an exchange on Cluster B (one-directional). Federation: exchanges and queues on Cluster B automatically receive messages from Cluster A (declarative, policy-based).
Stream queues (RabbitMQ Streams): Append-only log semantics (Kafka-like) within RabbitMQ. Consumers can read from any offset, replay messages, and multiple consumer groups can read the same stream independently. Best of both worlds: AMQP routing flexibility + Kafka-style replay and consumer groups.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Traffic#

Storage#

Memory#

3. API Design (3 min)#

Publisher API#

Consumer API (AMQP protocol, shown as pseudo-REST)#

Key Decisions#

4. Data Model (3 min)#

Message (on-disk and in-flight)#

Queue State (per queue, in memory + WAL)#

Persistent Message Store (per broker, disk)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

Publish Flow#

Consume Flow#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Delivery Guarantees — At-Least-Once, At-Most-Once, Exactly-Once#

Deep Dive 2: Exchange Types and Routing#

Deep Dive 3: Dead Letter Queues and Retry Patterns#

7. Extensions (2 min)#