1. Requirements & Scope (5 min)
Functional Requirements
- Producers publish messages to named exchanges, which route messages to queues based on routing rules
- Consumers subscribe to queues and receive messages with acknowledgment-based delivery (message stays in queue until acked)
- Support multiple exchange types: direct (exact routing key match), topic (pattern matching), fanout (broadcast to all bound queues)
- Dead letter queues: messages that fail processing after N retries are moved to a DLQ for inspection
- Message persistence: critical messages survive broker restarts (durable queues + persistent messages)
Non-Functional Requirements
- Availability: 99.99% — the message queue is infrastructure that other services depend on. Downtime cascades.
- Latency: < 5ms for message publish (broker acknowledges to producer). < 1ms for message delivery to a connected consumer.
- Ordering: Messages within a single queue are delivered in FIFO order. No ordering guarantees across queues.
- Delivery guarantees: At-least-once by default (ack-based). At-most-once available (auto-ack). Exactly-once achievable with idempotent consumers.
- Scale: 100K messages/sec ingestion, 10,000 queues, 50,000 consumers, 10M messages in flight at any time
2. Estimation (3 min)
Traffic
- Message publish rate: 100K messages/sec
- Average message size: 2KB (headers + body)
- Ingestion bandwidth: 100K x 2KB = 200 MB/sec
- Consumer delivery rate: 100K messages/sec (steady state: publish rate = consume rate)
- With replication (RF=2): internal bandwidth = 200 MB/sec additional
Storage
- Messages in flight (unconsumed): 10M messages x 2KB = 20 GB
- Peak backlog (consumer down for 1 hour): 100K/sec x 3600 x 2KB = 720 GB
- This is the sizing case for disk: must handle 1-hour consumer outages without data loss
- Per broker: 6 brokers → 120 GB in-flight + up to 360 GB backlog per broker during incidents
- Disk: 500 GB SSD per broker with headroom
Memory
- Message metadata index: 10M messages x 100 bytes = 1 GB — fits easily in RAM
- Queue metadata: 10,000 queues x 1KB = 10 MB
- Consumer connection state: 50,000 consumers x 2KB = 100 MB
- Total memory per broker: ~8 GB for metadata + OS page cache for message data
3. API Design (3 min)
Publisher API
// Declare an exchange
PUT /exchanges/{vhost}/{exchange_name}
Body: {
"type": "topic", // direct, topic, fanout, headers
"durable": true, // survive broker restart
"auto_delete": false // delete when no queues bound
}
// Publish a message
POST /exchanges/{vhost}/{exchange_name}/publish
Body: {
"routing_key": "order.created",
"properties": {
"delivery_mode": 2, // 1=transient, 2=persistent
"content_type": "application/json",
"message_id": "msg_abc123", // for deduplication
"correlation_id": "req_xyz", // for RPC-style patterns
"expiration": "60000", // TTL in milliseconds
"headers": {"x-retry-count": 0}
},
"payload": "{\"order_id\": 12345, \"amount\": 99.99}"
}
Response 200: { "routed": true } // message was routed to at least one queue
Consumer API (AMQP protocol, shown as pseudo-REST)
// Declare a queue
PUT /queues/{vhost}/{queue_name}
Body: {
"durable": true,
"exclusive": false, // exclusive to this connection
"auto_delete": false,
"arguments": {
"x-dead-letter-exchange": "dlx",
"x-dead-letter-routing-key": "dead.order",
"x-message-ttl": 300000, // 5 minutes
"x-max-length": 1000000, // max 1M messages
"x-max-priority": 10 // enable priority queue
}
}
// Bind queue to exchange
POST /bindings/{vhost}/e/{exchange}/q/{queue}
Body: { "routing_key": "order.*" } // pattern for topic exchange
// Consume messages (long-lived connection, push-based)
// In practice: AMQP channel with basic.consume
// Simplified:
GET /queues/{vhost}/{queue_name}/get?count=10&ack_mode=manual
Response 200: {
"messages": [
{
"delivery_tag": 1, // unique per channel, used for ack/nack
"exchange": "orders",
"routing_key": "order.created",
"properties": {...},
"payload": "{\"order_id\": 12345, ...}",
"redelivered": false
},
...
]
}
// Acknowledge message (consumed successfully)
POST /queues/{vhost}/{queue_name}/ack
Body: { "delivery_tag": 1, "multiple": false }
// Negative acknowledge (processing failed, requeue or dead-letter)
POST /queues/{vhost}/{queue_name}/nack
Body: { "delivery_tag": 1, "requeue": false } // false → route to DLQ
Key Decisions
- Push-based delivery (broker pushes to consumers) — lower latency than polling, consumer prefetch controls flow
- Manual acknowledgment by default — messages are not removed until consumer confirms processing success
- DLQ routing on nack — failed messages automatically move to dead letter queue for debugging
4. Data Model (3 min)
Message (on-disk and in-flight)
Message:
message_id | uuid -- publisher-assigned or broker-generated
exchange | string
routing_key | string
body | bytes -- up to 128MB (configurable)
properties:
delivery_mode | int (1 or 2) -- transient vs persistent
content_type | string
correlation_id | string
reply_to | string -- for RPC pattern
expiration | string -- TTL
timestamp | int64
priority | int (0-255)
headers | map<str,any>
metadata:
publish_seq | int64 -- publisher confirm sequence number
queue_position | int64 -- position in queue (for ordering)
delivery_count | int -- number of delivery attempts
first_death_exchange | string -- for DLQ: original exchange
first_death_queue | string -- for DLQ: original queue
first_death_reason | string -- rejected, expired, maxlen
Queue State (per queue, in memory + WAL)
Queue:
name | string (PK)
vhost | string
durable | boolean
state:
messages_ready | int -- messages waiting for delivery
messages_unacked | int -- delivered but not yet acknowledged
messages_total | int
consumers | int -- number of active consumers
head_position | int64 -- next message to deliver
tail_position | int64 -- where next publish lands
Message Index (in-memory):
TreeMap<queue_position, MessageRef>
MessageRef: {store_offset, size, expiry, priority}
For priority queues: use a priority heap instead of a simple FIFO
Persistent Message Store (per broker, disk)
Message Store (append-only segments):
┌───────────────────────────────┐
│ Segment 1 (16MB) │
│ [msg1][msg2][msg3]... │
│ Segment 2 (16MB) │
│ [msg4][msg5]... │
└───────────────────────────────┘
On publish (persistent msg):
1. Append message body to current segment
2. fsync (or batch fsync every 100ms for throughput)
3. Add to queue's in-memory index
On acknowledge:
1. Mark message as acked in index
2. Segment compaction: when all messages in a segment are acked, delete segment
Bindings (in metadata store):
Exchange → Queue mappings:
("orders", "topic") → [
("order-processing-queue", "order.created"),
("analytics-queue", "order.*"),
("audit-queue", "#") // # matches everything in topic exchange
]
Why These Choices
- Append-only segment files — high write throughput, sequential I/O. Similar to Kafka’s log segments.
- In-memory index + on-disk messages — index is small (100 bytes per message), messages can be large (KBs). Hot messages served from page cache, cold messages from disk.
- Per-queue ordering — each queue is an independent FIFO (or priority queue). No cross-queue coordination needed.
5. High-Level Design (12 min)
Architecture
Producers Consumers
│ ▲
│ AMQP / HTTP │ AMQP (push)
▼ │
┌──────────────────────────────────────────────────────┐
│ Load Balancer │
│ (TCP passthrough, sticky) │
└────────────────────────┬─────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Broker 1 │ │ Broker 2 │ │ Broker 3 │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Exchange │ │ │ │ Exchange │ │ │ │ Exchange │ │
│ │ Router │ │ │ │ Router │ │ │ │ Router │ │
│ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │
│ │ │ │ │ │ │ │ │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Queue │ │ │ │ Queue │ │ │ │ Queue │ │
│ │ Manager │ │ │ │ Manager │ │ │ │ Manager │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ Q1 (lead) │ │ │ │ Q1 (mirr) │ │ │ │ Q4 (lead) │ │
│ │ Q2 (lead) │ │ │ │ Q3 (lead) │ │ │ │ Q5 (lead) │ │
│ │ Q3 (mirr) │ │ │ │ Q4 (mirr) │ │ │ │ Q2 (mirr) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Message │ │ │ │ Message │ │ │ │ Message │ │
│ │ Store │ │ │ │ Store │ │ │ │ Store │ │
│ │ (Disk) │ │ │ │ (Disk) │ │ │ │ (Disk) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└────────────────┼────────────────┘
│
┌──────────▼──────────┐
│ Cluster Coordinator │
│ (Raft / mnesia) │
│ - Queue leadership │
│ - Membership │
│ - Exchange/binding │
│ metadata │
└─────────────────────┘
Publish Flow
Producer → Broker (any broker via load balancer)
│
▼
1. Receive AMQP basic.publish frame
2. Exchange Router:
a. Look up exchange by name
b. Get all bindings for this exchange
c. Match routing_key against binding patterns:
- Direct: exact match (routing_key == binding_key)
- Topic: pattern match ("order.created" matches "order.*" and "#")
- Fanout: all bound queues (ignore routing key)
d. Result: list of queues to route to
3. For each target queue:
a. If queue leader is on THIS broker:
→ Append to local message store (disk write if persistent)
→ Add to queue's in-memory index
→ Replicate to mirror brokers (sync or async based on config)
b. If queue leader is on ANOTHER broker:
→ Forward message to that broker (internal cluster protocol)
4. Publisher confirm:
→ If publisher confirms enabled: send basic.ack to producer with sequence number
→ Only after message is persisted + replicated to mirrors (if ha-mode=all)
Consume Flow
Consumer → Broker (connects to any broker, but deliveries come from queue leader)
│
▼
1. Consumer subscribes: basic.consume(queue="order-processing")
2. Broker locates queue leader:
a. If leader is on THIS broker: deliver directly
b. If leader is on another broker: proxy the channel to leader broker
3. Queue leader delivers messages:
a. Check prefetch limit (QoS): consumer has capacity for more messages?
b. If yes: pop next message from queue head
c. Mark message as "unacked" (in flight to consumer)
d. Send basic.deliver frame to consumer
e. Start ack timeout timer (30 seconds default)
4. Consumer processes message:
a. Success → basic.ack(delivery_tag)
→ Broker removes message from queue and disk
b. Failure → basic.nack(delivery_tag, requeue=false)
→ Broker routes to dead letter exchange (if configured)
→ Or basic.nack(delivery_tag, requeue=true) → put back at head of queue
5. Timeout (consumer doesn't ack within 30s):
→ Message requeued (redelivered flag set to true)
→ Delivered to same or different consumer
Components
- Brokers (3-6): Each broker handles connections, routing, queue management, and message storage. Queues are distributed across brokers — each queue has a “leader” broker.
- Exchange Router: In-memory routing table. Evaluates routing rules per exchange type. Direct exchange: O(1) hash lookup. Topic exchange: O(B) where B = number of bindings (trie-based optimization available). Fanout: O(Q) where Q = bound queues.
- Queue Manager: Manages queue state: ready messages, unacked messages, consumers, prefetch counters. One leader per queue, with mirrors on other brokers for HA.
- Message Store: Per-broker append-only segment files. Persistent messages fsync’d to disk. Segment compaction removes fully-acknowledged messages.
- Cluster Coordinator: Manages cluster membership, queue leader election, exchange/binding metadata. Uses Raft consensus (RabbitMQ 3.x used Mnesia/Erlang distribution).
- Management UI/API: HTTP API for monitoring: queue depths, message rates, consumer counts, connection status. Grafana dashboards for operations.
6. Deep Dives (15 min)
Deep Dive 1: Delivery Guarantees — At-Least-Once, At-Most-Once, Exactly-Once
At-most-once delivery:
Configuration:
- Publisher: no confirms (fire-and-forget)
- Consumer: auto-ack (basic.consume with no_ack=true)
Flow:
Producer → Broker: message arrives, broker acks immediately
Broker → Consumer: deliver message, remove from queue immediately
If consumer crashes after delivery but before processing → message lost
Use case: Logging, metrics, best-effort notifications
Pros: Highest throughput, lowest latency
Cons: Messages can be lost
At-least-once delivery (default, recommended):
Configuration:
- Publisher: confirm mode (basic.confirm)
- Consumer: manual ack (basic.ack after processing)
Flow:
Producer → Broker:
1. Send message with publisher confirm enabled
2. Broker persists to disk + replicates to mirrors
3. Broker sends basic.ack(seq=N) to producer
4. If producer doesn't receive ack → retransmit (may cause duplicate)
Broker → Consumer:
1. Deliver message (mark as unacked)
2. Consumer processes
3. Consumer sends basic.ack(delivery_tag)
4. Broker removes message
5. If consumer crashes before ack → broker redelivers (redelivered=true)
Duplicate scenarios:
a. Producer retransmits (network blip after broker persisted) → duplicate in queue
b. Consumer crashes after processing but before ack → duplicate delivery
Handling duplicates: consumers must be idempotent
Example: UPDATE orders SET status='shipped' WHERE order_id=123 AND status='confirmed'
Running this twice has the same effect as once → idempotent
Exactly-once delivery (end-to-end):
The broker cannot guarantee exactly-once alone. It requires cooperation:
Pattern 1: Idempotent Consumer (most common)
- At-least-once delivery + consumer deduplication
- Consumer maintains a set of processed message_ids
- On receive: if message_id already processed → skip
- message_id stored in the same DB transaction as the business operation:
BEGIN;
INSERT INTO processed_messages (message_id) VALUES ('msg_abc');
UPDATE orders SET status = 'shipped' WHERE order_id = 123;
COMMIT;
- If the INSERT fails (duplicate key) → skip processing
Pattern 2: Outbox Pattern (for publish-side exactly-once)
- Instead of publishing directly to the queue:
BEGIN;
UPDATE orders SET status = 'paid' WHERE order_id = 123;
INSERT INTO outbox (message_id, exchange, routing_key, body) VALUES (...);
COMMIT;
- A separate CDC process reads the outbox table and publishes to the queue
- Even if the CDC process crashes and retries, the outbox row is written exactly once
Pattern 3: Transactional publish + consume (RabbitMQ AMQP transactions)
- basic.tx_select, basic.tx_commit
- Groups publish + ack into an atomic transaction
- Very slow (~250x slower than non-transactional) — rarely used
- NOT recommended for high-throughput workloads
Deep Dive 2: Exchange Types and Routing
Direct Exchange:
Routing rule: exact match of routing_key to binding_key
Exchange "orders" (direct)
Binding: queue="order-processing", key="order.created"
Binding: queue="invoice-service", key="order.paid"
Binding: queue="shipping-service", key="order.shipped"
Message with routing_key="order.created" → routes to "order-processing" only
Message with routing_key="order.paid" → routes to "invoice-service" only
Implementation: HashMap<routing_key, List<Queue>>
Lookup: O(1)
Topic Exchange:
Routing rule: pattern matching with wildcards
* = exactly one word
# = zero or more words
Words separated by dots
Exchange "events" (topic)
Binding: queue="all-orders", key="order.#" → matches order.created, order.paid, order.shipped.us
Binding: queue="us-shipping", key="order.shipped.us" → matches order.shipped.us only
Binding: queue="everything", key="#" → matches everything
Message with routing_key="order.shipped.us":
→ "all-orders" (matches order.#)
→ "us-shipping" (matches order.shipped.us)
→ "everything" (matches #)
Implementation: Trie with wildcard nodes
Build trie from all binding patterns
On publish: traverse trie matching routing_key words
Complexity: O(W * B) where W = words in key, B = branching factor at wildcard nodes
In practice: very fast (< 1μs for typical patterns)
Fanout Exchange:
Routing rule: broadcast to ALL bound queues (routing_key ignored)
Exchange "notifications" (fanout)
Binding: queue="email-service"
Binding: queue="push-service"
Binding: queue="sms-service"
Any message → routes to all three queues
Implementation: List<Queue>
Lookup: O(Q) where Q = number of bound queues
Use case: broadcast events, pub/sub pattern
Headers Exchange (advanced):
Routing rule: match on message header values (not routing key)
Binding: queue="premium-orders", headers={"x-customer-tier": "premium", "x-match": "all"}
Binding: queue="large-orders", headers={"x-order-value": ">1000", "x-match": "any"}
Message with headers {"x-customer-tier": "premium", "x-order-value": "500"}:
→ "premium-orders" (all headers match)
→ NOT "large-orders" (order value < 1000)
Use case: Content-based routing without encoding routing info in the key
Deep Dive 3: Dead Letter Queues and Retry Patterns
Dead Letter Exchange (DLX):
Messages are "dead lettered" when:
1. Consumer nacks with requeue=false (explicit rejection)
2. Message TTL expires (x-message-ttl on queue or per-message expiration)
3. Queue length exceeded (x-max-length, oldest messages evicted)
Configuration:
Queue "order-processing":
x-dead-letter-exchange: "dlx"
x-dead-letter-routing-key: "dead.order"
x-message-ttl: 300000 // messages expire after 5 minutes
Exchange "dlx" (direct):
Binding: queue="dead-letter-orders", key="dead.order"
When a message is dead lettered:
- Original exchange/routing_key preserved in x-death headers
- Death reason recorded: rejected, expired, maxlen
- Death count incremented
- Message routed to DLX with the specified routing key
Retry pattern with exponential backoff:
Problem: Consumer fails to process a message. We want to retry with increasing delays
(1s, 5s, 30s, 5min) before giving up and dead-lettering.
Solution: Chain of delay queues using TTL + DLX:
┌──────────────────┐ nack ┌─────────────────────┐
│ order-processing │ ────────→ │ retry-1s (TTL=1000) │
│ (main queue) │ │ DLX → retry-exchange│
└──────────────────┘ └─────────┬───────────┘
▲ │ expires
│ ▼
│ ┌──────────────────┐
└────────────────────────│ retry-exchange │
│ routes back to │
│ order-processing │
└──────────────────┘
With multiple retry tiers:
Attempt 1 failure → retry-1s queue (TTL 1 second)
Attempt 2 failure → retry-5s queue (TTL 5 seconds)
Attempt 3 failure → retry-30s queue (TTL 30 seconds)
Attempt 4 failure → retry-5min queue (TTL 5 minutes)
Attempt 5 failure → dead-letter-orders queue (final, for human inspection)
Consumer logic:
x_retry_count = message.headers.get("x-retry-count", 0)
try:
process(message)
ack(message)
except:
if x_retry_count >= MAX_RETRIES:
nack(message, requeue=false) // → DLQ
else:
// Publish to appropriate retry queue with incremented retry count
publish(retry_exchange, message, headers={"x-retry-count": x_retry_count + 1})
ack(message) // ack the original
DLQ monitoring and reprocessing:
Operations dashboard:
- Alert when DLQ depth > 0 (messages failing processing)
- DLQ messages include original exchange, routing key, death reason, death count
- Engineers inspect DLQ messages to diagnose the failure
Reprocessing (after fixing the bug):
POST /queues/{vhost}/dead-letter-orders/reprocess
→ Move all messages from DLQ back to original queue
→ Implemented as: consume from DLQ, publish to original exchange with original routing key
Automatic reprocessing:
- Schedule a job that retries DLQ messages once per hour
- If the message succeeds → done
- If it fails again → put back in DLQ
- After 24 hours in DLQ → archive to cold storage and alert on-call
7. Extensions (2 min)
- Priority queues: Messages with higher priority are delivered before lower-priority messages, even if they arrived later. Implemented as N internal sub-queues (one per priority level) with highest-priority sub-queue drained first. Trade-off: O(P) overhead per dequeue where P = priority levels.
- Delayed messages (scheduled delivery): Publish a message now but deliver it at a future time (e.g., “send reminder email in 30 minutes”). Implemented via a plugin that holds messages in a delay store and moves them to the target queue when the delivery time arrives. Alternative: use TTL on intermediate queues as described in the retry pattern.
- Message deduplication: Broker-level dedup using message_id. Maintain a Bloom filter or TTL-bounded set of recent message_ids. If a duplicate arrives within the dedup window → silently discard. Prevents duplicate queue entries from producer retries.
- Shovel and federation (multi-cluster): Replicate messages across geographically distributed RabbitMQ clusters. Shovel: move messages from a queue on Cluster A to an exchange on Cluster B (one-directional). Federation: exchanges and queues on Cluster B automatically receive messages from Cluster A (declarative, policy-based).
- Stream queues (RabbitMQ Streams): Append-only log semantics (Kafka-like) within RabbitMQ. Consumers can read from any offset, replay messages, and multiple consumer groups can read the same stream independently. Best of both worlds: AMQP routing flexibility + Kafka-style replay and consumer groups.