Every team that has tried to build microservices with synchronous REST calls between services has hit the same wall. Service A calls Service B which calls Service C. Service C is slow, so Service B times out, so Service A retries, and now you have a cascading failure across three services because one database query took too long.
Event-driven architecture solves this by removing the synchronous coupling. Services produce events when things happen. Other services consume those events when they are ready. If a consumer is slow or down, the events wait in the queue. No cascading failures. No retry storms.
This is not new. What is new in 2026 is that the tooling has matured enough that event-driven is no longer an architecture choice reserved for Netflix-scale companies. Mid-size teams are adopting it because the patterns are well-understood and the infrastructure is manageable.
Three Things That Sound Similar But Are Not
These terms get conflated constantly. They are different:
| Concept | What it means | Example |
|---|---|---|
| Message queues | Point-to-point delivery. One producer, one consumer per message | SQS: “Process this payment” |
| Event-driven architecture | Services emit events about what happened. Multiple consumers can react independently | “OrderPlaced” event consumed by inventory, billing, and notification services |
| Event sourcing | Store state as a sequence of events, not as current state | Account balance = sum of all deposit and withdrawal events |
You can use event-driven architecture without event sourcing. Most teams should. Event sourcing adds complexity that is only justified when you need a complete audit trail or time-travel queries (financial systems, compliance-heavy domains).
Kafka vs NATS vs SQS
The broker choice matters less than people think, but there are real differences:
| Feature | Kafka | NATS JetStream | SQS |
|---|---|---|---|
| Throughput | Millions of msgs/sec | Millions of msgs/sec | Thousands of msgs/sec |
| Ordering | Per-partition | Per-stream | Best-effort (FIFO queues: per-group) |
| Retention | Configurable (days/weeks/forever) | Configurable | 14 days max |
| Replay | Yes - consumers can rewind | Yes - consumers can rewind | No - once consumed, gone |
| Operational complexity | High (ZooKeeper/KRaft, partitions, ISR) | Low (single binary, embedded) | None (managed) |
| Consumer groups | Native | Native | Requires manual coordination |
| Cost at scale | Infrastructure only | Infrastructure only | Per-request pricing adds up |
Choose Kafka when you need event replay, high throughput, and you have the team to operate it. Kafka on Kubernetes with Strimzi is manageable in 2026 but still not trivial.
Choose NATS when you want Kafka-like capabilities without the operational overhead. JetStream gives you persistence, replay, and consumer groups in a single binary that uses a fraction of the resources. This is the right choice for most teams in 2026.
Choose SQS when you are on AWS, do not need replay, and want zero operational burden. SQS with SNS for fan-out covers many use cases. The per-request cost becomes significant above ~100M messages/month.
The Patterns That Make It Work
Event-driven architecture introduces eventual consistency, which means you need patterns to handle failures gracefully. These four patterns are non-negotiable for production systems.
1. The Outbox Pattern
The most common bug in event-driven systems: your service updates the database and then publishes an event. If the publish fails after the database commit, your system is inconsistent. The event never fires, but the state changed.
The outbox pattern fixes this:
BEGIN;
UPDATE orders SET status = 'confirmed' WHERE id = 123;
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('OrderConfirmed', '{"order_id": 123}', NOW());
COMMIT;
A separate process (or CDC) reads the outbox table and publishes events. Since the state change and the event record are in the same transaction, they are atomic. If the transaction fails, neither happens.
// Outbox publisher - polls the outbox table
func (p *Publisher) processOutbox(ctx context.Context) error {
rows, err := p.db.QueryContext(ctx,
"SELECT id, event_type, payload FROM outbox WHERE published = false ORDER BY created_at LIMIT 100")
if err != nil {
return err
}
defer rows.Close()
for rows.Next() {
var id int64
var eventType, payload string
rows.Scan(&id, &eventType, &payload)
err := p.broker.Publish(eventType, []byte(payload))
if err != nil {
return err // Retry on next poll
}
p.db.ExecContext(ctx,
"UPDATE outbox SET published = true WHERE id = $1", id)
}
return nil
}
2. The Saga Pattern
When a business process spans multiple services, you cannot use a database transaction. Sagas coordinate multi-service workflows using events:
OrderPlaced -> ReserveInventory -> ChargePayment -> ShipOrder
| |
InventoryFailed PaymentFailed
| |
CancelOrder ReleaseInventory + CancelOrder
Each step publishes an event. If a step fails, compensating events undo the previous steps. This is choreography-based saga - each service knows what to do when it receives an event.
Orchestration-based sagas use a central coordinator that tells each service what to do. This is easier to reason about but creates a single point of coordination.
Use choreography when you have 3-4 services in the saga and the flow is straightforward. Use orchestration when the flow has complex branching or more than 5 services.
3. Change Data Capture (CDC)
CDC watches your database’s transaction log and emits events for every row change. Debezium is the standard tool for this:
# Debezium connector config
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "orders-db",
"database.port": "5432",
"database.dbname": "orders",
"table.include.list": "public.orders",
"topic.prefix": "orders",
"slot.name": "debezium_orders"
}
}
CDC is powerful because it requires zero code changes in your application. Your service writes to the database as usual, and Debezium turns those writes into Kafka events automatically.
The downside: CDC events are database-level (row changes), not domain-level (“OrderPlaced”). You often need a transformer service to convert CDC events into meaningful domain events.
4. Idempotent Consumers
Events will be delivered more than once. Your consumers must handle duplicates gracefully:
func (h *Handler) handleOrderConfirmed(ctx context.Context, event Event) error {
// Check if already processed
processed, err := h.store.IsProcessed(ctx, event.ID)
if err != nil {
return err
}
if processed {
return nil // Already handled, skip
}
// Process the event
err = h.processOrder(ctx, event)
if err != nil {
return err
}
// Mark as processed
return h.store.MarkProcessed(ctx, event.ID)
}
Store processed event IDs in your database. Check before processing. This is not optional - it is a requirement for any event-driven system.
When NOT to Use Event-Driven Architecture
Event-driven is not universally better. Avoid it when:
- You have a monolith and it works: Adding Kafka to a monolith is adding complexity without solving a real problem. Events make sense for decoupling services, not for internal module communication
- You need synchronous responses: User submits a form and needs an immediate result. Request-response is fine for this. Do not force it through an event pipeline
- Your team is small (under 5 engineers): The operational overhead of a message broker, dead letter queues, and eventual consistency debugging is not justified if three people own the entire system
- Data consistency is non-negotiable: If a bank transfer must be atomic across two accounts, a distributed transaction (or a single database) is simpler and safer than a saga
The Architecture Decision Framework
Ask these questions before choosing event-driven:
- Do multiple services need to react to the same state change? If yes, events.
- Can consumers tolerate processing delays of seconds to minutes? If no, use synchronous calls.
- Do you need to replay historical events for debugging or rebuilding state? If yes, Kafka or NATS with retention.
- Do you have the team to operate a message broker and debug eventual consistency issues? If no, start with SQS or a managed Kafka service.
Event-driven architecture is a tool, not a religion. The teams getting value from it in 2026 are the ones who applied it to the right problems - service decoupling, async workflows, and fan-out - and kept synchronous calls where they make sense.
Comments