Designing Event-Driven Architectures for Scale
Event-driven architecture is the backbone of systems that need to scale independently, process asynchronously, and maintain auditability. But most teams adopt it wrong — coupling events to implementation details and creating distributed monoliths.

Request-response architectures break at scale. When Service A calls Service B which calls Service C synchronously, you have created a distributed monolith — a system that has all the operational complexity of microservices with none of the independence. Event-driven architecture breaks this coupling by having services communicate through events rather than direct calls.
The core principle: services publish facts about what happened, not commands about what should happen. "OrderPlaced" is an event. "ProcessPayment" is a command. This distinction matters because events allow multiple consumers to react independently without the publisher knowing or caring about downstream behavior.
Core Patterns
Event Notification
The simplest pattern. A service publishes a lightweight event — "CustomerCreated", "OrderShipped" — and interested services subscribe. The event carries minimal data (typically just an ID and event type), and consumers fetch additional data if needed. This pattern is easy to adopt and works well for decoupling, but creates chattiness if consumers frequently need to call back for details.
Event-Carried State Transfer
Events carry all the data consumers need. "OrderPlaced" includes the full order details, customer information, and line items. Consumers maintain their own local copy of the data they care about. This eliminates callback chattiness and makes consumers fully autonomous, but means data is duplicated across services and events are larger.
Event Sourcing
Instead of storing current state, you store the sequence of events that led to the current state. An account balance is not a row in a database — it is the sum of all deposit and withdrawal events. This gives you a complete audit trail, the ability to reconstruct state at any point in time, and natural support for temporal queries. The trade-off is complexity: read queries require replaying or projecting events, and the event store grows indefinitely.
| Pattern | Complexity | Coupling | Auditability | Best For |
|---|---|---|---|---|
| Event Notification | Low | Low | Limited | Simple decoupling between services |
| Event-Carried State | Medium | Very low | Good | Autonomous services with local data |
| Event Sourcing | High | Very low | Complete | Financial systems, audit-heavy domains |
| CQRS + Event Sourcing | Very high | Minimal | Complete | High-read, high-write systems at scale |
Apache Kafka in Practice
Kafka dominates event-driven infrastructure for a reason: it provides durable, ordered, high-throughput event streaming with consumer group semantics that allow independent scaling of producers and consumers. But Kafka is not simple to operate, and the teams that treat it as a drop-in message queue discover its complexity the hard way.
- Partition count is hard to change after creation — size partitions for expected peak throughput at topic creation time
- Consumer lag monitoring is non-negotiable — a consumer falling behind is invisible until it becomes a production incident
- Schema evolution must be managed from day one — use a schema registry with compatibility checks or you will break consumers
- Retention policies determine your replay window — set them based on your disaster recovery requirements, not arbitrary defaults
- Exactly-once semantics require idempotent consumers — Kafka provides at-least-once delivery, your consumers must handle duplicates
The AI Agent Event Bus
AI agents are creating a new demand pattern for event-driven architectures. An AI agent that monitors customer behavior, detects patterns, and takes autonomous actions is fundamentally an event consumer and producer. It subscribes to behavioral events, processes them through a model, and publishes action events. The event bus becomes the coordination layer between AI agents and traditional services.
This works well when agents are reactive — responding to events as they occur. It becomes more complex when agents need to maintain state across multiple events (a customer journey spanning days or weeks) or coordinate with other agents. The emerging pattern is to combine event sourcing for state management with an orchestration layer that manages multi-agent coordination.
Migrating to Event-Driven Architecture
Map every synchronous service-to-service call in your system. Rank them by coupling impact — which ones, if they fail, cascade failures across multiple services?
Do not rip out REST calls. Add event publishing alongside them. Let consumers gradually shift from polling/calling to subscribing.
Establish an event envelope (event type, timestamp, correlation ID, schema version) and a schema registry before the first event is published. Retroactive schema management is painful.
Every consumer must handle duplicate events gracefully. Use deduplication keys or idempotent operations. This is not optional — at-least-once delivery means duplicates will happen.
Distributed tracing across event producers and consumers. Consumer lag dashboards. Dead letter queue monitoring. Without observability, debugging event-driven systems is guesswork.
“The biggest mistake teams make with event-driven architecture is not the technology choice — it is treating events as remote procedure calls with extra steps. Events are facts about the past, not requests for the future.”
Kafka vs RabbitMQ vs NATS: Choosing the Right Broker
The choice of message broker determines your system's throughput ceiling, delivery guarantees, and operational complexity for years. Teams frequently adopt Kafka because it is well-known, then discover that their use case required RabbitMQ's routing flexibility or NATS's low-latency characteristics. The decision deserves deliberate analysis.
| Broker | Throughput | Latency | Message retention | Consumer model | Best for |
|---|---|---|---|---|---|
| Apache Kafka | Millions/sec per cluster | 5-15ms (end-to-end) | Configurable (days to years) | Pull — consumer groups, offset-based | Event sourcing, audit logs, stream processing, replay |
| RabbitMQ 3.x | ~50K msg/sec per node | 1-5ms | Until consumed (or DLQ) | Push — queue-based, routing with exchanges | Task queues, request/reply, complex routing, RPC patterns |
| NATS (JetStream) | 20M+ msg/sec per node | Sub-millisecond | Configurable retention | Both push and pull | Low-latency IoT, edge computing, multi-cloud messaging |
| AWS SQS/SNS | Unlimited (managed) | 1-3 seconds typical | Up to 14 days (SQS) | Pull (SQS), push (SNS) | AWS-native teams, serverless, simple fan-out |
| Google Pub/Sub | Unlimited (managed) | < 100ms | Up to 7 days | Pull or push | GCP-native, global message distribution |
Kafka is the right choice when: you need message replay (re-process events from any point in time), your consumers need to process events at their own pace without the broker losing messages, or you are building stream processing pipelines with Kafka Streams or Flink. Kafka is the wrong choice when: your use case is a simple task queue where work needs routing to specific consumers, you need request-reply semantics, or your team does not have the operational expertise to run Kafka (ZooKeeper dependency prior to KRaft mode, partition management, consumer group rebalancing complexity).
RabbitMQ excels at routing: direct, fanout, topic, and header exchanges give you fine-grained control over message delivery. A message can be routed to zero, one, or many queues based on routing keys and binding patterns. For microservices that need sophisticated event routing without building routing logic into consumers, RabbitMQ's exchange model is powerful. It pairs naturally with API gateway patterns for event-driven request handling.
Event Sourcing: Patterns and Practical Implementation
Event sourcing stores the history of state changes as an immutable sequence of events, rather than storing the current state directly. The current state is derived by replaying events from the beginning (or from a snapshot). This sounds academic until you need to answer "what was the state of order #12345 at 14:32 on March 15th?" — a question that is trivial with event sourcing and impossible with a traditional mutable database.
The practical implementation pattern for production event sourcing: each aggregate (e.g., an Order) has an event stream identified by aggregate_id. Events are appended with an incrementing version number. To load an aggregate, fetch all events from the stream and fold them left over the initial state. For long-lived aggregates (thousands of events), take periodic snapshots at version N and replay only events after the snapshot.
Kafka is the natural storage layer for event sourcing: topics map to aggregate event streams, partitioned by aggregate ID for ordering guarantees. Each consumer group reads the stream at its own pace. The EventStore DB (by the creators of event sourcing) is purpose-built if you want aggregate management, subscriptions, and projections without building them on top of a general-purpose message broker.
The Saga Pattern for Distributed Transactions
Distributed transactions (2PC) do not scale. The Saga pattern replaces a single atomic transaction across services with a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the completed steps. Two Saga choreography styles exist: choreography (services react to events directly) and orchestration (a central coordinator, or Saga orchestrator, issues commands to each service). For complex multi-step workflows across more than three services, the orchestration style is easier to reason about and debug. Pair with event-driven architectures for the full distributed transaction picture.
The saga orchestrator creates a saga instance, persists its initial state, and publishes a ReserveInventory command to the inventory service.
Reserves stock, publishes InventoryReserved event. The orchestrator listens for this event and advances the saga to the next step.
Payment service processes payment, publishes PaymentProcessed or PaymentFailed event.
Fulfilment service confirms shipment, saga completes. Orchestrator marks saga instance as completed.
If payment fails, the orchestrator publishes ReleaseInventoryReservation command. The inventory service reverses the reservation. The saga ends in a cancelled state with full audit trail.
Dead Letter Queues and Poison Pill Handling
A poison pill message is a message that causes consumer crashes or processing failures on every attempt. Without a dead letter queue (DLQ), the broker will redeliver the message indefinitely — blocking the consumer or consuming all retry budget. Every production event-driven system needs a DLQ strategy.
The pattern: configure a maximum delivery count (e.g., 5 attempts). After N failures, the broker moves the message to a DLQ automatically. A separate consumer or a manual review process handles DLQ messages. The DLQ is not a discard bin — it is a quarantine. Messages should be inspectable (readable), actionable (can be replayed to the main queue after a fix), and monitored (alert on DLQ depth).
In Kafka, DLQ is not a built-in concept — you implement it with a separate topic (e.g., topic-name.DLT) and a retry topic per attempt (topic-name.retry.0, topic-name.retry.1). Spring Kafka's @RetryableTopic annotation handles this automatically with configurable backoff intervals between attempts. In RabbitMQ, set x-dead-letter-exchange on the queue; after max-retries, messages route to the DLX automatically.
Event Schema Evolution: Avro vs Protobuf vs JSON Schema
| Format | Schema registry needed | Backward compat tools | Payload size | Schema evolution model | Best for |
|---|---|---|---|---|---|
| Avro | Yes (Confluent/Apicurio) | Strong — reader/writer schema resolution | Binary, compact | Full compatibility modes | Kafka-native, Java/JVM ecosystems |
| Protobuf | Optional but recommended | Strong — field numbers are immutable | Binary, very compact | Additive-only by convention | Polyglot systems, gRPC, high throughput |
| JSON Schema | Optional | Weak — convention-based only | JSON text, verbose | Manual discipline required | Developer experience, REST/webhook payloads |
Avro with Confluent Schema Registry is the Kafka-native choice: the schema ID is embedded in every message, the registry enforces compatibility rules (backward, forward, full), and generated code handles serialisation. The tradeoff is operational dependency on the schema registry — a registry outage blocks all producers and consumers.
Protobuf is preferred for polyglot systems where producers and consumers span multiple languages. Field numbers provide a stable identifier that survives field renames. The rule: never reuse field numbers. Add new optional fields; never remove or change existing fields. Protobuf files are your API contract — treat them with the same discipline as a public API.
Exactly-Once Semantics: Reality Check
Every message broker advertises exactly-once semantics eventually. The practical reality is more nuanced. Kafka added exactly-once semantics (EOS) in version 0.11.0 via idempotent producers and transactional APIs. Within a single Kafka cluster, producer-to-broker-to-consumer EOS is achievable. The guarantee breaks the moment you write to an external system (a database, another service) — the external write is outside the transaction boundary.
The practical answer for most systems: design for at-least-once delivery and idempotent consumers. An idempotent consumer processes the same message N times with the same effect as processing it once. Implement idempotency with a deduplication key (event ID) stored in a processed-events table. Check for the key before processing; write it after. This pattern is simpler than chasing distributed EOS guarantees and works across brokers. For real-time sync systems that need stronger guarantees, CRDT-based approaches handle concurrent writes without coordination.
Operational Considerations
Event-driven architectures shift complexity from request-time to infrastructure-time. You trade synchronous call chains (simple to reason about, fragile under load) for asynchronous event flows (resilient under load, harder to debug). This is not a free trade — it requires investment in tooling. At minimum: a dead letter queue dashboard showing failed events with replay capability, a schema registry for event evolution, consumer lag monitoring with alerts, and a way to trace a single business event through the entire system.
Consumer lag is the most important operational metric. When a consumer falls behind the producer, events accumulate in the broker. If lag grows faster than the consumer can process, you have a capacity problem that will eventually cause data loss (when retention expires) or memory pressure on the broker. Monitor consumer lag per partition, alert when it exceeds your SLA threshold (typically the maximum acceptable event processing delay), and have a runbook for scaling consumers horizontally. For a broader treatment of operational monitoring, see our guide on observability-driven development.
Testing Event-Driven Systems
Testing event-driven architectures requires a fundamentally different approach than testing request-response systems. You cannot simply send a request and assert on the response — instead, you publish an event and assert that the correct downstream effects occurred within a timeout window. This temporal dimension makes tests slower and more fragile than synchronous tests, which is why most teams under-test their event-driven code.
The testing strategy: unit test your event handlers in isolation (mock the broker, test the handler function directly), integration test the event flow end-to-end using Testcontainers to spin up a real Kafka/RabbitMQ instance in Docker, and contract test your event schemas using a schema registry that rejects backward-incompatible changes. The contract tests are the most important layer — they prevent the most common failure mode in event-driven systems: a producer changes the event schema and breaks every consumer that depends on the old schema.
For teams running event consumers as part of a broader CI/CD pipeline, the integration test phase should spin up the message broker, publish test events, wait for consumers to process them (with a configurable timeout), and assert on the resulting state in the database or downstream service. Testcontainers makes this practical — a Kafka container starts in 5-10 seconds and provides a fully functional broker for your test suite.
Need this kind of thinking applied to your product?
We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.
Enjoyed this? Get the weekly digest.
Research highlights and AI news, delivered every Thursday. No spam.


