What is event-driven architecture and what problems does it solve at scale?

Event-driven architecture (EDA) decouples services by having them communicate through asynchronous events rather than direct synchronous calls. At scale it solves: tight coupling between services that creates cascading failures, synchronous call chains that add latency, difficulty scaling independent services at different rates, and the inability to add new consumers to a data flow without modifying the producer.

When should I use event-driven architecture vs request/response APIs?

Use event-driven when: operations are naturally asynchronous (order placement, notifications, async processing), multiple consumers need the same event (fan-out), you need to decouple services that should not know about each other, or you need event replay for recovery and audit. Use request/response when: the caller needs an immediate response, operations are low-volume, or consistency is more important than decoupling.

What are the most common event-driven architecture failure modes?

EDA failure modes: event schema evolution breaking consumers when producers change event format, missing dead letter queues for events that fail processing, consumers that fall behind and never catch up causing unbounded queue growth, event ordering assumptions violated when events are processed out of sequence, and no idempotency causing duplicate event processing to corrupt state.

How do you ensure exactly-once processing in event-driven systems?

Exactly-once processing in EDA requires: idempotent consumers (processing the same event twice produces the same result), event deduplication using event IDs, transactional outbox pattern to guarantee event publication after a database commit, and at-least-once delivery from the broker combined with idempotency. True exactly-once delivery from the broker is available in Kafka with transactional APIs but adds significant complexity.

How do event-driven architectures integrate with AI agent systems?

AI agents integrate well with EDA: agent task invocations can be modeled as events, agent outputs can be published as events for downstream consumers, long-running agent workflows can checkpoint state to the event log, and event replay enables debugging of agent workflows after failure. The pattern also enables agent fan-out — multiple specialized agents reacting to the same event concurrently.

Fordel Studios

Designing Event-Driven Architectures for Scale

Event-driven architecture is the backbone of systems that need to scale independently, process asynchronously, and maintain auditability. But most teams adopt it wrong — coupling events to implementation details and creating distributed monoliths.

Abhishek Sharma· Head of Engg @ Fordel Studios

April 4, 202610 min read min read

Designing Event-Driven Architectures for Scale

Request-response architectures break at scale. When Service A calls Service B which calls Service C synchronously, you have created a distributed monolith — a system that has all the operational complexity of microservices with none of the independence. Event-driven architecture breaks this coupling by having services communicate through events rather than direct calls.

The core principle: services publish facts about what happened, not commands about what should happen. "OrderPlaced" is an event. "ProcessPayment" is a command. This distinction matters because events allow multiple consumers to react independently without the publisher knowing or caring about downstream behavior.

···

Core Patterns

Event Notification

The simplest pattern. A service publishes a lightweight event — "CustomerCreated", "OrderShipped" — and interested services subscribe. The event carries minimal data (typically just an ID and event type), and consumers fetch additional data if needed. This pattern is easy to adopt and works well for decoupling, but creates chattiness if consumers frequently need to call back for details.

Event-Carried State Transfer

Events carry all the data consumers need. "OrderPlaced" includes the full order details, customer information, and line items. Consumers maintain their own local copy of the data they care about. This eliminates callback chattiness and makes consumers fully autonomous, but means data is duplicated across services and events are larger.

Event Sourcing

Instead of storing current state, you store the sequence of events that led to the current state. An account balance is not a row in a database — it is the sum of all deposit and withdrawal events. This gives you a complete audit trail, the ability to reconstruct state at any point in time, and natural support for temporal queries. The trade-off is complexity: read queries require replaying or projecting events, and the event store grows indefinitely.

Pattern	Complexity	Coupling	Auditability	Best For
Event Notification	Low	Low	Limited	Simple decoupling between services
Event-Carried State	Medium	Very low	Good	Autonomous services with local data
Event Sourcing	High	Very low	Complete	Financial systems, audit-heavy domains
CQRS + Event Sourcing	Very high	Minimal	Complete	High-read, high-write systems at scale

Apache Kafka in Practice

Kafka dominates event-driven infrastructure for a reason: it provides durable, ordered, high-throughput event streaming with consumer group semantics that allow independent scaling of producers and consumers. But Kafka is not simple to operate, and the teams that treat it as a drop-in message queue discover its complexity the hard way.

Kafka Operational Realities

Partition count is hard to change after creation — size partitions for expected peak throughput at topic creation time
Consumer lag monitoring is non-negotiable — a consumer falling behind is invisible until it becomes a production incident
Schema evolution must be managed from day one — use a schema registry with compatibility checks or you will break consumers
Retention policies determine your replay window — set them based on your disaster recovery requirements, not arbitrary defaults
Exactly-once semantics require idempotent consumers — Kafka provides at-least-once delivery, your consumers must handle duplicates

The AI Agent Event Bus

AI agents are creating a new demand pattern for event-driven architectures. An AI agent that monitors customer behavior, detects patterns, and takes autonomous actions is fundamentally an event consumer and producer. It subscribes to behavioral events, processes them through a model, and publishes action events. The event bus becomes the coordination layer between AI agents and traditional services.

This works well when agents are reactive — responding to events as they occur. It becomes more complex when agents need to maintain state across multiple events (a customer journey spanning days or weeks) or coordinate with other agents. The emerging pattern is to combine event sourcing for state management with an orchestration layer that manages multi-agent coordination.

Migrating to Event-Driven Architecture

Identify the integration points

Map every synchronous service-to-service call in your system. Rank them by coupling impact — which ones, if they fail, cascade failures across multiple services?

Introduce an event bus alongside existing integrations

Do not rip out REST calls. Add event publishing alongside them. Let consumers gradually shift from polling/calling to subscribing.

Define your event schema standard

Establish an event envelope (event type, timestamp, correlation ID, schema version) and a schema registry before the first event is published. Retroactive schema management is painful.

Implement idempotent consumers

Every consumer must handle duplicate events gracefully. Use deduplication keys or idempotent operations. This is not optional — at-least-once delivery means duplicates will happen.

Build observability from day one

Distributed tracing across event producers and consumers. Consumer lag dashboards. Dead letter queue monitoring. Without observability, debugging event-driven systems is guesswork.

“The biggest mistake teams make with event-driven architecture is not the technology choice — it is treating events as remote procedure calls with extra steps. Events are facts about the past, not requests for the future.”

···

Kafka vs RabbitMQ vs NATS: Choosing the Right Broker

The choice of message broker determines your system's throughput ceiling, delivery guarantees, and operational complexity for years. Teams frequently adopt Kafka because it is well-known, then discover that their use case required RabbitMQ's routing flexibility or NATS's low-latency characteristics. The decision deserves deliberate analysis.

Broker	Throughput	Latency	Message retention	Consumer model	Best for
Apache Kafka	Millions/sec per cluster	5-15ms (end-to-end)	Configurable (days to years)	Pull — consumer groups, offset-based	Event sourcing, audit logs, stream processing, replay
RabbitMQ 3.x	~50K msg/sec per node	1-5ms	Until consumed (or DLQ)	Push — queue-based, routing with exchanges	Task queues, request/reply, complex routing, RPC patterns
NATS (JetStream)	20M+ msg/sec per node	Sub-millisecond	Configurable retention	Both push and pull	Low-latency IoT, edge computing, multi-cloud messaging
AWS SQS/SNS	Unlimited (managed)	1-3 seconds typical	Up to 14 days (SQS)	Pull (SQS), push (SNS)	AWS-native teams, serverless, simple fan-out
Google Pub/Sub	Unlimited (managed)	< 100ms	Up to 7 days	Pull or push	GCP-native, global message distribution

Kafka is the right choice when: you need message replay (re-process events from any point in time), your consumers need to process events at their own pace without the broker losing messages, or you are building stream processing pipelines with Kafka Streams or Flink. Kafka is the wrong choice when: your use case is a simple task queue where work needs routing to specific consumers, you need request-reply semantics, or your team does not have the operational expertise to run Kafka (ZooKeeper dependency prior to KRaft mode, partition management, consumer group rebalancing complexity).

RabbitMQ excels at routing: direct, fanout, topic, and header exchanges give you fine-grained control over message delivery. A message can be routed to zero, one, or many queues based on routing keys and binding patterns. For microservices that need sophisticated event routing without building routing logic into consumers, RabbitMQ's exchange model is powerful. It pairs naturally with API gateway patterns for event-driven request handling.

···

Event Sourcing: Patterns and Practical Implementation

Event sourcing stores the history of state changes as an immutable sequence of events, rather than storing the current state directly. The current state is derived by replaying events from the beginning (or from a snapshot). This sounds academic until you need to answer "what was the state of order #12345 at 14:32 on March 15th?" — a question that is trivial with event sourcing and impossible with a traditional mutable database.

The practical implementation pattern for production event sourcing: each aggregate (e.g., an Order) has an event stream identified by aggregate_id. Events are appended with an incrementing version number. To load an aggregate, fetch all events from the stream and fold them left over the initial state. For long-lived aggregates (thousands of events), take periodic snapshots at version N and replay only events after the snapshot.

Kafka is the natural storage layer for event sourcing: topics map to aggregate event streams, partitioned by aggregate ID for ordering guarantees. Each consumer group reads the stream at its own pace. The EventStore DB (by the creators of event sourcing) is purpose-built if you want aggregate management, subscriptions, and projections without building them on top of a general-purpose message broker.

···

The Saga Pattern for Distributed Transactions

Distributed transactions (2PC) do not scale. The Saga pattern replaces a single atomic transaction across services with a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the completed steps. Two Saga choreography styles exist: choreography (services react to events directly) and orchestration (a central coordinator, or Saga orchestrator, issues commands to each service). For complex multi-step workflows across more than three services, the orchestration style is easier to reason about and debug. Pair with event-driven architectures for the full distributed transaction picture.

Orchestrator receives CreateOrder command

The saga orchestrator creates a saga instance, persists its initial state, and publishes a ReserveInventory command to the inventory service.

Inventory service processes ReserveInventory

Reserves stock, publishes InventoryReserved event. The orchestrator listens for this event and advances the saga to the next step.

Orchestrator publishes ProcessPayment command

Payment service processes payment, publishes PaymentProcessed or PaymentFailed event.

Happy path: orchestrator publishes ConfirmShipment command

Fulfilment service confirms shipment, saga completes. Orchestrator marks saga instance as completed.

Failure path: compensating transactions

If payment fails, the orchestrator publishes ReleaseInventoryReservation command. The inventory service reverses the reservation. The saga ends in a cancelled state with full audit trail.

···

Dead Letter Queues and Poison Pill Handling

A poison pill message is a message that causes consumer crashes or processing failures on every attempt. Without a dead letter queue (DLQ), the broker will redeliver the message indefinitely — blocking the consumer or consuming all retry budget. Every production event-driven system needs a DLQ strategy.

The pattern: configure a maximum delivery count (e.g., 5 attempts). After N failures, the broker moves the message to a DLQ automatically. A separate consumer or a manual review process handles DLQ messages. The DLQ is not a discard bin — it is a quarantine. Messages should be inspectable (readable), actionable (can be replayed to the main queue after a fix), and monitored (alert on DLQ depth).

In Kafka, DLQ is not a built-in concept — you implement it with a separate topic (e.g., topic-name.DLT) and a retry topic per attempt (topic-name.retry.0, topic-name.retry.1). Spring Kafka's @RetryableTopic annotation handles this automatically with configurable backoff intervals between attempts. In RabbitMQ, set x-dead-letter-exchange on the queue; after max-retries, messages route to the DLX automatically.

···

Event Schema Evolution: Avro vs Protobuf vs JSON Schema

Format	Schema registry needed	Backward compat tools	Payload size	Schema evolution model	Best for
Avro	Yes (Confluent/Apicurio)	Strong — reader/writer schema resolution	Binary, compact	Full compatibility modes	Kafka-native, Java/JVM ecosystems
Protobuf	Optional but recommended	Strong — field numbers are immutable	Binary, very compact	Additive-only by convention	Polyglot systems, gRPC, high throughput
JSON Schema	Optional	Weak — convention-based only	JSON text, verbose	Manual discipline required	Developer experience, REST/webhook payloads

Avro with Confluent Schema Registry is the Kafka-native choice: the schema ID is embedded in every message, the registry enforces compatibility rules (backward, forward, full), and generated code handles serialisation. The tradeoff is operational dependency on the schema registry — a registry outage blocks all producers and consumers.

Protobuf is preferred for polyglot systems where producers and consumers span multiple languages. Field numbers provide a stable identifier that survives field renames. The rule: never reuse field numbers. Add new optional fields; never remove or change existing fields. Protobuf files are your API contract — treat them with the same discipline as a public API.

···

Exactly-Once Semantics: Reality Check

Every message broker advertises exactly-once semantics eventually. The practical reality is more nuanced. Kafka added exactly-once semantics (EOS) in version 0.11.0 via idempotent producers and transactional APIs. Within a single Kafka cluster, producer-to-broker-to-consumer EOS is achievable. The guarantee breaks the moment you write to an external system (a database, another service) — the external write is outside the transaction boundary.

The practical answer for most systems: design for at-least-once delivery and idempotent consumers. An idempotent consumer processes the same message N times with the same effect as processing it once. Implement idempotency with a deduplication key (event ID) stored in a processed-events table. Check for the key before processing; write it after. This pattern is simpler than chasing distributed EOS guarantees and works across brokers. For real-time sync systems that need stronger guarantees, CRDT-based approaches handle concurrent writes without coordination.

cost of reprocessing vs preventing duplicatesAt scale, idempotent consumer design saves more engineering time than chasing exactly-once delivery guarantees

···

Operational Considerations

Event-driven architectures shift complexity from request-time to infrastructure-time. You trade synchronous call chains (simple to reason about, fragile under load) for asynchronous event flows (resilient under load, harder to debug). This is not a free trade — it requires investment in tooling. At minimum: a dead letter queue dashboard showing failed events with replay capability, a schema registry for event evolution, consumer lag monitoring with alerts, and a way to trace a single business event through the entire system.

Consumer lag is the most important operational metric. When a consumer falls behind the producer, events accumulate in the broker. If lag grows faster than the consumer can process, you have a capacity problem that will eventually cause data loss (when retention expires) or memory pressure on the broker. Monitor consumer lag per partition, alert when it exceeds your SLA threshold (typically the maximum acceptable event processing delay), and have a runbook for scaling consumers horizontally. For a broader treatment of operational monitoring, see our guide on observability-driven development.

target consumer lagFor most event-driven systems, consumer lag should stay under 5 seconds during normal operation — spikes during deployments are expected but should recover within minutes

···

Testing Event-Driven Systems

Testing event-driven architectures requires a fundamentally different approach than testing request-response systems. You cannot simply send a request and assert on the response — instead, you publish an event and assert that the correct downstream effects occurred within a timeout window. This temporal dimension makes tests slower and more fragile than synchronous tests, which is why most teams under-test their event-driven code.

The testing strategy: unit test your event handlers in isolation (mock the broker, test the handler function directly), integration test the event flow end-to-end using Testcontainers to spin up a real Kafka/RabbitMQ instance in Docker, and contract test your event schemas using a schema registry that rejects backward-incompatible changes. The contract tests are the most important layer — they prevent the most common failure mode in event-driven systems: a producer changes the event schema and breaks every consumer that depends on the old schema.

For teams running event consumers as part of a broader CI/CD pipeline, the integration test phase should spin up the message broker, publish test events, wait for consumers to process them (with a configurable timeout), and assert on the resulting state in the database or downstream service. Testcontainers makes this practical — a Kafka container starts in 5-10 seconds and provides a fully functional broker for your test suite.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles