The Transactional Outbox Pattern: Reliable Event Publishing Without 2PC

Every event-driven system depends on a deceptively simple guarantee: when a service commits a state change to its database, the corresponding event must be published. Both happen, or neither does. The system that gets this wrong silently corrupts itself — orders accepted with no inventory deduction, payments charged with no shipping notification, accounts created with no welcome email, all because somewhere in the stack a database write committed and the event publish quietly failed.

This is the dual-write problem. It's the single most common reliability bug in event-driven architectures, and the pattern that solves it cleanly — without resorting to two-phase commit — is the transactional outbox. The pattern is small in code but load-bearing in consequence: every event-driven system at production scale needs some version of it, whether the team realises it or not.

This piece is the practitioner's guide. What the problem actually is, why 2PC isn't the answer, how the outbox pattern works mechanically, the two implementation styles (polling and CDC), and the production concerns that bite teams who skip it.

If you've read the ESB + event-driven architecture piece or the saga pattern piece, this is the foundational piece those depend on. Without reliable event publishing, both architectures degrade into eventual-corruption systems.

The Dual-Write Problem, Visualised

The naive implementation of "update the database and publish an event" is two independent operations:

   Service code:

      db.save(order)              // operation 1
      eventBus.publish(orderEvent) // operation 2

This looks atomic. It is not. Three failure modes between the two operations:

   Failure mode 1: db commit succeeds, publish fails

   db.save(order)              --> succeeds, order persisted
   eventBus.publish(orderEvent) --> fails (network, broker down)

   Result: order exists in database, event was never published.
   Downstream systems never react. Silent corruption.


   Failure mode 2: db commit fails, publish succeeds

   db.save(order)              --> fails (constraint violation)
   eventBus.publish(orderEvent) --> would succeed, but never runs

   Result: this case is fine; nothing happened.


   Failure mode 3: publish succeeds, db commit fails after
   (if order is reversed: publish first, then save)

   eventBus.publish(orderEvent) --> succeeds, event in flight
   db.save(order)              --> fails

   Result: event published for an order that never persisted.
   Downstream systems react to a state that doesn't exist.

The asymmetry between database transactions and event-bus publishes makes the dual-write inherently fragile. The database is transactional; the event bus is "fire and forget" from the service's point of view. There is no shared transaction that spans both. Whichever order you do them in, some failure mode produces silent corruption.

The frequency of these failures is low — usually well under 0.1% of writes — but the consequences are severe and the diagnostic difficulty is high. You discover the bug weeks later, when a customer-support ticket reveals an order that "doesn't exist" downstream even though the customer was charged. By then, hundreds of similar drifts have accumulated.

Why 2PC Isn't the Answer

The classical-database response — wrap both operations in a distributed transaction with two-phase commit — has been covered in the saga piece. The short version: 2PC requires every participant (the database and the event broker) to support the protocol, holds locks across the wire, and creates a coordinator failure that blocks both systems. Event brokers like Kafka don't support XA transactions in the way enterprise databases historically did. Even where the protocol is technically available, the operational cost is unacceptable at the rate of normal application writes.

The architecture we need preserves the conceptual atomicity of "DB write + event publish" without requiring distributed transactions. The transactional outbox is that architecture.

The Outbox Pattern: The Core Idea

The trick is to make the event publish part of the same database transaction as the state change. You can't transact across the database and the event bus — but you can transact within the database. So you write the event to a table in the same database, in the same transaction as the state change, and let a separate process publish from that table to the event bus.

   Service code (single transaction):

   BEGIN TRANSACTION
      db.save(order)                          // domain state
      db.insert(outbox, { event: orderEvent }) // event in same DB
   COMMIT

   ... (eventually) ...

   Publisher process (separate):
      reads from outbox -> publishes to event bus -> marks as published

The state change and the event row are written atomically. Either both succeed and the transaction commits, or both fail and the transaction rolls back. There is no window in which one happened and the other didn't.

The event row sits in the database until the publisher picks it up and forwards it to the event bus. The publisher is a separate process, but its role is only to forward events that have already been atomically committed. It can fail. It can retry. It can lag. None of those failure modes can produce corruption, because the event is already durably committed in the database — the publisher is just a delivery mechanism, not a source of truth.

   +---------------+
   | Service code  |
   +-------+-------+
           |
           v (single ACID transaction)
   +-------+-------+
   | Database      |
   |  +---------+  |
   |  | orders  |  |
   |  +---------+  |
   |  | outbox  |  |
   |  +---------+  |
   +-------+-------+
           |
           | (publisher process polls or CDC tails)
           v
   +---------------+
   | Event Bus     |
   | (Kafka, etc.) |
   +---------------+

The architectural elegance: the dual-write problem is eliminated, not mitigated. There is no failure mode in which the order exists and the event doesn't — because the event row is part of the same atomic transaction. The publisher's only job is delivery, and delivery failures are retried until they succeed, which is a much easier problem than "did this thing happen or not."

The pattern is small enough to fit in a paragraph, and load-bearing enough to be the foundation of every event-driven architecture at production scale. The implementation has two main styles.

Implementation: The Polling Publisher

The simplest implementation. A background process polls the outbox table on an interval, picks up unpublished rows, publishes them to the event bus, and marks them as published.

   Publisher loop (every N seconds):

   1. SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id LIMIT 100
   2. For each row:
        try:
            eventBus.publish(row.payload)
            UPDATE outbox SET published_at = NOW() WHERE id = row.id
        except:
            // retry on next iteration; row stays unpublished
   3. Sleep N seconds; repeat.

The strengths of polling:

Operationally simple. No additional infrastructure. The publisher is just a service that talks to the database.
Easy to reason about. The mental model is straightforward — read rows, publish, mark as done.
Works with any database. No requirement for specific replication features.
Failure handling is obvious. If a publish fails, the row remains unpublished and gets retried on the next poll.

The weaknesses:

Latency. Events lag behind their state change by up to the poll interval. With a 1-second poll, events arrive on average half a second late. For most workloads this is fine; for latency-critical ones it's not.
Database load. Every poll is a query, even when there's nothing to publish. At low write rates, the publisher is mostly running empty queries.
Polling competes with other work. Indexes have to support the polling query without slowing down the normal write path.
Ordering can be subtle. If multiple publishers run in parallel for scale, ensuring ordered delivery per stream requires care.

Polling is the right starting choice for most teams. It works at surprisingly large scale (low tens of thousands of events per second) before the operational characteristics start to push you toward CDC.

Implementation: The CDC Publisher

The more sophisticated implementation. Instead of polling, the publisher uses change data capture — tailing the database's transaction log directly — to pick up new outbox rows as soon as they're committed.

   +---------------+
   | Database      |
   |  +---------+  |
   |  | outbox  |  |
   |  +---------+  |
   |  WAL/binlog   |
   +-------+-------+
           |
           | (CDC tool tails the log)
           v
   +---------------+
   | Debezium /    |
   | similar       |
   +-------+-------+
           |
           v
   +---------------+
   | Event Bus     |
   | (Kafka)       |
   +---------------+

CDC tools (Debezium is the de facto choice for Postgres and MySQL) read the database's write-ahead log (Postgres WAL, MySQL binlog) and emit events whenever rows change. Set up to watch the outbox table specifically, the CDC tool publishes events to Kafka the moment a row is committed — typically within tens of milliseconds.

The strengths of CDC:

Low latency. Events arrive milliseconds after the underlying commit. Fast enough for nearly all latency-sensitive workloads.
No polling overhead. The CDC tool consumes the log as it's written, not on a fixed schedule.
High throughput. Industrial-scale CDC pipelines handle hundreds of thousands of events per second.
Ordering is natural. Events arrive in commit order from the log.

The weaknesses:

Operational complexity. Debezium-on-Kafka-Connect is real infrastructure to run. Schema registry, connector configuration, offset management, snapshot orchestration — all need attention.
Database coupling. The CDC tool is tightly coupled to your database's specific log format. Switching databases means re-doing the CDC setup.
Schema evolution. Changing the outbox table's schema requires coordinated changes in the CDC pipeline. Doable, but not as transparent as the polling approach.
More failure modes. Log retention, connector failures, replication lag — all need monitoring and runbooks.

CDC is the right choice when latency matters and the operational footprint can absorb the additional infrastructure. The PostgreSQL deep-dive piece covers the WAL-level details that matter for tuning a CDC publisher.

Polling vs CDC: Trade-offs

Concern	Polling	CDC
Operational complexity	Low	Higher
Latency	Seconds	Tens of ms
Throughput ceiling	Mid (10K+ events/sec)	High (100K+ events/sec)
Infrastructure required	None extra	CDC tooling (Debezium, Kafka Connect)
Database coupling	Loose	Tight
Failure recovery	Trivially simple	Requires runbook
Best for	Starting points; moderate volumes	Latency-sensitive; high-volume

The pragmatic path most teams follow: start with polling. Cross over to CDC when the latency or throughput numbers force the move, not before. Migrating from polling to CDC is straightforward because the outbox table format is the same; only the publisher changes.

Idempotency and Deduplication on the Consumer Side

The outbox pattern guarantees at-least-once delivery. The publisher will deliver every event committed to the outbox. It may, under specific failure conditions (publisher dies after publishing but before marking the row as published), deliver the same event more than once.

This means consumers must be idempotent — applying the same event twice must produce the same result as applying it once.

   Consumer flow:

   1. Receive event
   2. Look up: have I processed event_id X before?
        Yes -> skip; we already handled this one
        No  -> apply the event; record event_id X as processed
   3. Acknowledge

The standard implementation maintains a processed events table on the consumer side. Each event carries a unique ID; the consumer records the ID after processing. Duplicate events are detected by ID lookup and dropped.

The architectural principle: every consumer in an event-driven system must be idempotent. This is non-negotiable. The outbox guarantees you'll get the event; the consumer must guarantee that getting it twice doesn't cause harm.

For saga implementations, this matters even more — the same logic that handles duplicate events also handles retried compensations. Idempotent design is the foundation that makes both patterns work in production.

What This Pattern Solves

The pros worth being explicit about. The outbox pattern solves several concrete problems that the naive dual-write architecture doesn't:

The dual-write failure mode is eliminated, not mitigated. Every state change is published, exactly because the publish row is part of the same atomic transaction. There is no failure scenario in which one happened and the other didn't.
Recovery from publisher outages is automatic. If the publisher process dies for an hour, events accumulate in the outbox and are flushed when it comes back online. No lost events, no manual reconciliation.
The publisher can be replaced or scaled independently. Switching from polling to CDC is a publisher swap, not an application-code change. Scaling out publishers for higher throughput is straightforward.
Schema evolution is contained. Event schemas live in the outbox table's payload column. Evolving them affects the publisher and consumers, not the application's domain code.
Audit and observability come naturally. The outbox table is, incidentally, a complete log of every event the service has ever produced. Useful for debugging, for compliance, for analytics, and for event sourcing if you choose to take the next step.
It composes with everything. Sagas, integrations, downstream consumers, analytics pipelines — all benefit from reliable event delivery. The outbox is the foundation, not the application.

The compounding benefit: the outbox pattern turns event-driven architecture from "mostly reliable" into actually reliable. Without it, every event-driven system carries a hidden tax of occasional silent corruption that nobody can fully reason about. With it, the events are as reliable as the database commits they piggyback on.

When Not to Use It

Honest limits. Three cases where the outbox pattern is the wrong tool:

Pure read paths or pure transient operations. If the operation doesn't produce a state change in your own database, there's no outbox transaction to attach to. Events about purely external state (a webhook fired by another service, an inbound message you're forwarding) don't need the outbox — they need other reliability patterns (idempotent forwarding, deduplication on receive).
Best-effort notifications where occasional loss is acceptable. If the event is "the user might want to know about this" and the system is fine if it sometimes doesn't get sent, the outbox infrastructure isn't justified. Direct publishing is fine. Most production systems have some events in this category — be honest about which ones, and don't over-engineer the unimportant cases.
Single-database, single-service systems. If you don't yet have an event-driven architecture, you don't yet need the outbox. Add it when you add event publishing as a real architectural primitive, not preemptively.

The general anti-pattern to avoid: applying the outbox indiscriminately to every database write. Most writes don't produce events that other systems need. Be selective about which writes have associated events, and apply the outbox specifically there.

Implementation Stack Choices

Common production stacks for the outbox pattern:

Hand-rolled polling on Postgres. A simple outbox table with (id, aggregate_id, event_type, payload, published_at), a small background worker that polls and publishes. The right starting point for most teams. Postgres patterns covered in the Postgres deep-dive piece.
Debezium + Kafka Connect. The de facto CDC stack. Tail Postgres or MySQL into Kafka topics. Requires running Kafka Connect, but very production-mature.
AWS DMS for CDC. Managed CDC into Kinesis or other AWS streaming services. Good for AWS-native architectures.
Per-language libraries. Spring Modulith has outbox support for Spring Boot teams. Various Node.js libraries handle the pattern for Node services. Often the right choice when you want the pattern baked into the framework rather than hand-rolled.
Database-native features. Some newer databases (CockroachDB, YugabyteDB) offer changefeed features that subsume the CDC layer. Worth knowing about as the database market evolves.

The decision criteria for the implementation: throughput, latency, existing infrastructure, and team operational comfort. Most teams should start hand-rolled-on-Postgres and migrate only when scale or latency forces the move.

The Practitioner's Take

The transactional outbox is the quietest critical pattern in event-driven architecture. It doesn't generate headlines or marketing pages. It just makes the difference between an event-driven system that actually works in production and one that mostly works but occasionally drifts in ways nobody can diagnose. The teams that ship reliable event-driven architectures have an outbox somewhere in the stack, even if they don't always call it that. The teams that don't ship reliable event-driven architectures often don't realise the outbox is what they're missing.

The right move is to adopt it deliberately, the first time you add real cross-service events to your architecture. Start with the polling implementation — it's simple, it works, and it composes with everything. Migrate to CDC when the latency or throughput numbers force it, not because the CDC option exists. Apply the pattern selectively, to the writes that actually produce cross-service events, not as a universal database wrapper.

Event-driven architecture is sold on its reactive elegance. The reactive elegance is real, but it depends entirely on a boring, unglamorous reliability foundation. The outbox is that foundation. Without it, the elegance is a marketing slide. With it, the architecture actually holds up.