The Saga Pattern: Distributed Transactions Without Two-Phase Commit

Splitting a system into independent services solves a lot of problems. It also creates one specifically nasty new one: how do you change state across multiple services as a single atomic operation, when each service has its own database and there's no shared transaction to roll back?

The traditional answer — two-phase commit (2PC) — does not scale. It assumes coordinated transactions across distributed databases, holds locks across the wire, and turns one slow participant into a cascading outage for every other participant. Almost no production microservices system uses it for cross-service state changes, and the ones that do regret it.

The pattern that does work is the saga — a sequence of local transactions, each one owned by one service, coordinated by either events (choreography) or a central coordinator (orchestration), with explicit compensating transactions to roll back partial progress when something fails. This is the practitioner's guide to building sagas: when each style wins, how compensation actually works, the hard parts that bite teams in production, and the stack choices for implementing one.

If you've read the ESB + event-driven architecture piece, the saga pattern is the natural extension — the piece that explains how to actually maintain consistency across the services connected to that bus.

Why Two-Phase Commit Doesn't Scale

Before the saga pattern makes sense, the failure of the alternative has to be visible. Two-phase commit works in theory:

   Coordinator ----> Phase 1: PREPARE ----> Service A: "ready"
                                            Service B: "ready"
                                            Service C: "ready"

   Coordinator ----> Phase 2: COMMIT  ----> Service A: commits
                                            Service B: commits
                                            Service C: commits

In practice, the failure mode is the blocking-on-coordinator problem. If the coordinator fails between phase 1 and phase 2, every participant is stuck holding locks waiting for the COMMIT decision that will never come:

   Coordinator: X  (crashed after phase 1)
        |
        v
   Service A: locked, waiting for COMMIT
   Service B: locked, waiting for COMMIT
   Service C: locked, waiting for COMMIT

   ... and now nothing can proceed until manual recovery.

This isn't a theoretical concern. It's the reason 2PC fell out of favour for cross-service transactions in the 2010s. The participants hold locks for the duration of the protocol. Network partitions, slow participants, and coordinator failures all turn into cascading availability incidents. The architecture buys atomicity at the cost of liveness, and at scale that trade is unacceptable.

The alternative pattern abandons strict atomicity in exchange for liveness, while preserving the conceptual intent of "this multi-step business operation either fully succeeds or gets cleanly rolled back." That pattern is the saga.

The Saga Pattern: The Core Idea

A saga is a sequence of local transactions. Each transaction is small, owned by exactly one service, and commits or fails on its own. The coordination between transactions happens via events or commands at the application level, not via distributed locks.

   +---------+   +---------+   +---------+   +---------+
   | T1: Pay |-->| T2: Ship|-->| T3: Bill|-->| T4: Loyal|
   +---------+   +---------+   +---------+   +---------+
   commits       commits       commits       commits

Each step does its work and commits locally. The next step starts when the previous one completes. If everything succeeds, the saga ends in the desired final state. If something fails partway through — say, T3 fails — the saga compensates by running explicit rollback transactions for the earlier steps, in reverse order:

   +---------+   +---------+   +---------+
   | T1: Pay |-->| T2: Ship|-->| T3: FAIL|
   +---------+   +---------+   +---------+
        |             |             |
        v             v             v
   +---------+   +---------+
   | C1:     |<--| C2:     |<--- compensation
   | Refund  |   | Cancel  |
   +---------+   +---------+

C2 cancels the shipment. C1 refunds the payment. The system arrives at a state equivalent to "the saga never happened," even though individual transactions committed along the way.

The deep difference from 2PC: at no point are any of these transactions holding locks across services. Each commits locally and immediately. Failures are handled by running more transactions (the compensating ones), not by rolling back distributed state. This trades atomicity for liveness, and at scale that's the right trade.

There are two ways to implement the coordination. Choreography and orchestration.

Choreography Sagas

In a choreographed saga, there is no central coordinator. Each service knows what events to react to and what events to publish. The saga happens because the services' reactions chain together.

   [Order Service]
        |
        | publishes: OrderPlaced
        v
   +-------------+
   | event bus   |
   +-------------+
        |
        v
   [Payment Service]  reacts: charges the card
        |
        | publishes: PaymentSucceeded
        v
   +-------------+
   | event bus   |
   +-------------+
        |
        v
   [Shipping Service]  reacts: schedules shipment
        |
        | publishes: ShipmentScheduled
        v
   ... and so on

The order service doesn't know that payment exists. Payment doesn't know about shipping. Each service knows only its own piece — "when I see event X, I do work Y and publish event Z." The saga emerges from the chain of reactions.

The strengths of choreography:

Loose coupling. Each service is genuinely independent. Adding a new step is adding a new event reaction, no coordinator changes required.
Natural fit for event-driven architecture. If you're already running an event bus, the saga uses the same primitive you already have.
No central point of failure. No coordinator to crash. Each service handles its own piece.

The weaknesses:

Hard to reason about end-to-end. "What does the order saga actually do?" requires reading the event-reaction logic across five services. No single place describes the workflow.
Compensation is decentralised. Each service has to know how to compensate on the failure events it might see. The rollback logic is scattered.
Easy to accidentally create event cycles. Service A publishes X, Service B reacts and publishes Y, Service C reacts and publishes X. Three services later you have an infinite loop you didn't intend.

Choreography is the right choice when the saga steps are genuinely independent business operations that happen to chain together, and when the team values service autonomy over workflow visibility.

Orchestration Sagas

In an orchestrated saga, a central coordinator (the orchestrator) explicitly drives the workflow. The orchestrator sends commands to each service, listens for responses, and decides what happens next.

              +---------------------+
              |    Orchestrator     |
              |  (workflow state)   |
              +---+-------+-------+-+
                  |       |       |
            cmd:  |  cmd: |  cmd: |  cmd:
            pay   |  ship |  bill |  loyal
                  v       v       v       v
            +-------+ +-------+ +-------+ +-------+
            | Pay   | | Ship  | | Bill  | | Loyal |
            | Svc   | | Svc   | | Svc   | | Svc   |
            +---+---+ +---+---+ +---+---+ +---+---+
                |         |         |         |
                v         v         v         v
                response  response  response  response
                          (back to orchestrator)

The orchestrator holds the workflow state. It knows "I sent the pay command, I got back PaymentSucceeded, now I send the ship command." On failure, the orchestrator knows which compensating commands to send and in what order. The participants stay simple — they just respond to commands and report back.

The strengths of orchestration:

Workflow visibility. The saga is in one place. You can read the orchestrator's state machine and understand exactly what happens, in what order, with what compensations.
Centralised compensation logic. The orchestrator knows the full rollback sequence. Each service stays focused on its own forward and reverse operations.
Easier observability. The orchestrator naturally produces a timeline of "what step are we on, what failed where, when did each step run."

The weaknesses:

The orchestrator can become a coupling point. Every service knows the orchestrator exists. Adding a new step requires updating the orchestrator.
The orchestrator needs to be durable and recoverable. Its state machine is the workflow's source of truth — losing it loses the saga's progress.
It can become a bottleneck. All workflow traffic flows through it. Performance and availability concerns concentrate there.

Orchestration is the right choice when the saga represents a meaningful business workflow with explicit ordering, where visibility and centralised reasoning outweigh the coupling cost.

Choreography vs Orchestration: When Each Wins

A direct comparison:

Concern	Choreography	Orchestration
Service autonomy	High	Lower (services know the orchestrator)
Workflow visibility	Low (scattered across services)	High (one place)
Compensation complexity	Distributed (each service compensates)	Centralised (orchestrator drives)
Coupling	Loose	Tighter
Risk of accidental cycles	Real	Low
Operational complexity	Lower per service	Concentrated in orchestrator
Right for	Independent steps; loosely-related operations	Defined business workflows; high-visibility saga

A pragmatic rule of thumb: use choreography when the steps are naturally events that other services happen to care about. Use orchestration when the steps are a workflow with explicit ordering and compensation logic that matters as a unit.

In practice, many production systems use both — choreography for loose event-reaction flows, orchestration for the handful of critical business workflows (checkout, signup, refund) where visibility and control are worth the coupling cost.

The same pattern recommendation applies regardless of the implementation style: the CQRS+ES architecture I've written about pairs naturally with sagas because the event log is exactly the right substrate for saga events and for reconstructing saga state.

Compensating Transactions

The single most important design discipline in any saga is correctly identifying the compensating transaction for each step. A compensating transaction is not necessarily the literal inverse — it's the business-logical undo of the step.

A few principles:

Compensation doesn't have to be perfect. The goal is to leave the system in a state equivalent to "the saga didn't happen," not to byte-for-byte revert.
Compensation must be safe to retry. If the compensation itself fails partially, you'll need to run it again. Idempotency matters as much for compensations as for forward steps.
Some operations have no clean compensation. Sending a confirmation email can't be unsent — the compensation is a follow-up "ignore that previous message" email. Designing the saga to put irrevocable steps as late as possible minimises this kind of problem.

   Forward saga (success path):

   T1: Reserve inventory           --> commits
   T2: Charge payment               --> commits
   T3: Schedule shipment            --> commits
   T4: Send confirmation email     --> commits

   Failure at T3 -- compensating saga:

   C2: Refund payment               <-- runs first
   C1: Release inventory reservation <-- runs second

Notice the reverse order. Compensations typically run in reverse so each operation only has to know how to undo the things that committed before it.

For the team running the saga: build the failure-injection tests before you ship. Every step's compensation needs to be exercised in test, ideally automatically. Most production saga bugs I've seen are in compensation paths that nobody tested because they "should never run." They run.

The Hard Parts

Three production concerns that bite teams who haven't thought about them in advance:

Idempotency. Network failures mean steps will be retried. A step that "charges the card" must be safe to call twice without charging the card twice. The standard fix is per-step idempotency keys — every saga step includes a unique ID that the receiving service uses to detect and dedupe duplicate requests.
Ordering and partial failures. In a choreography saga, events can arrive out of order if the underlying messaging system doesn't guarantee ordering. Design the steps to be commutative where possible, or use ordering keys to enforce per-saga sequence.
Saga visibility. Without observability, debugging a stuck saga is archaeology. Every saga should have a unique correlation ID propagated through every step's logs and metrics, with a dashboard showing in-flight sagas and their current step.

The deeper failure mode — sagas that get stuck in a half-finished state because both the forward and compensation paths failed — is the one that's easiest to ignore until it bites you. The architectural answer: build a saga monitor that watches for sagas that have been in progress longer than expected, alerts on them, and provides tooling for human intervention. Most of the production agent-failure-mode patterns I've written about apply directly here — the surprising failure mode is always the one nobody designed an alert for.

What This Pattern Solves

The pros worth being explicit about. Sagas address several concrete problems that 2PC and direct-call architectures don't:

Liveness without locks. Each transaction commits locally and immediately. No participant holds distributed locks. The system doesn't grind to a halt because one slow service is in the middle of a coordinated transaction.
Cross-service workflows become first-class. The saga is an explicit entity with state, history, and a clear lifecycle — not a scattered set of side effects across multiple services.
Failure is recoverable. Partial failures roll back cleanly via compensations. The system arrives at a consistent state regardless of where the failure happened.
Independent service deployment. Each service in the saga can deploy and scale independently. The saga doesn't require coordinated releases.
Reliable event publishing is the foundation. Sagas depend on events being delivered reliably — which is exactly the problem the transactional outbox pattern solves. The two patterns compose naturally.
Visibility is configurable. Orchestrated sagas give you central workflow visibility; choreographed sagas trade that for autonomy. You pick where the trade lands.

The compounding benefit is that the saga becomes a unit of reasoning for the whole organisation. Product, engineering, and operations can all talk about "the checkout saga" or "the refund saga" as a thing with known steps, known compensations, and known failure modes.

When Not to Use It

Honest limits. Three cases where sagas are the wrong tool:

Single-database operations. If the transaction lives entirely inside one service's database, use a normal ACID transaction. Sagas exist for cross-service consistency; using them when one is unnecessary is over-engineering.
Operations with no business-logical undo. Some real-world side effects can't be compensated — physical shipping that's already left the warehouse, government filings, transactions on external rails that don't support reversal. Either restructure the workflow to put these last, or accept that the saga has a one-way commit point past which "compensation" means something else (human intervention, follow-up workflow, brand-protection messaging).
High-velocity, low-stakes operations. If a step occasionally not happening is operationally tolerable, the engineering investment in saga infrastructure isn't justified. Many real systems run for years on "best-effort cross-service consistency with occasional manual reconciliation" because the cost of saga infrastructure outweighs the cost of occasional drift.

The pattern is genuinely powerful, and it's also genuinely expensive to build well. The teams that succeed with it are the ones who use it where the business case is clearest — checkout, payments, complex multi-step workflows — and don't reach for it for every cross-service interaction.

Implementation Stack Choices

Common production stacks for saga implementation:

Temporal. The de facto choice for orchestration sagas in greenfield systems. Durable workflow engine, native compensation support, strong observability. The right answer when the orchestration use case is real and you can adopt the operational footprint.
Camunda / Camunda Cloud. BPMN-based workflow engine with strong enterprise tooling. The right answer in regulated environments where business analysts need to read and modify the workflow.
Axon Saga. JVM/Spring-native saga support, integrates tightly with CQRS + event sourcing implementations on the same platform.
Hand-rolled on Kafka. A choreography saga with each service publishing and subscribing to events on Kafka topics. Simplest infrastructure footprint. Pairs naturally with the broader ESB + event-driven architecture.
AWS Step Functions / Azure Durable Functions. Managed orchestration in cloud-native systems. The right choice when you're already deep in a single cloud and want managed durability and observability without operating a workflow engine yourself.

The decision criteria: orchestration vs choreography first, then visibility and tooling needs, then existing infrastructure investments. Most teams should not build their own saga engine — the failure modes are subtle enough that buying or adopting a mature option is almost always the right call.

The Practitioner's Take

The saga pattern is what makes cross-service consistency actually work at scale. The alternative — pretending 2PC still applies, or living with partial-failure bugs that occasionally corrupt user state — is worse than the operational cost of doing it properly. The teams that ship reliable multi-service workflows in production are running sagas, whether they call them that or not.

The right move is to identify the workflows where consistency genuinely matters as a business property (not as an abstract engineering preference), implement those as explicit sagas with clear compensation paths, and leave the rest of the system using simpler patterns. Most production systems have somewhere between three and ten genuine sagas — checkout, refund, signup, account closure, a handful of operational workflows — and many more cross-service interactions that don't need the pattern at all. Pick the right ones, build them carefully, instrument them well, and the architecture will hold up under load.

The teams that succeed with sagas treat them as critical infrastructure with the operational discipline that implies. The teams that don't end up debugging stuck workflows at 2am and rediscovering, slowly, why distributed transactions are hard. Either way, the lesson lands. Better to learn it deliberately than at the cost of production incidents.