Saga

A long-running business transaction split across services, coordinated by a sequence of local transactions and explicit compensations when a later step fails. You trade atomic rollback for the ability to commit each step locally.

When to reach for it

  • A workflow crosses three or more service boundaries and can't be pushed back into one database
  • Each step has a meaningful local commit — book flight, charge card, allocate stock — worth keeping if a later step fails
  • Every forward step has a real compensating action the business actually wants (refund, release, reverse), not a hand-wave
  • You need observable, resumable workflow state, not "the process died somewhere between step 3 and step 4 and we found out from a customer"

What it actually costs

Saga isn't free distributed transactions — it's a workflow engine and a discipline. You'll maintain a compensation for every forward step, deal with non-commutative steps that can't really be undone (sent emails, external API calls, physical side effects), and spend serious time on idempotency, retries and timeouts. Orchestration concentrates the complexity in one place but couples every service to the orchestrator; choreography spreads it across event handlers and you'll lose hours reconstructing what happened from logs.

The failure mode nobody mentions

Partial compensation. Step 5 fails, the saga starts compensating, then compensation step 2 also fails — now you're in a half-rolled-back state with no automated path forward. Teams plan the happy path and the single-failure path; almost nobody plans the cascading-failure path. You find out at 2 AM when a queue backs up and someone is hand-writing SQL to reconcile orders, payments and inventory before the morning batch.

When not to use it

A workflow that fits in one database, one service, or two services with a single coordinated commit — pulling in a saga buys you a workflow engine you don't need.