Distributed Transactions: A Comprehensive Guide to 2PC, 3PC, Paxos Commit, and Saga
Distributed transactions are one of the most challenging problems in distributed systems. How do you ensure that operations across multiple nodes either all succeed or all fail, even when nodes crash or networks partition? This comprehensive guide explores the major approaches: Two-Phase Commit (2PC), Three-Phase Commit (3PC), Paxos Commit, and the Saga pattern.
The ACID Challenge in Distributed Systems
In a single-node database, ACID properties are relatively straightforward:
- Atomicity: All operations succeed or all fail
- Consistency: Database remains in valid state
- Isolation: Concurrent transactions don’t interfere
- Durability: Committed changes persist
In distributed systems, achieving ACID across multiple nodes introduces significant challenges:
- Network failures: Messages may be lost or delayed
- Node failures: Nodes may crash at any time
- Partitions: Network may split, isolating nodes
- Performance: Coordination overhead increases latency
The Two-Phase Commit Protocol (2PC)
Two-Phase Commit is the classic distributed transaction protocol, providing strong consistency guarantees at the cost of availability and performance.
How 2PC Works
2PC divides the commit process into two phases, coordinated by a transaction coordinator:
Phase 1: Prepare (Voting Phase)
- Coordinator sends PREPARE: Coordinator sends prepare request to all participants
- Participants vote: Each participant:
- Writes transaction to log (WAL - Write-Ahead Log)
- Locks resources
- Responds with VOTE-COMMIT or VOTE-ABORT
- Coordinator collects votes: Waits for all votes or timeout
Phase 2: Commit (Decision Phase)
If all votes are COMMIT:
- Coordinator decides COMMIT: Writes commit decision to log
- Send COMMIT: Coordinator sends commit to all participants
- Participants commit: Each participant:
- Commits transaction
- Releases locks
- Sends ACK to coordinator
- Coordinator completes: After all ACKs, coordinator forgets transaction
If any vote is ABORT:
- Coordinator decides ABORT: Writes abort decision to log
- Send ABORT: Coordinator sends abort to all participants
- Participants abort: Each participant:
- Rolls back transaction
- Releases locks
- Sends ACK to coordinator
- Coordinator completes: After all ACKs, coordinator forgets transaction
2PC State Machine
Coordinator States:
[INIT] → [WAITING] → [COMMITTED] / [ABORTED]
Participant States:
[INIT] → [PREPARED] → [COMMITTED] / [ABORTED]
Example: 2PC Execution
Coordinator (C) Participant A Participant B
------------ ------------- -------------
[INIT] [INIT] [INIT]
↓ PREPARE(x=1)
[PREPARED] [PREPARED]
VOTE-COMMIT VOTE-COMMIT
↓ (All votes COMMIT)
[WAITING]
↓ COMMIT
[COMMITTED] [COMMITTED]
ACK ACK
[COMMITTED]
Failure Scenarios in 2PC
Coordinator Failure During Phase 1
Scenario: Coordinator crashes after sending PREPARE but before receiving all votes.
Problem: Participants are in PREPARED state, holding locks indefinitely.
Solutions:
- Timeout: Participants timeout and abort (but violates atomicity if coordinator committed)
- Recovery: New coordinator queries participants and decides based on votes
- Blocking: Participants must wait for coordinator recovery (blocks system)
Coordinator Failure During Phase 2
Scenario: Coordinator crashes after deciding COMMIT but before all participants receive it.
Problem: Some participants committed, others didn’t receive commit message.
Solutions:
- Recovery: New coordinator queries participants:
- If any committed → send COMMIT to all
- If any aborted → send ABORT to all
- If all prepared → can commit (all voted yes)
Participant Failure
Scenario: Participant crashes after voting COMMIT but before receiving coordinator’s decision.
Problem: Participant must recover and learn the decision.
Solutions:
- Recovery: On recovery, participant queries coordinator or other participants
- Log: Decision is logged, can be recovered
- Blocking: Coordinator waits for participant recovery (or times out and aborts)
2PC Properties
Consistency: Strong consistency - all nodes agree on commit/abort Availability: Blocks during coordinator failure or network partition Performance: High latency (2 round-trips), blocking locks Fault Tolerance: Blocks on coordinator failure
Real-World Use Cases
XA Transactions (Java): Java’s XA interface uses 2PC for distributed transactions across multiple resources (databases, message queues).
Distributed Databases: Some distributed databases use 2PC internally for cross-shard transactions.
Limitations: 2PC is rarely used in modern distributed systems due to blocking behavior and poor availability.
Three-Phase Commit (3PC)
Three-Phase Commit extends 2PC with an additional phase to reduce blocking, but still doesn’t solve all problems.
How 3PC Works
3PC adds a “pre-commit” phase between prepare and commit:
Phase 1: Prepare (Voting)
Same as 2PC - coordinator collects votes from participants.
Phase 2: Pre-Commit
If all votes are COMMIT:
- Coordinator sends PRE-COMMIT: Notifies all participants of commit decision
- Participants acknowledge: Participants acknowledge and enter PRE-COMMITTED state
- Coordinator waits: Waits for all acknowledgments
Phase 3: Commit
- Coordinator sends COMMIT: After all PRE-COMMIT acks received
- Participants commit: Participants commit and send final ACK
3PC State Machine
Coordinator States:
[*] → INIT → WAITING → PRE-COMMITTED → COMMITTED → [*]
Participant States:
[*] → INIT → PREPARED → PRE-COMMITTED → COMMITTED → [*]
State Transitions:
- Coordinator: INIT → WAITING (all PREPARE sent) → PRE-COMMITTED (all votes COMMIT) → COMMITTED (all PRE-COMMIT acks received)
- Participant: INIT → PREPARED (PREPARE received) → PRE-COMMITTED (PRE-COMMIT received) → COMMITTED (COMMIT received)
Why Pre-Commit Phase?
The PRE-COMMIT phase ensures that:
- All participants know the decision: Before coordinator can forget
- Non-blocking recovery: If coordinator fails, participants can recover without blocking
- Timeout safety: Participants can safely timeout and commit if in PRE-COMMITTED state
Failure Scenarios in 3PC
Coordinator Failure After Pre-Commit
Scenario: Coordinator crashes after sending PRE-COMMIT but before COMMIT.
Recovery:
- Participants in PRE-COMMITTED state can safely commit
- New coordinator can query participants and complete commit
- Non-blocking: System can make progress without coordinator
Network Partition
Scenario: Network partitions, isolating coordinator from some participants.
Problem: Still problematic - if coordinator and minority partition think transaction committed, but majority didn’t receive PRE-COMMIT, we have inconsistency.
3PC Properties
Consistency: Strong consistency (same as 2PC) Availability: Better than 2PC (non-blocking recovery) but still blocks on partitions Performance: Higher latency than 2PC (3 round-trips) Fault Tolerance: Better than 2PC but still has issues
Limitations of 3PC
- Still blocks on partitions: Cannot guarantee consistency during network partitions
- Higher latency: Additional round-trip increases latency
- Complexity: More complex than 2PC
- Not widely used: Rarely implemented in practice
Paxos Commit
Paxos Commit uses the Paxos consensus algorithm to achieve distributed commit, providing better fault tolerance than 2PC/3PC.
Why Paxos for Commit?
Paxos solves the consensus problem: getting multiple nodes to agree on a single value. For distributed transactions, that value is the commit/abort decision.
How Paxos Commit Works
Instead of a single coordinator, Paxos Commit uses Paxos to reach consensus on the commit decision:
- Propose commit: Any node can propose COMMIT or ABORT
- Paxos consensus: Use Paxos to agree on the decision
- Learn decision: All nodes learn the agreed decision
- Execute: Participants execute based on consensus decision
Paxos Basics (Simplified)
Paxos has three roles:
- Proposers: Propose values
- Acceptors: Accept proposals
- Learners: Learn the chosen value
Phase 1 (Prepare):
- Proposer sends prepare with proposal number
- Acceptors promise not to accept proposals with lower numbers
- If majority promises, proceed to Phase 2
Phase 2 (Accept):
- Proposer sends accept with proposal number and value
- Acceptors accept if no higher-numbered proposal promised
- If majority accepts, value is chosen
Paxos Commit Flow
Phase 1: Prepare
Proposer (Node A) Acceptor (Node B) Acceptor (Node C)
----------------- ---------------- ----------------
Propose COMMIT
|
|--Prepare(n=1)----->|
|--Prepare(n=1)---------------------------->|
| | |
|<--Promise(n=1)-----| |
|<--Promise(n=1)-----------------------------|
|
Majority promised
Proceed to Accept
Phase 2: Accept
Proposer (Node A) Acceptor (Node B) Acceptor (Node C)
----------------- ---------------- ----------------
|
|--Accept(n=1, COMMIT)----->|
|--Accept(n=1, COMMIT)---------------------->|
| | |
|<--Accept(n=1, COMMIT)------| |
|<--Accept(n=1, COMMIT)----------------------|
|
COMMIT chosen
All nodes learn COMMIT
Advantages of Paxos Commit
- No single point of failure: No coordinator that can block system
- Handles failures: Continues working with majority of nodes
- Handles partitions: Can make progress if majority partition exists
- Proven correctness: Paxos has formal proofs of correctness
Disadvantages of Paxos Commit
- Complexity: Paxos is complex to understand and implement
- Latency: Multiple rounds of communication
- Overhead: More messages than 2PC
- Not for transactions: Typically used for configuration, not transactions
Real-World Use
Chubby (Google): Uses Paxos for distributed lock service and configuration.
ZooKeeper: Uses Zab (Paxos-like) for coordination and configuration.
Note: Paxos Commit is rarely used for actual database transactions - it’s more common for coordination and configuration.
Saga Pattern
The Saga pattern takes a fundamentally different approach: instead of trying to achieve ACID transactions, it embraces eventual consistency and uses compensating transactions.
Philosophy
Traditional approach: Try to make distributed operations atomic (hard, blocks on failures).
Saga approach: Accept that operations may complete partially, but provide compensation to undo if needed.
How Saga Works
A Saga is a sequence of local transactions, each with a compensating transaction:
- Execute transactions: Execute T1, T2, T3, … sequentially
- If all succeed: Saga completes successfully
- If any fails: Execute compensating transactions in reverse order:
- If T3 fails, execute C3, then C2, then C1
- Each compensation undoes the effects of its corresponding transaction
Saga Execution Patterns
Orchestrated Saga
A central orchestrator coordinates the saga execution:
Orchestrator Transaction 1 Transaction 2 Transaction 3 Compensate 2 Compensate 1
------------ ------------- ------------- ------------- ------------- -------------
| | | | | |
|--Execute-------->| | | | |
|<--Success--------| | | | |
| | | | | |
| --Execute--------------->| | |
| <--Success----------------| | |
| | | | | |
| --Execute--------->| |
| <--Failure!--------| |
| | | | | |
| | --Compensate--------->|
| | <--Done----------------|
| | | | | |
| | | --Compensate-->|
| | | <--Done---------|
| | | | | |
| Saga aborted
Characteristics:
- Central control: Orchestrator has full visibility
- Easier to reason: Clear execution flow
- Single point of failure: Orchestrator can fail (but can be made resilient)
- Tight coupling: All participants depend on orchestrator
Choreographed Saga
Each participant knows what to do next and coordinates itself:
Transaction 1 Event Bus Transaction 2 Transaction 3 Compensate 2 Compensate 1
------------ --------- ------------- ------------- ------------- -------------
| | | | | |
|--Publish------->| | | | |
|"T1 completed" | | | | |
| |--Subscribe & Execute--------------->| | |
| | | | | |
| |<--Publish--------------------------| | |
| |"T2 completed" | | | |
| | | | | |
| | --Subscribe & Execute-->| |
| | <--Publish--------------| |
| | "T3 failed" | |
| | | |
| | --Subscribe & Compensate------->|
| | <--Publish----------------------|
| | "C2 completed" |
| | | |
| | --Subscribe & Compensate-->|
Characteristics:
- Decentralized: No central coordinator
- Loosely coupled: Participants communicate via events
- More complex: Harder to understand execution flow
- Resilient: No single point of failure
Example: E-Commerce Order Saga
Scenario: Process an order that involves:
- Reserve inventory
- Charge credit card
- Ship product
- Send confirmation email
Orchestrated Saga
Orchestrator Inventory Service Payment Service Shipping Service Email Service
------------ ---------------- ---------------- ---------------- -------------
| | | | |
|--Reserve inventory>| | | |
|<--Success----------| | | |
| | | | |
| --Charge credit card->| |
| <--Success-------------| |
| | | | |
| --Ship product---->|
| <--Failure!---------|
| | | | |
| --Refund payment (compensate)----------->|
| <--Refunded-------------------------------|
| | | | |
|--Release inventory (compensate)------>| | |
|<--Released----------| | | |
| | | | |
| Saga aborted
Execution Flow:
- Orchestrator requests inventory reservation
- If successful, requests payment charge
- If successful, requests shipment
- If shipment fails, orchestrator triggers compensation:
- Refund payment
- Release inventory
- Saga aborted
Choreographed Saga
Inventory Service Event Bus Payment Service Shipping Service Email Service
--------------- --------- ---------------- ---------------- -------------
| | | | |
|--Publish--------->| | | |
|"inventory_reserved"| | | |
| |--Subscribe "inventory_reserved"---->| |
| | | | |
| |<--Publish--------------------------| |
| |"payment_succeeded" | |
| | | | |
| | --Subscribe "payment_succeeded"-->|
| | <--Publish------------------------|
| | "shipment_failed" |
| | | |
| | --Subscribe "shipment_failed"-->|
| | <--Publish---------------------|
| | "payment_refunded" (compensate)|
| | | |
| |--Subscribe "shipment_failed"------>| | |
| |<--Publish--------------------------| | |
| |"inventory_released" (compensate) | | |
Execution Flow:
- Inventory service reserves items and publishes event
- Payment service subscribes, charges card, publishes success
- Shipping service subscribes, attempts shipment, publishes failure
- Payment service subscribes to failure, refunds payment
- Inventory service subscribes to failure, releases inventory
Saga Properties
Consistency: Eventual consistency - system eventually consistent Availability: High availability - no blocking Performance: Good performance - no blocking locks Fault Tolerance: Resilient - can handle failures gracefully
Saga Challenges
Compensating Transactions
Problem: Not all operations are easily reversible.
Examples:
- Easy to compensate: Reserve inventory → Release inventory
- Hard to compensate: Send email → Cannot unsend email
- Impossible to compensate: Launch missile → Cannot unlaunch
Solutions:
- Design operations to be reversible
- Use idempotent operations
- Accept that some operations cannot be compensated
Partial Failures
Problem: Saga may complete partially, leaving system in inconsistent state.
Example: Inventory reserved, payment charged, but shipment fails. System must compensate payment and release inventory.
Solutions:
- Idempotency: Make operations idempotent (safe to retry)
- Timeouts: Set timeouts for each step
- Monitoring: Monitor saga execution and handle stuck sagas
Ordering Guarantees
Problem: In choreographed saga, events may arrive out of order.
Solutions:
- Event ordering: Use event ordering guarantees (Kafka partitions)
- Versioning: Use version numbers to detect out-of-order events
- Idempotency: Make handlers idempotent
Real-World Examples
Netflix: Conductor (Orchestrated)
Netflix uses Conductor for orchestrated sagas:
- Workflow orchestration: Coordinates microservices
- Compensation: Built-in compensation support
- Visibility: Full visibility into saga execution
Amazon: Step Functions (Orchestrated)
AWS Step Functions provides orchestrated sagas:
- State machines: Define saga as state machine
- Compensation: Support for compensation logic
- Integration: Integrates with AWS services
Event Sourcing + Saga (Choreographed)
Many event-sourced systems use choreographed sagas:
- Events as coordination: Events coordinate saga execution
- Event store: Events stored for audit and replay
- Compensation: Compensation handled via events
Comparison: 2PC vs. 3PC vs. Paxos Commit vs. Saga
| Aspect | 2PC | 3PC | Paxos Commit | Saga |
|---|---|---|---|---|
| Consistency | Strong | Strong | Strong | Eventual |
| Availability | Low | Medium | High | High |
| Performance | Low | Lower | Medium | High |
| Blocking | Yes | Reduced | No | No |
| Complexity | Medium | High | Very High | Medium |
| Use Case | ACID transactions | ACID transactions | Consensus | Business processes |
| Fault Tolerance | Low | Medium | High | High |
| Latency | 2 RTT | 3 RTT | Multiple RTT | Sequential |
Choosing the Right Approach
Use 2PC/3PC When:
- Strong consistency is required
- Short-lived transactions
- All participants are reliable
- Network is stable
- Don’t use if: High availability required, network partitions common
Use Paxos Commit When:
- Need consensus on commit decision
- High availability required
- Can tolerate complexity
- Don’t use if: Simple solution needed, low latency critical
Use Saga When:
- Long-running business processes
- High availability required
- Can accept eventual consistency
- Operations can be compensated
- Don’t use if: Strong consistency required, operations cannot be compensated
Best Practices
For 2PC/3PC:
- Keep transactions short: Reduce lock time
- Use timeouts: Prevent indefinite blocking
- Monitor coordinator: Ensure coordinator is healthy
- Log everything: Critical for recovery
- Test failures: Regularly test coordinator and participant failures
For Saga:
- Design for compensation: Make operations reversible
- Use idempotency: Make operations safe to retry
- Monitor saga execution: Track saga progress and handle stuck sagas
- Set timeouts: Prevent sagas from running indefinitely
- Use orchestration: Prefer orchestrated for visibility and control
- Event ordering: Ensure events are processed in order (choreographed)
Real-World System Analysis
Google Spanner: Distributed Transactions
Spanner uses 2PC with Paxos for distributed transactions:
- 2PC: For transaction coordination
- Paxos: For replica consensus
- TrueTime: For timestamp ordering
- External consistency: Stronger than strong consistency
Why this works: Spanner controls the entire stack (hardware, network, software), allowing it to provide strong consistency with good performance.
Amazon DynamoDB: Transactions
DynamoDB uses 2PC-like protocol for transactions:
- Prepare phase: Validates and reserves resources
- Commit phase: Applies changes atomically
- Limited scope: Transactions limited to 25 items, single partition
- Performance: Optimized for single-partition transactions
Why limited: Cross-partition transactions would require more coordination, reducing performance.
Uber: Distributed Transactions
Uber uses Saga pattern for business processes:
- Orchestrated sagas: For complex workflows
- Compensation: For handling failures
- Event-driven: Choreographed sagas for some processes
- High availability: Critical for ride-sharing
Why Saga: Long-running processes (matching driver to rider) cannot use blocking protocols.
Conclusion
Distributed transactions represent a fundamental trade-off between consistency, availability, and performance:
- 2PC/3PC: Provide strong consistency but block on failures
- Paxos Commit: Provides consensus but adds complexity
- Saga: Provides availability but accepts eventual consistency
The right choice depends on your requirements:
- Financial systems: May need 2PC for strong consistency
- Coordination systems: May use Paxos for consensus
- Business processes: Often use Saga for availability
Modern systems often combine approaches:
- Strong consistency for critical data (2PC/Paxos)
- Eventual consistency for business processes (Saga)
- Compensation for handling failures (Saga)
Understanding these patterns helps you make informed architectural decisions and build systems that meet your consistency and availability requirements.