Understanding Raft Consensus Algorithm: A Comprehensive Guide
Consensus is one of the most fundamental problems in distributed systems. How do multiple nodes agree on a value when some nodes may fail or messages may be delayed? The Raft consensus algorithm provides an elegant solution to this problem, designed to be more understandable than its predecessor, Paxos.
What is Consensus?
Consensus is the problem of getting multiple nodes in a distributed system to agree on a value or sequence of values, even in the presence of failures. It’s the foundation for:
- Replicated state machines: Ensuring all replicas have the same state
- Distributed databases: Maintaining consistency across replicas
- Configuration management: Agreeing on cluster membership
- Leader election: Choosing a single leader among multiple candidates
Why Consensus Matters
In distributed systems, we need consensus to ensure:
- Consistency: All nodes see the same data
- Availability: System continues working despite failures
- Correctness: Operations are applied in the same order everywhere
The Problem with Paxos
Paxos, the classic consensus algorithm, has been notoriously difficult to understand and implement correctly. Its complexity comes from:
- Multiple roles: Proposers, acceptors, learners
- Complex state transitions: Hard to reason about correctness
- Implementation variations: Many “Paxos-like” algorithms exist
- Difficult debugging: Hard to understand what went wrong
This led Diego Ongaro and John Ousterhout to create Raft in 2013, with the explicit goal of making consensus understandable.
Raft: Designed for Understandability
Raft separates consensus into three relatively independent subproblems:
- Leader election: A new leader must be chosen when the current leader fails
- Log replication: The leader must accept log entries from clients and replicate them to a majority of servers
- Safety: The key safety property ensuring that any two servers with overlapping logs agree on entries
This separation makes Raft easier to understand, implement, and reason about than Paxos.
Raft Basics
Server States
Each Raft server is always in one of three states:
- Leader: Handles all client requests and replicates log entries
- Follower: Passively receives log entries from the leader
- Candidate: Intermediate state used during leader election
[Follower] -- timeout --> [Candidate] -- majority votes --> [Leader]
^ | |
| | |
+---------------------------+--------------------------------+
(discovers current leader)
Terms
Raft divides time into terms, numbered with consecutive integers. Each term begins with an election, where one or more candidates attempt to become leader. If a candidate wins the election, it serves as leader for the rest of the term.
- Term number: Monotonically increasing
- Election: At the start of each term
- At most one leader: Per term (safety property)
RPCs
Raft servers communicate using two RPCs:
- RequestVote: Used by candidates to gather votes
- AppendEntries: Used by leaders to replicate log entries and as heartbeat
Leader Election
Leader election ensures that a new leader is chosen when the current leader fails or becomes unreachable.
Election Process
- Timeout: Followers wait for a random timeout (election timeout, typically 150-300ms)
- Become Candidate: If no heartbeat received, follower becomes candidate
- Request Votes: Candidate sends RequestVote RPCs to all other servers
- Vote: Servers vote for the first candidate they see in a term
- Win Election: Candidate becomes leader if it receives votes from majority
- Heartbeat: Leader sends periodic heartbeats to maintain leadership
RequestVote RPC
RequestVoteRequest contains:
term: Candidate’s term numbercandidate_id: ID of the candidate requesting votelast_log_index: Index of candidate’s last log entrylast_log_term: Term of candidate’s last log entry
RequestVoteResponse contains:
term: Current term (for candidate to update itself)vote_granted: Boolean indicating if candidate received vote
RequestVote RPC Flow:
Candidate Follower 1 Follower 2 Follower 3
-------- ---------- ---------- ----------
| | | |
|--RequestVote---->| | |
|--RequestVote----------------------->| |
|--RequestVote------------------------------------------>|
| | | |
|<--Vote (true)----| | |
|<--Vote (true)-----------------------| |
|<--Vote (false)----------------------------------------|
| | | |
| (Received 2 votes - majority)
| Becomes Leader
Voting Rules
A server votes for a candidate if:
- Term check: Candidate’s term >= server’s current term
- Log completeness: Candidate’s log is at least as up-to-date as server’s log
- First vote: Server hasn’t voted for another candidate in this term
Log Completeness
A log is more up-to-date if:
- The last entry has a higher term, OR
- The last entries have the same term, but the log is longer
This ensures the leader has all committed entries.
Split Vote Prevention
Raft uses randomized election timeouts to prevent split votes:
- Random timeout: Each server picks a random timeout between 150-300ms
- First to timeout: Usually wins election before others timeout
- Split vote rare: Even if split occurs, new election happens quickly
Example: Leader Election
Timeline:
Time Server A (Follower) Server B (Follower) Server C (Follower)
---- ------------------- ------------------- -------------------
T0 Waiting (200ms) Waiting (250ms) Waiting (180ms)
for heartbeat for heartbeat for heartbeat
T180 Timeout!
→ Become Candidate
→ RequestVote(B, C)
T181 Vote for A Vote for A
T182 Received 2 votes (majority)
→ Become Leader
→ AppendEntries(heartbeat) → B
→ AppendEntries(heartbeat) → C
T183 Become Follower Become Follower
Log Replication
Once a leader is elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machine.
Log Structure
Each log entry contains:
- Index: Position in the log (1-indexed)
- Term: Term when entry was created
- Command: State machine command
Index: 1 2 3 4 5 6
Term: 1 1 2 2 2 3
Cmd: x=1 y=2 x=3 z=4 w=5 y=6
Replication Process
- Client Request: Client sends command to leader
- Append to Log: Leader appends entry to its log
- Replicate: Leader sends AppendEntries RPCs to all followers
- Acknowledge: Followers append entry and respond
- Commit: When majority acknowledges, leader commits entry
- Apply: Leader applies entry to state machine and responds to client
- Notify Followers: Leader notifies followers of committed entries
AppendEntries RPC
AppendEntriesRequest contains:
term: Leader’s term numberleader_id: ID of the leaderprev_log_index: Index of log entry immediately preceding new onesprev_log_term: Term of prev_log_index entryentries: Log entries to store (empty for heartbeat)leader_commit: Leader’s commitIndex
AppendEntriesResponse contains:
term: Current term (for leader to update itself)success: Boolean indicating if follower contained entry matching prev_log_index and prev_log_term
AppendEntries RPC Flow:
Leader Follower 1 Follower 2 Follower 3
------ ---------- ---------- ----------
| | | |
|--AppendEntries---->| | |
|--AppendEntries---------------------->| |
|--AppendEntries---------------------------------------->|
| | | |
|<--Response(success)| | |
|<--Response(success)---------------------| |
|<--Response(fail)----------------------------------------|
| | | |
| (Majority acknowledged - Entry committed)
Consistency Check
The prev_log_index and prev_log_term ensure log consistency:
- Matching entry: Follower must have entry at
prev_log_indexwith termprev_log_term - If match: Follower appends new entries and deletes conflicting entries
- If mismatch: Follower rejects and leader retries with earlier entry
Commit Index
- Leader commitIndex: Highest log entry known to be committed
- Follower commitIndex: Highest log entry applied to state machine
- Commit rule: Entry is committed when it’s stored on majority of servers
Example: Log Replication
Initial State:
- Leader (A): Log [1,2,3]
- Follower (B): Log [1,2]
- Follower (C): Log [1,2]
Execution:
Client Leader (A) Follower (B) Follower (C)
------ ---------- ---------- ----------
| | | |
|--Write x=4------>| | |
| | | |
| Append entry 4 | |
| Log: [1,2,3,4] | |
| | | |
| |--AppendEntries(prev=3, entries=[4])-->|
| |--AppendEntries(prev=3, entries=[4])--|
| | | |
| | | Log: [1,2,4] |
| |<--Response(success) |
| |<--Response(success)-------------------|
| | | |
| Majority acknowledged |
| Commit entry 4 |
| | | |
| |--AppendEntries(commit=4)------------->|
| |--AppendEntries(commit=4)-------------|
| | | |
| Apply x=4 Apply x=4 Apply x=4
|<--Success--------| | |
Safety Properties
Raft ensures several critical safety properties:
Election Safety
At most one leader can be elected in a given term.
This is guaranteed by:
- Each server votes at most once per term
- Majority voting ensures at most one candidate can win
Leader Append-Only
A leader never overwrites or deletes entries in its log; it only appends new entries.
This ensures that once an entry is in the leader’s log, it remains there.
Log Matching
If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.
This is maintained by:
- Consistency check in AppendEntries
- Leader only creates one entry per log position per term
Leader Completeness
If a log entry is committed in a given term, then that entry will be present in the logs of all leaders for all higher-numbered terms.
This is guaranteed by:
- Election restriction: only candidates with complete logs can become leader
- Log completeness check during voting
State Machine Safety
If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry at the same index.
This follows from the previous properties.
Handling Failures
Leader Failure
When a leader fails:
- No heartbeats: Followers stop receiving AppendEntries
- Election timeout: Followers become candidates
- New election: New leader is elected
- Log recovery: New leader brings followers up to date
Follower Failure
When a follower fails:
- Leader continues: Leader continues serving requests
- Missing entries: Follower misses some log entries
- Recovery: When follower recovers, leader resends missing entries
- Catch up: Follower catches up using AppendEntries
Network Partition
During a network partition:
- Minority partition: Cannot elect leader (needs majority)
- Majority partition: Continues operating normally
- Rejoin: When partition heals, minority partition’s leader steps down
- Log synchronization: New leader brings all nodes up to date
Example: Leader Failure
Timeline:
Time Leader (A) Follower (B) Follower (C)
---- ---------- ----------- -----------
T0 [1,2,3] [1,2,3] [1,2,3]
Serving Waiting Waiting
T100 CRASH! Waiting Waiting
(down) (timeout: 250ms) (timeout: 180ms)
T250 (down) Timeout!
→ Become Candidate
→ RequestVote(C)
T251 (down) Received vote from C
→ Become Leader
→ AppendEntries(heartbeat) → C
T252 (down) [1,2,3] [1,2,3]
Serving Following
Log Compaction
Over time, logs grow unbounded. Raft uses snapshotting for log compaction:
- Snapshot: Replace committed entries with snapshot
- Snapshot metadata: Include last included index and term
- InstallSnapshot RPC: Leader sends snapshots to lagging followers
- State machine: Snapshot includes complete state machine state
Snapshot Process
Before: [1,2,3,4,5,6,7,8,9,10] (10 entries)
After: [snapshot(1-7), 8,9,10] (snapshot + 3 entries)
Cluster Membership Changes
Adding or removing servers requires careful handling to avoid split-brain scenarios.
Joint Consensus
Raft uses a two-phase approach for membership changes:
- Joint consensus: Transition to intermediate configuration (old + new)
- New consensus: Once joint consensus committed, transition to new configuration
This ensures that during the transition, both old and new majorities can make decisions.
Configuration Changes
Old: {A, B, C}
New: {A, B, C, D}
Step 1: Joint consensus {A,B,C} + {A,B,C,D}
Step 2: Once committed, switch to {A,B,C,D}
Performance Characteristics
Throughput
- Single leader: All writes go through leader (bottleneck)
- Batching: Can batch multiple entries in one AppendEntries RPC
- Pipelining: Can have multiple AppendEntries RPCs in flight
Latency
- Client latency: One round-trip to majority (typically < 10ms in same datacenter)
- Commit latency: Half round-trip time (RTT) to majority
- Read optimization: Can serve reads from leader without log replication
Scalability
- Log size: Grows linearly with number of operations
- Network: O(n) messages per operation (n = number of servers)
- Typical cluster size: 3-7 servers (odd number for majority)
Real-World Implementations
etcd
etcd uses Raft for distributed key-value store:
- Configuration: Kubernetes, Docker Swarm use etcd
- Consistency: Strong consistency guarantees
- Performance: Handles thousands of writes per second
Consul
Consul uses Raft for service discovery and configuration:
- Service registry: Maintains consistent service catalog
- Leader election: Ensures single leader for coordination
- Multi-datacenter: Uses Raft within each datacenter
CockroachDB
CockroachDB uses Raft (with Multi-Raft) for distributed SQL database:
- Range replication: Each range uses Raft
- Consistency: Strong consistency across regions
- Scale: Handles millions of ranges
MongoDB Replica Sets
MongoDB uses Raft-like algorithm for replica sets:
- Primary election: Similar to Raft leader election
- Oplog replication: Similar to log replication
- Automatic failover: Handles primary failures
Raft vs. Paxos
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | High | Low |
| Leader | Always has leader | No stable leader |
| Log structure | Sequential log | Independent proposals |
| Implementation | Easier | Harder |
| Performance | Similar | Similar |
| Safety | Same guarantees | Same guarantees |
Common Implementation Challenges
Stale Reads
Problem: Reading from followers may return stale data.
Solution:
- Read from leader (strong consistency)
- Use read leases (time-based guarantee)
- Use read index (check if entry is committed)
Split Votes
Problem: Multiple candidates can cause repeated elections.
Solution:
- Randomized timeouts
- Pre-vote mechanism (check if current leader exists)
Log Growth
Problem: Log grows unbounded over time.
Solution:
- Periodic snapshots
- Log compaction
- Truncate old entries
Network Partitions
Problem: Minority partition cannot make progress.
Solution:
- This is by design (prevents split-brain)
- Use majority quorums
- Consider availability vs. consistency trade-offs
Best Practices
- Use odd number of servers: Ensures clear majority (3, 5, 7)
- Monitor leader health: Track leader election frequency
- Tune timeouts: Balance between availability and performance
- Handle snapshots: Implement efficient snapshotting
- Test failures: Regularly test leader failures and network partitions
- Monitor metrics: Track commit latency, election frequency, log size
Conclusion
Raft provides a understandable, correct, and efficient solution to the consensus problem. Its separation of concerns (leader election, log replication, safety) makes it easier to understand and implement than Paxos, while providing the same safety guarantees.
Key takeaways:
- Understandability: Raft’s design prioritizes clarity
- Safety: Strong guarantees even with failures
- Performance: Efficient for most use cases
- Real-world: Widely used in production systems
- Trade-offs: Understand availability vs. consistency
Whether you’re building a distributed database, configuration system, or any system requiring consensus, Raft provides a solid foundation for ensuring consistency and correctness in distributed systems.