Understanding Raft Consensus Algorithm: A Comprehensive Guide

• 20 min read
Consensus

Consensus is one of the most fundamental problems in distributed systems. How do multiple nodes agree on a value when some nodes may fail or messages may be delayed? The Raft consensus algorithm provides an elegant solution to this problem, designed to be more understandable than its predecessor, Paxos.

What is Consensus?

Consensus is the problem of getting multiple nodes in a distributed system to agree on a value or sequence of values, even in the presence of failures. It’s the foundation for:

  • Replicated state machines: Ensuring all replicas have the same state
  • Distributed databases: Maintaining consistency across replicas
  • Configuration management: Agreeing on cluster membership
  • Leader election: Choosing a single leader among multiple candidates

Why Consensus Matters

In distributed systems, we need consensus to ensure:

  • Consistency: All nodes see the same data
  • Availability: System continues working despite failures
  • Correctness: Operations are applied in the same order everywhere

The Problem with Paxos

Paxos, the classic consensus algorithm, has been notoriously difficult to understand and implement correctly. Its complexity comes from:

  • Multiple roles: Proposers, acceptors, learners
  • Complex state transitions: Hard to reason about correctness
  • Implementation variations: Many “Paxos-like” algorithms exist
  • Difficult debugging: Hard to understand what went wrong

This led Diego Ongaro and John Ousterhout to create Raft in 2013, with the explicit goal of making consensus understandable.

Raft: Designed for Understandability

Raft separates consensus into three relatively independent subproblems:

  1. Leader election: A new leader must be chosen when the current leader fails
  2. Log replication: The leader must accept log entries from clients and replicate them to a majority of servers
  3. Safety: The key safety property ensuring that any two servers with overlapping logs agree on entries

This separation makes Raft easier to understand, implement, and reason about than Paxos.

Raft Basics

Server States

Each Raft server is always in one of three states:

  • Leader: Handles all client requests and replicates log entries
  • Follower: Passively receives log entries from the leader
  • Candidate: Intermediate state used during leader election
[Follower] -- timeout --> [Candidate] -- majority votes --> [Leader]
    ^                           |                                |
    |                           |                                |
    +---------------------------+--------------------------------+
                    (discovers current leader)

Terms

Raft divides time into terms, numbered with consecutive integers. Each term begins with an election, where one or more candidates attempt to become leader. If a candidate wins the election, it serves as leader for the rest of the term.

  • Term number: Monotonically increasing
  • Election: At the start of each term
  • At most one leader: Per term (safety property)

RPCs

Raft servers communicate using two RPCs:

  1. RequestVote: Used by candidates to gather votes
  2. AppendEntries: Used by leaders to replicate log entries and as heartbeat

Leader Election

Leader election ensures that a new leader is chosen when the current leader fails or becomes unreachable.

Election Process

  1. Timeout: Followers wait for a random timeout (election timeout, typically 150-300ms)
  2. Become Candidate: If no heartbeat received, follower becomes candidate
  3. Request Votes: Candidate sends RequestVote RPCs to all other servers
  4. Vote: Servers vote for the first candidate they see in a term
  5. Win Election: Candidate becomes leader if it receives votes from majority
  6. Heartbeat: Leader sends periodic heartbeats to maintain leadership

RequestVote RPC

RequestVoteRequest contains:

  • term: Candidate’s term number
  • candidate_id: ID of the candidate requesting vote
  • last_log_index: Index of candidate’s last log entry
  • last_log_term: Term of candidate’s last log entry

RequestVoteResponse contains:

  • term: Current term (for candidate to update itself)
  • vote_granted: Boolean indicating if candidate received vote

RequestVote RPC Flow:

Candidate          Follower 1        Follower 2        Follower 3
--------          ----------        ----------        ----------
    |                  |                  |                  |
    |--RequestVote---->|                  |                  |
    |--RequestVote----------------------->|                  |
    |--RequestVote------------------------------------------>|
    |                  |                  |                  |
    |<--Vote (true)----|                  |                  |
    |<--Vote (true)-----------------------|                  |
    |<--Vote (false)----------------------------------------|
    |                  |                  |                  |
    | (Received 2 votes - majority)
    | Becomes Leader

Voting Rules

A server votes for a candidate if:

  1. Term check: Candidate’s term >= server’s current term
  2. Log completeness: Candidate’s log is at least as up-to-date as server’s log
  3. First vote: Server hasn’t voted for another candidate in this term

Log Completeness

A log is more up-to-date if:

  • The last entry has a higher term, OR
  • The last entries have the same term, but the log is longer

This ensures the leader has all committed entries.

Split Vote Prevention

Raft uses randomized election timeouts to prevent split votes:

  • Random timeout: Each server picks a random timeout between 150-300ms
  • First to timeout: Usually wins election before others timeout
  • Split vote rare: Even if split occurs, new election happens quickly

Example: Leader Election

Timeline:

Time    Server A (Follower)    Server B (Follower)    Server C (Follower)
----    -------------------    -------------------    -------------------
T0      Waiting (200ms)        Waiting (250ms)        Waiting (180ms)
        for heartbeat          for heartbeat          for heartbeat

T180    Timeout!
        → Become Candidate
        → RequestVote(B, C)

T181                            Vote for A             Vote for A

T182    Received 2 votes (majority)
        → Become Leader
        → AppendEntries(heartbeat) → B
        → AppendEntries(heartbeat) → C

T183                            Become Follower        Become Follower

Log Replication

Once a leader is elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machine.

Log Structure

Each log entry contains:

  • Index: Position in the log (1-indexed)
  • Term: Term when entry was created
  • Command: State machine command
Index:  1    2    3    4    5    6
Term:   1    1    2    2    2    3
Cmd:    x=1  y=2  x=3  z=4  w=5  y=6

Replication Process

  1. Client Request: Client sends command to leader
  2. Append to Log: Leader appends entry to its log
  3. Replicate: Leader sends AppendEntries RPCs to all followers
  4. Acknowledge: Followers append entry and respond
  5. Commit: When majority acknowledges, leader commits entry
  6. Apply: Leader applies entry to state machine and responds to client
  7. Notify Followers: Leader notifies followers of committed entries

AppendEntries RPC

AppendEntriesRequest contains:

  • term: Leader’s term number
  • leader_id: ID of the leader
  • prev_log_index: Index of log entry immediately preceding new ones
  • prev_log_term: Term of prev_log_index entry
  • entries: Log entries to store (empty for heartbeat)
  • leader_commit: Leader’s commitIndex

AppendEntriesResponse contains:

  • term: Current term (for leader to update itself)
  • success: Boolean indicating if follower contained entry matching prev_log_index and prev_log_term

AppendEntries RPC Flow:

Leader            Follower 1        Follower 2        Follower 3
------            ----------        ----------        ----------
  |                    |                  |                  |
  |--AppendEntries---->|                  |                  |
  |--AppendEntries---------------------->|                  |
  |--AppendEntries---------------------------------------->|
  |                    |                  |                  |
  |<--Response(success)|                  |                  |
  |<--Response(success)---------------------|                  |
  |<--Response(fail)----------------------------------------|
  |                    |                  |                  |
  | (Majority acknowledged - Entry committed)

Consistency Check

The prev_log_index and prev_log_term ensure log consistency:

  • Matching entry: Follower must have entry at prev_log_index with term prev_log_term
  • If match: Follower appends new entries and deletes conflicting entries
  • If mismatch: Follower rejects and leader retries with earlier entry

Commit Index

  • Leader commitIndex: Highest log entry known to be committed
  • Follower commitIndex: Highest log entry applied to state machine
  • Commit rule: Entry is committed when it’s stored on majority of servers

Example: Log Replication

Initial State:

  • Leader (A): Log [1,2,3]
  • Follower (B): Log [1,2]
  • Follower (C): Log [1,2]

Execution:

Client          Leader (A)         Follower (B)       Follower (C)
------          ----------         ----------         ----------
  |                  |                   |                   |
  |--Write x=4------>|                   |                   |
  |                  |                   |                   |
  |              Append entry 4          |                   |
  |              Log: [1,2,3,4]          |                   |
  |                  |                   |                   |
  |                  |--AppendEntries(prev=3, entries=[4])-->|
  |                  |--AppendEntries(prev=3, entries=[4])--|
  |                  |                   |                   |
  |                  |                   | Log: [1,2,4]      |
  |                  |<--Response(success)                  |
  |                  |<--Response(success)-------------------|
  |                  |                   |                   |
  |              Majority acknowledged                       |
  |              Commit entry 4                              |
  |                  |                   |                   |
  |                  |--AppendEntries(commit=4)------------->|
  |                  |--AppendEntries(commit=4)-------------|
  |                  |                   |                   |
  |              Apply x=4            Apply x=4          Apply x=4
  |<--Success--------|                   |                   |

Safety Properties

Raft ensures several critical safety properties:

Election Safety

At most one leader can be elected in a given term.

This is guaranteed by:

  • Each server votes at most once per term
  • Majority voting ensures at most one candidate can win

Leader Append-Only

A leader never overwrites or deletes entries in its log; it only appends new entries.

This ensures that once an entry is in the leader’s log, it remains there.

Log Matching

If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.

This is maintained by:

  • Consistency check in AppendEntries
  • Leader only creates one entry per log position per term

Leader Completeness

If a log entry is committed in a given term, then that entry will be present in the logs of all leaders for all higher-numbered terms.

This is guaranteed by:

  • Election restriction: only candidates with complete logs can become leader
  • Log completeness check during voting

State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry at the same index.

This follows from the previous properties.

Handling Failures

Leader Failure

When a leader fails:

  1. No heartbeats: Followers stop receiving AppendEntries
  2. Election timeout: Followers become candidates
  3. New election: New leader is elected
  4. Log recovery: New leader brings followers up to date

Follower Failure

When a follower fails:

  1. Leader continues: Leader continues serving requests
  2. Missing entries: Follower misses some log entries
  3. Recovery: When follower recovers, leader resends missing entries
  4. Catch up: Follower catches up using AppendEntries

Network Partition

During a network partition:

  1. Minority partition: Cannot elect leader (needs majority)
  2. Majority partition: Continues operating normally
  3. Rejoin: When partition heals, minority partition’s leader steps down
  4. Log synchronization: New leader brings all nodes up to date

Example: Leader Failure

Timeline:

Time    Leader (A)        Follower (B)        Follower (C)
----    ----------        -----------         -----------
T0      [1,2,3]           [1,2,3]            [1,2,3]
        Serving            Waiting            Waiting

T100    CRASH!            Waiting            Waiting
        (down)             (timeout: 250ms)   (timeout: 180ms)

T250    (down)            Timeout!
                          → Become Candidate
                          → RequestVote(C)

T251    (down)            Received vote from C
                          → Become Leader
                          → AppendEntries(heartbeat) → C

T252    (down)            [1,2,3]             [1,2,3]
                          Serving              Following

Log Compaction

Over time, logs grow unbounded. Raft uses snapshotting for log compaction:

  1. Snapshot: Replace committed entries with snapshot
  2. Snapshot metadata: Include last included index and term
  3. InstallSnapshot RPC: Leader sends snapshots to lagging followers
  4. State machine: Snapshot includes complete state machine state

Snapshot Process

Before:  [1,2,3,4,5,6,7,8,9,10] (10 entries)
After:   [snapshot(1-7), 8,9,10] (snapshot + 3 entries)

Cluster Membership Changes

Adding or removing servers requires careful handling to avoid split-brain scenarios.

Joint Consensus

Raft uses a two-phase approach for membership changes:

  1. Joint consensus: Transition to intermediate configuration (old + new)
  2. New consensus: Once joint consensus committed, transition to new configuration

This ensures that during the transition, both old and new majorities can make decisions.

Configuration Changes

Old: {A, B, C}
New: {A, B, C, D}

Step 1: Joint consensus {A,B,C} + {A,B,C,D}
Step 2: Once committed, switch to {A,B,C,D}

Performance Characteristics

Throughput

  • Single leader: All writes go through leader (bottleneck)
  • Batching: Can batch multiple entries in one AppendEntries RPC
  • Pipelining: Can have multiple AppendEntries RPCs in flight

Latency

  • Client latency: One round-trip to majority (typically < 10ms in same datacenter)
  • Commit latency: Half round-trip time (RTT) to majority
  • Read optimization: Can serve reads from leader without log replication

Scalability

  • Log size: Grows linearly with number of operations
  • Network: O(n) messages per operation (n = number of servers)
  • Typical cluster size: 3-7 servers (odd number for majority)

Real-World Implementations

etcd

etcd uses Raft for distributed key-value store:

  • Configuration: Kubernetes, Docker Swarm use etcd
  • Consistency: Strong consistency guarantees
  • Performance: Handles thousands of writes per second

Consul

Consul uses Raft for service discovery and configuration:

  • Service registry: Maintains consistent service catalog
  • Leader election: Ensures single leader for coordination
  • Multi-datacenter: Uses Raft within each datacenter

CockroachDB

CockroachDB uses Raft (with Multi-Raft) for distributed SQL database:

  • Range replication: Each range uses Raft
  • Consistency: Strong consistency across regions
  • Scale: Handles millions of ranges

MongoDB Replica Sets

MongoDB uses Raft-like algorithm for replica sets:

  • Primary election: Similar to Raft leader election
  • Oplog replication: Similar to log replication
  • Automatic failover: Handles primary failures

Raft vs. Paxos

AspectRaftPaxos
UnderstandabilityHighLow
LeaderAlways has leaderNo stable leader
Log structureSequential logIndependent proposals
ImplementationEasierHarder
PerformanceSimilarSimilar
SafetySame guaranteesSame guarantees

Common Implementation Challenges

Stale Reads

Problem: Reading from followers may return stale data.

Solution:

  • Read from leader (strong consistency)
  • Use read leases (time-based guarantee)
  • Use read index (check if entry is committed)

Split Votes

Problem: Multiple candidates can cause repeated elections.

Solution:

  • Randomized timeouts
  • Pre-vote mechanism (check if current leader exists)

Log Growth

Problem: Log grows unbounded over time.

Solution:

  • Periodic snapshots
  • Log compaction
  • Truncate old entries

Network Partitions

Problem: Minority partition cannot make progress.

Solution:

  • This is by design (prevents split-brain)
  • Use majority quorums
  • Consider availability vs. consistency trade-offs

Best Practices

  1. Use odd number of servers: Ensures clear majority (3, 5, 7)
  2. Monitor leader health: Track leader election frequency
  3. Tune timeouts: Balance between availability and performance
  4. Handle snapshots: Implement efficient snapshotting
  5. Test failures: Regularly test leader failures and network partitions
  6. Monitor metrics: Track commit latency, election frequency, log size

Conclusion

Raft provides a understandable, correct, and efficient solution to the consensus problem. Its separation of concerns (leader election, log replication, safety) makes it easier to understand and implement than Paxos, while providing the same safety guarantees.

Key takeaways:

  • Understandability: Raft’s design prioritizes clarity
  • Safety: Strong guarantees even with failures
  • Performance: Efficient for most use cases
  • Real-world: Widely used in production systems
  • Trade-offs: Understand availability vs. consistency

Whether you’re building a distributed database, configuration system, or any system requiring consensus, Raft provides a solid foundation for ensuring consistency and correctness in distributed systems.