Understanding Raft Consensus Algorithm: A Comprehensive Guide

May 15, 2024 • 20 min read

Consensus

Consensus is one of the most fundamental problems in distributed systems. How do multiple nodes agree on a value when some nodes may fail or messages may be delayed? The Raft consensus algorithm provides an elegant solution to this problem, designed to be more understandable than its predecessor, Paxos.

What is Consensus?

Consensus is the problem of getting multiple nodes in a distributed system to agree on a value or sequence of values, even in the presence of failures. It’s the foundation for:

Replicated state machines: Ensuring all replicas have the same state
Distributed databases: Maintaining consistency across replicas
Configuration management: Agreeing on cluster membership
Leader election: Choosing a single leader among multiple candidates

Why Consensus Matters

In distributed systems, we need consensus to ensure:

Consistency: All nodes see the same data
Availability: System continues working despite failures
Correctness: Operations are applied in the same order everywhere

The Problem with Paxos

Paxos, the classic consensus algorithm, has been notoriously difficult to understand and implement correctly. Its complexity comes from:

Multiple roles: Proposers, acceptors, learners
Complex state transitions: Hard to reason about correctness
Implementation variations: Many “Paxos-like” algorithms exist
Difficult debugging: Hard to understand what went wrong

This led Diego Ongaro and John Ousterhout to create Raft in 2013, with the explicit goal of making consensus understandable.

Raft: Designed for Understandability

Raft separates consensus into three relatively independent subproblems:

Leader election: A new leader must be chosen when the current leader fails
Log replication: The leader must accept log entries from clients and replicate them to a majority of servers
Safety: The key safety property ensuring that any two servers with overlapping logs agree on entries

This separation makes Raft easier to understand, implement, and reason about than Paxos.

Raft Basics

Server States

Each Raft server is always in one of three states:

Leader: Handles all client requests and replicates log entries
Follower: Passively receives log entries from the leader
Candidate: Intermediate state used during leader election

[Follower] -- timeout --> [Candidate] -- majority votes --> [Leader]
    ^                           |                                |
    |                           |                                |
    +---------------------------+--------------------------------+
                    (discovers current leader)

Terms

Raft divides time into terms, numbered with consecutive integers. Each term begins with an election, where one or more candidates attempt to become leader. If a candidate wins the election, it serves as leader for the rest of the term.

Term number: Monotonically increasing
Election: At the start of each term
At most one leader: Per term (safety property)

RPCs

Raft servers communicate using two RPCs:

RequestVote: Used by candidates to gather votes
AppendEntries: Used by leaders to replicate log entries and as heartbeat

Leader Election

Leader election ensures that a new leader is chosen when the current leader fails or becomes unreachable.

Election Process

Timeout: Followers wait for a random timeout (election timeout, typically 150-300ms)
Become Candidate: If no heartbeat received, follower becomes candidate
Request Votes: Candidate sends RequestVote RPCs to all other servers
Vote: Servers vote for the first candidate they see in a term
Win Election: Candidate becomes leader if it receives votes from majority
Heartbeat: Leader sends periodic heartbeats to maintain leadership

RequestVote RPC

RequestVoteRequest contains:

term: Candidate’s term number
candidate_id: ID of the candidate requesting vote
last_log_index: Index of candidate’s last log entry
last_log_term: Term of candidate’s last log entry

RequestVoteResponse contains:

term: Current term (for candidate to update itself)
vote_granted: Boolean indicating if candidate received vote

RequestVote RPC Flow:

Candidate          Follower 1        Follower 2        Follower 3
--------          ----------        ----------        ----------
    |                  |                  |                  |
    |--RequestVote---->|                  |                  |
    |--RequestVote----------------------->|                  |
    |--RequestVote------------------------------------------>|
    |                  |                  |                  |
    |<--Vote (true)----|                  |                  |
    |<--Vote (true)-----------------------|                  |
    |<--Vote (false)----------------------------------------|
    |                  |                  |                  |
    | (Received 2 votes - majority)
    | Becomes Leader

Voting Rules

A server votes for a candidate if:

Term check: Candidate’s term >= server’s current term
Log completeness: Candidate’s log is at least as up-to-date as server’s log
First vote: Server hasn’t voted for another candidate in this term

Log Completeness

A log is more up-to-date if:

The last entry has a higher term, OR
The last entries have the same term, but the log is longer

This ensures the leader has all committed entries.

Split Vote Prevention

Raft uses randomized election timeouts to prevent split votes:

Random timeout: Each server picks a random timeout between 150-300ms
First to timeout: Usually wins election before others timeout
Split vote rare: Even if split occurs, new election happens quickly

Example: Leader Election

Timeline:

Time    Server A (Follower)    Server B (Follower)    Server C (Follower)
----    -------------------    -------------------    -------------------
T0      Waiting (200ms)        Waiting (250ms)        Waiting (180ms)
        for heartbeat          for heartbeat          for heartbeat

T180    Timeout!
        → Become Candidate
        → RequestVote(B, C)

T181                            Vote for A             Vote for A

T182    Received 2 votes (majority)
        → Become Leader
        → AppendEntries(heartbeat) → B
        → AppendEntries(heartbeat) → C

T183                            Become Follower        Become Follower

Log Replication

Once a leader is elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machine.

Log Structure

Each log entry contains:

Index: Position in the log (1-indexed)
Term: Term when entry was created
Command: State machine command

Index:  1    2    3    4    5    6
Term:   1    1    2    2    2    3
Cmd:    x=1  y=2  x=3  z=4  w=5  y=6

Replication Process

Client Request: Client sends command to leader
Append to Log: Leader appends entry to its log
Replicate: Leader sends AppendEntries RPCs to all followers
Acknowledge: Followers append entry and respond
Commit: When majority acknowledges, leader commits entry
Apply: Leader applies entry to state machine and responds to client
Notify Followers: Leader notifies followers of committed entries

AppendEntries RPC

AppendEntriesRequest contains:

term: Leader’s term number
leader_id: ID of the leader
prev_log_index: Index of log entry immediately preceding new ones
prev_log_term: Term of prev_log_index entry
entries: Log entries to store (empty for heartbeat)
leader_commit: Leader’s commitIndex

AppendEntriesResponse contains:

term: Current term (for leader to update itself)
success: Boolean indicating if follower contained entry matching prev_log_index and prev_log_term

AppendEntries RPC Flow:

Leader            Follower 1        Follower 2        Follower 3
------            ----------        ----------        ----------
  |                    |                  |                  |
  |--AppendEntries---->|                  |                  |
  |--AppendEntries---------------------->|                  |
  |--AppendEntries---------------------------------------->|
  |                    |                  |                  |
  |<--Response(success)|                  |                  |
  |<--Response(success)---------------------|                  |
  |<--Response(fail)----------------------------------------|
  |                    |                  |                  |
  | (Majority acknowledged - Entry committed)

Consistency Check

The prev_log_index and prev_log_term ensure log consistency:

Matching entry: Follower must have entry at prev_log_index with term prev_log_term
If match: Follower appends new entries and deletes conflicting entries
If mismatch: Follower rejects and leader retries with earlier entry

Commit Index

Leader commitIndex: Highest log entry known to be committed
Follower commitIndex: Highest log entry applied to state machine
Commit rule: Entry is committed when it’s stored on majority of servers

Example: Log Replication

Initial State:

Leader (A): Log [1,2,3]
Follower (B): Log [1,2]
Follower (C): Log [1,2]

Execution:

Client          Leader (A)         Follower (B)       Follower (C)
------          ----------         ----------         ----------
  |                  |                   |                   |
  |--Write x=4------>|                   |                   |
  |                  |                   |                   |
  |              Append entry 4          |                   |
  |              Log: [1,2,3,4]          |                   |
  |                  |                   |                   |
  |                  |--AppendEntries(prev=3, entries=[4])-->|
  |                  |--AppendEntries(prev=3, entries=[4])--|
  |                  |                   |                   |
  |                  |                   | Log: [1,2,4]      |
  |                  |<--Response(success)                  |
  |                  |<--Response(success)-------------------|
  |                  |                   |                   |
  |              Majority acknowledged                       |
  |              Commit entry 4                              |
  |                  |                   |                   |
  |                  |--AppendEntries(commit=4)------------->|
  |                  |--AppendEntries(commit=4)-------------|
  |                  |                   |                   |
  |              Apply x=4            Apply x=4          Apply x=4
  |<--Success--------|                   |                   |

Safety Properties

Raft ensures several critical safety properties:

Election Safety

At most one leader can be elected in a given term.

This is guaranteed by:

Each server votes at most once per term
Majority voting ensures at most one candidate can win

Leader Append-Only

A leader never overwrites or deletes entries in its log; it only appends new entries.

This ensures that once an entry is in the leader’s log, it remains there.

Log Matching

If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.

This is maintained by:

Consistency check in AppendEntries
Leader only creates one entry per log position per term

Leader Completeness

If a log entry is committed in a given term, then that entry will be present in the logs of all leaders for all higher-numbered terms.

This is guaranteed by:

Election restriction: only candidates with complete logs can become leader
Log completeness check during voting

State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry at the same index.

This follows from the previous properties.

Handling Failures

Leader Failure

When a leader fails:

No heartbeats: Followers stop receiving AppendEntries
Election timeout: Followers become candidates
New election: New leader is elected
Log recovery: New leader brings followers up to date

Follower Failure

When a follower fails:

Leader continues: Leader continues serving requests
Missing entries: Follower misses some log entries
Recovery: When follower recovers, leader resends missing entries
Catch up: Follower catches up using AppendEntries

Network Partition

During a network partition:

Minority partition: Cannot elect leader (needs majority)
Majority partition: Continues operating normally
Rejoin: When partition heals, minority partition’s leader steps down
Log synchronization: New leader brings all nodes up to date

Example: Leader Failure

Timeline:

Time    Leader (A)        Follower (B)        Follower (C)
----    ----------        -----------         -----------
T0      [1,2,3]           [1,2,3]            [1,2,3]
        Serving            Waiting            Waiting

T100    CRASH!            Waiting            Waiting
        (down)             (timeout: 250ms)   (timeout: 180ms)

T250    (down)            Timeout!
                          → Become Candidate
                          → RequestVote(C)

T251    (down)            Received vote from C
                          → Become Leader
                          → AppendEntries(heartbeat) → C

T252    (down)            [1,2,3]             [1,2,3]
                          Serving              Following

Log Compaction

Over time, logs grow unbounded. Raft uses snapshotting for log compaction:

Snapshot: Replace committed entries with snapshot
Snapshot metadata: Include last included index and term
InstallSnapshot RPC: Leader sends snapshots to lagging followers
State machine: Snapshot includes complete state machine state

Snapshot Process

Before:  [1,2,3,4,5,6,7,8,9,10] (10 entries)
After:   [snapshot(1-7), 8,9,10] (snapshot + 3 entries)

Cluster Membership Changes

Adding or removing servers requires careful handling to avoid split-brain scenarios.

Joint Consensus

Raft uses a two-phase approach for membership changes:

Joint consensus: Transition to intermediate configuration (old + new)
New consensus: Once joint consensus committed, transition to new configuration

This ensures that during the transition, both old and new majorities can make decisions.

Configuration Changes

Old: {A, B, C}
New: {A, B, C, D}

Step 1: Joint consensus {A,B,C} + {A,B,C,D}
Step 2: Once committed, switch to {A,B,C,D}

Performance Characteristics

Throughput

Single leader: All writes go through leader (bottleneck)
Batching: Can batch multiple entries in one AppendEntries RPC
Pipelining: Can have multiple AppendEntries RPCs in flight

Latency

Client latency: One round-trip to majority (typically < 10ms in same datacenter)
Commit latency: Half round-trip time (RTT) to majority
Read optimization: Can serve reads from leader without log replication

Scalability

Log size: Grows linearly with number of operations
Network: O(n) messages per operation (n = number of servers)
Typical cluster size: 3-7 servers (odd number for majority)

Real-World Implementations

etcd

etcd uses Raft for distributed key-value store:

Configuration: Kubernetes, Docker Swarm use etcd
Consistency: Strong consistency guarantees
Performance: Handles thousands of writes per second

Consul

Consul uses Raft for service discovery and configuration:

Service registry: Maintains consistent service catalog
Leader election: Ensures single leader for coordination
Multi-datacenter: Uses Raft within each datacenter

CockroachDB

CockroachDB uses Raft (with Multi-Raft) for distributed SQL database:

Range replication: Each range uses Raft
Consistency: Strong consistency across regions
Scale: Handles millions of ranges

MongoDB Replica Sets

MongoDB uses Raft-like algorithm for replica sets:

Primary election: Similar to Raft leader election
Oplog replication: Similar to log replication
Automatic failover: Handles primary failures

Raft vs. Paxos

Aspect	Raft	Paxos
Understandability	High	Low
Leader	Always has leader	No stable leader
Log structure	Sequential log	Independent proposals
Implementation	Easier	Harder
Performance	Similar	Similar
Safety	Same guarantees	Same guarantees

Common Implementation Challenges

Stale Reads

Problem: Reading from followers may return stale data.

Solution:

Read from leader (strong consistency)
Use read leases (time-based guarantee)
Use read index (check if entry is committed)

Split Votes

Problem: Multiple candidates can cause repeated elections.

Solution:

Randomized timeouts
Pre-vote mechanism (check if current leader exists)

Log Growth

Problem: Log grows unbounded over time.

Solution:

Periodic snapshots
Log compaction
Truncate old entries

Network Partitions

Problem: Minority partition cannot make progress.

Solution:

This is by design (prevents split-brain)
Use majority quorums
Consider availability vs. consistency trade-offs

Best Practices

Use odd number of servers: Ensures clear majority (3, 5, 7)
Monitor leader health: Track leader election frequency
Tune timeouts: Balance between availability and performance
Handle snapshots: Implement efficient snapshotting
Test failures: Regularly test leader failures and network partitions
Monitor metrics: Track commit latency, election frequency, log size

Conclusion

Raft provides a understandable, correct, and efficient solution to the consensus problem. Its separation of concerns (leader election, log replication, safety) makes it easier to understand and implement than Paxos, while providing the same safety guarantees.

Key takeaways:

Understandability: Raft’s design prioritizes clarity
Safety: Strong guarantees even with failures
Performance: Efficient for most use cases
Real-world: Widely used in production systems
Trade-offs: Understand availability vs. consistency

Whether you’re building a distributed database, configuration system, or any system requiring consensus, Raft provides a solid foundation for ensuring consistency and correctness in distributed systems.