22_Leader_Election

Difficulty: Advanced
Generated on: 2025-07-13 02:55:08
Category: System Design Cheatsheet

Leader Election - Advanced System Design Cheatsheet

1. Core Concept

Leader election is the process of choosing a single node (the “leader”) from a distributed set of nodes to be responsible for coordinating tasks or making decisions. It’s crucial for achieving fault tolerance, consistency, and coordination in distributed systems where a single point of failure is unacceptable. Without a leader, systems can suffer from split-brain scenarios, data inconsistencies, and overall instability.

2. Key Principles

Safety: Only one leader can be active at any given time. This is paramount to avoid conflicts and data corruption.
Liveness: If the current leader fails, a new leader will eventually be elected. The system must recover and continue functioning.
Fault Tolerance: The system should tolerate node failures (both leader and follower nodes) without compromising safety or liveness.
Consistency: The elected leader should have a consistent view of the system’s state.
Quorum: Most leader election algorithms rely on a quorum (typically a majority) of nodes agreeing on the leader. This ensures that the election is valid even if some nodes are unavailable.
Heartbeats: Nodes periodically send heartbeats to the leader to indicate they are still alive. The leader also sends heartbeats to followers. Lack of heartbeat indicates failure.
Epochs/Terms: Each leader election is associated with a unique epoch or term number. This helps prevent stale leaders from interfering with the new leader.
Fencing Tokens: Assigning monotonically increasing “fencing tokens” to leaders helps prevent stale leaders from performing operations after a new leader has been elected. The token must be passed with every write operation.

3. Diagrams

Basic Leader Election Flow (Raft Example):

sequenceDiagram
    participant Follower1
    participant Follower2
    participant Follower3
    participant Candidate
    participant Leader

    Follower1->>Candidate: Timeout, starts election
    Candidate->>Follower1: Request Vote (term=N)
    Candidate->>Follower2: Request Vote (term=N)
    Candidate->>Follower3: Request Vote (term=N)
    Follower1->>Candidate: Vote Granted
    Follower2->>Candidate: Vote Granted
    Follower3->>Candidate: Vote Granted
    Candidate->>Leader: Become Leader (term=N)
    Leader->>Follower1: Heartbeat (term=N)
    Leader->>Follower2: Heartbeat (term=N)
    Leader->>Follower3: Heartbeat (term=N)

    alt Leader fails
        Leader->>Follower1:  X (Failure)
        Follower1->>Candidate: Timeout, starts election
    end

Leader Election with Distributed Lock (e.g., using ZooKeeper):

sequenceDiagram
    participant Node1
    participant Node2
    participant Node3
    participant ZooKeeper

    Node1->>ZooKeeper: Attempt to create lock node (/leader)
    ZooKeeper->>Node1: Lock created successfully (Node1 becomes leader)
    Node2->>ZooKeeper: Attempt to create lock node (/leader)
    ZooKeeper->>Node2: Lock creation failed (Node2 watches /leader)
    Node3->>ZooKeeper: Attempt to create lock node (/leader)
    ZooKeeper->>Node3: Lock creation failed (Node3 watches /leader)

    alt Node1 fails
        Node1->>ZooKeeper: X (Failure)
        ZooKeeper->>Node2: /leader node deleted
        Node2->>ZooKeeper: Node2 becomes new leader
        ZooKeeper->>Node3: /leader node deleted
        Node3->>ZooKeeper: Node3 retries, fails
    end

4. Use Cases

Distributed Databases: Electing a master node for write operations (e.g., Raft in etcd, Paxos in Cassandra).
Message Queues: Selecting a primary broker to handle message routing and persistence (e.g., Kafka controller election).
Distributed Lock Management: Granting exclusive access to a resource to only one node at a time (e.g., using ZooKeeper or Consul).
Service Discovery: Choosing a leader to maintain the service registry and handle service registration/deregistration.
Configuration Management: Electing a leader to manage and distribute configuration updates across a cluster.
Scheduler: Electing a leader to schedule jobs across multiple worker nodes.

When to Use:

When you need a single point of coordination in a distributed system.
When you require fault tolerance and automatic failover.
When you need to ensure consistency across multiple nodes.

When to Avoid:

For simple, stateless systems where no coordination is needed.
When the overhead of leader election outweighs the benefits (e.g., very small clusters).
For use cases that are easily sharded and can operate independently without coordination.

5. Trade-offs

Trade-off	Description
Complexity	Leader election algorithms can be complex to implement and debug. Consider using a well-tested library or framework.
Performance	Election processes can introduce latency, especially during failover. Optimize heartbeat intervals and election timeouts.
Consistency vs. Availability	Strong consistency (e.g., Raft) guarantees that all nodes see the same data but may sacrifice availability during network partitions. Weaker consistency (e.g., Gossip protocol) prioritizes availability over consistency.
Split-Brain	A critical concern. Algorithms must be carefully designed to prevent split-brain scenarios, where multiple nodes believe they are the leader.
False Positives (Leader Death)	A follower may incorrectly assume the leader is dead due to temporary network issues, triggering an unnecessary election. This adds load to the system.

6. Scalability & Performance

Scalability: The scalability of leader election depends on the underlying algorithm and implementation.
- Raft/Paxos: Generally scale well to hundreds of nodes. Performance degrades as the cluster size increases due to increased communication overhead.
- ZooKeeper/Consul: Designed for relatively smaller clusters (typically up to 100 nodes) due to the centralized nature of the ZAB protocol.
Performance:
- Election Latency: The time it takes to elect a new leader after a failure. Minimize this by optimizing heartbeat intervals and election timeouts.
- Heartbeat Frequency: A trade-off between detecting failures quickly and reducing network overhead.
- Network Communication: Leader election algorithms rely heavily on network communication. Optimize network latency and bandwidth.
- CPU Usage: Election processes can be CPU-intensive, especially during high contention.

Strategies for Improving Scalability & Performance:

Reduce Heartbeat Interval: Carefully tune the heartbeat interval to balance responsiveness and network load.
Optimize Network Latency: Place nodes in close proximity to minimize network latency.
Batch Operations: Batch multiple operations together to reduce the number of network requests.
Use a Dedicated Network: Isolate leader election traffic on a dedicated network to minimize interference.
Hierarchical Leader Election: For very large clusters, consider a hierarchical approach where leaders are elected within smaller groups, and then a global leader is elected from the group leaders.

7. Real-world Examples

etcd (Kubernetes): Uses Raft for leader election and distributed key-value store. Kubernetes relies on etcd to store cluster state and configuration.
ZooKeeper (Hadoop, Kafka): Uses the ZAB protocol for leader election. Hadoop uses ZooKeeper for managing the NameNode high availability. Kafka uses Zookeeper for controller election.
Consul (HashiCorp): Uses Raft for leader election and service discovery.
Cassandra: Uses a Paxos-based protocol (modified) for eventual consistency and leader election.
Kafka: Uses ZooKeeper for controller election and cluster metadata management (prior to KRaft). Now transitioning to KRaft (Kafka Raft) which removes the dependency on ZooKeeper.

Example: Kubernetes and etcd

Kubernetes uses etcd as its backing store. etcd uses Raft for leader election, ensuring only one etcd node acts as the leader for writing cluster state. If the etcd leader fails, a new leader is automatically elected, ensuring high availability for the Kubernetes control plane.

8. Interview Questions

Explain the concept of leader election and its importance in distributed systems.
Describe different leader election algorithms (e.g., Raft, Paxos, Zab). What are their trade-offs?
How does Raft ensure consistency and fault tolerance?
What is a quorum and why is it important in leader election?
How can you prevent split-brain scenarios in a distributed system?
Explain the concept of fencing tokens and how they prevent stale writes.
How would you design a leader election system using ZooKeeper?
What are the performance implications of leader election? How can you optimize it?
How does Kubernetes use etcd for leader election and distributed state management?
Discuss the CAP theorem in the context of leader election.
How do you handle network partitions in a distributed system with leader election?
What are the differences between strong consistency and eventual consistency in the context of leader election?
Design a system where a leader is elected and then assigns tasks to worker nodes. Consider fault tolerance and scalability.
How do you monitor the health of the leader and followers in a leader election system?
How do you handle a scenario where a leader is falsely detected as dead (false positive)?
Compare and contrast the use of leader election in a database versus a message queue.
Explain the concept of epochs/terms and how they help in preventing stale leaders from interfering.
How does the choice of leader election algorithm affect the overall architecture and complexity of a distributed system?
Explain the advantages and disadvantages of using a centralized coordination service like ZooKeeper for leader election versus a decentralized algorithm like Raft.
How would you design a leader election system for a globally distributed application? What challenges would you need to address?