12_Consistent_Hashing

Difficulty: Intermediate
Generated on: 2025-07-13 02:52:48
Category: System Design Cheatsheet

Consistent Hashing Cheatsheet (Intermediate)

1. Core Concept

Consistent hashing is a distributed hashing technique that minimizes the disruption caused by adding or removing servers (nodes) in a distributed system. Instead of remapping all keys when the number of servers changes, only a small fraction of keys need to be remapped. This is crucial for maintaining high availability and minimizing data movement in large-scale systems. It’s important because it allows for dynamic scaling without significant performance degradation.

2. Key Principles

Hash Ring: Servers and data items (keys) are mapped to a circular hash space (e.g., 0 to 2³²-1).
Hashing Function: A consistent hashing function maps both servers and keys to points on the hash ring. Common choices include MD5, SHA-1, or MurmurHash.
Placement: A key is assigned to the first server it encounters in a clockwise direction on the hash ring (successor).
Node Addition/Removal: When a server is added or removed, only the keys that were previously mapped to the affected server need to be remapped.
Virtual Nodes (vnodes): Each physical server is represented by multiple virtual nodes on the hash ring. This helps to distribute the load more evenly and reduce the impact of server failures.

3. Diagrams

Basic Consistent Hashing

graph LR
    A[Key 1] --> H1((Hash Function))
    B[Key 2] --> H1
    C[Key 3] --> H1
    D[Key 4] --> H1

    H1 --> K1[Hash Ring]
    K1 -- Maps to --> S1(Server 1)
    K1 -- Maps to --> S2(Server 2)
    K1 -- Maps to --> S3(Server 3)

    style K1 fill:#f9f,stroke:#333,stroke-width:2px

Consistent Hashing with Virtual Nodes

graph LR
    A[Key 1] --> H1((Hash Function))
    B[Key 2] --> H1
    C[Key 3] --> H1
    D[Key 4] --> H1

    H1 --> K1[Hash Ring]
    K1 -- Maps to --> VN11((Virtual Node 1.1))
    K1 -- Maps to --> VN12((Virtual Node 1.2))
    K1 -- Maps to --> VN21((Virtual Node 2.1))
    K1 -- Maps to --> VN22((Virtual Node 2.2))
    K1 -- Maps to --> VN31((Virtual Node 3.1))
    K1 -- Maps to --> VN32((Virtual Node 3.2))

    VN11 --> S1(Server 1)
    VN12 --> S1
    VN21 --> S2(Server 2)
    VN22 --> S2
    VN31 --> S3(Server 3)
    VN32 --> S3

    style K1 fill:#f9f,stroke:#333,stroke-width:2px

4. Use Cases

Distributed Caches (Memcached, Redis Cluster): Distributing cached data across multiple servers. When a server is added or removed, minimal cache invalidation occurs.
Database Sharding: Distributing data across multiple database servers. Consistent hashing ensures that data is relatively evenly distributed and that data movement is minimized during scaling.
Content Delivery Networks (CDNs): Routing user requests to the closest or least loaded server.
Distributed Message Queues (Kafka): Partitioning topics across multiple brokers.
Peer-to-Peer (P2P) Networks: Locating and distributing files across a network of peers.

When to Use:

Frequent node additions/removals are expected.
Minimizing data movement during scaling is crucial.
Even data distribution across nodes is desired.
High availability is a primary concern.

When to Avoid:

Very small datasets where the overhead outweighs the benefits.
Static environments with no expected changes in the number of servers.
When perfect data distribution is absolutely critical (consistent hashing provides relatively even distribution, but not perfect). Consider range partitioning if perfect distribution is needed.

5. Trade-offs

Feature	Pros	Cons
Scaling	Minimizes data movement during node additions/removals.	Uneven distribution is possible, especially with a small number of nodes.
Availability	Reduces the impact of node failures. Only a small portion of the data becomes unavailable.	Requires a mechanism to handle node failures and data replication for true high availability.
Complexity	Relatively simple to implement compared to other sharding techniques.	More complex than simple modulo hashing.
Data Locality	Improves data locality by minimizing data movement.	Data locality is not guaranteed.
Hotspots	Can mitigate hotspots by using virtual nodes to distribute load more evenly.	Hotspots can still occur if the hashing function isn’t well-suited for the data distribution.
Overhead	Low overhead for reads after initial setup.	Increased complexity for initial setup and node management.

6. Scalability & Performance

Scalability: Horizontally scalable by adding more servers to the hash ring. Virtual nodes enhance scalability by improving load distribution.
Performance:
- Read Performance: Excellent read performance as data is located directly on the responsible server.
- Write Performance: Write performance depends on the replication factor and consistency requirements.
- Node Addition/Removal: Adding or removing a server requires remapping only a small percentage of keys, minimizing disruption. The performance impact of this remapping depends on the size of the data being remapped and the speed of the network.
Key Considerations:
- Hashing Function: Choose a hashing function that distributes keys evenly across the hash ring.
- Number of Virtual Nodes: Experiment to find the optimal number of virtual nodes for your workload. More virtual nodes improve distribution but increase metadata overhead.
- Replication: Implement data replication to ensure data durability and availability in case of node failures.
- Data Consistency: Implement appropriate consistency mechanisms (e.g., eventual consistency, quorum) to ensure data integrity.

7. Real-world Examples

Amazon DynamoDB: Uses consistent hashing with virtual nodes to distribute data across storage nodes.
Apache Cassandra: Uses consistent hashing to partition data across nodes in a cluster.
Akamai CDN: Employs consistent hashing for content distribution and load balancing.
Riak: A distributed NoSQL database that uses consistent hashing.

8. Interview Questions

What is consistent hashing and why is it important?
How does consistent hashing work? Explain the concept of the hash ring and virtual nodes.
What are the advantages and disadvantages of consistent hashing?
How does consistent hashing compare to other sharding techniques, such as modulo hashing?
How do you handle node failures in a consistent hashing system?
How do you choose the number of virtual nodes in a consistent hashing system?
How does consistent hashing improve scalability and performance?
Describe a real-world use case where consistent hashing would be beneficial.
How do you ensure data consistency in a consistent hashing system?
How would you implement consistent hashing in a distributed cache?
Design a system for a distributed key-value store using consistent hashing. Consider aspects like data replication, consistency, and fault tolerance.