Skip to content

12_Consistent_Hashing

Difficulty: Intermediate
Generated on: 2025-07-13 02:52:48
Category: System Design Cheatsheet


Consistent Hashing Cheatsheet (Intermediate)

Section titled “Consistent Hashing Cheatsheet (Intermediate)”

Consistent hashing is a distributed hashing technique that minimizes the disruption caused by adding or removing servers (nodes) in a distributed system. Instead of remapping all keys when the number of servers changes, only a small fraction of keys need to be remapped. This is crucial for maintaining high availability and minimizing data movement in large-scale systems. It’s important because it allows for dynamic scaling without significant performance degradation.

  • Hash Ring: Servers and data items (keys) are mapped to a circular hash space (e.g., 0 to 232-1).
  • Hashing Function: A consistent hashing function maps both servers and keys to points on the hash ring. Common choices include MD5, SHA-1, or MurmurHash.
  • Placement: A key is assigned to the first server it encounters in a clockwise direction on the hash ring (successor).
  • Node Addition/Removal: When a server is added or removed, only the keys that were previously mapped to the affected server need to be remapped.
  • Virtual Nodes (vnodes): Each physical server is represented by multiple virtual nodes on the hash ring. This helps to distribute the load more evenly and reduce the impact of server failures.
graph LR
A[Key 1] --> H1((Hash Function))
B[Key 2] --> H1
C[Key 3] --> H1
D[Key 4] --> H1
H1 --> K1[Hash Ring]
K1 -- Maps to --> S1(Server 1)
K1 -- Maps to --> S2(Server 2)
K1 -- Maps to --> S3(Server 3)
style K1 fill:#f9f,stroke:#333,stroke-width:2px
graph LR
A[Key 1] --> H1((Hash Function))
B[Key 2] --> H1
C[Key 3] --> H1
D[Key 4] --> H1
H1 --> K1[Hash Ring]
K1 -- Maps to --> VN11((Virtual Node 1.1))
K1 -- Maps to --> VN12((Virtual Node 1.2))
K1 -- Maps to --> VN21((Virtual Node 2.1))
K1 -- Maps to --> VN22((Virtual Node 2.2))
K1 -- Maps to --> VN31((Virtual Node 3.1))
K1 -- Maps to --> VN32((Virtual Node 3.2))
VN11 --> S1(Server 1)
VN12 --> S1
VN21 --> S2(Server 2)
VN22 --> S2
VN31 --> S3(Server 3)
VN32 --> S3
style K1 fill:#f9f,stroke:#333,stroke-width:2px
  • Distributed Caches (Memcached, Redis Cluster): Distributing cached data across multiple servers. When a server is added or removed, minimal cache invalidation occurs.
  • Database Sharding: Distributing data across multiple database servers. Consistent hashing ensures that data is relatively evenly distributed and that data movement is minimized during scaling.
  • Content Delivery Networks (CDNs): Routing user requests to the closest or least loaded server.
  • Distributed Message Queues (Kafka): Partitioning topics across multiple brokers.
  • Peer-to-Peer (P2P) Networks: Locating and distributing files across a network of peers.

When to Use:

  • Frequent node additions/removals are expected.
  • Minimizing data movement during scaling is crucial.
  • Even data distribution across nodes is desired.
  • High availability is a primary concern.

When to Avoid:

  • Very small datasets where the overhead outweighs the benefits.
  • Static environments with no expected changes in the number of servers.
  • When perfect data distribution is absolutely critical (consistent hashing provides relatively even distribution, but not perfect). Consider range partitioning if perfect distribution is needed.
FeatureProsCons
ScalingMinimizes data movement during node additions/removals.Uneven distribution is possible, especially with a small number of nodes.
AvailabilityReduces the impact of node failures. Only a small portion of the data becomes unavailable.Requires a mechanism to handle node failures and data replication for true high availability.
ComplexityRelatively simple to implement compared to other sharding techniques.More complex than simple modulo hashing.
Data LocalityImproves data locality by minimizing data movement.Data locality is not guaranteed.
HotspotsCan mitigate hotspots by using virtual nodes to distribute load more evenly.Hotspots can still occur if the hashing function isn’t well-suited for the data distribution.
OverheadLow overhead for reads after initial setup.Increased complexity for initial setup and node management.
  • Scalability: Horizontally scalable by adding more servers to the hash ring. Virtual nodes enhance scalability by improving load distribution.
  • Performance:
    • Read Performance: Excellent read performance as data is located directly on the responsible server.
    • Write Performance: Write performance depends on the replication factor and consistency requirements.
    • Node Addition/Removal: Adding or removing a server requires remapping only a small percentage of keys, minimizing disruption. The performance impact of this remapping depends on the size of the data being remapped and the speed of the network.
  • Key Considerations:
    • Hashing Function: Choose a hashing function that distributes keys evenly across the hash ring.
    • Number of Virtual Nodes: Experiment to find the optimal number of virtual nodes for your workload. More virtual nodes improve distribution but increase metadata overhead.
    • Replication: Implement data replication to ensure data durability and availability in case of node failures.
    • Data Consistency: Implement appropriate consistency mechanisms (e.g., eventual consistency, quorum) to ensure data integrity.
  • Amazon DynamoDB: Uses consistent hashing with virtual nodes to distribute data across storage nodes.
  • Apache Cassandra: Uses consistent hashing to partition data across nodes in a cluster.
  • Akamai CDN: Employs consistent hashing for content distribution and load balancing.
  • Riak: A distributed NoSQL database that uses consistent hashing.
  • What is consistent hashing and why is it important?
  • How does consistent hashing work? Explain the concept of the hash ring and virtual nodes.
  • What are the advantages and disadvantages of consistent hashing?
  • How does consistent hashing compare to other sharding techniques, such as modulo hashing?
  • How do you handle node failures in a consistent hashing system?
  • How do you choose the number of virtual nodes in a consistent hashing system?
  • How does consistent hashing improve scalability and performance?
  • Describe a real-world use case where consistent hashing would be beneficial.
  • How do you ensure data consistency in a consistent hashing system?
  • How would you implement consistent hashing in a distributed cache?
  • Design a system for a distributed key-value store using consistent hashing. Consider aspects like data replication, consistency, and fault tolerance.