Consistent Hashing Explained

To achieve horizontal scaling, it is important to distribute data evenly across servers. A common technique to achieve this is the consistent hashing. But before discussing the problem, first understand it in-depth.

Rehashing Problem

You have n database servers and you want to distribute the data evenly among these database servers. You are using this approach to map server with the key:

server_index = hash(key) % n, where n is the number of servers

For example we have 4 database servers and the 8 string keys:

Now, to store data onto the database server or to fetch data from them we use the above mapping where key0 is mapped to the server0, key1 is mapped to server1 and so on.

The problem is that this approach only works when there are fixed number of the database servers in the system adding or removing a server from it will cause lot of troubles.

For example, due to some issue the server2 goes offline so the server pool size become 3 and the modulo calculation become hash % 3 .

New distribution of the keys become:

As you can see that when server2 goes offline almost all of the keys are redistributed. This means that now most of the client will connect to the wrong database to access there data, which can cause serious issues. Consistent hashing is an effective technique to mitigate this problem.

Consistent Hashing

Quoted from Wikipedia: Consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. Consistent hashing evenly distributes keys across shards, even if some of the shards crash or become unavailable. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

Hash Space and Hash Ring

Now that we understand what consistent hashing is, let’s see how it works internally.

Assume we use SHA-1 as our hash function f. The output of this function is a number between:

0 to 2^160 - 1

This entire range of possible output values is called the hash space.

Instead of treating this range as a straight line, consistent hashing connects both ends together.

In other words:

0 and 2¹⁶⁰ − 1 are considered adjacent
The hash space is visualized as a circle

When we connect the start and end of the number line, it becomes a ring. This circular structure is called the hash ring.

Hash Servers and Keys

Using the same SHA hash function we mapped the servers and keys onto the hash ring.

One important thing is that the hash function used here is different from the rehashing problem and we are not using the modulo operation here. Using SHA hash function whichever numeric value we get we map it to the hash ring.

Server Lookup

To assign a server to the key we move in the clockwise direction and keep moving until server is found so k0 is stored on server s1, k1 is stored on server s2 and so on.

Adding a Server

After adding a new server s4 only the key k1 need to redistributed. k0 and k2 remains on the same server.

Removing a server

After removing the server s3 only k2 key is redistributed. k0 and k1 remains on the same server.

But, two issues are identified in the above approach:

The partition between two servers can be non-uniform considering that a server can be added or removed(either it would be fairly large or small). The partition is the hash space between adjacents servers. As you can see, if s3 is removed, s4 partition becomes twice of s1 and s2 partitions.
It is also possible to have non-uniform distribution of the keys. In Figure, server s1 get the most of keys while server s2 and s3 has no data.

A technique called virtual nodes or replicas is used to solve these problems.

Virtual Nodes

A virtual node refers to the real node, and each server is represented by multiple virtual nodes on the ring. Both server 0 and server 1 have 3 virtual nodes. The 3 is arbitrarily number; and in real-world systems, the number of virtual nodes is much larger. Instead of using s0, we have s0_0, s0_1, and s0_2 to represent server 0 on the ring. Similarly, s1_0, s1_1, and s1_2 represent server 1 on the ring. With virtual nodes, each server is responsible for multiple partitions. Partitions (edges) with label s0 are managed by server 0. On the other hand, partitions with label s1 are managed by server 1. To find which server a key is stored on, we go clockwise from the key’s location and find the first virtual node encountered on the ring.

As the number of virtual nodes increases, the distribution of keys becomes more balanced in the system.

Conclusion

Consistent hashing is a powerful technique that enables distributed systems to scale efficiently without causing massive data reshuffling. Unlike traditional modulo-based hashing, which redistributes almost all keys when the number of servers changes, consistent hashing ensures that only a small portion of data is reassigned when a server is added or removed.

Reference Materials

https://bytebytego.com/courses/system-design-interview/design-consistent-hashing

Consistent Hashing

Rehashing Problem