How to avoid HBase Hotspotting?

  • Post author:
  • Post last modified:February 27, 2018
  • Post category:BigData
  • Reading time:3 mins read

HBase hotspotting occurs when large amount of traffic from various clients redirected to single or very few numbers of nodes in the cluster. The HBase hotspotting occurs because of bad row key design. In this article, we will see how to avoid HBase hotspotting or region server hotspotting.

How Does HBase hotspotting occurs?

HBase hotspotting occurs because of poorly designed row key. Because of bad row key, HBase stores large amount of data on single node and entire traffic is redirected to this node when client requests some data leaving other node idle.

This traffic may represent reads, writes, or store operations. The entire traffic would go to single machine responsible for hosting that region containing required data, this issue causes performance degradation and sometimes causes region unavailability.

You schema should be in such a way that, data should evenly distribute across all the regions in all the nodes available in cluster.

How to avoid HBase Hotspotting?

So the question is how to avoid Hbase hotspotting?

Answer to this question is lies in your schema and row key design. Design your row key in such a way that data being written should go to multiple regions across the cluster.

There are some techniques that can be used to avoid hotspotting. There are some pros and cons of these techniques.

Below are some of techniques use to avoid hotspotting:

Salting

Salting is nothing but appending random assigned value to the start of row key. The number of different random values depends upon the number of regions in the cluster.

Salting process is helpful when you have small number of fixed number of row keys those come up over and over again.

For examples, let us consider you have below four row key values:

machine0001machine0002machine0003machine0004

If you would like to write thsese across four different regions. You can use the four letters a, b, c and d. The updated values would be:

a-machine0001b-machine0002c-machine0003d-machine0004

The problem with salting is, if you add one more machine details then salting will end up assigning one of four values randomly and end up storing in one of the four regions.

Hashing

Hashing mechanism is using hash functions to assign values instead of using random mechanism.

You can use the one-way hash function that would allow row being stored is always be “salted” with the same prefix, that would spread load across regionServers.

Reversing the Key

A third common technique for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often is first.

Read: