HBase Auto Sharding Concept and Explanation

  • Post author:
  • Post last modified:March 2, 2018
  • Post category:BigData
  • Reading time:3 mins read

HBase is the Hadoop storage manager on the top of Hadoop HDFS that provides low-latency random reads and writes, and it can handle petabytes of data without any issue.

One of the interesting capabilities in HBase is auto sharding, which simply means that tables are dynamically distributed by the system to different region servers when they become too large. In other word, Splitting and serving regions can be thought of as auto sharding, as offered by other systems.

Regions and Region Servers

In Hbase, the scalability and load balancing is handled using region. Regions are contiguous ranges of rows stored together. Regions are dynamically split by the HBase storage manager system when they become too large. Alternatively, they may also be merged to reduce their number and required storage files.

You can manually pre-split HBase table. Read more:

Each region is served by exactly one region server, and each region server can serve many regions at any given point of time.

Initially, when you HBase create table, there is only one region for a table. When regions become too large when you start adding more rows, the region is split into two at the middle key, creating two roughly equal halves.

Related reading:

Below diagram depicts how rows are split into regions.

HBase Auto Sharding Concept and Explanation

Identify Row Key and Region Server

Client system does not have to contact master to put or get rows from table. It can query the meta table to identify the region server that is responsible for handling set of keys which are available in region.

META table is a HBase system table. The mapping of Regions and Region Servers is kept in a HBase system table called META. META table has information about which region is responsible for your key. For any read or write operation client will directly go to region server. Hmaster is not at all involved! Regions servers are responsible to server the requested data to client applications.

Related Reading: