Apache HBase distributes its load through region splitting. HBase stored rows in the tables and each table is split into ‘regions’. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process in the system. All rows in the tables are sorted between regions start and end key. Every single row is belonging to exactly one region and a region is served by single region server at any given point of time. In this article, we will check Splitting HBase Tables, Examples and Best Practices.
HBase Table Regions
Regions are the physical mechanism used to distribute the write and query load across region servers in HBase. A table in HBase consists of many regions associated with region servers. When table is created, by default, HBase allocate single region to it. Thus, initial loading of HBase table does not utilize the entire capacity of cluster.
Pre-splitting HBase Tables
As mentioned in previous section, HBase allocates only one region to table, because it does not know how to split the table into multiple regions. With a pre-splitting process, you can create a HBase table with many regions by supplying the split points at the table creation time.
Related reading:
However, there is always risk of creating multiple regions with pre-splitting. This could affect the distribution because of data skew. You should always know the key distribution before applying pre-split to avoid data skew.
Calculating Split Point for Tables
You can use the RegionSplitter utility to identify correct split point for table. RegionSplitter creates the split points, by with either HexStringSplit or UniformSplit Split Algorithm.
For example, create table ‘table1’ with 5 regions:
https://gist.github.com/d88578b5c9145efac4b339cf32fa2c61.js
Pre-splitting HBase Tables Examples
If you know the split point, you can use HBase shell command to create table. Below is the example for splitting HBase tables:
https://gist.github.com/cefcf8a2f643ad9737c921d3a5f4c088.js