Apache HBase Writing Data Best Practices

For writing data into HBase, you use methods of the HtableInterface class. You can also use the Java API directly, or use the HBase Shell Commands. When you issue an HBase Shell Put command, the coordinates of the data are the row, the column, and the timestamp. The timestamp is unique per version of the cell, and it can be generated automatically or specified programmatically by your application, and must be a long integer.

In this article, we will check Apache HBase writing data best practices to tune the performance of the HBase.

Apache HBase Writing Data Best Practices

Below are some of the best practices for writing data into Hadoop HBase:

Bulk Loading
Number of Regions per HBase Record Table
HBase Row Key design
Number of Column Families
MapReduce: Skip Reducer
HBase Client: AutoFlush
Splitting
Durability

Batch Loading

Use the bulk loading tools whenever possible. Bulk loading tools will help you to get data from flat file into HBase tables as fast as possible.

You can read:

Apache HBase Bulk load CSV and Examples

Number of Regions per HBase Record Table

HBase regions are where HBase data is kept in the form of HFiles. When you create an HBase table, you can either explicitly define the number of regions or you can allow the table to dictate internally. For better performance, it is suggested that you define the number of regions explicitly.

Defining the number of regions explicitly can avoid the issue of Hotspotting.

For more information read:

Number of Column Families

The most common approach in HBase table schema design is to have one column family and have all the columns in that one-column family. However, in case if more column families required then there should be no more than 10 column families per HBase table.

Read:

HBase Table Schema Design and Concept

MapReduce: Skip Reducer

When writing a lot of data to an HBase table from a MR jobs and specifically where HBase Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted and shuffled to other Reducers that will most likely be off-node. It’s far more efficient and better to just write directly to HBase table as it is.

HBase Client: AutoFlush

When performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance. Otherwise, the Puts will be sent one at a time to the RegionServer

Splitting

Splitting is another way of improving performance in Apache HBase. To manually define splitting, you must know your data well. If you do not, then you can split using a default splitting approach that is provided by HBase called “HexStringSplit”. HexStringSplit automatically optimizes the number of splits for your HBase operations.

Splitting may cause hotspotting if you do not use the proper splitting mechanism.

Durability

HBase has the concept of Write Ahead Logs (WAL). You should keep this property to asynchronous mode to improve HBase performance.

The use of WAL is, before committing any changes in data to StoreFiles, it is written in MemStores. The WALs make sure that they are logged properly; if the WALs are not set properly then write operation to HBase fails.

Below is the example of Hbase table considering all above parameters:

create 'hbase_test', {NAME => 'cf', COMPRESSION => 'SNAPPY'}, {NUMREGIONS => 8, SPLITALGO => 'HexStringSplit', DURABILITY => 'ASYNC_WAL'}