Apache HBase Bulk Load CSV and Examples

  • Post author:
  • Post last modified:February 27, 2018
  • Post category:BigData
  • Reading time:5 mins read

Apache HBase starts where Hadoop HDFS stops, i.e. HBase provides random, realtime read/write access to the Bigdata. If you have flat files such as CSV and TSV, you can use Apache HBase bulk load CSV and TSV features to get the data into HBase tables.

Apache HBase Bulk Load CSV

In this post, I will tell you how to import data to HBase from CSV and TSV files. We will not dig into any transformation. We will check importing data into already existing HBase table.

HBase Importtsv utility

Importtsv is a utility that will load data in TSV or CSV format into HBase.

Importtsv has two distinct usages:

  • Loading data from TSV or CSV format in HDFS into HBase via Puts.
  • Preparing StoreFiles to be loaded via the completebulkload.

Load Data from TSV or CSV format in HDFS to Hbase

Below is the example that allows you to load data from hdfs file to HBase table. You must copy the local file to the hdfs folder then you can load that to HBase table.

$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age personal /test

The above command will generate the MapReduce job to load data from CSV file to HBase table.

Verify HDFS file and Table contents

Below is the HDFS file content:

$ hdfs dfs -cat /test/personal.csv
2,sham,Bengaluru,24
3,Guru,New Delhi,27
4,John,NY,26
5,Rock,DC,30

Below is the HBase table content:

hbase(main):001:0> scan 'personal'
ROW COLUMN+CELL
 2 column=personal_data:age, timestamp=1505968148863, value=24
 2 column=personal_data:city, timestamp=1505968148863, value=Bengaluru
 2 column=personal_data:name, timestamp=1505968148863, value=sham
 3 column=personal_data:age, timestamp=1505968148863, value=27
 3 column=personal_data:city, timestamp=1505968148863, value=New Delhi
 3 column=personal_data:name, timestamp=1505968148863, value=Guru
 4 column=personal_data:age, timestamp=1505968148863, value=26
 4 column=personal_data:city, timestamp=1505968148863, value=NY
 4 column=personal_data:name, timestamp=1505968148863, value=John
 5 column=personal_data:age, timestamp=1505968148863, value=30
 5 column=personal_data:city, timestamp=1505968148863, value=DC
 5 column=personal_data:name, timestamp=1505968148863, value=Rock

4 row(s) in 0.2870 seconds

Apache HBase Bulk Load CSV using completebulkload

The completebulkload utility will move generated StoreFiles into an HBase table. To use this utility, you have to create StoreFile first using importtsv and then load that file to HBase using completebulkload.

Below are the steps to use completebulkload

Create StoreFile first using importtsv

Use below command to create StoreFile:

$hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age -Dimporttsv.bulk.output=hdfs:///test_sf personal /test

You can check the HDFS directory to verify the created file.

$hdfs dfs -ls hdfs:///test_sf
Found 2 items
-rw-r--r-- 3 impadmin hdfs 0 2017-09-21 10:58 hdfs:///test_sf/_SUCCESS
drwxr-xr-x - impadmin hdfs 0 2017-09-21 10:58 hdfs:///test_sf/personal_data

Load Data to HBase Table using completebulkload

Below is the command:

$hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs:///test_sf personal

Below is the HBase table content:

hbase(main):001:0> scan 'personal'
ROW COLUMN+CELL
 2 column=personal_data:age, timestamp=1505971701012, value=24
 2 column=personal_data:city, timestamp=1505971701012, value=Bengaluru
 2 column=personal_data:name, timestamp=1505971701012, value=sham
 3 column=personal_data:age, timestamp=1505971701012, value=27
 3 column=personal_data:city, timestamp=1505971701012, value=New Delhi
 3 column=personal_data:name, timestamp=1505971701012, value=Guru
 4 column=personal_data:age, timestamp=1505971701012, value=26
 4 column=personal_data:city, timestamp=1505971701012, value=NY
 4 column=personal_data:name, timestamp=1505971701012, value=John
 5 column=personal_data:age, timestamp=1505971701012, value=30
 5 column=personal_data:city, timestamp=1505971701012, value=DC
 5 column=personal_data:name, timestamp=1505971701012, value=Rock

4 row(s) in 0.9360 seconds

Read:

This Post Has 2 Comments

  1. Nagesh

    I am trying ti follow all above steps sequentially,but i m unable to load .csv file to hbase

    please,help me out.

    Thank you,

    1. Vithal S

      Hi Nagesh,

      Please drop an error message that you are getting.

      Thanks

Comments are closed.