Apache HBase starts where Hadoop HDFS stops, i.e. HBase provides random, realtime read/write access to the Bigdata. If you have flat files such as CSV and TSV, you can use Apache HBase bulk load CSV and TSV features to get the data into HBase tables.
In this post, I will tell you how to import data to HBase from CSV and TSV files. We will not dig into any transformation. We will check importing data into already existing HBase table.
HBase Importtsv utility
Importtsv is a utility that will load data in TSV or CSV format into HBase.
Importtsv has two distinct usages:
- Loading data from TSV or CSV format in HDFS into HBase via Puts.
- Preparing StoreFiles to be loaded via the completebulkload.
Load Data from TSV or CSV format in HDFS to Hbase
Below is the example that allows you to load data from hdfs file to HBase table. You must copy the local file to the hdfs folder then you can load that to HBase table.
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age personal /test
The above command will generate the MapReduce job to load data from CSV file to HBase table.
Verify HDFS file and Table contents
Below is the HDFS file content:
$ hdfs dfs -cat /test/personal.csv 2,sham,Bengaluru,24 3,Guru,New Delhi,27 4,John,NY,26 5,Rock,DC,30
Below is the HBase table content:
hbase(main):001:0> scan 'personal' ROW COLUMN+CELL 2 column=personal_data:age, timestamp=1505968148863, value=24 2 column=personal_data:city, timestamp=1505968148863, value=Bengaluru 2 column=personal_data:name, timestamp=1505968148863, value=sham 3 column=personal_data:age, timestamp=1505968148863, value=27 3 column=personal_data:city, timestamp=1505968148863, value=New Delhi 3 column=personal_data:name, timestamp=1505968148863, value=Guru 4 column=personal_data:age, timestamp=1505968148863, value=26 4 column=personal_data:city, timestamp=1505968148863, value=NY 4 column=personal_data:name, timestamp=1505968148863, value=John 5 column=personal_data:age, timestamp=1505968148863, value=30 5 column=personal_data:city, timestamp=1505968148863, value=DC 5 column=personal_data:name, timestamp=1505968148863, value=Rock 4 row(s) in 0.2870 seconds
Apache HBase Bulk Load CSV using completebulkload
The completebulkload utility will move generated StoreFiles into an HBase table. To use this utility, you have to create StoreFile first using importtsv and then load that file to HBase using completebulkload.
Below are the steps to use completebulkload
Create StoreFile first using importtsv
Use below command to create StoreFile:
$hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age -Dimporttsv.bulk.output=hdfs:///test_sf personal /test
You can check the HDFS directory to verify the created file.
$hdfs dfs -ls hdfs:///test_sf Found 2 items -rw-r--r-- 3 impadmin hdfs 0 2017-09-21 10:58 hdfs:///test_sf/_SUCCESS drwxr-xr-x - impadmin hdfs 0 2017-09-21 10:58 hdfs:///test_sf/personal_data
Load Data to HBase Table using completebulkload
Below is the command:
$hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs:///test_sf personal
Below is the HBase table content:
hbase(main):001:0> scan 'personal' ROW COLUMN+CELL 2 column=personal_data:age, timestamp=1505971701012, value=24 2 column=personal_data:city, timestamp=1505971701012, value=Bengaluru 2 column=personal_data:name, timestamp=1505971701012, value=sham 3 column=personal_data:age, timestamp=1505971701012, value=27 3 column=personal_data:city, timestamp=1505971701012, value=New Delhi 3 column=personal_data:name, timestamp=1505971701012, value=Guru 4 column=personal_data:age, timestamp=1505971701012, value=26 4 column=personal_data:city, timestamp=1505971701012, value=NY 4 column=personal_data:name, timestamp=1505971701012, value=John 5 column=personal_data:age, timestamp=1505971701012, value=30 5 column=personal_data:city, timestamp=1505971701012, value=DC 5 column=personal_data:name, timestamp=1505971701012, value=Rock 4 row(s) in 0.9360 seconds
Read:
- HBase Table Schema Design and Concept
- How to avoid HBase Hotspotting?
- Insert data using HBase shell put Command and Examples
- Read HBase Table using HBase shell get Command
- Hadoop HDFS Architecture Introduction and Design
- Create Tables using HBase Shell
- Official Apache HBase documentation
- HBase Architecture and its Components
I am trying ti follow all above steps sequentially,but i m unable to load .csv file to hbase
please,help me out.
Thank you,
Hi Nagesh,
Please drop an error message that you are getting.
Thanks