Improve Hive Memory Usage using Hadoop Archive

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:BigData
  • Reading time:3 mins read

Hadoop hdfs is designed in such a way that, number of hdfs files directly affects the memory consumption in the namenode as it must keep track of all files in the hdfs environment. It does not affect if cluster is small, memory usage may cause problem on cluster when file count crosses 50 to 100 million files. Hadoop ecosystem performs best with fewer number of files. Now, let us check Improve Hive Memory Usage using Hadoop Archive.

Related reading:

Improve Hive Memory Usage using Hadoop Archive

You can use Hadoop archiving to reduce the number of hdfs files in the Hive table partition. Hive has built in functions to convert Hive table partition into Hadoop Archive (HAR). HAR does not compress the files, it is analogous to the Linux tar command.

Note that, if the Hive table partition is archived, Hive SQL query may run slow because of additional overhead in reading HAR files.

Hive Hadoop Archive Settings

There are three setting that you can use on Hive console to set up Hadoop archiving:

https://gist.github.com/2b4d7a79e6b382657fcf9bb2e454d779

Improve Hive Memory Usage using Hadoop Archive Examples

You can use Hive ALTER TABLE command with ARCHIVE PARTITION option to perform Hadoop archiving. Below is the example of usage on Hadoop archiving.

Related reading:

Archive using Hive ALTER TABLE

As mentioned earlier, Hadoop archive will help you to reduce the number of HDFS files in the table partition. You can use the Hive Alter Table command to perform the Archive on Hive table partitions.  Below is the Hadoop archive functionality demonstration:

https://gist.github.com/9a16578510b43c33a8ad381ab2a2b557

Unarchive using Hive ALTER TABLE

Hadoop also supports the unarchive functionality. You can use the Hive Alter Table command to perform the un archive on Hive table partitions whenever required. Below is the command to unarchive partitions in the table:

https://gist.github.com/f998e3ac6b1cf0b6b5d8d5c72b711a21