Hadoop hdfs is designed in such a way that, number of hdfs files directly affects the memory consumption in the namenode as it must keep track of all files in the hdfs environment. It does not affect if cluster is small, memory usage may cause problem on cluster when file count crosses 50 to 100 million files. Hadoop ecosystem performs best with fewer number of files. Now, let us check Improve Hive Memory Usage using Hadoop Archive.
Related reading:
Improve Hive Memory Usage using Hadoop Archive
You can use Hadoop archiving to reduce the number of hdfs files in the Hive table partition. Hive has built in functions to convert Hive table partition into Hadoop Archive (HAR). HAR does not compress the files, it is analogous to the Linux tar command.
Note that, if the Hive table partition is archived, Hive SQL query may run slow because of additional overhead in reading HAR files.
Hive Hadoop Archive Settings
There are three setting that you can use on Hive console to set up Hadoop archiving:
https://gist.github.com/2b4d7a79e6b382657fcf9bb2e454d779
Improve Hive Memory Usage using Hadoop Archive Examples
You can use Hive ALTER TABLE command with ARCHIVE PARTITION option to perform Hadoop archiving. Below is the example of usage on Hadoop archiving.
Related reading:
Archive using Hive ALTER TABLE
As mentioned earlier, Hadoop archive will help you to reduce the number of HDFS files in the table partition. You can use the Hive Alter Table command to perform the Archive on Hive table partitions. Below is the Hadoop archive functionality demonstration:
https://gist.github.com/9a16578510b43c33a8ad381ab2a2b557
Unarchive using Hive ALTER TABLE
Hadoop also supports the unarchive functionality. You can use the Hive Alter Table command to perform the un archive on Hive table partitions whenever required. Below is the command to unarchive partitions in the table:
https://gist.github.com/f998e3ac6b1cf0b6b5d8d5c72b711a21