Apache Hive Fixed-Width File Loading Options and Examples

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:BigData
  • Reading time:4 mins read

In general, fixed-width text files are special types of text files where the row format is specified by column widths, pad character and either left or right alignments. In the fixed width file format, column width is in terms of units of characters. Fixed width format files are usually generated by machines such as switches, SS7 etc. In this article, we will learn about Apache Hive fixed-width file loading options and some examples.

Fixed-Width File Overview

In general, fixed-length format files use ordinal positions, which are offsets to identify where fields are within the record of the file. Usually, fixed-width data files have uniform lengths for each data row in the file. Each field in a fixed-width data file has the same length and position.

In fixed width format files, there are no field delimiters, and there may or may not be an end-of-record delimiter associated with file. There is no field delimiter in fixed width files. If you have any requirement that needs to convert data (for e.g. date) to required data type, then you should use explicit type conversion functions.

Sample Fixed-Width Format File

Below is the sample fixed width file format file:

Apache Hive Fixed-Width File Loading Options

Apache Hive Fixed-Width File Loading Options

There are couple of options available to load fixed width file into Hive tables.

  1. Create Hive external table using SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ and load target table.
  2. Load fixed-width file in single column table and use substring to extract require fields.
Create Hive external table 
using SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’

You can use regex Serde to create the external table on top of fixed width file. You can then load target table using external table or convert that o managed table.

Related reading:

Below is the example of creating Hive external table using Regular Expression serde:

https://gist.github.com/ec6f6e424aaf4de0a5608a5414da4b0b

Fixed-Width File Overview

Now verify the external table:

Once above Hive external table is created pointing to fixed width file, you can query the external table to verify the column data. Below is the output of fixed width file external table.

https://gist.github.com/446033c731753313ab1929cfa599ff32

Load fixed-width file in single column table and use Hive substring to extract require fields

Another easy option is to load fixed-width file to single column stage table and then use Hive substring to load target table:

Related reading:

Hive Load fixed-width file to single column table

As mentioned earlier, you can get records from the fixed width file b using single column table. Simply load the entire table to single column table and use substring to get required fields.

https://gist.github.com/9d9dc12b9057dce0df69ea4158862b99

Use Substring to get required fields

Here is the example of using substring on single column table:

https://gist.github.com/a46140d43494320b4b66fb6ee94c6dd8

This Post Has 5 Comments

  1. Arpan Gupta

    Hi, thank you for such neat explanation. Is there a limit on length of row for SerDe to work properly?

    I was able to successfully read a fixed width file having 30 characters in a row into a hive table using SerDe. But when I try to use SerDe for a dataset having 358 columns and around 6600 characters in each row, the table is created but data couldn’t be loaded. It only loads all NULLs. Doing a select * outputs NULL in all columns and rows. Is it a limitation of SerDe or am I doing something wrong?

    1. Vithal S

      Hi,

      I don’t think there is a limitation. Meanwhile, check your input regex. There will be NULL in case if your regex is wrong. I will also try to replicate this issue.

      Thanks

      1. Arpan Gupta

        I have exactly 6,610 characters in one row. And my input regex is as follows :

        ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’

        WITH SERDEPROPERTIES (“input.regex” = “(.{7})(.{20})(.{90})(.{64})(.{64})(.{64})(.{64})(.{64})(.{50})(.{50})(.{16})(.{20})(.{16})(.{20})(.{9})(.{1})(.{8})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{2})(.{1})(.{1})(.{1})(.{9})(.{1})(.{9})(.{20})(.{8})(.{4})(.{20})(.{10})(.{10})(.{8})(.{8})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{20})(.{7})(.{1})(.{1})(.{1})(.{1})(.{3})(.{120})(.{64})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{16})(.{16})(.{4})(.{30})(.{30})(.{30})(.{30})(.{30})(.{50})(.{50})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{9})(.{16})(.{5})(.{1})(.{6})(.{41})(.{1})(.{4})(.{4})(.{15})(.{1})(.{15})(.{1})(.{18})(.{4})(.{9})(.{1})(.{8})(.{3})(.{8})(.{8})(.{8})(.{8})(.{8})(.{8})(.{30})(.{30})(.{30})(.{30})(.{30})(.{30})(.{6})(.{6})(.{6})(.{6})(.{6})(.{6})(.{120})(.{120})(.{120})(.{120})(.{120})(.{120})(.{2})(.{1})(.{4})(.{1})(.{9})(.{120})(.{64})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{9})(.{120})(.{64})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{9})(.{120})(.{64})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{9})(.{120})(.{64})(.{64})(.{50})(.{50})(.{16})(.{50})(.{3})(.{2})(.{9})(.{1})(.{50})(.{1})(.{3})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{3})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{3})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{3})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{3})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{50})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{50})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{50})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{50})(.{1})(.{1})(.{8})(.{8})(.{8})(.{1})(.{50})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{50})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{50})(.{1})(.{1})(.{8})(.{8})(.{20})(.{8})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{1})(.{3})(.{1})(.{5})(.{5})(.{5})(.{3})(.{5}).*”)

        I haven’t written this regex manually, I used R to generate this string so I don’t expect it have errors somewhere in the middle. Is there something else that is wrong in this?

      2. Arpan Gupta

        Got the error. No problem with the SerDe limitation or regex, but my input fixed width file had less no. of characters in all rows than listed in the dictionary.

        I figured it out by removing last column from the regex.

        1. Vithal S

          Awesome! Thanks for sharing answer.

Comments are closed.