Hadoop HDFS Schema Design for ETL Process

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:BigData
  • Reading time:4 mins read

Now a day’s many organisations are using Hadoop for their ETL processing. In this post we will learn Hadoop HDFS Schema Design for ETL Process. In this section, you will learn about good schema design for data that you store in Hadoop HDFS directly.

Hadoop HDFS Schema Design Overview

Many organisation uses Hadoop for storing and processing unstructured, semi-structured or structured data. Hadoop is schema-on-read model that does not impose any requirements when loading data into Hadoop ecosystem. You can simply ingest data into Hadoop HDFS by using available ingestion methods.

hadoop hdfs schema design

In many cases, Hadoop HDFS storage is shared across multiple departments and teams. Hadoop HDFS usually acts as a data hub for entire organisation. Creating structured and organised HDFS repository of your data in HDFS will provide many benefits. Such as:

  • Standard directory helps to share data across multiple teams across organisation.
  • This structure helps to control accidental deletion or corruption.
  • Use stage area before processing all data to target directory, this helps to make sure partial data is not processed.

Hadoop HDFS Schema Design

Data warehouse implementations on Hadoop ecosystem and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. Unstructured and semi-structured data, on the other hand, are likely to focus more on directory placement and metadata management.

Related reading:

Standard HDFS locations make it easier to find and share various data between different teams within organisation. The following is an example HDFS directory structure that we use when implementing data warehouse on Hadoop ecosystem.

/user/<user-name>

This folder will have the user specific configuration files. The directory under /user/ will only be typically readable and writable by the users who own them.

/etl

You can process data in various stages using etl workflow. All the directories under /etl will be readable and writable by ETL processes and ETL team members. The /etl directory will have many sub directories for various group that perform the ETL operations such as analytics, BI etc.

For example, if you are part of Business intelligence (BI) team, and you are dealing with invoice data then the one of its process includes aggregation of the data. The recommended structure of the directory that holds the aggregate data is: /etl/bi/invoice-data/aggregate

/temp

This directory holds the temporary data that are shared between different users within organisation. The directory typically holds data for a day and cleaned by automated processes. This directory usually have full read and write permission.

/data

This directory holds data that have been processed and are shared across organisation. This directory is critical to the organization and only end applications are allowed to write to it and user will have read only access.

/bin

This directory will have everything that is required to run the Hadoop data warehouse application. That includes any JAR files, functions, scripts, workflow definitions etc.

/sql

This directory will hold Hive QL or Impala QL files that are required to run application. Usually the sql files are used to load the metadata tables to track the ETL operations.

/parms

This directory holds any parameter files that are used to process the data in ETL operation. For example, if you have requirement to load the fixed-width file, then you will store the layout definition in parameter file.

/archive

This directory holds the processed files. Whatever file you transfer to this directory will be compressed by the automated process.

This directory is simply a backup directory for data files.