Sqoop Architecture – Mappers with No Reducers

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:BigData
  • Reading time:3 mins read

In today’s Bigdata world, data is everything. Still most of the organisations are using relational database for their needs. You cannot perform the complex calculations on the data stored in RDBMS or it takes lots of time to perform task. Sqoop tool is designed to transfer data between the Hadoop cluster and RDBMS.

Sqoop architecture has unique design to transfer data between Hadoop and various relational databases available in market. You can use Sqoop to import data from a relational database management system (RDBMS) such as Netezza, MySQL, Oracle or SQL Server to the HDFS. You can transform and perform complex calculations on the data in Hadoop MapReduce programs, and then again export calculated result back to RDBMS. Sqoop is based on a connector architecture which supports plugins to provide connectivity to external systems (RDBMS).

Sqoop uses MapReduce jobs to import and export the data. Data import and export is provides parallel operation as well as fault tolerance. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.

Read:

The input to the sqoop import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this process is a set of HDFS files containing a copy of the imported table. The import process is performed in parallel. As a result of this parallelism, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.

Sqoop Architecture

Sqoop is usually installed on any available node on Hadoop cluster. Usually in production environment, it is installed on the edge node. What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

Sqoop Ouput

In addition, sqoop output will be either HDFS file in import or RDBMS table rows in export.

Sample sqoop import command is given below

$ sqoop import –connect jdbc:netezza://localhost/MYDB –username Vithal –password xxxxx –direct –table MY_TABLE –num-mappers 8 –escaped-by ‘\\’ –fields-terminated-by ‘,’

Sample sqoop export command is given below

$ sqoop export –connect jdbc:netezza://localhost/MYDB –username Vithal –password xxxxx –direct –export-dir /user/arvind/MY_TABLE –table MY_TABLE_TGT –num-mappers 8 –input-escaped-by ‘\\’

The direct mode Netezza connector supports the following Netezza-specific arguments for imports and exports.