Export data using Apache Sqoop

  • Post author:
  • Post last modified:March 8, 2019
  • Post category:BigData
  • Reading time:6 mins read

In some cases data processed by Hadoop cluster may be needed in production systems to help run additional critical business functions. The sqoop can exports a set of files from HDFS to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specific delimiters.  Delimiter is provided in the sqoop options file or CLI options.

Related Reading:

Because the target system is RDBMS, the default operation is to transform these into a set of INSERT statements that inject the records into the database. There is also an update record option available with Sqoop export command. In an update mode, Sqoop will generate UPDATE statements that replace existing records in the database.

$ sqoop export –connect jdbc:netezza://localhost/MYDB \ –table HEALTH –username Vithal –password **** \ –export-dir /user/Vithal/HEALTH

export: This is the sub-command that instructs Sqoop to initiate an export.

–connect <connect string>, –username <user name>, –password <password>: These are connection parameters that are used to connect with the database.

–table <table name>: This parameter specifies the table which will be populated.

–export-dir <directory path>: This is the directory from which data will be exported.

Export is done in two steps as shown in Figure 1.  The first step is to get the metadata information from the database followed by second step to transferring data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

Sqoop Export Architecture

There could be a possibility that sqoop export may corrupt the production table. So some connectors support the staging table.  Staging tables are populated by map tasks , later it is merged with target table once all data is copied to database.

Connectors

Sqoop can connect with external systems that have optimised import and export facilities by using specialised connectors. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialised connectors for enterprise data warehouse systems to NoSQL datastores.

You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The –num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.

Insert and updates

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If table contains constraints such as a primary key or unique key column whose values must be unique and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.

If you specify the –update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATEstatement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with –update-key.

$ sqoop export –connect jdbc:netezza://localhost/MYDB –table sqoop_updt_test –update-key id –export-dir /dw/sqoop_updt_test

Transactions

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Partial results from the export will become visible before the export is complete.

Failed Transaction

Transaction may fail for number of reasons

  • Hadoop cluster connectivity issue
  • Primary or unique key constraint violation
  • Attempting to parse record from the HDFS source data
  • Attempting to parse records using incorrect delimiters
  • Capacity issues such as primary memory or disk space

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back.