Export data using Apache Sqoop

In some cases data processed by Hadoop cluster may be needed in production systems to help run additional critical business functions. The sqoop can exports a set of files from HDFS to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specific delimiters. Delimiter is provided in the sqoop options file or CLI options.

Connectors

Sqoop can connect with external systems that have optimised import and export facilities by using specialised connectors. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialised connectors for enterprise data warehouse systems to NoSQL datastores.

You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The –num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.

Insert and updates

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If table contains constraints such as a primary key or unique key column whose values must be unique and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.

If you specify the –update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATEstatement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with –update-key.

$ sqoop export –connect jdbc:netezza://localhost/MYDB –table sqoop_updt_test –update-key id –export-dir /dw/sqoop_updt_test

Transactions

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Partial results from the export will become visible before the export is complete.

Failed Transaction

Transaction may fail for number of reasons

Hadoop cluster connectivity issue
Primary or unique key constraint violation
Attempting to parse record from the HDFS source data
Attempting to parse records using incorrect delimiters
Capacity issues such as primary memory or disk space

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back.