Now a days everybody wants to migrate to Hadoop environment for their analytics that includes real-time or near real-time. In this post i will explain some best practices in Migrating Netezza to Impala SQL.
Impala uses the standard SQL but still you might need to modify the source SQL when bringing specific application to Hadoop Impala due to variations in data types, built-in function and obviously Hadoop specific syntax. Even if the SQL is working correctly in Impala, you might consider rewriting it to improve performance.
Read:
- Netezza Hadoop Connector and its Usage
- Netezza Hadoop Integration and Different Types of Ingestions
- Import Data Using Apache Sqoop
Migrating DDL and DML Statements
When migrating Netezza to Impala SQL, expect to find number of difference in DDL and DML statements.
For example, Netezza use the distribution by clause to distribute data but Impala doesn’t supports this syntax and you should consider changing the such DDL’s.
Since Impala supports standard SQL, expect DML SQL queries to have much higher level degree of compatibility. With little bit modification according to Hadoop specific syntax, you will be able to run queries.
Some of point to be remembered when porting DDL:
- A CREATE TABLE statement with no STORED AS clause creates data files in plain text format, You should modify the DDL of you want to store data in specific format. The file formats such as PARQUET are well suited for Impala tables.
- A CREATE TABLE statement with no PARTITIONED BY clause stores all the data files in the same physical location. Which will impact the performance when data volume increases.
- Use INSERT … SELECT statement whenever possible for your transformations.
- Rewrite UNION queries to use the UNION ALL operator. UNION ALL is more efficient because it avoids materializing and sorting the entire result set to eliminate duplicate values.
- Change any GENERATE STATISTICS statements to COMPUTE STATS
Migrating Netezza Data Types to Impala specific Data Types
Modify any columns with data types such as varchar or char to string data types if you are using CDH 5.1.0 or lower version.
- Remove the constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, UNSIGNED, or CHECK constraints from CREATE TABLE statements.
- Remove any INDEX, KEY, or PRIMARY KEY clauses from statements such as CREATE TABLE and ALTER TABLE statements.
- Change any DATE, DATETIME or TIME data types columns to TIMESTAMP. You should consider removing any timezone specific data types.
- INTERVAL type is supported in Impala. You can make use of that of your query requires.
- DECIMAL types are supported in CDH 5.2.0 or higher. Consider changing any NUMBER type to INT depending upon the application.
- FLOAT, DOUBLE, and REAL types are supported in Impala. Remove any precision and scale specifications.
- Most of the Netezza INTEGER data types are supported in the Impala SQL. You can make use of those types while migrating.
- Binary large objects such as BLOB, RAW BINARY, and VARBINARY are not supported in Impala.
- For any boolean like type, use Impala BOOLEAN type.
Netezza Statements not Supported in Impala SQL
Some of the SQL statements or clauses are not supported in Impala:
- Impala does not supports the DELETE statements.
- CDH Impala does not supports the UPDATE statements.
- Impala does not supports the INTERSECT in set operation.
- Impala has no transaction statements, such as COMMIT or ROLLBACK. All records are auto committed.
- Impala supports subqueries only in the FROM clause of a query, not within the WHERE clauses. Consider rewriting SQL if you have such statements.
This is not a perfect migration guide. Feel free to comment and share if you like this post.
Great article Vithal Sampagar. Thanks for sharing. I have some more questions. Wondering if we can connect.
I have several TB of Netezza data to be offloaded into Hadoop based datalake. I cant do this over the wire due to bandwidth limitations and would like to do it on temperory staged platform .
Any suggestions ?
Hi Prem,
Thank you 🙂
If you have bandwidth limitation then my suggestion would be to mount storage in Netezza Host (or edge node) to offload required tables in the form of flat files. Then you can mount that storage to Hadoop edge node as well and use those flat files in Hadoop ecosystem. This is fastest way to get data into Hadoop ecosystem if you have bandwidth limitation. Talk to your local IT team on this idea and they can help on mounting storage.
Let me know how it goes.
Please contact me via “Contact us” page , i will reply to any of your queries.
Keep reading…