How to Connect Netezza Server from Spark?

I was working on one of the Spark project where we had a requirement to connect Netezza server from Spark. Integrating Netezza and Apache Spark enable analytic capabilities using the Apache Spark for data resides in Netezza database.

There are various ways you can use to connect Netezza server from Spark program. You can also connect to the Netezza server from Pyspark.

So, Why to Connect Netezza Server from Spark?

This is an interesting question. To answer this, let us check how Apache Spark works. Apache Spark works on data that you provide using local file system or HDFS. If you have a very big table in Netezza, it may take significant time to create flat file and then feed that to Spark. Instead, you can directly connect to the Netezza server using nzjdbc. jar driver and create dataFrame for a particular table. You can use that dataFrame for your subsequent calculations.

How to Connect Netezza Server from Spark?

In this article, we will check simple method to connect to Netezza server using jdbc driver. Make sure you use, Spark version 1.2 and above as it has data source API.

Download JDBC Driver

IBM provides nzjdbc*.jar JDBC driver that you can use to integrate Netezza and Apache Spark.

You can download a JDBC driver from IBM fix central. Create IBM id if you don’t have already to download any software from IBM fix central.

Optionally, this driver will come as a part of you Netezza software bundle. Check with your local Netezza administrator for the jar file. Or if you are an administrator, check at /nz/kit/sbin/nzjdbc*.jar location.

Add nzjdbc.jar File Path to CLASSPATH

Next step is to add nzjdbc*.jar file path to CLASSPATH environmental variable. You can either do it manually or add export statement to .bashrc or .profile.

export CLASSPATH=$PWD/nzjdbc3.jar

Connect to Netezza Server from Spark

Now you are all set, just establish JDBC connection, read Netezza table and store it in DataFrame variable.

Below is the command and example. Change it as per your Netezza server configuration.

scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:netezza://192.168.100.200:5480/NEWTPCDS", "user" -> "admin", "password" -> "password", "dbtable" -> "REASON", "driver" -> "org.netezza.Driver")).load()
df: org.apache.spark.sql.DataFrame = [R_REASON_SK: int, R_REASON_ID: string, R_REASON_DESC: string]

scala> df.show()

+-----------+----------------+--------------------+
|R_REASON_SK|     R_REASON_ID|       R_REASON_DESC|
+-----------+----------------+--------------------+
|          4|AAAAAAAAEAAAAAAA|Not the product t...|
|          8|AAAAAAAAIAAAAAAA|Did not like the ...|
|         12|AAAAAAAAMAAAAAAA|No service locati...|
|         16|AAAAAAAAABAAAAAA|Did not fit      ...|
|         20|AAAAAAAAEBAAAAAA|duplicate purchas...|
…

Foot Note

Always remember to,

Use correct Netezza JDBC driver. Check IBM fix central for latest version
Set up CLASSPATH environmental variable
Use correct details in jdbc connection string.

Hope this helps ?