I was working on one of the Spark project where we had a requirement to connect Netezza server from Spark. Integrating Netezza and Apache Spark enable analytic capabilities using the Apache Spark for data resides in Netezza database.
There are various ways you can use to connect Netezza server from Spark program. You can also connect to the Netezza server from Pyspark.
So, Why to Connect Netezza Server from Spark?
This is an interesting question. To answer this, let us check how Apache Spark works. Apache Spark works on data that you provide using local file system or HDFS. If you have a very big table in Netezza, it may take significant time to create flat file and then feed that to Spark. Instead, you can directly connect to the Netezza server using nzjdbc. jar driver and create dataFrame for a particular table. You can use that dataFrame for your subsequent calculations.
Related Article,
- Steps to Connect Teradata Database from Spark? – Examples
- Steps to Connect Oracle Database from Spark? – Examples
- How to Connect Netezza using JDBC Driver and working Examples
- Connect to SQL Server From Spark – PySpark
How to Connect Netezza Server from Spark?
In this article, we will check simple method to connect to Netezza server using jdbc driver. Make sure you use, Spark version 1.2 and above as it has data source API.
Download JDBC Driver
IBM provides nzjdbc*.jar JDBC driver that you can use to integrate Netezza and Apache Spark.
You can download a JDBC driver from IBM fix central. Create IBM id if you don’t have already to download any software from IBM fix central.
Optionally, this driver will come as a part of you Netezza software bundle. Check with your local Netezza administrator for the jar file. Or if you are an administrator, check at /nz/kit/sbin/nzjdbc*.jar location.
Add nzjdbc.jar File Path to CLASSPATH
Next step is to add nzjdbc*.jar file path to CLASSPATH environmental variable. You can either do it manually or add export statement to .bashrc or .profile.
export CLASSPATH=$PWD/nzjdbc3.jar
Connect to Netezza Server from Spark
Now you are all set, just establish JDBC connection, read Netezza table and store it in DataFrame variable.
Below is the command and example. Change it as per your Netezza server configuration.
scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:netezza://192.168.100.200:5480/NEWTPCDS", "user" -> "admin", "password" -> "password", "dbtable" -> "REASON", "driver" -> "org.netezza.Driver")).load()
df: org.apache.spark.sql.DataFrame = [R_REASON_SK: int, R_REASON_ID: string, R_REASON_DESC: string]
scala> df.show()
+-----------+----------------+--------------------+
|R_REASON_SK| R_REASON_ID| R_REASON_DESC|
+-----------+----------------+--------------------+
| 4|AAAAAAAAEAAAAAAA|Not the product t...|
| 8|AAAAAAAAIAAAAAAA|Did not like the ...|
| 12|AAAAAAAAMAAAAAAA|No service locati...|
| 16|AAAAAAAAABAAAAAA|Did not fit ...|
| 20|AAAAAAAAEBAAAAAA|duplicate purchas...|
…
Foot Note
Always remember to,
- Use correct Netezza JDBC driver. Check IBM fix central for latest version
- Set up CLASSPATH environmental variable
- Use correct details in jdbc connection string.
Hope this helps ?