Steps to Connect Teradata Database from Spark

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. You can connect Spark to all major databases in market such as Netezza, Oracle, etc. In this article, we will check one of methods to connect Teradata database from Spark program. You can connect using either Scala or Python Pyspark. For all examples in this article, we will use Scala to read Teradata tables. You can even execute queries and create Spark dataFrame.

Steps to Connect Teradata Database from Spark

Teradata database is one of the widely used MPP databases in world. Most of the companies use Teradata as a data warehouse appliance or transaction systems.

Download Teradata JDBC Driver

You need two jar files to use Teradata jdbc string, terajdbc4.jar and tdgssconfig.jar.

You can download latest version of jdbc drivers from official website.

Add JDBC Driver Path to CLASSPATH

In this step, add terajdbc4.jar and tdgssconfig.jar files path to CLASSPATH environmental variable. You can either add it manually or add export statement to .bashrc or .profile.

export CLASSPATH=$CLASSPATH :$PWD/terajdbc4.jar:$PWD/tdgssconfig.jar

Connect Teradata Database from Spark

Now you are all set, just establish JDBC connection, read Teradata table and store as a scala DataFrame variable.

Teradata JDBC connection String

Below is the connection string for Teradata jdbc connection.

jdbc:teradata://host_IP/database=db_name, TMODE=TERA"

Below is the command and example. Change it as per your Teradata server configuration.

scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:teradata://192.168.100.123/database=test_db, TMODE=TERA", "user" -> "dbc", "password" -> "dbc", "dbtable" -> "reason", "driver" -> "com.teradata.jdbc.TeraDriver")).load()
df: org.apache.spark.sql.DataFrame = [r_reason_sk: int, r_reason_id: string, r_reason_desc: string]

scala> df.show()
…
+-----------+----------------+--------------------+
|r_reason_sk|     r_reason_id|       r_reason_desc|
+-----------+----------------+--------------------+
|         19|AAAAAAAADBAAAAAA|unauthoized purch...|
|         17|AAAAAAAABBAAAAAA|Wrong size       ...|
|          7|AAAAAAAAHAAAAAAA|Gift exchange    ...|
|         15|AAAAAAAAPAAAAAAA|Not working any m...|
|         26|AAAAAAAAKBAAAAAA|reason 26        ...|
|         34|AAAAAAAACCAAAAAA|reason 34        ...|
|          5|AAAAAAAAFAAAAAAA|Parts missing    ...|
|         13|AAAAAAAANAAAAAAA|Found a better pr...|
|         24|AAAAAAAAIBAAAAAA|reason 24        ...|
|         32|AAAAAAAAACAAAAAA|reason 32        ...|
+-----------+----------------+--------------------+

Execute Query on Teradata Database Using Spark

You can execute queries from Spark. Below is the example:

scala> val Query = "(select * from reason) emp"
Query: String = (select * from reason) emp

scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:teradata://192.168.100.123/database=test_db, TMODE=TERA", "user" -> "dbc", "password" -> "dbc", "dbtable" -> Query, "driver" -> "com.teradata.jdbc.TeraDriver")).load()
df: org.apache.spark.sql.DataFrame = [r_reason_sk: int, r_reason_id: string, r_reason_desc: string]

scala> df.show()
…
+-----------+----------------+--------------------+
|r_reason_sk|     r_reason_id|       r_reason_desc|
+-----------+----------------+--------------------+
|         19|AAAAAAAADBAAAAAA|unauthoized purch...|
|         17|AAAAAAAABBAAAAAA|Wrong size       ...|
|          7|AAAAAAAAHAAAAAAA|Gift exchange    ...|
|         15|AAAAAAAAPAAAAAAA|Not working any m...|
|         26|AAAAAAAAKBAAAAAA|reason 26        ...|
|         34|AAAAAAAACCAAAAAA|reason 34        ...|
|          5|AAAAAAAAFAAAAAAA|Parts missing    ...|
|         13|AAAAAAAANAAAAAAA|Found a better pr...|
|         24|AAAAAAAAIBAAAAAA|reason 24        ...|
|         32|AAAAAAAAACAAAAAA|reason 32        ...|
+-----------+----------------+--------------------+

Foot Note

Always remember to,

Use correct Teradata JDBC driver. Check Teradata download center for latest version.
Set up CLASSPATH environmental variable
Use correct details in jdbc connection string

Hope this helps 🙂