Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. You can connect Spark to all major databases in market such as Netezza, Oracle, etc. In this article, we will check one of methods to connect Teradata database from Spark program. You can connect using either Scala or Python Pyspark. For all examples in this article, we will use Scala to read Teradata tables. You can even execute queries and create Spark dataFrame.
Steps to Connect Teradata Database from Spark
Teradata database is one of the widely used MPP databases in world. Most of the companies use Teradata as a data warehouse appliance or transaction systems.
Related Article,
- Steps to Connect Oracle Database from Spark? – Examples
- How to Connect Netezza Server from Spark? – Example
- How to Connect Netezza using JDBC Driver and working Examples
Below are the steps to connect Teradata Database from Spark:
Download Teradata JDBC Driver
You need two jar files to use Teradata jdbc string, terajdbc4.jar and tdgssconfig.jar.
You can download latest version of jdbc drivers from official website.
Add JDBC Driver Path to CLASSPATH
In this step, add terajdbc4.jar and tdgssconfig.jar files path to CLASSPATH environmental variable. You can either add it manually or add export statement to .bashrc or .profile.
export CLASSPATH=$CLASSPATH :$PWD/terajdbc4.jar:$PWD/tdgssconfig.jar
Connect Teradata Database from Spark
Now you are all set, just establish JDBC connection, read Teradata table and store as a scala DataFrame variable.
Teradata JDBC connection String
Below is the connection string for Teradata jdbc connection.
jdbc:teradata://host_IP/database=db_name, TMODE=TERA"
Below is the command and example. Change it as per your Teradata server configuration.
scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:teradata://192.168.100.123/database=test_db, TMODE=TERA", "user" -> "dbc", "password" -> "dbc", "dbtable" -> "reason", "driver" -> "com.teradata.jdbc.TeraDriver")).load()
df: org.apache.spark.sql.DataFrame = [r_reason_sk: int, r_reason_id: string, r_reason_desc: string]
scala> df.show()
…
+-----------+----------------+--------------------+
|r_reason_sk| r_reason_id| r_reason_desc|
+-----------+----------------+--------------------+
| 19|AAAAAAAADBAAAAAA|unauthoized purch...|
| 17|AAAAAAAABBAAAAAA|Wrong size ...|
| 7|AAAAAAAAHAAAAAAA|Gift exchange ...|
| 15|AAAAAAAAPAAAAAAA|Not working any m...|
| 26|AAAAAAAAKBAAAAAA|reason 26 ...|
| 34|AAAAAAAACCAAAAAA|reason 34 ...|
| 5|AAAAAAAAFAAAAAAA|Parts missing ...|
| 13|AAAAAAAANAAAAAAA|Found a better pr...|
| 24|AAAAAAAAIBAAAAAA|reason 24 ...|
| 32|AAAAAAAAACAAAAAA|reason 32 ...|
+-----------+----------------+--------------------+
Execute Query on Teradata Database Using Spark
You can execute queries from Spark. Below is the example:
scala> val Query = "(select * from reason) emp"
Query: String = (select * from reason) emp
scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:teradata://192.168.100.123/database=test_db, TMODE=TERA", "user" -> "dbc", "password" -> "dbc", "dbtable" -> Query, "driver" -> "com.teradata.jdbc.TeraDriver")).load()
df: org.apache.spark.sql.DataFrame = [r_reason_sk: int, r_reason_id: string, r_reason_desc: string]
scala> df.show()
…
+-----------+----------------+--------------------+
|r_reason_sk| r_reason_id| r_reason_desc|
+-----------+----------------+--------------------+
| 19|AAAAAAAADBAAAAAA|unauthoized purch...|
| 17|AAAAAAAABBAAAAAA|Wrong size ...|
| 7|AAAAAAAAHAAAAAAA|Gift exchange ...|
| 15|AAAAAAAAPAAAAAAA|Not working any m...|
| 26|AAAAAAAAKBAAAAAA|reason 26 ...|
| 34|AAAAAAAACCAAAAAA|reason 34 ...|
| 5|AAAAAAAAFAAAAAAA|Parts missing ...|
| 13|AAAAAAAANAAAAAAA|Found a better pr...|
| 24|AAAAAAAAIBAAAAAA|reason 24 ...|
| 32|AAAAAAAAACAAAAAA|reason 32 ...|
+-----------+----------------+--------------------+
Foot Note
Always remember to,
- Use correct Teradata JDBC driver. Check Teradata download center for latest version.
- Set up CLASSPATH environmental variable
- Use correct details in jdbc connection string
Related Article
Hope this helps 🙂