Steps to Connect Oracle Database from Spark

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. In this article, we will check one of methods to connect Oracle database from Spark program. Preferably, we will use Scala to read Oracle tables. You can even execute queries and create Spark dataFrame.

Steps to Connect Oracle Database from Spark

Oracle database is one of the widely used databases in world. Almost all companies use Oracle as a data warehouse appliance or transaction systems.

Spark accepts data in the form of DataFrame variable. You can create dataFrame from local file system or HDFS files.

Download Oracle ojdbc6.jar JDBC Driver

You need an Oracle jdbc diver to connect to the Oracle server. The latest version of the Oracle jdbc driver is ojdbc6.jar file.

You can download this driver from official website. Go ahead and create Oracle account to download if you do not have.

Add ojdbc6.jar File Path to CLASSPATH

In this step, add ojdbc6.jar file path to CLASSPATH environmental variable. You can either add it manually or add export statement to .bashrc or .profile.

export CLASSPATH=$PWD/ojdbc6.jar

Connect Oracle Database from Spark

Now you are all set, just establish JDBC connection, read Oracle table and store as a DataFrame variable.

Oracle JDBC connection String

Below is the connection string that you can use in your Scala program.

jdbc:oracle:thin:@host_IP:portnumber:SSID

Below is the command and example. Change it as per your Oracle server configuration.

scala> val df = sqlContext.read.format("jdbc") \
.options(Map("url" -> "jdbc:oracle:thin:@192.168.100.200:1521:ORCL", "user" -> "USER1", "password" -> "password", "dbtable" -> "employee", "driver" -> "oracle.jdbc.OracleDriver")) \
.load()
df: org.apache.spark.sql.DataFrame = [EID: decimal(10,0), NAME: string, SALARY: string, DESTINATION: string, EMAIL: string, TEXT: string, PHONENUMBER: string, STATE: string]
scala> df.show()

+----+-----------+------+-----------------+
| EID|       NAME|SALARY|      DESTINATION|
+----+-----------+------+-----------------+
|1201|      Gopal| 45000|Technical manager|
|1202|    Manisha| 45000|     Proof reader|
|1203|Masthanvali| 40000| Technical writer|
|1204|      Kiran| 40000|         Hr Admin|
|1205|    Kranthi| 30000|         Op Admin|
|1206|     Kiran1| 40000|         Hr Admin|
|1207|   Kranthi1| 30000|         Op Admin|
+----+-----------+------+-----------------+

Execute Query on Oracle Database Using Spark

You can execute queries from Spark. Below is the example:

scala> val Query = "(select EID,NAME,state from employee) emp"
Query: String = (select EID,NAME,state from employee) emp

scala> val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:oracle:thin:@192.168.41.93:1521:ORCL", "user" -> "ILABUSER", "password" -> "impetus", "dbtable" -> Query, "driver" -> "oracle.jdbc.driver.OracleDriver")).load()
df: org.apache.spark.sql.DataFrame = [EID: decimal(10,0), NAME: string, STATE: string]

scala> df.show()
...
+----+-----------+----------+
| EID|       NAME|     STATE|
+----+-----------+----------+
|1201|      Gopal|   Alabama|
|1202|    Manisha|California|
|1203|Masthanvali|  Maryland|
|1204|      Kiran|  Michigan|
|1205|    Kranthi|New Jersey|
|1206|     Kiran1|   Alabama|
|1207|   Kranthi1|  Maryland|
+----+-----------+----------+

Steps to Connect Oracle Database from PySpark

You can use add Oracle ODBC jar to the spark-submit command while executing PySpark code to connect to Oracle Database.

Following is the PySpark example.

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName('SparkConnect') \
.enableHiveSupport() \
.getOrCreate()

connection_details = { "user": "orauser", "password": "Password@123", "driver": "oracle.jdbc.OracleDriver", }
jdbc_url = "jdbc:oracle:thin:@192.168.10.48:1521:ORCL"
df = spark.read.jdbc(url=jdbc_url, table="SALE_VIEW", properties=connection_details)
df.show()

And you can execute above PySpark code using spark-submit.

spark-submit --jars /home/jars/oracle/ojdbc7.jar /home/spark_exmaples/pyspark_oracle.py

Foot Note

Always remember to,

Use correct JDBC driver. Check Oracle download center for latest version.
Use correct jdbc driver otherwise, you will end up with “java.sql.SQLException: The Network Adapter could not establish the connection” error.
Set up CLASSPATH environmental variable
Use spark-submit with –jars option to provide jdbc driver jar
Use correct details in jdbc connection string.