Connect to SQL Server From Spark - PySpark

Due to its in memory distributed and fast computation, Apache Spark is one of the emerging Bigdata technology. Apache Spark in memory distributed computation allows you to analyze petabytes of data without any performance issue. In this article, we will check one of methods to connect SQL Server database from Spark program. Preferably, we will use PySpark to read SQL Server table.

Connection method is similar to that have already discussed for Oracle, Netezza, Snowflake, Teradata, etc.

Steps to Connect SQL Server From Spark

To access SQL Server from Apache Spark, you can use the JDBC connector provided by the Microsoft. The SQL Server JDBC connector allows you to connect to a Bigdata framework such as Apache Spark, Databricks, etc.

You can use following steps to connect SQL Server Database from Spark:

Download SQL Server sqlserverjdbc.jar JDBC Driver

You need SQL Server jdbc diver to connect to the database. You can download this driver from official website.

Add JDBC Jar File Path to spark.driver.extraClassPath While Creating Spark Context

To access to a SQL Server database using the JDBC driver, you will need to include the driver in your Spark application. In this step, add downloaded JDBC jar to the Spark extraClassPath.

Following code create Spark context by adding the JDBC jar to “spark.driver.extraClassPath”.

spark = SparkSession \
.builder \
.appName('SparkConnect') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
.getOrCreate()

Connect SQL Server From Spark

Now, you all set to read Microsoft SQL Server from Apache Spark.

MS SQL Server JDBC connection String

Following is the connection string that you can use in your PySpark program.

jdbc:sqlserver://hostname;databaseName=<your_databaseName>

PySpark Program to Access SQL Server from Spark

You can access SQL Server from Spark by specifying the JDBC driver class, the JDBC connection URL, and the connection properties. Once you have established a connection to SQL Server, you can use Spark’s DataFrame API to read and write data from and to the Microsoft SQL Server database. To read data from SQL Server to Spark, you can use the “spark.read.jdbc()” method.

Following is the complete PySpark program:

from pyspark.sql import SparkSession

def connect_to_sql( spark, jdbc_hostname, jdbc_port, database, data_table, username, password):
	print (jdbc_port)
	jdbc_url = "jdbc:sqlserver://"+jdbc_hostname+";databaseName="+database+""
	connection_details = { "user": username, "password": password, "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver", }
	df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
	return df

if __name__=='__main__':
	spark = SparkSession \
	.builder \
	.appName('test') \
	.enableHiveSupport() \
	.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
	.getOrCreate()

    df = connect_to_sql(spark, '192.168.2.1', 1433, 'NEWDB', 'dbo.Customers', 'sa', 'Password')

And finally, following is the output:

>>> df.select("CustomerID", "CustomerFirstName").show()
+----------+-----------------+
|CustomerID|CustomerFirstName|
+----------+-----------------+
|         1|         Stefanie|
|         2|            Sandy|
|         3|              Lee|
|         4|           Regina|
|         5|           Daniel|
|         6|           Dennis|
|         7|             Myra|
|         8|            Teddy|
|         9|            Annie|
|        10|           Herman|

+----------+-----------------+

Foot Note

Always remember to,

Use correct JDBC driver. Check Microsoft official website for latest version.
Use correct JDBC drive class: com.microsoft.sqlserver.jdbc.SQLServerDriver.
Use correct jdbc driver otherwise, you will end up with “java.sql.SQLException: The Network Adapter could not establish the connection” error.
Set up CLASSPATH environmental variable or add jdbc driver to spark.driver.extraClassPath
Use correct details in jdbc connection string.