Due to its in memory distributed and fast computation, Apache Spark is one of the emerging Bigdata technology. Apache Spark in memory distributed computation allows you to analyze petabytes of data without any performance issue. In this article, we will check one of methods to connect SQL Server database from Spark program. Preferably, we will use PySpark to read SQL Server table.
Connection method is similar to that have already discussed for Oracle, Netezza, Snowflake, Teradata, etc.
Steps to Connect SQL Server From Spark
To access SQL Server from Apache Spark, you can use the JDBC connector provided by the Microsoft. The SQL Server JDBC connector allows you to connect to a Bigdata framework such as Apache Spark, Databricks, etc.
You can use following steps to connect SQL Server Database from Spark:
Download SQL Server sqlserverjdbc.jar JDBC Driver
You need SQL Server jdbc diver to connect to the database. You can download this driver from official website.
Add JDBC Jar File Path to spark.driver.extraClassPath While Creating Spark Context
To access to a SQL Server database using the JDBC driver, you will need to include the driver in your Spark application. In this step, add downloaded JDBC jar to the Spark extraClassPath.
Following code create Spark context by adding the JDBC jar to “spark.driver.extraClassPath”.
spark = SparkSession \
.builder \
.appName('SparkConnect') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
.getOrCreate()
Connect SQL Server From Spark
Now, you all set to read Microsoft SQL Server from Apache Spark.
MS SQL Server JDBC connection String
Following is the connection string that you can use in your PySpark program.
jdbc:sqlserver://hostname;databaseName=<your_databaseName>
PySpark Program to Access SQL Server from Spark
You can access SQL Server from Spark by specifying the JDBC driver class, the JDBC connection URL, and the connection properties. Once you have established a connection to SQL Server, you can use Spark’s DataFrame API to read and write data from and to the Microsoft SQL Server database. To read data from SQL Server to Spark, you can use the “spark.read.jdbc()” method.
Following is the complete PySpark program:
from pyspark.sql import SparkSession
def connect_to_sql( spark, jdbc_hostname, jdbc_port, database, data_table, username, password):
print (jdbc_port)
jdbc_url = "jdbc:sqlserver://"+jdbc_hostname+";databaseName="+database+""
connection_details = { "user": username, "password": password, "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver", }
df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
return df
if __name__=='__main__':
spark = SparkSession \
.builder \
.appName('test') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
.getOrCreate()
df = connect_to_sql(spark, '192.168.2.1', 1433, 'NEWDB', 'dbo.Customers', 'sa', 'Password')
And finally, following is the output:
>>> df.select("CustomerID", "CustomerFirstName").show()
+----------+-----------------+
|CustomerID|CustomerFirstName|
+----------+-----------------+
| 1| Stefanie|
| 2| Sandy|
| 3| Lee|
| 4| Regina|
| 5| Daniel|
| 6| Dennis|
| 7| Myra|
| 8| Teddy|
| 9| Annie|
| 10| Herman|
+----------+-----------------+
Foot Note
Always remember to,
- Use correct JDBC driver. Check Microsoft official website for latest version.
- Use correct JDBC drive class: com.microsoft.sqlserver.jdbc.SQLServerDriver.
- Use correct jdbc driver otherwise, you will end up with “java.sql.SQLException: The Network Adapter could not establish the connection” error.
- Set up CLASSPATH environmental variable or add jdbc driver to spark.driver.extraClassPath
- Use correct details in jdbc connection string.
Related Articles,
- How to Export SQL Server Table to S3 using Spark?
- SQL INITCAP Function Alternative in Azure Synapse and TSQL
- SQL LPAD and RPAD Functions Alternative in Synapse and TSQL
- What are SQL GREATEST and LEAST Functions Alternative in Synapse and TSQL
Hope this helps 🙂