How to Export SQL Server Table to S3 using Spark?

Apache Spark is one of the emerging Bigdata technology. Due to its in memory distributed and fast computation, you can use it to perform heavy jobs such as analyzing petabytes of data or export millions or billions of records from any relational database to cloud storage such as Amazon S3, Azure Blob or Google cloud storage. In this article, we will check how to export SQL Server table to the Amazon cloud S3 bucket using Spark. We will use PySpark to demonstrate the method.

In my other article, we have seen how to connect to SQL Server from Spark. This article, would be somewhat a extension of that article.

Export SQL Server Table to S3 using Spark

One of the simplest methods to export a SQL Server table to an Amazon S3 bucket using Spark is to use the Spark JDBC library to read the table from SQL Server and then use the Spark built-in support for writing data to S3.

Apache Spark can read and write data in object stores through a filesystem connectors. There connectors are implemented in either Hadoop or provided by the infrastructure suppliers (such as Amazon , Azure, Google Cloud) themselves. These connectors make the object stores look almost like file systems, with directories and files and the classic operations on them such as list, delete and rename.

Steps to Export SQL Server Table to S3 using Spark

You can use following steps to export SQL Server Database table to Amazon S3 using Apache Spark:

Download SQL Server sqlserverjdbc.jar JDBC Driver

You need SQL Server jdbc diver to connect to the database. You can download this driver from official website.

Add JDBC Jar File Path to spark.driver.extraClassPath While Creating Spark Context

To access to a SQL Server database using the JDBC driver, you will need to include the driver in your Spark application. In this step, add downloaded JDBC jar to the Spark extraClassPath.

Following code create Spark context by adding the JDBC jar to “spark.driver.extraClassPath”.

spark = SparkSession \
.builder \
.appName('ExportToS3') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
.getOrCreate()

Use the JDBC Library to Read the Table from SQL Server

You can use the following sample code to read data from SQL Server.

df = spark.read.format("jdbc").options(
    url="jdbc:sqlserver://<hostname>:<port>;databaseName=<dbname>",
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
    dbtable="<tablename>",
    user="<username>",
    password="<password>"
).load()

Configure Spark to use Hadoop S3A Connector

Make sure to configure Spark to use the Hadoop S3A connector by adding the following line of code. You can get access and secret key from your AWS administrators.

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<Your_Access_Key>") 
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<Your_Secret_Key>")

Use Spark Write Method to Export the DataFrame to Amazon S3

Following code exports the DataFrame to Amazon S3.

df \
.write \
.option('fs.s3a.committer.staging.conflict-mode', 'replace') \
.option("fs.s3a.fast.upload.buffer", "bytebuffer") \
.mode('overwrite') \
.csv(path='s3a://demo/tmp2/', sep=',')

Here is the complete PySpark program that will allow you to copy SQL Server table to Amazon S3.

from pyspark.sql import SparkSession

# Connect to SQL Server.
def connect_to_sql( spark, jdbc_hostname, jdbc_port, database, data_table, username, password):
	print (jdbc_port)
	jdbc_url = "jdbc:sqlserver://"+jdbc_hostname+";databaseName="+database+""
	connection_details = { "user": username, "password": password, "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver", }
	df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
	return df

if __name__=='__main__':
	spark = SparkSession \
	.builder \
	.appName('ExportToS3') \
	.enableHiveSupport() \
	.config("spark.driver.extraClassPath", "/home/jars/sqlserver/sqlserverjdbc.jar") \
	.getOrCreate()

	df = connect_to_sql(spark, '192.168.1.1', 1433, 'NEWDB', 'dbo.Customers', 'sa', 'Password')

	# Amazon S3 Integration
	spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<Your_Access_Key>") 
	spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<Your_Secret_Key>")

	# Export DataFrame to Amazon S3
	df \
	.write \
	.option('fs.s3a.committer.staging.conflict-mode', 'replace') \
	.option("fs.s3a.fast.upload.buffer", "bytebuffer") \
	.mode('overwrite') \
	.csv(path='s3a://demo/tmp2/', sep=',')

Foot Note

Always remember to,

Use correct JDBC driver. Check Microsoft official website for latest version.
Use correct JDBC drive class: com.microsoft.sqlserver.jdbc.SQLServerDriver.
Use correct jdbc driver otherwise, you will end up with “java.sql.SQLException: The Network Adapter could not establish the connection” error.
Remember to add jdbc driver to spark.driver.extraClassPath
Use correct details in jdbc connection string.