In my other article How to Create Redshift Table from DataFrame using Python, we have seen how to create Redshift table from Python Pandas DataFrame. In this article, we will check how to export Spark DataFrame to Redshift table.
Export Spark DataFrame to Redshift Table
Apache Spark is fast because of its in-memory computation. It is common practice to use Spark as an execution engine to process huge amount data. Sometimes, you may get a requirement to export processed data back to Redshift for reporting.
We are going to use a JDBC driver to write data from a Spark dataframe to database tables. In the subsequent sections, we will explore method to write Spark dataframe to Redshift Table.
Download Redshift JDBC Driver
You can download this driver from the official website. Go ahead and download appropriate JDBC driver.
Add JDBC Driver to CLASSPATH
There are two methods that you can follow to add an Redshift JDBC driver to CLASSPATH.
export CLASSPATH=$PWD/RedshiftJDBC42-1.1.17.1017.jar
Sometimes, Spark will not recognize the driver class when you export it in CLASSPATH. So, another option is to add jar files while you start pyspark shell.
Start Pyspark by providing jar files
Another method is to add jar while you start pyspark shell.
pyspark --conf spark.executor.extraClassPath=/path/RedshiftJDBC42-1.1.17.1017.jar --driver-class-path /path/RedshiftJDBC42-1.1.17.1017.jar --jars /path/RedshiftJDBC42-1.1.17.1017.jar
This will work as expected. But, when you are working on the application you should add jars to Spark config.
Enable Hive Support
This is an optional step. We enable Hive supports to read data from Hive table to create test dataframe. We will export same test df to Redshift table.
>>> spark=SparkSession.builder.appName( "dftoRedshift" ).enableHiveSupport().getOrCreate()
Create Test DataFrame
You can use Spark SQL to read Hive table and create test dataframe that we are going to load into Redshift table.
>>> tesDf=spark.sql("select * from default.Test")
>>> testDf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| b| v|
+----+----+----+
Example to Export Spark DataFrame to Redshift Table
Now the environment is set and test dataframe is created. we can use dataframe.write method to load dataframe into Redshift tables.
For example, following piece of code will establish jdbc connection with Redshift cluster and load dataframe content into the table.
testDf.write.format('jdbc').options(
url='jdbc:redshift://testredshift.us-east-1.redshift.amazonaws.com:5439/dev',
driver='com.amazon.redshift.jdbc42.Driver',
dbtable='public.df_load_test',
user='redshiftuser',
password='Password').mode('append').save()
Note that, the mode is ‘append’. It will insert records to the existing table.
Check Redshift Table
If the previous command is executed successfully, you can check the Redshift table to see if data is loaded.
Create Redshift Table from Spark DataFrame Example
You can also create tables out of Spark dataFrame if it is not already present.
For example, consider following example.
testDf.write.format('jdbc').options(
url='jdbc:redshift://testredshift.us-east-1.redshift.amazonaws.com:5439/dev',
driver='com.amazon.redshift.jdbc42.Driver',
dbtable='public.df_load_test',
user='redshiftuser',
password='Password').mode('overwrite').save()
Note that, the mode is ‘overwrite‘. It will truncate the table and insert records if table present. Otherwise, it will create a new table.
Do let me know in the comment section if you are using a different method.
Hope this helps 🙂