How to Export Spark DataFrame to Redshift Table

  • Post author:
  • Post last modified:July 10, 2020
  • Post category:Redshift
  • Reading time:6 mins read

In my other article How to Create Redshift Table from DataFrame using Python, we have seen how to create Redshift table from Python Pandas DataFrame. In this article, we will check how to export Spark DataFrame to Redshift table.

Export Spark DataFrame to Redshift Table

Apache Spark is fast because of its in-memory computation. It is common practice to use Spark as an execution engine to process huge amount data. Sometimes, you may get a requirement to export processed data back to Redshift for reporting.

We are going to use a JDBC driver to write data from a Spark dataframe to database tables. In the subsequent sections, we will explore method to write Spark dataframe to Redshift Table.

Download Redshift JDBC Driver

You can download this driver from the official website. Go ahead and download appropriate JDBC driver.

Add JDBC Driver to CLASSPATH

There are two methods that you can follow to add an Redshift JDBC driver to CLASSPATH.

export CLASSPATH=$PWD/RedshiftJDBC42-1.1.17.1017.jar

Sometimes, Spark will not recognize the driver class when you export it in CLASSPATH. So, another option is to add jar files while you start pyspark shell.

Start Pyspark by providing jar files

Another method is to add jar while you start pyspark shell.

pyspark --conf spark.executor.extraClassPath=/path/RedshiftJDBC42-1.1.17.1017.jar --driver-class-path /path/RedshiftJDBC42-1.1.17.1017.jar --jars /path/RedshiftJDBC42-1.1.17.1017.jar	

This will work as expected. But, when you are working on the application you should add jars to Spark config.

Enable Hive Support

This is an optional step. We enable Hive supports to read data from Hive table to create test dataframe. We will export same test df to Redshift table.

>>> spark=SparkSession.builder.appName( "dftoRedshift" ).enableHiveSupport().getOrCreate()
Create Test DataFrame

You can use Spark SQL to read Hive table and create test dataframe that we are going to load into Redshift table.

>>> tesDf=spark.sql("select * from default.Test")
>>> testDf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   a|   b|   v|
+----+----+----+

Example to Export Spark DataFrame to Redshift Table

Now the environment is set and test dataframe is created. we can use dataframe.write method to load dataframe into Redshift tables.

For example, following piece of code will establish jdbc connection with Redshift cluster and load dataframe content into the table.

testDf.write.format('jdbc').options(
      url='jdbc:redshift://testredshift.us-east-1.redshift.amazonaws.com:5439/dev',	  
      driver='com.amazon.redshift.jdbc42.Driver',
      dbtable='public.df_load_test',
      user='redshiftuser',
      password='Password').mode('append').save() 

Note that, the mode is ‘append’. It will insert records to the existing table.

Check Redshift Table

If the previous command is executed successfully, you can check the Redshift table to see if data is loaded.

How to Export Spark DataFrame to Redshift Table

Create Redshift Table from Spark DataFrame Example

You can also create tables out of Spark dataFrame if it is not already present.

For example, consider following example.

testDf.write.format('jdbc').options(
      url='jdbc:redshift://testredshift.us-east-1.redshift.amazonaws.com:5439/dev',	  
      driver='com.amazon.redshift.jdbc42.Driver',
      dbtable='public.df_load_test',
      user='redshiftuser',
      password='Password').mode('overwrite').save() 

Note that, the mode is ‘overwrite‘. It will truncate the table and insert records if table present. Otherwise, it will create a new table.

Do let me know in the comment section if you are using a different method.

Hope this helps 🙂