How to Create Redshift Table from DataFrame using Python

As you know, Python is one of the widely used Programming languages for the data analysis, data science and machine learning. When analyzing data using Python, you will use Numpy and Pandas extensively. In this article, we will check how to create Redshift table from DataFrame in Python. Here DataFrame is actually referred to pandas not Spark. I will write another article on how to create tables out of Spark DataFrame, but for now let us stick to pandas df.

Python Pandas DataFrame

The pandas DataFrame’s are really very useful when you are working on the non-numeric values. You can view your data in the form of rows and columns just like relational database and that allows you to view data in a more structured format.

For example, consider below pandas dataFrame.

df = pd.DataFrame([{'Name': 'Ram', 'Age': 50},
                   {'Name': 'Bhim', 'Age': 23},
                   {'Name': 'Shyam', 'Age': 25}])

print(df)

   Age   Name
0   50    Ram
1   23   Bhim
2   25  Shyam

As you can see, data is organized into rows and columns.

Create Redshift Table from DataFrame using Python

As mentioned in the previous section, Pandas DataFrame organize your data into rows and column format.

Now, the question is there any way we can create a table out of pandas dataFrame?

Yes, you can dump pandas data frame into relational database table. For simplicity, we will be using Redshift as a relational database for our demonstration.

Pandas data from provides many useful methods. One of such methods is to_sql, you can use to_sql to push dataFrame data to a Redshift database. In this article, we have used SQLAlchemy engine to create an active database connection to Redshift.

Note that, you might have to install psycopg2 to connect to Redshift.

Pandas DataFrame to_sql options

Below are the some of the important to_sql options that you should take care of.

Set index = False
if_exists = ‘replace’ – The table will be created if it doesn’t exist, and you can specify if you want you call to replace the table, append to the table, or fail if the table already exists.

Example to Create Redshift Table from DataFrame using Python

Below is a working example that will create Redshift table from pandas DataFrame.

from sqlalchemy import create_engine
import pandas as pd

conn = create_engine('postgresql://username:password@redshift-cluster.com:5439/database')

df = pd.DataFrame([{'Name': 'Ram', 'Age': 50},
                   {'Name': 'Bhim', 'Age': 23},
                   {'Name': 'Shyam', 'Age': 25}])

df.to_sql('test_table', conn, index=False, if_exists='replace')

Now, verify in your Redshift cluster for created table.

databse=# select * from test_table;
 age | name
-----+-------
  50 | Ram
  23 | Bhim
  25 | Shyam
(3 rows)

Hope this helps 🙂