How to Save Spark DataFrame as Hive Table – Example

Post author:Vithal S
Post last modified:September 26, 2019
Post category:Apache Spark
Reading time:5 mins read

Apache Spark is one of the highly contributed frameworks. Many e-commerce, data analytics and travel companies are using Spark to analyze the huge amount of data as soon as possible. Because of in memory computations, Apache Spark can provide results 10 to 100X faster compared to Hive. In this article, we will check How to Save Spark DataFrame as Hive Table? and some examples.

How to Save Spark DataFrame as Hive Table?

Because of its in-memory computation, Spark is used to process the complex computation. In case if you have requirement to save Spark DataFrame as Hive table, then you can follow below steps to create a Hive table out of Spark dataFrame.

Create Test Data Set

Let us create sample Apache Spark dataFrame that you want to store to Hive table.

For example, consider below sampleDF data frame.

# Create Sample DataFrame
>> sampleDF = sqlContext.createDataFrame([(1,111),
                                   (2,111),
                                  (3,222),
                                  (4,222),
                                  (5,222),
                                  (6,111),
                                  (7,333),
                                  (8,444)], 
                                 ["std_id", "dept_id"])

>> sampleDF.show()
+------+-------+
|std_id|dept_id|
+------+-------+
|     1|    111|
|     2|    111|
|     3|    222|
|     4|    222|
|     5|    222|
|     6|    111|
|     7|    333|
|     8|    444|
+------+-------+

As you can see, we have created Spark data frame with two columns which holds student id and department id.

Use DataFrame Writer to Save Spark DataFrame as a Hive Table

The next step is to use DataFrame writer to save dataFrame as a Hive table. This method works on all versions of the Apache Spark. Better to use this method if you want your application to be back-word compatible.

Note that, we have added hive-site.xml file to an Apache CONF folder to connect to Hive metastore automatically when you connect to Spark or Pyspark Shell.

For example, consider below example to store the sampleDF data frame to Hive.

sampleDF.write.saveAsTable('newtest.sampleStudentTable')

Now, go to the newtest database in your Hive and check if table is created.

select * from newtest.sampleStudentTable;

+---------+----------+--+
| std_id  | dept_id  |
+---------+----------+--+
| 1       | 111      |
| 2       | 111      |
| 3       | 222      |
| 4       | 222      |
| 5       | 222      |
| 6       | 111      |
| 7       | 333      |
| 8       | 444      |
+---------+----------+--+
8 rows selected (0.186 seconds)

createOrReplaceTempView Method

The createOrReplaceTempView another method that you can use if you are using latest spark version. This method is available in Apache Spark since 2.0.0.

sampleDF.createOrReplaceTempView("sampleStudentTable")

Then use sparkSQL to create table on Hive.

For example, consider below example.

sqlContext.sql("create table newtest.sampleStudentTable as select * from sampleStudentTable");