Spark SQL INSERT INTO Table VALUES issue and Alternatives

Post author:Vithal S
Post last modified:July 7, 2019
Post category:Apache Spark
Reading time:5 mins read

Spark SQL is gaining popularity because of is fast distributed framework. The Spark SQL is fast enough compared to Apache Hive. You can create tables in the Spark warehouse as explained in the Spark SQL introduction or connect to Hive metastore and work on the Hive tables. Not all the Hive syntax are supported in Spark SQL, one such syntax is Spark SQL INSERT INTO Table VALUES which is not supported.

You cannot use INSERT INTO table VALUES option in spark. We will discuss the alternate approach with some examples.

Spark SQL INSERT INTO Table VALUES

As mentioned in the introductory paragraph, INSERT INTO tables with VALUES option is not supported in Spark SQL as of now. One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. You can manipulate the data present in Hive Tables using the either Spark JDBC driver or by creating HiveContex to connect to existing Hive databases.

As you may aware HiveSQL support SQL like statement to insert a single row into a table. i.e. using INSERT INTO table VALUES option. Below is the HiveSQL syntax to insert a single row into a table:

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

But unfortunately, this syntax will not work on Spark SQL.

Let us discuss the possible alternative approach to replace HiveQL syntax.

Spark SQL INSERT INTO Table VALUES Alternative

There are two alternative approaches that you can use to replace HiveQL methods to insert a single row into a table.

Use INSERT INTO table SELECT Syntax
Use DataFrame insertInto Option

Now let us discuss these two methods with an example.

INSERT INTO table SELECT Syntax

This is one of the easiest methods to insert record to a table. This method is a lot similar to a HiveQL syntax, but with a SELECT clause instead of VALUES.

Below is the syntax:

INSERT INTO tableName SELECT t.* FROM (SELECT value1, value2, ... ) t;

sqlContext.sql("insert into sample_tab1 select t.* from (select '6','FFF') t")

DataFrame insertInto Option

This approach requires the input data to be Spark DataFrame. This is one of the fastest approaches to insert the data into the target table.

Below are the steps:

Create Input Spark DataFrame

You can create Spark DataFrame using createDataFrame option.

df = sqlContext.createDataFrame([(10, 'ZZZ')],["id", "name"])

Write DataFrame Value to Target table

You can write DataFrame Value to Target table using insertInto option.

df.write.insertInto('testtdb.sample_tab1',overwrite = False)

Spark SQL Cumulative Sum Function and Examples

Hope this helps 🙂

Tags: Apache Spark