What is SQL Cursor Alternative in Spark SQL?

  • Post author:
  • Post last modified:May 22, 2019
  • Post category:Apache Spark
  • Reading time:4 mins read

SQL Cursor is a database object to retrieve data from a result set one row at a time. You can also consider cursor as a temporary workspace created in database system memory when a SQL query is executed. SQL Cursor always returns one row at a time, you can perform your calculation on returned values.

Cursors are usually written using SQL procedural language such as Oracle PL/SQL, Netezza NZPL/SQL.

Sample SQL Cursor Example

Below is the sample Oracle PL/SQL procedure with cursor defined:

CREATE OR replace PROCEDURE Sample_proc 
IS 
  str1 VARCHAR2(20); 
  str2 INT; 
  str3 INT; 
  CURSOR sample_cur IS 
    SELECT empid, 
           ename, 
           deptid 
    FROM   emp1; 
BEGIN 
    OPEN sample_cur; 

    FETCH sample_cur INTO str2, str1, str3; 

    CLOSE sample_cur; 
END; 

We will refer this sample cursor for subsequent examples.

Does Spark SQL Support Cursors?

As of now, Spark SQL does not support cursors. If you are using the Spark JDBC connection, there you will create cursors to hold intermediate results of query execution, but those cursors are different from actual SQL cursors. In the subsequent sections, we will check what could be the SQL cursor alternative in Spark SQL.

Related Articles:

What is SQL Cursor Alternative in Spark SQL?

One of the easiest way is to create dataFrame and iterate through rows in dataFrame which is same as iterate through SQL cursor using loop statements.

Note that, we have used pyspark to implement SQL cursor alternative in Spark SQL.

Spark DataFrame as a SQL Cursor Alternative in Spark SQL

One of the SQL cursor alternatives is to create dataFrame by executing spark SQL query. You can loop through records in dataFrame and perform assignments or data manipulations.

Below are the steps to create dataFrame for spark SQL query:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("use testtdb")

# Variable declaration
str1 = ''
str2 = ''
str3 = ''
# Create DataFrame for SQL Cursor
sample_cur = hiveContext.sql("SELECT empid, ename, deptid FROM emp")

# Assign value to Variable
for row in sample_cur.rdd.collect():
	str1 = row['empid']
	str2 = row['ename']
	str3 = row['deptid']

Above method is one of the easy method that you can use as a SQL cursor alternative.

Create Intermediate Table

Another approach approach is to create intermediate table and create dataFrame from that table. You can use this approach if your SQL cursor query is very complicated and difficult to execute using hiveContex.

Hope this helps 🙂