Spark SQL COALESCE on DataFrame – Examples

  • Post author:
  • Post last modified:June 16, 2022
  • Post category:Apache Spark
  • Reading time:5 mins read

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example.

Spark SQL COALESCE on DataFrame - Examples

Spark SQL COALESCE on DataFrame

The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null.

Coalesce requires at least one column and all columns have to be of the same or compatible types.

Spark SQL COALESCE on DataFrame Examples

You can apply the COALESCE function on DataFrame column values or you can write your own expression to test conditions.

Following example demonstrates the usage of COALESCE function on the DataFrame columns and create new column. We have used PySpark to demonstrate the Spark coalesce function.

Test Data

We will be using following DataFrame to test Spark SQL COALESCE function.

testDF = sqlContext.createDataFrame([(1,1), (2,2), (None,3), (4,None)], ["id", "number"])

+----+------+
|  id|number|
+----+------+
|   1|     1|
|   2|     2|
|null|     3|
|   4|  null|
+----+------+
Import Required Pyspark Functions

As a first step, you need to import required functions such as withColumn, WHERE, etc.

For example, execute the following command on the pyspark command line interface or add it in your Python script.

from pyspark.sql.types import FloatType
from pyspark.sql.functions import *

You can use the coalesce function either on DataFrame or in SparkSQL query if you are working on tables.

Spark COALESCE Function on DataFrame

This is one of the commonly used method to get non null values.

Following example create new column which contains all non null values.

# create new column with non Null values
tmp = testDF.withColumn('newColumn', coalesce(testDF['id'], testDF['number']))

# Check the content of new df
tmp.show()

+----+------+---------+
|  id|number|newColumn|
+----+------+---------+
|   1|     1|        1|
|   2|     2|        2|
|null|     3|        3|
|   4|  null|        4|
+----+------+---------+

As you can see, newColumn contains non-null values. Optionally, you can rename this column and drop existing columns.

COALESCE Function in Spark SQL Queries

You can use coalesce function in your Spark SQL queries if you are working on the Hive or Spark SQL tables or views.

For example, consider below example which use coalesce in queries. Note that, we have registered Spark DataFrame as a temp table using registerTempTable method.

sqlContext.sql("select *, coalesce(id, number) as newColumn from temp_table").show()

+----+------+---------+
|  id|number|newColumn|
+----+------+---------+
|   1|     1|        1|
|   2|     2|        2|
|null|     3|        3|
|   4|  null|        4|
+----+------+---------+

Related Articles,

Hope this helps 🙂