In general, the CASE expression or command is a conditional expression, similar to if-then-else statements found in other languages. Spark SQL supports almost all features that are available in Apace Hive. One of such a features is CASE statement. In this article, how to use CASE WHEN and OTHERWISE statement on a Spark SQL DataFrame.
Spark SQL CASE WHEN on DataFrame
The CASE WHEN and OTHERWISE function or statement tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression.
Spark SQL DataFrame CASE Statement Examples
You can write the CASE statement on DataFrame column values or you can write your own expression to test conditions.
Following are the different kind of examples of CASE WHEN and OTHERWISE statement. We have used PySpark to demonstrate the Spark case statement.
Test Data
We will be using following DataFrame to test Spark SQL CASE statement.
testDF = sqlContext.createDataFrame([(1,"One"), (2,"Two"), (3,"Three"), (4,"Four")], ["id", "number"])
+---+------+
| id|number|
+---+------+
| 1| One|
| 2| Two|
| 3| Three|
| 4| Four|
+---+------+
Import Required Pyspark Functions
As a first step, you need to import required functions such as col
and when
.
For example, the execute following command on the pyspark command line interface or add it in your Python script.
from pyspark.sql.functions import col, when
Spark DataFrame CASE with multiple WHEN Conditions
In this example, we will check multiple WHEN conditions without any else part.
For example,
>>> testDF.select(col("*"), when(col("id") == "1", col("number")) \
.when(col("id") == "2", col("number")) \
.alias("number")).show()
+---+------+------+
| id|number|number|
+---+------+------+
| 1| One| One|
| 2| Two| Two|
| 3| Three| null|
| 4| Four| null|
+---+------+------+
Note that, null values in the result are because of unmatched condition.
Spark DataFrame CASE WHEN Condition with else part (OTHERWISE)
You can also specify the OTHERWISE part which will execute if none of the conditions are met. You can consider this as an else part.
Following example demonstrates the Spark SQL CASE WHEN with a default OTHERWISE condition.
>>> testDF.select(col("*"), when(col("id") == "1", col("number")) \
.when(col("id") == "2", col("number")) \
.otherwise("No Condition") \
.alias("number")).show()
+---+------+------------+
| id|number| number|
+---+------+------------+
| 1| One| One|
| 2| Two| Two|
| 3| Three|No Condition|
| 4| Four|No Condition|
+---+------+------------+
CASE Statement using Spark SQL Query
You can write a CASE statement using SQL query and execute it using Spark SQL.
Following example executes the CASE statement. It uses Spark SQL context to execute queries.
>>> sqlContext.sql(" SELECT CASE 1 WHEN 1 THEN 'ONE' WHEN 2 THEN 'TWO' ELSE 'NO MATCH FOUND' END").show()
+---+
|_c0|
+---+
|ONE|
+---+
Related Articles,
- Spark SQL Date and Timestamp Functions and Examples
- Rename PySpark DataFrame Column – Methods and Examples
Hope this helps 🙂