A dataFrame in Spark is a distributed collection of data, which is organized into named columns. You can compare Spark dataFrame with Pandas dataFrame, but the only difference is Spark dataFrames are immutable, i.e. You cannot change data from already created dataFrame. In this article, we will check how to update spark dataFrame column values using pyspark. The same concept will be applied to Scala as well.
How to Update Spark DataFrame Column Values using Pyspark?
The Spark dataFrame is one of the widely used features in Apache Spark. All Spark RDD operations usually work on dataFrames. Just like SQL, you can join two dataFrames and perform various actions and transformations on Spark dataFrames.
As mentioned earlier, Spark dataFrames are immutable. You cannot change existing dataFrame, instead, you can create new dataFrame with updated values.
I was working on one of the task to transform Oracle stored procedure to pyspark application. We were using Spark dataFrame as an alternative to SQL cursor. In the same task itself, we had requirement to update dataFrame. In subsequent sections, I will explain you how we updated Spark dataFrames.
Related Articles:
Note that, we have used hiveContext to create dataFrame from Apache Hive. We have integrated Spark and Hive.
Required Modules
We have used below mentioned pyspark modules to update Spark dataFrame column values:
- SQLContext
- HiveContext
- Functions from pyspark sql
Update Spark DataFrame Column Values Examples
We will check two examples, update a dataFrame column value which has NULL values in it and update column value which has zero stored in it.
Update NULL values in Spark DataFrame
You can use isNull() column functions to verify nullable columns and use condition functions to replace it with the desired value.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql import functions as F
hiveContext = HiveContext(sc)
# Connect to Hive database
hiveContext.sql('use newtpcds')
# Read Table from hive
res = hiveContext.sql(' select * from test')
res.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2|null|
| 3|null|
+----+----+
# update Null
res = res.withColumn("col2", F.when(F.col("col2").isNull(), 0).otherwise(F.col("col2")))
res.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 0|
| 3| 0|
+----+----+
Update Specific values in Spark DataFrame
You can use equality condition to verify zero values and use condition functions to replace it with the desired value.
# update values
res = res.withColumn("col2", F.when(F.col("col2")==0, 100).otherwise(F.col("col2")))
res.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 100|
| 3| 100|
+----+----+
Related Articles
- SQL Merge Operation Using Pyspark – UPSERT Example
- How to Save Spark DataFrame as Hive Table – Example
- Spark DataFrame Integer Type Check and Conversion
- Spark DataFrame Column Type Conversion using CAST
- Rename PySpark DataFrame Column – Methods and Examples
- Spark SQL Create Temporary Tables, Syntax and Examples
- Spark SQL Recursive DataFrame – Pyspark and Scala
Hope this helps 🙂