How to Update Spark DataFrame Column Values using Pyspark?

  • Post author:
  • Post last modified:May 31, 2021
  • Post category:Apache Spark
  • Reading time:6 mins read

A dataFrame in Spark is a distributed collection of data, which is organized into named columns. You can compare Spark dataFrame with Pandas dataFrame, but the only difference is Spark dataFrames are immutable, i.e. You cannot change data from already created dataFrame. In this article, we will check how to update spark dataFrame column values using pyspark. The same concept will be applied to Scala as well.

How to Update Spark DataFrame Column Values using Pyspark?

The Spark dataFrame is one of the widely used features in Apache Spark. All Spark RDD operations usually work on dataFrames. Just like SQL, you can join two dataFrames and perform various actions and transformations on Spark dataFrames.

As mentioned earlier, Spark dataFrames are immutable. You cannot change existing dataFrame, instead, you can create new dataFrame with updated values.

I was working on one of the task to transform Oracle stored procedure to pyspark application. We were using Spark dataFrame as an alternative to SQL cursor. In the same task itself, we had requirement to update dataFrame.  In subsequent sections, I will explain you how we updated Spark dataFrames.

Related Articles:

Note that, we have used hiveContext to create dataFrame from Apache Hive. We have integrated Spark and Hive.

Required Modules

We have used below mentioned pyspark modules to update Spark dataFrame column values:

  • SQLContext
  • HiveContext
  • Functions from pyspark sql

Update Spark DataFrame Column Values Examples

We will check two examples, update a dataFrame column value which has NULL values in it and update column value which has zero stored in it.

Update NULL values in Spark DataFrame

You can use isNull() column functions to verify nullable columns and use condition functions to replace it with the desired value.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql import functions as F

hiveContext = HiveContext(sc)

# Connect to Hive database
hiveContext.sql('use newtpcds')

# Read Table from hive 
res = hiveContext.sql(' select * from test')
res.show()

+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|null|
|   3|null|
+----+----+

# update Null
res = res.withColumn("col2", F.when(F.col("col2").isNull(), 0).otherwise(F.col("col2")))

res.show()

+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   0|
|   3|   0|
+----+----+

Update Specific values in Spark DataFrame

You can use equality condition to verify zero values and use condition functions to replace it with the desired value.

# update values

res = res.withColumn("col2", F.when(F.col("col2")==0, 100).otherwise(F.col("col2")))

res.show()

+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2| 100|
|   3| 100|
+----+----+

Related Articles

Hope this helps 🙂