Replace Pyspark DataFrame Column Value

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames.

As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame columns. The method is same in both Pyspark and Spark Scala.

Note that, we are replacing values. We are not renaming or converting DataFrame column data type.

Following are some methods that you can use to Replace dataFrame column value in Pyspark.

Use regexp_replace Function
Use Translate Function (Recommended for character replace)

Now, let us check these methods with an example.

Test Data

Following is the test DataFrame that we will be using in subsequent methods and examples.

testDF = spark.createDataFrame([(1,"a"), (2,"a"), (3,"222"), (4,"222"), (5,"222"), (6,"a"), (7,"333"), (8,"444")], ["id", "d_id"])

+---+----+
| id|d_id|
+---+----+
|  1|   a|
|  2|   a|
|  3| 222|
|  4| 222|
|  5| 222|
|  6|   a|
|  7| 333|
|  8| 444|
+---+----+

You may have noticed there is some invalid values (“a”) in test data.

Replace Spark DataFrame Column Value using regexp_replace

This is one of the easiest methods that you can use to replace the dataFrame column value.

For example, consider following example which replaces “a” with zero.

from pyspark.sql.functions import *

newDf = testDF.withColumn('d_id', regexp_replace('d_id', 'a', '0'))
newDf.show()

+---+----+
| id|d_id|
+---+----+
|  1|   0|
|  2|   0|
|  3| 222|
|  4| 222|
|  5| 222|
|  6|   0|
|  7| 333|
|  8| 444|
+---+----+

Notice that, we have used withColumn along with regexp_replace function. The function regexp_replace will generate a new column by replacing all occurrences of “a” with zero. The function withColumn replaces column if the column name exists in data frame.

Replace Spark DataFrame Column Value using Translate Function

This method is recommended if you are replace individual characters within given values.

For example, consider following example to replace occurrences of “a” with zero.

from pyspark.sql.functions import *
newDf = testDF.withColumn('d_id', translate('d_id', 'a', '0'))
newDf.show()

+---+----+
| id|d_id|
+---+----+
|  1|   0|
|  2|   0|
|  3| 222|
|  4| 222|
|  5| 222|
|  6|   0|
|  7| 333|
|  8| 444|
+---+----+

Similar to other method, we have used withColumn along with translate function. The translate function will generate a new column by replacing all occurrences of “a” with zero. The function withColumn replaces column if the column name exists in data frame.