How to Use Spark SQL REPLACE on DataFrame?

Similar to the DataFrame COALESCE function, REPLACE function is one of the important functions that you will use to manipulate string data. Replace function is one of the widely used function in SQL. You can use the replace function to replace values. In this article, we will check how to use Spark SQL replace function on an Apache Spark DataFrame with an example.

Spark SQL REPLACE on DataFrame

In a SQL, replace function removes all occurrences of a specified substring, and optionally replaces them with another string. But, in a DataFrame, this function Returns a new DataFrame replacing a value with another value.

Following is the DataFrame replace syntax:

DataFrame.replace(to_replace, value=<no value>, subset=None)

In the above syntax, to_replace is a value to be replaced and data type can be bool, int, float, string, list or dict. The to_replace value cannot be a ‘None’. The value is a replacement value must be a bool, int, float, string or None. The subset is an optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored.

Note that, values to_replace and a value must have the same type and can only be Numerics, Booleans, or Strings.

and following is Spark SQL replace function syntax:

replace(str, search [, replace] )

In the above syntax,

str: A string expression to be searched.
search: A string repression to be replaced.
replace: An optional string expression to replace search with. The default is an empty string.

Spark SQL REPLACE on DataFrame Examples

Following is the test DataFrame that we will use in an example.

>>> testDF = spark.createDataFrame([(1,1), (2,2), (None,3), (4,None)], ["id", "number"])
>>> testDF.show()
+----+------+
|  id|number|
+----+------+
|   1|     1|
|   2|     2|
|null|     3|
|   4|  null|
+----+------+

Following is the DataFrame replace function example.

>>> testDF.replace(1, 0).show()
+----+------+
|  id|number|
+----+------+
|   0|     0|
|   2|     2|
|null|     3|
|   4|  null|
+----+------+

>>> testDF.replace(0, 999).show()
+---+------+
| id|number|
+---+------+
|  1|     1|
|  2|     2|
|999|     3|
|  4|   999|
+---+------+

Spark SQL REPLACE Function Examples

If you are writing a Spark SQL query and want to use the replace function following example helps you on syntax and usage front:

>>> spark.sql('select replace(id, 1,0) as id, replace(number,1, 0) as number from testTable').show()
+----+------+
|  id|number|
+----+------+
|   0|     0|
|   2|     2|
|null|     3|
|   4|  null|
+----+------+

When to use DataFrame Replace and SQL Replace in Spark?

When you are performing operations on DataFrame, then you can use DataFrame replace. And if you are writing Spark SQL queries use SQL replace function. The DataFrame replace function is faster compared to SQL replace as you are directly performing DataFrame operations.

Mixed type replacements are not supported

You will get “Mixed type replacements are not supported” error when you try to replace value with different data type. For example, if you try to replace integer with a string. Values to_replace and value must have the same type and can only be Numerics, Booleans, or Strings.