How to Search String in Spark DataFrame? – Scala and PySpark

  • Post author:
  • Post last modified:June 16, 2022
  • Post category:Apache Spark
  • Reading time:5 mins read

Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods.

How to Search String in Spark DataFrame?

Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame.

Following are the some of the commonly used methods to search strings in Spark DataFrame

Test Data

Following is the test dataframe that we are going to use in all our subsequent examples.

val testDF = Seq((1,"Jhon Smith"), (2,"Michael Munna"), (3,"Bob Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")
).toDF("ID", "Name")

+---+--------------+
| ID|          Name|
+---+--------------+
|  1|    Jhon Smith|
|  2| Michael Munna|
|  3|Bob Williamson|
|  4|     Jack Rose|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Spark Contains() Function to Search Strings in DataFrame

You can use contains() function in Spark and PySpark to match the dataframe column values contains a literal string.

Spark Contains() Function

Following is Spark contains() function example to search string.

import org.apache.spark.sql.functions.col

testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark Contains() Function

Following is PySpark contains() function example to search string.

from pyspark.sql.functions import col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| id|          name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Filter Spark DataFrame using like Function

The Spark like function in Spark and PySpark to match the dataframe column values contains a literal string.

Spark like Function to Search Strings in DataFrame

Following is Spark like function example to search string.

import org.apache.spark.sql.functions.col
testDF.filter(col("name").like("%Williamson")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame

Following is PySpark like function example to search string.

from pyspark.sql.functions import col
testDF.filter(col("name").like("%Williamson")).show()
+---+--------------+
| id|          name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Filter Spark DataFrame using rlike Function

The Spark and PySpark rlike method allows you to write powerful string matching algorithms with regular expressions (regexp).

Spark rlike Function to Search String in DataFrame

Following is Spark like function example to search string.

import org.apache.spark.sql.functions.col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame

Following is PySpark rlike function example to search string.

from pyspark.sql.functions import col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| id|          name|
+---+--------------+
|  3|Bob Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Related Articles,

Hope this helps 🙂