Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods.
How to Search String in Spark DataFrame?
Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame.
Following are the some of the commonly used methods to search strings in Spark DataFrame
Test Data
Following is the test dataframe that we are going to use in all our subsequent examples.
val testDF = Seq((1,"Jhon Smith"), (2,"Michael Munna"), (3,"Bob Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")
).toDF("ID", "Name")
+---+--------------+
| ID| Name|
+---+--------------+
| 1| Jhon Smith|
| 2| Michael Munna|
| 3|Bob Williamson|
| 4| Jack Rose|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Spark Contains() Function to Search Strings in DataFrame
You can use contains()
function in Spark and PySpark to match the dataframe column values contains a literal string.
Spark Contains() Function
Following is Spark contains() function example to search string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark Contains() Function
Following is PySpark contains() function example to search string.
from pyspark.sql.functions import col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| id| name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Filter Spark DataFrame using like Function
The Spark like
function in Spark and PySpark to match the dataframe column values contains a literal string.
Spark like Function to Search Strings in DataFrame
Following is Spark like function example to search string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").like("%Williamson")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame
Following is PySpark like function example to search string.
from pyspark.sql.functions import col
testDF.filter(col("name").like("%Williamson")).show()
+---+--------------+
| id| name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Filter Spark DataFrame using rlike Function
The Spark and PySpark rlike
method allows you to write powerful string matching algorithms with regular expressions (regexp).
Spark rlike Function to Search String in DataFrame
Following is Spark like function example to search string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame
Following is PySpark rlike function example to search string.
from pyspark.sql.functions import col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| id| name|
+---+--------------+
| 3|Bob Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Related Articles,
- How to Add Column with Default Value to Pyspark DataFrame?
- How to Use Spark SQL REPLACE on DataFrame?
Hope this helps 🙂