Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element.
Create a Row for each array Element using PySpark Explode
Before jumping into the examples, first, let us understand what is explode function in PySpark.
Pyspark Explode Function
The Pyspark explode function returns a new row for each element in the given array or map. The explode function can be used to create a new row for each element in an array or each key-value pair. This is similar to LATERAL VIEW EXPLODE in HiveQL.
Following is the syntax of an explode function in PySpark and it is same in Scala as well.
pyspark.sql.functions.explode(col)
Create a Row for each array Element Example
You can use explode function to create a row for each array or map element in the JSON content. The explode function will work on the array element and convert each element to a row.
Consider following example, which uses explode function to transform array elements in to a row.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.json(sc.parallelize([("""
{
"Name": "student1",
"course": ["CS", "Maths", "EC"]
}
""")]))
df1 = df.select(df.Name, explode("course").alias("course"))
df1.show()
+--------+------+
| Name|course|
+--------+------+
|student1| CS|
|student1| Maths|
|student1| EC|
+--------+------+
As you can see, the explode function explodes the array into multiple rows. In other word, explode function expands the array into rows.
Related Articles,
- Rename PySpark DataFrame Column – Methods and Examples
- Spark DataFrame Column Type Conversion using CAST
Hope this helps 🙂