Create Row for each array Element using PySpark Explode

  • Post author:
  • Post last modified:June 14, 2021
  • Post category:Apache Spark
  • Reading time:3 mins read

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element.

Create Row for each array Element using PySpark Explode

Create a Row for each array Element using PySpark Explode

Before jumping into the examples, first, let us understand what is explode function in PySpark.

Pyspark Explode Function

The Pyspark explode function returns a new row for each element in the given array or map. The explode function can be used to create a new row for each element in an array or each key-value pair. This is similar to LATERAL VIEW EXPLODE in HiveQL.

Following is the syntax of an explode function in PySpark and it is same in Scala as well.

pyspark.sql.functions.explode(col)
Create a Row for each array Element Example

You can use explode function to create a row for each array or map element in the JSON content. The explode function will work on the array element and convert each element to a row.

Consider following example, which uses explode function to transform array elements in to a row.

from pyspark.sql.functions import *
from pyspark.sql.types import *

df = spark.read.json(sc.parallelize([("""
{
  "Name": "student1",
  "course": ["CS", "Maths", "EC"]
}
""")]))

df1 = df.select(df.Name, explode("course").alias("course"))
df1.show()

+--------+------+
|    Name|course|
+--------+------+
|student1|    CS|
|student1| Maths|
|student1|    EC|
+--------+------+

As you can see, the explode function explodes the array into multiple rows. In other word, explode function expands the array into rows.

Related Articles,

Hope this helps 🙂