Spark Dataset Join Operators using Pyspark

Joining two different tables results in different dataset. You can join two different datasets to perform specific task, such as getting common rows. Relational databases like Netezza, Teradata supports different join types. Just like RDBMS, Apache Hive also supports different join types. In this article, we will check Spark Dataset Join Operators using Pyspark and some examples to demonstrate different join types.

Before going into Spark SQL dataframe join types, let us check what is join in SQL?

“A query that accesses multiple rows of the same or different table is called a join query. Result of the query is based on the joining condition that you provide in your query.”

Spark SQL Dataset Join Types

Spark SQL dataset join types are similar to the relational database join types. Below is the list of commonly used Spark dataset join types:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER
LEFT SEMI

Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join

Let us discuss these join types using examples.

import pyspark
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as sf
sqlcontext = pyspark.SparkContext(sc)

# Create Sample Data for calculation
std_data = sqlcontext.createDataFrame([(1,111),
                                   (2,111),
                                  (3,222),
                                  (4,222),
                                  (5,222),
                                  (6,111),
                                  (7,333),
                                  (8,444)], 
                                 ["std_id", "dept_id"])
+------+-------+
|std_id|dept_id|
+------+-------+
|     1|    111|
|     2|    111|
|     3|    222|
|     4|    222|
|     5|    222|
|     6|    111|
|     7|    333|
|     8|    444|
+------+-------+
								 
dpt_data = sqlcontext.createDataFrame([(111,1),
                                   (222,1),
                                  (333,2),
                                  (444,2),
                                  (555,3)], 
                                 ["dept_id", "std_id"])

+-------+------+
|dept_id|std_id|
+-------+------+
|    111|     1|
|    222|     1|
|    333|     2|
|    444|     2|
|    555|     3|
+-------+------+

Pyspark Inner Join Example

Inner Join It returns the matching rows from both the data sets. Inner join is a default type of join. If you don’t provide join type, inner join will be used.

inner_join = std_data.join(dpt_data, std_data.std_id == dpt_data.std_id)

inner_join.show()

+------+-------+-------+------+
|std_id|dept_id|dept_id|std_id|
+------+-------+-------+------+
|     1|    111|    111|     1|
|     1|    111|    222|     1|
|     2|    111|    333|     2|
|     2|    111|    444|     2|
|     3|    222|    555|     3|
+------+-------+-------+------+

Spark SQL Inner Join

You can write the inner join using SQL mode as well.

For example:

Select std_data.*, dpt_data.* from std_data
Inner join dpt_data
on(std_data.std_id = dpt_data.std_id);

Pyspark Left Join Example

This type of join returns all rows from the left dataset even if there is no matching values in the right dataset. Unmatched right tables records will be NULL.

left_join = std_data.join(dpt_data, std_data.std_id == dpt_data.std_id,how='left')

left_join.show()

+------+-------+-------+------+
|std_id|dept_id|dept_id|std_id|
+------+-------+-------+------+
|     1|    111|    111|     1|
|     1|    111|    222|     1|
|     2|    111|    333|     2|
|     2|    111|    444|     2|
|     3|    222|    555|     3|
|     4|    222|   null|  null|
|     5|    222|   null|  null|
|     6|    111|   null|  null|
|     7|    333|   null|  null|
|     8|    444|   null|  null|
+------+-------+-------+------+

Note that, you can also use type of ‘left_outer’ instead of ‘left’.

Spark SQL Left Join

You can write the left outer join using SQL mode as well.

For example:

Select std_data.*, dpt_data.* from std_data
left join dpt_data
on(std_data.std_id = dpt_data.std_id);

Pyspark Right Join Example

This type of join returns all rows from the right dataset even if there is no matching row in the left dataset. All unmatched left table records will be NULL.

right_join = std_data.join(dpt_data, std_data.std_id == dpt_data.std_id,how='right')

right_join.show()

+------+-------+-------+------+
|std_id|dept_id|dept_id|std_id|
+------+-------+-------+------+
|     1|    111|    111|     1|
|     1|    111|    222|     1|
|     2|    111|    333|     2|
|     2|    111|    444|     2|
|     3|    222|    555|     3|
+------+-------+-------+------+

Note that, you can also use type of ‘right_outer’ instead of ‘right’.

Spark SQL Right Join

You can write the right outer join using SQL mode as well.

For example:

Select std_data.*, dpt_data.* from std_data
right join dpt_data
on(std_data.std_id = dpt_data.std_id);

Pyspark Full Outer Join Example

This type of join returns all rows from the both datasets that fulfill the JOIN condition. The unmatched rows from both datasets will be returned as a NULL.

full_join = std_data.join(dpt_data, std_data.std_id == dpt_data.std_id,how='full')

full_join.show()

+------+-------+-------+------+
|std_id|dept_id|dept_id|std_id|
+------+-------+-------+------+
|     1|    111|    111|     1|
|     1|    111|    222|     1|
|     2|    111|    333|     2|
|     2|    111|    444|     2|
|     3|    222|    555|     3|
|     4|    222|   null|  null|
|     5|    222|   null|  null|
|     6|    111|   null|  null|
|     7|    333|   null|  null|
|     8|    444|   null|  null|
+------+-------+-------+------+

Spark SQL Full Outer Join

You can write the right full outer join using SQL mode as well.

Select std_data.*, dpt_data.* from std_data
Full outer join dpt_data
on(std_data.std_id = dpt_data.std_id);

Pyspark Left Semi Join Example

A Left Semi Join only returns the records from the left-hand dataset. Get records from left dataset that only appear in right dataset.

left_semi_join = std_data.join(dpt_data, std_data.std_id == dpt_data.std_id,how='leftsemi')
left_semi_join.show()
+------+-------+
|std_id|dept_id|
+------+-------+
|     1|    111|
|     2|    111|
|     3|    222|
+------+-------+

Spark SQL Left Semi Join

You can write the left semi join using SQL mode as well.

Select std_data.*, dpt_data.* from std_data
Left semi join dpt_data
on(std_data.std_id = dpt_data.std_id);

Hope this helps 🙂