Apache Hive Set Operators: UNION and UNION ALL

You can use the Apache Hive set operators to combine similar data sets from two or more SELECT statements into a single result set . Here the similar data set literally mean, the data type of the result set should also match. Otherwise, you have to explicitly convert type using type conversion functions.

Hive Set Operators

Hadoop Hive supports following set operators.

UNION [DISTINCT]
UNION ALL

Hive versions prior to 1.2.0 only support UNION ALL (bag union), in which duplicate rows are not eliminated. Starting version 1.2.0, Hive includes optional ALL keywords.

Hive UNION and UNION ALL Syntax

Following is the syntax for Hive UNION and UNION ALL set operators.

select Statement
{ UNION [DISTINCT] | UNION [ ALL ] } 
select Statement;

Now, let us check Hive UNION and UNION ALL in brief.

Hive UNION Set Operator

The UNION set operation combines the results of two or more similar sub-queries into a single result set that contains the rows that are returned by all SELECT statements. Data types of the column that you are trying to combine should match. The Hive UNION set operation is different from JOIN, which combine the columns from two tables.

A UNION set operation removes duplicate rows from the result set.

Hive UNION Set Operator Example

Following example demonstrates the Hive UNION set operation.

select * from test1 
union 
select * from test2;

+-----------+-----------+--+
| _u2.col1  | _u2.col2  |
+-----------+-----------+--+
| 1         | 1         |
| 1         | 2         |
| 2         | 2         |
| 2         | 3         |
| 3         | 3         |
| 4         | 4         |
+-----------+-----------+--+
6 rows selected (2.023 seconds)

You can also use UNION with DISTINCT keyword. For example, consider following Hive query.

select * from test1 
union distinct 
select * from test2;

+-----------+-----------+--+
| _u2.col1  | _u2.col2  |
+-----------+-----------+--+
| 1         | 1         |
| 1         | 2         |
| 2         | 2         |
| 2         | 3         |
| 3         | 3         |
| 4         | 4         |
+-----------+-----------+--+
6 rows selected (11.873 seconds)

Hive UNION ALL Set Operator

The UNION ALL set operation combines the results of two or more similar sub-queries into a single result set that contains the rows that are returned by all SELECT statements. Data types of the column that you are trying to combine should match.

A UNION ALL set operation does not remove duplicate rows from the result set.

Hive UNION ALL Set Operator Example

Following example demonstrates the Hive UNION ALL set operation.

select * from test1 
union all 
select * from test2;

+-----------+-----------+--+
| _u1.col1  | _u1.col2  |
+-----------+-----------+--+
| 1         | 1         |
| 2         | 2         |
| 3         | 3         |
| 1         | 2         |
| 2         | 3         |
| 4         | 4         |
| 1         | 1         |
+-----------+-----------+--+
7 rows selected (1.178 seconds)

Note that, UNION ALL keeps the duplicate records.

Unsupported Set Operators in Hive

Following set operations are not supported in Hive as of now.

INTERSECT [DISTINCT]
EXCEPT [DISTINCT] or MINUS [DISTINCT]

SQL SET Operator MINUS Alternative in Hive and Examples

Hope this helps 🙂