You can use the Apache Hive set operators to combine similar data sets from two or more SELECT statements into a single result set . Here the similar data set literally mean, the data type of the result set should also match. Otherwise, you have to explicitly convert type using type conversion functions.
Hive Set Operators
Hadoop Hive supports following set operators.
- UNION [DISTINCT]
- UNION ALL
Hive versions prior to 1.2.0 only support UNION ALL (bag union), in which duplicate rows are not eliminated. Starting version 1.2.0, Hive includes optional ALL keywords.
Hive UNION and UNION ALL Syntax
Following is the syntax for Hive UNION and UNION ALL set operators.
select Statement
{ UNION [DISTINCT] | UNION [ ALL ] }
select Statement;
Now, let us check Hive UNION and UNION ALL in brief.
Hive UNION Set Operator
The UNION set operation combines the results of two or more similar sub-queries into a single result set that contains the rows that are returned by all SELECT statements. Data types of the column that you are trying to combine should match. The Hive UNION set operation is different from JOIN, which combine the columns from two tables.
A UNION set operation removes duplicate rows from the result set.
Hive UNION Set Operator Example
Following example demonstrates the Hive UNION set operation.
select * from test1
union
select * from test2;
+-----------+-----------+--+
| _u2.col1 | _u2.col2 |
+-----------+-----------+--+
| 1 | 1 |
| 1 | 2 |
| 2 | 2 |
| 2 | 3 |
| 3 | 3 |
| 4 | 4 |
+-----------+-----------+--+
6 rows selected (2.023 seconds)
You can also use UNION with DISTINCT keyword. For example, consider following Hive query.
select * from test1
union distinct
select * from test2;
+-----------+-----------+--+
| _u2.col1 | _u2.col2 |
+-----------+-----------+--+
| 1 | 1 |
| 1 | 2 |
| 2 | 2 |
| 2 | 3 |
| 3 | 3 |
| 4 | 4 |
+-----------+-----------+--+
6 rows selected (11.873 seconds)
Hive UNION ALL Set Operator
The UNION ALL set operation combines the results of two or more similar sub-queries into a single result set that contains the rows that are returned by all SELECT statements. Data types of the column that you are trying to combine should match.
A UNION ALL set operation does not remove duplicate rows from the result set.
Hive UNION ALL Set Operator Example
Following example demonstrates the Hive UNION ALL set operation.
select * from test1
union all
select * from test2;
+-----------+-----------+--+
| _u1.col1 | _u1.col2 |
+-----------+-----------+--+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 1 | 2 |
| 2 | 3 |
| 4 | 4 |
| 1 | 1 |
+-----------+-----------+--+
7 rows selected (1.178 seconds)
Note that, UNION ALL keeps the duplicate records.
Unsupported Set Operators in Hive
Following set operations are not supported in Hive as of now.
- INTERSECT [DISTINCT]
- EXCEPT [DISTINCT] or MINUS [DISTINCT]
Related Articles
Hope this helps 🙂