The Relational databases like SQL server supports writing queries on a relatively small number of rows from the very large table. In this article, we will check Hive table sampling concept, methods and some examples.
Hive Table Sampling Concept
The Hive TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The sampling comes handy when you are working on the large tables and it takes time to return results. The TABLESAMPLE clause can be added to any table in the FROM clause.
Type of Hive Sampling
There are two type of Hive tables sampling
- Sampling Bucketized Table
- Hive Block Sampling
Hive Table Sampling Syntax
You can use following syntax to get sample records from the Hive table.
Bucketized Sampling
Following is the syntax of the Bucketized Sampling
SELECT *
FROM source TABLESAMPLE (BUCKET x OUT OF y [ON colname]) s;
Block Sampling
Following is the syntax of the block Sampling
SELECT *
FROM source TABLESAMPLE (n PERCENT) s;
Where, the BUCKET is numbered starting from 1. colname indicates the column on which to sample each row in the table. Instead of colname, use rand() indicating sampling on the entire row instead of an individual column.
And n is the percent of data size in case of block sampling.
Hive Sampling Bucketized Table
The sampling Bucketized table allows you to get sample records using the number of buckets. The Bucketized sampling method can be used when your tables are bucketed.
You can provide the bucket number starting from 1 along with colname on which to sample each row in the Hive table. You can also use rand() indicating sampling on the entire row instead of an individual column.
For example, following example provides random sample rows from the bucket 1.
SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(BUCKET 1 OUT OF 3 ON rand()) s;
+-------+---------+--------+--+
| s.id | s.name | s.mnt |
+-------+---------+--------+--+
| 3 | ccc | 1 |
| 11 | kkk | 4 |
+-------+---------+--------+--+
Related Articles
Hive Block Sampling
This sampling method will allow Hive to pick up at least n% data size. Note that, PERCENT doesn’t necessarily mean the number of rows, it is the percentage of table size. If your table is small then it may return all rows.
For example, in the following example the input size 0.1% or more will be used for the query.
SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(0.01 PERCENT) s;
+-------+---------+--------+--+
| s.id | s.name | s.mnt |
+-------+---------+--------+--+
| 1 | aaa | 1 |
| 2 | bbb | 1 |
| 3 | ccc | 1 |
+-------+---------+--------+--+
Hope this helps 🙂