Hive Table Sampling – Concept and Example

  • Post author:
  • Post last modified:November 17, 2019
  • Post category:BigData
  • Reading time:5 mins read

The Relational databases like SQL server supports writing queries on a relatively small number of rows from the very large table. In this article, we will check Hive table sampling concept, methods and some examples.

Hive Table Sampling - Concept, Methods and Example

Hive Table Sampling Concept

The Hive TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The sampling comes handy when you are working on the large tables and it takes time to return results. The TABLESAMPLE clause can be added to any table in the FROM clause.

Type of Hive Sampling

There are two type of Hive tables sampling

  • Sampling Bucketized Table
  • Hive Block Sampling

Hive Table Sampling Syntax

You can use following syntax to get sample records from the Hive table.

Bucketized Sampling

Following is the syntax of the Bucketized Sampling

SELECT *
FROM source TABLESAMPLE (BUCKET x OUT OF y [ON colname]) s;

Block Sampling

Following is the syntax of the block Sampling

SELECT *
FROM source TABLESAMPLE (n PERCENT) s;

Where, the BUCKET is numbered starting from 1. colname indicates the column on which to sample each row in the table. Instead of colname, use rand() indicating sampling on the entire row instead of an individual column.

And n is the percent of data size in case of block sampling.

Hive Sampling Bucketized Table

The sampling Bucketized table allows you to get sample records using the number of buckets. The Bucketized sampling method can be used when your tables are bucketed.

You can provide the bucket number starting from 1 along with colname on which to sample each row in the Hive table. You can also use rand() indicating sampling on the entire row instead of an individual column.

For example, following example provides random sample rows from the bucket 1.

SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(BUCKET 1 OUT OF 3 ON rand()) s;

+-------+---------+--------+--+
| s.id  | s.name  | s.mnt  |
+-------+---------+--------+--+
| 3     | ccc     | 1      |
| 11    | kkk     | 4      |
+-------+---------+--------+--+

Related Articles

Hive Block Sampling

This sampling method will allow Hive to pick up at least n% data size. Note that, PERCENT doesn’t necessarily mean the number of rows, it is the percentage of table size. If your table is small then it may return all rows.

For example, in the following example the input size 0.1% or more will be used for the query.

SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(0.01 PERCENT) s;

+-------+---------+--------+--+
| s.id  | s.name  | s.mnt  |
+-------+---------+--------+--+
| 1     | aaa     | 1      |
| 2     | bbb     | 1      |
| 3     | ccc     | 1      |
+-------+---------+--------+--+

Hope this helps 🙂