Hadoop Hive bucket concept is dividing Hive partition into number of equal clusters or buckets. The bucketing concept is very much similar to Netezza Organize on clause for table clustering. Hive bucket is decomposing the hive partitioned data into more manageable parts.
Let us check out the example of Hive bucket usage. Let us say we have sales table with sales_date, product_id, product_dtl etc. The Hive table will be partitioned on sales_date and product_id as the second-level partition would have led to too many small partitions in HDFS. To tackle this situation, we will use Hive bucketing concept. Partition by sale_date and bucketing by product_id. The value of this column will be hashed by a user-defined number into buckets. Records with the same product_id will always be stored in the same bucket.
Hadoop Hive Bucket Concept
Hive bucketing concept is diving Hive partitioned data into further equal number of buckets or clusters. You have to use the CLUSTERED BY (Col) clause with Hive create table command to create buckets.
Syntax to create Bucket on Hadoop Hive Tables
Below is the syntax to create bucket on Hive tables:
CREATE TABLE bucketed_table ( Col1 integer, col2 string, col3 string, ... ) PARTITIONED BY (col4 date) CLUSTERED BY (col1) INTO 32 BUCKETS STORED AS TEXTFILE;
You can create buckets on only one column, you cannot specify more than one column. You can create Hive buckets on Hive managed tables or hive external tables based on your requirement and type of select statement that you run on those tables. The high cardinality field in the bucketing results in even distribution of data among created buckets.
You can use the buckets in sampling Hive table.
Hadoop Hive Bucketing Concept Examples
Below is the example of the bucketed table:
CREATE TABLE order_table ( username STRING, orderdate STRING, amount DOUBLE, tax DOUBLE, ) PARTITIONED BY (company STRING) CLUSTERED BY (username) INTO 25 BUCKETS;
Advantages of Hive Table Bucketing
Below are some of the advantages of bucketing in Hive:
- Optimized Hive tables
- Enables more efficient queries
- Optimized access to the table data
- Evenly distribute the data.
Also read my other posts on Apache Hive
- Hadoop Hive WITH Clause Syntax and Examples
- Hadoop Hive Date Functions and Examples
- Hive Create Table Command and Examples
- Commonly used Hadoop Hive Commands
- Hadoop Hive Dynamic Partition and Examples
- Hive String Functions and Examples
- Hive Join Types and Examples