Apache Spark SQL Bucketing Support – Explanation

  • Post author:
  • Post last modified:May 29, 2020
  • Post category:Apache Spark
  • Reading time:5 mins read

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark.

Apache Spark SQL Bucketing Support - Explanation

In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article.

Apache Spark SQL Bucketing Support

Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to optimize joins by avoiding shuffles of the tables participating in the join.

All versions of Spark SQL support bucketing via CLUSTERED BY clause. However, not all Spark version support same syntax.

Now, let us check bucketing on different Spark versions.

Bucketing on Spark SQL Version 1.x

Spark SQL 1.x supports the CLUSTERED BY syntax which is similar to Hive DDL.

For example, consider following Spark SQL DDL.

CREATE TABLE bucketed_table
(
 Col1 int,
 col2 string,
 col3 string
)
PARTITIONED BY (col4 date)
CLUSTERED BY (col1) INTO 32 BUCKETS
STORED AS TEXTFILE;

The above syntax is not supported in Spark 2.2.x, but again, it is supported in version 2.3.x and above.

Bucketing on Spark SQL Version 2.2.x

Spark 2.2.x supports bucketing with slightly different syntax compared Spark SQL 1.x.

For example, Consider following example that uses USING clause to specify storage format.

CREATE TABLE users_bucketed_and_partitioned
(
  name STRING,
  favorite_color STRING,
  favorite_numbers int
) USING TEXT 
PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;

This syntax is not supported in Spark SQL 1.X.

Bucketing on Spark SQL Version 2.3.x and Above

Spark SQL version 2.3.x and above supports both Spark SQL 1.x and 2.2.x syntax.

For example, consider following Spark SQL 2.2.x bucketing syntax.

CREATE TABLE users_bucketed_and_partitioned3
(
  name STRING,
  favorite_color STRING,
  favorite_numbers int
) USING TEXT 
PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;

And following is the example that is supported on Spark SQL version 1.x.

CREATE TABLE bucketed_table3
(
 Col1 int,
 col2 string,
 col3 string
)
PARTITIONED BY (col4 date)
CLUSTERED BY (col1) INTO 32 BUCKETS
STORED AS TEXTFILE;

Bucketing on Spark DataFrames

You can create bucketing on the DataFrame. This is one of the options that you can use when you are working with DataFrames instead of tables.

For examples, consider following simple example to create bucket on dataframe and save it as a table.

people.write
  .bucketBy(42, "ID")
  .sortBy("dept")
  .saveAsTable("employee_bucketed")

Related Articles

How to enable Bucketing on Spark?

Bucketing is enabled when spark.sql.sources.bucketing.enabled configuration property is turned on (true) and by default it is turned on.

Optionally, you can use the following command to enable bucketing on Spark.

SET spark.sql.sources.bucketing.enabled=true;

Hope this helps 🙂