Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark.
In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article.
Apache Spark SQL Bucketing Support
Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to optimize joins by avoiding shuffles of the tables participating in the join.
All versions of Spark SQL support bucketing via CLUSTERED BY clause. However, not all Spark version support same syntax.
Now, let us check bucketing on different Spark versions.
Bucketing on Spark SQL Version 1.x
Spark SQL 1.x supports the CLUSTERED BY syntax which is similar to Hive DDL.
For example, consider following Spark SQL DDL.
CREATE TABLE bucketed_table
(
Col1 int,
col2 string,
col3 string
)
PARTITIONED BY (col4 date)
CLUSTERED BY (col1) INTO 32 BUCKETS
STORED AS TEXTFILE;
The above syntax is not supported in Spark 2.2.x, but again, it is supported in version 2.3.x and above.
Bucketing on Spark SQL Version 2.2.x
Spark 2.2.x supports bucketing with slightly different syntax compared Spark SQL 1.x.
For example, Consider following example that uses USING clause to specify storage format.
CREATE TABLE users_bucketed_and_partitioned
(
name STRING,
favorite_color STRING,
favorite_numbers int
) USING TEXT
PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
This syntax is not supported in Spark SQL 1.X.
Bucketing on Spark SQL Version 2.3.x and Above
Spark SQL version 2.3.x and above supports both Spark SQL 1.x and 2.2.x syntax.
For example, consider following Spark SQL 2.2.x bucketing syntax.
CREATE TABLE users_bucketed_and_partitioned3
(
name STRING,
favorite_color STRING,
favorite_numbers int
) USING TEXT
PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
And following is the example that is supported on Spark SQL version 1.x.
CREATE TABLE bucketed_table3
(
Col1 int,
col2 string,
col3 string
)
PARTITIONED BY (col4 date)
CLUSTERED BY (col1) INTO 32 BUCKETS
STORED AS TEXTFILE;
Bucketing on Spark DataFrames
You can create bucketing on the DataFrame. This is one of the options that you can use when you are working with DataFrames instead of tables.
For examples, consider following simple example to create bucket on dataframe and save it as a table.
people.write
.bucketBy(42, "ID")
.sortBy("dept")
.saveAsTable("employee_bucketed")
Related Articles
How to enable Bucketing on Spark?
Bucketing is enabled when spark.sql.sources.bucketing.enabled configuration property is turned on (true) and by default it is turned on.
Optionally, you can use the following command to enable bucketing on Spark.
SET spark.sql.sources.bucketing.enabled=true;
Hope this helps 🙂