Spark SQL Bucketing on DataFrame – Examples
We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…