In general, Query hints or optimizer hints can be used with SQL statements to alter execution plans. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. As a data architect, you might know information about your data that the optimizer does not know. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on the specific criteria. In this article, we will check Spark SQL and Dataset hints types, usage and examples.
Spark SQL and Dataset Hints
Query hints give users a way to suggest how Spark SQL to use specific approaches to generate its execution plan. This can be very useful when the query optimizer cannot make optimal decisions, For example, join types due to lack if data size information. You can give hints to optimizer to use certain join type as per your data size and storage criteria.
Hint Framework was added in Spark SQL 2.2. Spark SQL supports many hints types such as COALESCE and REPARTITION, JOIN type hints including BROADCAST hints. Query hints are useful to improve the performance of the Spark SQL.
Specifying Spark SQL Query Hints
You can specify query hints using Dataset.hint operator or SELECT SQL statements with hints.
Foe example,
// Hints on Dataset
spark.table("src").join (spark.table("records").hint ("broadcast"), "a").show()
//Hints on SQL
spark.sql("SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
There are two type of Spark SQL hints.
Now, let us check these two hint types in briefly.
Spark SQL Partitioning Hints
Spark SQL partitioning hints allow users to suggest a partitioning strategy that Spark should follow. When multiple partitioning hints are specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.
Following are the Spark SQL partitioning hints
Spark SQL COALESCE Hint
You can use the COALESCE
hint to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter. This partition hint is equivalent to coalesce Dataset APIs,
For example,
// COALESCE Dataset hint
spark.table("src").join (spark.table("records").hint ("COALESCE", 1), "a").show()
// Spark SQL COALESCE hint
spark.sql("SELECT /*+ COALESCE(1) */ * FROM records r JOIN src s ON r.a = s.a").show()
Spark SQL REPARTITION Hint
You can use the REPARTITION
hint to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters. This repartition hint is equivalent to repartition Dataset APIs,
For example.
// REPARTITION Dataset hint
spark.table("src").join (spark.table("records").hint ("REPARTITION", 10), "a").show()
// Spark SQL REPARTITION hint
spark.sql("SELECT /*+ REPARTITION(10,s.a) */ * FROM records r JOIN src s ON r.a = s.a").show()
REPARTITION_BY_RANGE
You can use the REPARTITION_BY_RANGE
hint to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters. This hint is equivalent to repartitionByRange Dataset APIs,
For example,
// REPARTITION_BY_RANGE Dataset hint
spark.table("src").join (spark.table("records").hint ("REPARTITION_BY_RANGE", 10 ), "a").show()
// Spark SQL REPARTITION_BY_RANGE hint
spark.sql("SELECT /*+ REPARTITION_BY_RANGE(10, s.a) */ * FROM records r JOIN src s ON r.a = s.a").show()
Spark SQL Join Hints
Join hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the BROADCAST
Join Hint was supported.
Following are the Spark SQL join hints
Spark SQL BROADCAST Join Hint
The Spark SQL BROADCAST join hint suggests that Spark use broadcast join. The join side with the hint will be broadcast. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast. The aliases for BROADCAST
hint are BROADCASTJOIN
and MAPJOIN
For example,
// BROADCAST join Hint on Dataset
spark.table("src").join (spark.table("records").hint ("broadcast"), "a").show()
spark.table("src").join (spark.table("records").hint ("BROADCASTJOIN"), "a").show()
spark.table("src").join (spark.table("records").hint ("MAPJOIN"), "a").show()
//Spark SQL BROADCAST Join Hint
spark.sql("SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
spark.sql("SELECT /*+ BROADCASTJOIN(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
spark.sql("SELECT /*+ MAPJOIN(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
Spark SQL MERGE Join Hint
The Spark SQL MERGE join hint Suggests that Spark use shuffle sort merge join. The aliases for MERGE
join hint are SHUFFLE_MERGE
and MERGEJOIN
.
For example.
// MERGE join Hint on Dataset
spark.table("src").join (spark.table("records").hint ("MERGE"), "a").show()
spark.table("src").join (spark.table("records").hint ("SHUFFLE_MERGE"), "a").show()
spark.table("src").join (spark.table("records").hint ("MERGEJOIN"), "a").show()
//Spark SQL MERGE Join Hint
spark.sql("SELECT /*+ MERGE(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
spark.sql("SELECT /*+ SHUFFLE_MERGE(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
spark.sql("SELECT /*+ MERGEJOIN(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
Spark SQL SHUFFLE_HASH Join Hint
The Spark SQL SHUFFLE_HASH join hint suggests that Spark use shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.
For example,
// SHUFFLE_HASH join Hint on Dataset
spark.table("src").join (spark.table("records").hint ("SHUFFLE_MERGE"), "a").show()
//Spark SQL SHUFFLE_HASH Join Hint
spark.sql("SELECT /*+ SHUFFLE_MERGE(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
Spark SQL SHUFFLE_REPLICATE_NL Join Hint
The Spark SQL SHUFFLE_REPLICATE_NL Join Hint suggests that Spark use shuffle-and-replicate nested loop join.
For example,
// SHUFFLE_REPLICATE_NL join Hint on Dataset
spark.table("src").join (spark.table("records").hint ("SHUFFLE_REPLICATE_NL"), "a").show()
//Spark SQL SHUFFLE_REPLICATE_NL Join Hint
spark.sql("SELECT /*+ SHUFFLE_REPLICATE_NL(r) */ * FROM records r JOIN src s ON r.a = s.a").show()
Related Article,
- Basic Spark Transformations and Actions using pyspark
- Spark SQL Performance Tuning – Improve Spark SQL Performance
- Spark RDD Cache and Persist to Improve Performance
- Spark SQL Recursive DataFrame – Pyspark and Scala
- Apache Spark SQL Supported Subqueries and Examples
Hope this helps 🙂