Spark SQL Performance Tuning

You can improve the performance of Spark SQL by making simple changes to the system parameters. It requires Spark knowledge and the type of file system that are used to tune your Spark SQL performance. In this article, we will check the Spark SQL performance tuning to improve Spark SQL performance.

Data Storage Consideration for Spark Performance

Before going into Spark SQL performance tuning, let us check some of data storage considerations for spark performance.

Optimize File System

To improve the Spark SQL performance, you should optimize the file system. File size should not be too small, as it will take lots of time to open all those small files.

If you consider too big, the Spark will spend some time in splitting that file when it reads. Optimal file size should be 64MB to 1GB.

Spark SQL is a module to process structured data on Spark. Almost all organizations are using relational databases. If they want to use in-memory processing, then they can use Spark SQL. The high-level query language and additional type information makes Spark SQL more efficient.

The Spark SQL uses of in-memory columnar storage. The in-memory columnar is a feature that allows storing the data in a columnar format, rather than row format. Columnar storage works extremely well with any type of complex analytic queries. File with columnar storage takes less space and relatively takes less time to fetch when query is executed.

Spark SQL Performance Tuning Options

There are many ways you can tune Spark SQL queries.

Use Explain Plan to Analyze Query

One of the easy way to optimize the Spark SQL query is to use Spark EXPLAIN Plan to identify how spark engine is trying to execute your query.

You can get more information on Spark Explain plan is my other post – Spark SQL EXPLAIN Operator and Examples

Set Spark Parameter to Speed up Spark SQL

Here are some of commonly used Spark parameter settings that you can use.

spark.sql.inMemorycolumnarStorage.compressed

This option optimizes your Spark in-memory columnar storage. Make sure it is set to true. By default, spark.sql.inMemorycolumnarStorage.compressed is set to true. This option make sure that in-memory columnar storage is compressed.

spark.sql.codegen

The default value of spark.sql.codegen is false. You should set this parameter to true, as it will compile and create Java byte code quickly for large queries. But it may lag in case of smaller queries. You should have knowledge about your queries to use this parameter efficiently.

spark.sql.parquet.compression.codec

With this option, you can use other compression algorithms for parquet file format. By default, compression codec is snappy. Other possible algorithm options are uncompressed, gzip and lzo.

spark.sql.inMemoryColumnarStorage.batchSize

The default value of spark.sql.inMemoryColumnarStorage.batchSize is 10000. It is the batch size for columnar caching. The larger values can boost up memory utilization but causes an out-of-memory problem. You should have cluster infrastructure knowledge to use this option.

Hoe this helps 🙂