Be it relational database, Hive, or Spark SQL, Finding the table size is one of the common requirements. Relational databases such as Snowflake, Teradata, etc support system tables. You can use those system tables to identify the size of tables. But, there are no system tables in Spark SQL. You can make use of the Spark catalog API to find the tables size in the Spark SQL database.
Find Tables Size in Spark SQL
Starting version 2.0, Spark supports catalog API. It has many useful methods such as listtables
, listdatabases
, etc. You can read more about those API’s in my other post – Search Table in Spark Database.
You can determine the size of a table by calculating the total sum of the individual files within the underlying directory. You can also use queryExecution.analyzed.stats
to return the size.
For example,
Following example return the size of the “customer” table.
spark.read.table("customer").queryExecution.analyzed.stats.sizeInBytes
and following is the output.
res0: BigInt = 13
Identify Size of all Tables Present in Spark SQL database
Along with Spark catalog API methods, you can use queryExecution.analyzed.stats
to return the size of all tables in the Spark SQL database.
Following example return the size of all tables present in “default” database. Feel free to change database name as per your requirements.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Find Table Size")
.getOrCreate()
val db_name = "default"
val tb_list = spark.catalog.listTables(db_name).select("name")
for (row <- tb_list.rdd.collect)
{
println(row(0).toString + ": " + spark.read.table( row(0).toString ).queryExecution.analyzed.stats.sizeInBytes)
}
Sample output:
customer: 13
date_dim: 300
demo_data: 88
Related Articles,
Hope this helps 🙂