How to Search String in Spark DataFrame? – Scala and PySpark

Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods. How to Search String in Spark DataFrame? Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame. Following are the…

Continue ReadingHow to Search String in Spark DataFrame? – Scala and PySpark
Comments Off on How to Search String in Spark DataFrame? – Scala and PySpark

How to Find Tables Size in Spark SQL? – Scala Example

Be it relational database, Hive, or Spark SQL, Finding the table size is one of the common requirements. Relational databases such as Snowflake, Teradata, etc support system tables. You can use those system tables to identify the size of tables. But, there are no system tables in Spark SQL. You can make use of the Spark catalog API to find the tables size in the Spark SQL database. Find Tables Size in Spark SQL Starting version 2.0, Spark supports catalog API. It has many useful methods such as listtables, listdatabases,…

Continue ReadingHow to Find Tables Size in Spark SQL? – Scala Example
Comments Off on How to Find Tables Size in Spark SQL? – Scala Example

Spark SQL to_date() Function – Pyspark and Scala

Spark SQL supports many date and time conversion functions. One of such a function is to_date() function. Spark SQL to_date() function is used to convert string containing date to a date format. The function is useful when you are trying to transform captured string data into particular data type such as date type. In this article, we will check how to use the Spark to_date function on DataFrame as well as in plain SQL queries. Spark SQL to_date() Function You can use Spark to_date() function to convert and format string…

Continue ReadingSpark SQL to_date() Function – Pyspark and Scala
Comments Off on Spark SQL to_date() Function – Pyspark and Scala

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Continue ReadingSpark SQL Recursive DataFrame – Pyspark and Scala
Comments Off on Spark SQL Recursive DataFrame – Pyspark and Scala