Spark SQL Archives - DWgeek.com

How to Use Spark SQL REPLACE on DataFrame?

Similar to the DataFrame COALESCE function, REPLACE function is one of the important functions that you will use to manipulate string data. Replace function is one of the widely used function in SQL. You can use the replace function to replace values. In this article, we will check how to use Spark SQL replace function on an Apache Spark DataFrame with an example. Spark SQL REPLACE Spark SQL REPLACE on DataFrame In a SQL, replace function removes all occurrences of a specified substring, and optionally replaces them with another string.…

Comments Off

June 16, 2022

Apache Spark

How to Search String in Spark DataFrame? – Scala and PySpark

Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods. How to Search String in Spark DataFrame? Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame. Following are the…

Comments Off

June 28, 2021

Apache Spark

How to Find Tables Size in Spark SQL? – Scala Example

Be it relational database, Hive, or Spark SQL, Finding the table size is one of the common requirements. Relational databases such as Snowflake, Teradata, etc support system tables. You can use those system tables to identify the size of tables. But, there are no system tables in Spark SQL. You can make use of the Spark catalog API to find the tables size in the Spark SQL database. Find Tables Size in Spark SQL Starting version 2.0, Spark supports catalog API. It has many useful methods such as listtables, listdatabases,…

Comments Off

June 28, 2021

Apache Spark

Spark SQL Array Functions – Syntax and Examples

Similar to relational databases such as Snowflake, Teradata, Spark SQL support many useful array functions. You can use these array manipulation functions to manipulate the array types. In this article, we will check how to work with Spark SQL Array Functions its Syntax and Examples. Spark SQL Array Functions Following is the list of Spark SQL array functions with brief descriptions: Spark SQL Array FunctionDescriptionarray(expr, ...) Returns an array with the given elements.array_contains(array, value)Returns true if the array contains the value.array_distinct(array)This function removes duplicate values from the arrayarray_except(array1, array2)Returns an array…

Comments Off

June 14, 2021

Apache Spark

PySpark – Search Table in Spark Database

In a real world scenario, you will be dealing with petabytes of data and thousands of tables in a hundred of databases within Spark or Hive catalog. It is practically, time consuming to identify the particular table in a database, hence it is always good idea to develop reusable code that you can use to search table(s) in a given database. The usability includes safely dropping table(s) and identify table structure. In this article, we will learn how to search table in a database using PySpark. Search Table in Database…

Comments Off

June 8, 2021

Apache Spark

Spark SQL Correlated Subquery and Usage Restrictions

The Correlated subquery in a Spark SQL is a query within a query that refer the columns from the parent or outer query table. These kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Spark SQL supports the regular and correlated subqueries. You can use the subqueries to improve the performance of the Spark SQL queries such as limiting the number of records returned by the subquery. Spark SQL Correlated Subquery Spark SQL supports many types of subqueries. However, it…

Comments Off

June 7, 2021

Apache Spark

Spark SQL to_date() Function – Pyspark and Scala

Spark SQL supports many date and time conversion functions. One of such a function is to_date() function. Spark SQL to_date() function is used to convert string containing date to a date format. The function is useful when you are trying to transform captured string data into particular data type such as date type. In this article, we will check how to use the Spark to_date function on DataFrame as well as in plain SQL queries. Spark SQL to_date() Function You can use Spark to_date() function to convert and format string…

Comments Off

June 7, 2021

Apache Spark

Apache Spark SQL Supported Subqueries and Examples

A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples. Spark SQL Supported Subqueries Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block. You can use these…

Comments Off

June 2, 2021

Apache Spark

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Comments Off

June 2, 2021

Apache Spark

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Comments Off

May 31, 2021