Apache Spark SQL Supported Subqueries and Examples

A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples. Spark SQL Supported Subqueries Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block. You can use these…

Continue ReadingApache Spark SQL Supported Subqueries and Examples
Comments Off on Apache Spark SQL Supported Subqueries and Examples

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Continue ReadingHow to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala
Comments Off on How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Continue ReadingSpark SQL Recursive DataFrame – Pyspark and Scala
Comments Off on Spark SQL Recursive DataFrame – Pyspark and Scala

Spark SQL and Dataset Hints Types- Usage and Examples

In general, Query hints or optimizer hints can be used with SQL statements to alter execution plans. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. As a data architect, you might know information about your data that the optimizer does not know. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on the specific criteria. In this article, we will check Spark SQL and Dataset hints types, usage and examples. Spark SQL and Dataset…

Continue ReadingSpark SQL and Dataset Hints Types- Usage and Examples
Comments Off on Spark SQL and Dataset Hints Types- Usage and Examples

Create Spark SQL isdate Function – Date Validation

Many databases such as SQL Server supports isdate function. Spark SQL supports many data frame methods. We have already seen Spark SQL date functions in my other post, "Spark SQL Date and Timestamp Functions". You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date. In this article, we will check how to create Spark SQL isdate user defined function with an example. Create Spark SQL isdate Function The best part about…

Continue ReadingCreate Spark SQL isdate Function – Date Validation
Comments Off on Create Spark SQL isdate Function – Date Validation

Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Continue ReadingSpark SQL Bucketing on DataFrame – Examples
Comments Off on Spark SQL Bucketing on DataFrame – Examples

Replace Pyspark DataFrame Column Value – Methods

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames. Replace Pyspark DataFrame Column Value As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame…

Continue ReadingReplace Pyspark DataFrame Column Value – Methods
Comments Off on Replace Pyspark DataFrame Column Value – Methods

Create Row for each array Element using PySpark Explode

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element. Create a Row for each array Element using PySpark Explode Before jumping into the examples, first, let us understand what is explode function in PySpark. Pyspark Explode Function The Pyspark explode function returns…

Continue ReadingCreate Row for each array Element using PySpark Explode
Comments Off on Create Row for each array Element using PySpark Explode

Apache Spark SQL Bucketing Support – Explanation

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark. In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article. Apache Spark SQL Bucketing Support Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to…

Continue ReadingApache Spark SQL Bucketing Support – Explanation
Comments Off on Apache Spark SQL Bucketing Support – Explanation

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Continue ReadingSpark SQL COALESCE on DataFrame – Examples
Comments Off on Spark SQL COALESCE on DataFrame – Examples