DataFrame Archives - DWgeek.com

How to Use Spark SQL REPLACE on DataFrame?

Similar to the DataFrame COALESCE function, REPLACE function is one of the important functions that you will use to manipulate string data. Replace function is one of the widely used function in SQL. You can use the replace function to replace values. In this article, we will check how to use Spark SQL replace function on an Apache Spark DataFrame with an example. Spark SQL REPLACE Spark SQL REPLACE on DataFrame In a SQL, replace function removes all occurrences of a specified substring, and optionally replaces them with another string.…

Comments Off

June 16, 2022

Apache Spark

How to Add Column with Default Value to Pyspark DataFrame?

Since the inception, Spark has made a lot of improvement and added many useful DataFrame API's. If you are from SQL background, you might have noticed that adding default value to a column when you add new column is a common practice. This is just to make sure the new column does not hold junk or NULL values. In this article, we will check how to add a column with a default or constant value to a Pyspark DataFrame. Add a Column with Default Value to Pyspark DataFrame Adding a…

Comments Off

June 11, 2021

Apache Spark

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Comments Off

June 2, 2021

Apache Spark

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Comments Off

May 31, 2021

Apache Spark

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Comments Off

May 7, 2020

Apache Spark

Spark SQL DataFrame Self Join and Example

You can use Spark Dataset join operators to join multiple dataframes in Spark. Two or more dataFrames are joined to perform specific tasks such as getting common data from both dataFrames. In this article, we will check how to perform Spark SQL DataFrame self join using Pyspark. Spark SQL DataFrame Self Join using Pyspark Spark DataFrame supports various join types as mentioned in Spark Dataset join operators. A self join in a DataFrame is a join in which dataFrame is joined to itself. The self join is used to identify…

Comments Off

November 14, 2019