DWgeek.com

Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Comments Off

May 29, 2020

Apache Spark

Replace Pyspark DataFrame Column Value – Methods

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames. Replace Pyspark DataFrame Column Value As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame…

Comments Off

May 20, 2020

Apache Spark

Create Row for each array Element using PySpark Explode

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element. Create a Row for each array Element using PySpark Explode Before jumping into the examples, first, let us understand what is explode function in PySpark. Pyspark Explode Function The Pyspark explode function returns…

Comments Off

May 20, 2020

Apache Spark

Apache Spark SQL Bucketing Support – Explanation

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark. In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article. Apache Spark SQL Bucketing Support Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to…

Comments Off

May 18, 2020

Apache Spark

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Comments Off

May 7, 2020

Redshift

Redshift Temporary Tables, Usage and Examples

Similar to many other relational databases such as Netezza, Snowflake, Oracle, etc. Amazon Redshift support creating temp or temporary tables to hold non-permanent data. i.e. Data which you will use only in the current session and Redshift will drop the temp table soon after the session ends. In this article, we will check how to create Redshift temp or temporary tables, syntax, usage and restrictions with some examples. Redshift Temporary Tables The temporary table in Redshift is visible only within the current session. The table is automatically dropped at the…

Comments Off

May 5, 2020

Snowflake

Handle Cursor in Snowflake Stored Procedures – Examples

Snowflake stored procedures are used to encapsulate the data migration, data validation and business specific logic's. Stored procedure also handles an exceptions if any in your data or custom exception handling. The relational databases such as Oracle, Redshift, Netezza, etc. supports cursor variables. In this article, we will check how to handle cursor variable in Snowflake stored procedures with an example. Handle Cursor in Snowflake Stored Procedures In a relational database, cursors are extensively used in stored procedures to loop through the records from SELECT statements. Stored procedures encapsulate the business logic. For…

Comments Off

May 2, 2020

Snowflake

Convert Permanent table to Transient Table in Snowflake

Snowflake Transient tables are similar to that of permanent tables only difference is that they don not support fail-safe period. Therefore, cost associated with fail-safe is not applicable to transient tables. You can use transient tables in an ETL design to hold temporary data. In this article, we will check how to convert permanent table to transient table in Snowflake. Transient Table in Snowflake As mentioned earlier, transition tables are similar to managed tables with key difference such as fail-safe is not available. The transient tables are designed for transitory…

Comments Off

May 2, 2020

Redshift

TRY_CAST Function Alternative in Redshift – Examples

There are many situations where in CAST conversion fails. For example, let us say you are trying to convert varchar to integer. The cast function will fail if the content is not valid integer values. Data bases such as Snowflake, Azure SQL data warehouse supports try_cast function to safely convert data types. In this article, we will check TRY_CAST function alternative in Redshift and how to use it to safely convert data types of the input values. TRY_CAST Function Alternative in Reshift Before going into details about try_cast alternative in…

Comments Off

April 25, 2020

Snowflake

Working with Snowflake External Tables and S3 Examples

Snowflake External tables allow you to access files stored in external stage as a regular table. You can join the Snowflake external table with permanent or managed table to get required information or perform the complex transformations involving various tables. The External tables are commonly used to build the data lake where you access the raw data which is stored in the form of file and perform join with existing tables. Snowflake External Tables As mentioned earlier, external tables access the files stored in external stage area such as Amazon…

Comments Off

April 7, 2020