Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Continue ReadingSpark SQL Bucketing on DataFrame – Examples
Comments Off on Spark SQL Bucketing on DataFrame – Examples

Replace Pyspark DataFrame Column Value – Methods

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames. Replace Pyspark DataFrame Column Value As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame…

Continue ReadingReplace Pyspark DataFrame Column Value – Methods
Comments Off on Replace Pyspark DataFrame Column Value – Methods

Create Row for each array Element using PySpark Explode

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element. Create a Row for each array Element using PySpark Explode Before jumping into the examples, first, let us understand what is explode function in PySpark. Pyspark Explode Function The Pyspark explode function returns…

Continue ReadingCreate Row for each array Element using PySpark Explode
Comments Off on Create Row for each array Element using PySpark Explode

Apache Spark SQL Bucketing Support – Explanation

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark. In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article. Apache Spark SQL Bucketing Support Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to…

Continue ReadingApache Spark SQL Bucketing Support – Explanation
Comments Off on Apache Spark SQL Bucketing Support – Explanation

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Continue ReadingSpark SQL COALESCE on DataFrame – Examples
Comments Off on Spark SQL COALESCE on DataFrame – Examples

Redshift Temporary Tables, Usage and Examples

Similar to many other relational databases such as Netezza, Snowflake, Oracle, etc. Amazon Redshift support creating temp or temporary tables to hold non-permanent data. i.e. Data which you will use only in the current session and Redshift will drop the temp table soon after the session ends. In this article, we will check how to create Redshift temp or temporary tables, syntax, usage and restrictions with some examples. Redshift Temporary Tables The temporary table in Redshift is visible only within the current session. The table is automatically dropped at the…

Continue ReadingRedshift Temporary Tables, Usage and Examples
Comments Off on Redshift Temporary Tables, Usage and Examples

Handle Cursor in Snowflake Stored Procedures – Examples

Snowflake stored procedures are used to encapsulate the data migration, data validation and business specific logic's. Stored procedure also handles an exceptions if any in your data or custom exception handling. The relational databases such as Oracle, Redshift, Netezza, etc. supports cursor variables. In this article, we will check how to handle cursor variable in Snowflake stored procedures with an example. Handle Cursor in Snowflake Stored Procedures In a relational database, cursors are extensively used in stored procedures to loop through the records from SELECT statements. Stored procedures encapsulate the business logic. For…

Continue ReadingHandle Cursor in Snowflake Stored Procedures – Examples
Comments Off on Handle Cursor in Snowflake Stored Procedures – Examples

Convert Permanent table to Transient Table in Snowflake

Snowflake Transient tables are similar to that of permanent tables only difference is that they don not support fail-safe period. Therefore, cost associated with fail-safe is not applicable to transient tables. You can use transient tables in an ETL design to hold temporary data. In this article, we will check how to convert permanent table to transient table in Snowflake. Transient Table in Snowflake As mentioned earlier, transition tables are similar to managed tables with key difference such as fail-safe is not available. The transient tables are designed for transitory…

Continue ReadingConvert Permanent table to Transient Table in Snowflake
Comments Off on Convert Permanent table to Transient Table in Snowflake

TRY_CAST Function Alternative in Redshift – Examples

There are many situations where in CAST conversion fails. For example, let us say you are trying to convert varchar to integer. The cast function will fail if the content is not valid integer values. Data bases such as Snowflake, Azure SQL data warehouse supports try_cast function to safely convert data types. In this article, we will check TRY_CAST function alternative in Redshift and how to use it to safely convert data types of the input values. TRY_CAST Function Alternative in Reshift Before going into details about try_cast alternative in…

Continue ReadingTRY_CAST Function Alternative in Redshift – Examples
Comments Off on TRY_CAST Function Alternative in Redshift – Examples

Working with Snowflake External Tables and S3 Examples

Snowflake External tables allow you to access files stored in external stage as a regular table. You can join the Snowflake external table with permanent or managed table to get required information or perform the complex transformations involving various tables. The External tables are commonly used to build the data lake where you access the raw data which is stored in the form of file and perform join with existing tables. Snowflake External Tables As mentioned earlier, external tables access the files stored in external stage area such as Amazon…

Continue ReadingWorking with Snowflake External Tables and S3 Examples
Comments Off on Working with Snowflake External Tables and S3 Examples