Spark SQL and Dataset Hints Types- Usage and Examples

In general, Query hints or optimizer hints can be used with SQL statements to alter execution plans. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. As a data architect, you might know information about your data that the optimizer does not know. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on the specific criteria. In this article, we will check Spark SQL and Dataset hints types, usage and examples. Spark SQL and Dataset…

Continue ReadingSpark SQL and Dataset Hints Types- Usage and Examples
Comments Off on Spark SQL and Dataset Hints Types- Usage and Examples

Oracle INSERT ALL Alternative in Hive/Spark SQL

Oracle database is one of widely used relational databases. It supports many syntax that are not available in any other transaction databases. One of such command is INSERT ALL. The INSERT ALL is used to insert computed records into multiple tables based on conditions. In this article, we will check what is Oracle INSERT ALL alternative in Hive and Spark SQL. Oracle INSERT ALL alternative in Hive/Spark SQL When migrating Oracle scripts to Apache Hive and Spark SQL, you will notice Hive and Spark SQL does not support many Oracle…

Continue ReadingOracle INSERT ALL Alternative in Hive/Spark SQL
Comments Off on Oracle INSERT ALL Alternative in Hive/Spark SQL

Create Spark SQL isdate Function – Date Validation

Many databases such as SQL Server supports isdate function. Spark SQL supports many data frame methods. We have already seen Spark SQL date functions in my other post, "Spark SQL Date and Timestamp Functions". You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date. In this article, we will check how to create Spark SQL isdate user defined function with an example. Create Spark SQL isdate Function The best part about…

Continue ReadingCreate Spark SQL isdate Function – Date Validation
Comments Off on Create Spark SQL isdate Function – Date Validation

Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Continue ReadingSpark SQL Bucketing on DataFrame – Examples
Comments Off on Spark SQL Bucketing on DataFrame – Examples

Replace Pyspark DataFrame Column Value – Methods

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames. Replace Pyspark DataFrame Column Value As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame…

Continue ReadingReplace Pyspark DataFrame Column Value – Methods
Comments Off on Replace Pyspark DataFrame Column Value – Methods

Apache Spark SQL Bucketing Support – Explanation

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark. In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article. Apache Spark SQL Bucketing Support Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to…

Continue ReadingApache Spark SQL Bucketing Support – Explanation
Comments Off on Apache Spark SQL Bucketing Support – Explanation

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Continue ReadingSpark SQL COALESCE on DataFrame – Examples
Comments Off on Spark SQL COALESCE on DataFrame – Examples

Spark SQL Create Temporary Tables, Syntax and Examples

The temporary tables are tables that are available within the current session. Tables are automatically dropped at the end of the current session. In this article, we will check how to create Spark SQL temporary tables, its syntax and some examples. Spark SQL Create Temporary Tables Temporary tables or temp tables in Spark are available within the current spark session. Spark temp tables are useful, for example, when you want to join the dataFrame column with other tables. Spark DataFrame Methods or Function to Create Temp Tables Depends on the…

Continue ReadingSpark SQL Create Temporary Tables, Syntax and Examples
Comments Off on Spark SQL Create Temporary Tables, Syntax and Examples

Spark SQL CASE WHEN on DataFrame – Examples

In general, the CASE expression or command is a conditional expression, similar to if-then-else statements found in other languages. Spark SQL supports almost all features that are available in Apace Hive. One of such a features is CASE statement. In this article, how to use CASE WHEN and OTHERWISE statement on a Spark SQL DataFrame. Spark SQL CASE WHEN on DataFrame The CASE WHEN and OTHERWISE function or statement tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression. Spark…

Continue ReadingSpark SQL CASE WHEN on DataFrame – Examples
Comments Off on Spark SQL CASE WHEN on DataFrame – Examples

How to Create Spark SQL User Defined Functions? Example

A user defined function (UDF) is a function written to perform specific tasks when built-in function is not available for the same. In a Hadoop environment, you can write user defined function using Java, Python, R, etc. In this article, we will check how to create Spark SQL user defined functions with an python user defined functionexample. Spark SQL User-defined Functions When you migrate your relational database warehouse to Hive and use Spark as an execution engine, you may miss some of the built-in function support. Some user defined functions…

Continue ReadingHow to Create Spark SQL User Defined Functions? Example
Comments Off on How to Create Spark SQL User Defined Functions? Example