Spark SQL Correlated Subquery and Usage Restrictions

The Correlated subquery in a Spark SQL is a query within a query that refer the columns from the parent or outer query table. These kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Spark SQL supports the regular and correlated subqueries. You can use the subqueries to improve the performance of the Spark SQL queries such as limiting the number of records returned by the subquery. Spark SQL Correlated Subquery Spark SQL supports many types of subqueries. However, it…

Continue ReadingSpark SQL Correlated Subquery and Usage Restrictions
Comments Off on Spark SQL Correlated Subquery and Usage Restrictions

Spark SQL to_date() Function – Pyspark and Scala

Spark SQL supports many date and time conversion functions. One of such a function is to_date() function. Spark SQL to_date() function is used to convert string containing date to a date format. The function is useful when you are trying to transform captured string data into particular data type such as date type. In this article, we will check how to use the Spark to_date function on DataFrame as well as in plain SQL queries. Spark SQL to_date() Function You can use Spark to_date() function to convert and format string…

Continue ReadingSpark SQL to_date() Function – Pyspark and Scala
Comments Off on Spark SQL to_date() Function – Pyspark and Scala

Apache Spark SQL Supported Subqueries and Examples

A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples. Spark SQL Supported Subqueries Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block. You can use these…

Continue ReadingApache Spark SQL Supported Subqueries and Examples
Comments Off on Apache Spark SQL Supported Subqueries and Examples

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Continue ReadingHow to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala
Comments Off on How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

How to Merge Json Objects in Snowflake?

One of the greatest strengths of Snowflake is that it can handle both structured and semi-structured data. Semi-structured data includes JSON and XML. Snowflake allows you to store and query the json or xml data without using any special functions. The built-in function such as merging two or more json object is not available as of now. But, you can make use of JavaScript function by writing Snowflake user defined function. In this article, we will check how to merge two json objects in Snowflake. Merge JSON Objects in Snowflake…

Continue ReadingHow to Merge Json Objects in Snowflake?
Comments Off on How to Merge Json Objects in Snowflake?

How Snowflake Internally Handles Updates? – Explanation

When you load data into Snowflake, Snowflake reorganizes that data into micro partition and stores into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage. Snowflake uses S3, Blob storage or GCP cloud storage. However, all these storages are immutable. Obviously the question would be how Snowflake internally performs or handles updates when you execute the update command? How Snowflake Internally Handles Updates? Many people would have thought of this question when they were going through Snowflake architecture. It is a complex question. Snowflake uses…

Continue ReadingHow Snowflake Internally Handles Updates? – Explanation
Comments Off on How Snowflake Internally Handles Updates? – Explanation

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Continue ReadingSpark SQL Recursive DataFrame – Pyspark and Scala
Comments Off on Spark SQL Recursive DataFrame – Pyspark and Scala

Snowflake Array Functions – Syntax and Examples

It is very common practice to store values in the form of an array in the databases. Without a doubt, Snowflake supports many array functions. You can use these array manipulation functions to manipulate the array types. In this article, we will check how to work with Snowflake Array Functions, syntax and examples to manipulate array types. Snowflake Array Functions Following is the list of Snowflake array functions with brief descriptions: Array FunctionsDescriptionARRAY_AGGFunction returns the input values, pivoted into an ARRAY.ARRAY_APPENDThis function returns an array containing all elements from the…

Continue ReadingSnowflake Array Functions – Syntax and Examples
Comments Off on Snowflake Array Functions – Syntax and Examples

Spark SQL and Dataset Hints Types- Usage and Examples

In general, Query hints or optimizer hints can be used with SQL statements to alter execution plans. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. As a data architect, you might know information about your data that the optimizer does not know. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on the specific criteria. In this article, we will check Spark SQL and Dataset hints types, usage and examples. Spark SQL and Dataset…

Continue ReadingSpark SQL and Dataset Hints Types- Usage and Examples
Comments Off on Spark SQL and Dataset Hints Types- Usage and Examples

What are SELECT INTO Alternatives in Snowflake?

In my other post, I have discussed about what are different methods to create Snowflake table. There are many database specific syntaxes that are not supported in Snowflake yet. One of such syntax is SELECT INTO. The databases such as SQL Server, Reshift, Teradata, etc. supports SELECT INTO clause to create new table and insert the resulting rows from the query into it. In this article, we will check what are SELECT INTO alternatives in Snowflake with some examples. SELECT INTO Alternatives in Snowflake In the databases such as SQL…

Continue ReadingWhat are SELECT INTO Alternatives in Snowflake?
Comments Off on What are SELECT INTO Alternatives in Snowflake?