Spark SQL to_date() Function – Pyspark and Scala

Spark SQL supports many date and time conversion functions. One of such a function is to_date() function. Spark SQL to_date() function is used to convert string containing date to a date format. The function is useful when you are trying to transform captured string data into particular data type such as date type. In this article, we will check how to use the Spark to_date function on DataFrame as well as in plain SQL queries. Spark SQL to_date() Function You can use Spark to_date() function to convert and format string…

Continue ReadingSpark SQL to_date() Function – Pyspark and Scala
Comments Off on Spark SQL to_date() Function – Pyspark and Scala

Apache Spark SQL Supported Subqueries and Examples

A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples. Spark SQL Supported Subqueries Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block. You can use these…

Continue ReadingApache Spark SQL Supported Subqueries and Examples
Comments Off on Apache Spark SQL Supported Subqueries and Examples

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Continue ReadingHow to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala
Comments Off on How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Continue ReadingSpark SQL Recursive DataFrame – Pyspark and Scala
Comments Off on Spark SQL Recursive DataFrame – Pyspark and Scala

Spark SQL and Dataset Hints Types- Usage and Examples

In general, Query hints or optimizer hints can be used with SQL statements to alter execution plans. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. As a data architect, you might know information about your data that the optimizer does not know. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on the specific criteria. In this article, we will check Spark SQL and Dataset hints types, usage and examples. Spark SQL and Dataset…

Continue ReadingSpark SQL and Dataset Hints Types- Usage and Examples
Comments Off on Spark SQL and Dataset Hints Types- Usage and Examples

Oracle INSERT ALL Alternative in Hive/Spark SQL

Oracle database is one of widely used relational databases. It supports many syntax that are not available in any other transaction databases. One of such command is INSERT ALL. The INSERT ALL is used to insert computed records into multiple tables based on conditions. In this article, we will check what is Oracle INSERT ALL alternative in Hive and Spark SQL. Oracle INSERT ALL alternative in Hive/Spark SQL When migrating Oracle scripts to Apache Hive and Spark SQL, you will notice Hive and Spark SQL does not support many Oracle…

Continue ReadingOracle INSERT ALL Alternative in Hive/Spark SQL
Comments Off on Oracle INSERT ALL Alternative in Hive/Spark SQL

Create Spark SQL isdate Function – Date Validation

Many databases such as SQL Server supports isdate function. Spark SQL supports many data frame methods. We have already seen Spark SQL date functions in my other post, "Spark SQL Date and Timestamp Functions". You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date. In this article, we will check how to create Spark SQL isdate user defined function with an example. Create Spark SQL isdate Function The best part about…

Continue ReadingCreate Spark SQL isdate Function – Date Validation
Comments Off on Create Spark SQL isdate Function – Date Validation

How to Export Spark DataFrame to Teradata Table

In my other article, Steps to Connect Teradata Database from Spark, we have seen how to connect Teradata database from Spark using the JDBC driver. In this article, we will check how to export spark dataframe to a Teradata table using same JDBC drivers. We will also check how to create table our of Spark dataframe if it's not present in the target database. i.e. Teradata database. Export Spark DataFrame to Teradata Table Apache Spark is fast because of its in-memory computation. It is common practice to use Spark as…

Continue ReadingHow to Export Spark DataFrame to Teradata Table
Comments Off on How to Export Spark DataFrame to Teradata Table

How to Export Spark DataFrame to Redshift Table

In my other article How to Create Redshift Table from DataFrame using Python, we have seen how to create Redshift table from Python Pandas DataFrame. In this article, we will check how to export Spark DataFrame to Redshift table. Export Spark DataFrame to Redshift Table Apache Spark is fast because of its in-memory computation. It is common practice to use Spark as an execution engine to process huge amount data. Sometimes, you may get a requirement to export processed data back to Redshift for reporting. We are going to use…

Continue ReadingHow to Export Spark DataFrame to Redshift Table
Comments Off on How to Export Spark DataFrame to Redshift Table

How to Load Spark DataFrame to Oracle Table – Example

In my previous post Steps to Connect Oracle Database from Spark, I have explained how to connect to Oracle and query tables from the database. But in some cases, you may get requirement to load Spark dataFrame to Oracle table. We can also use JDBC to write data from a Spark dataframe to database tables. In the subsequent sections, we will explore method to write Spark dataframe to Oracle Table. Load Spark DataFrame to Oracle Table As mentioned in the previous section, we can use JDBC driver to write dataframe…

Continue ReadingHow to Load Spark DataFrame to Oracle Table – Example
Comments Off on How to Load Spark DataFrame to Oracle Table – Example