How to Export SQL Server Table to S3 using Spark?

Apache Spark is one of the emerging Bigdata technology. Due to its in memory distributed and fast computation, you can use it to perform heavy jobs such as analyzing petabytes of data or export millions or billions of records from any relational database to cloud storage such as Amazon S3, Azure Blob or Google cloud storage. In this article, we will check how to export SQL Server table to the Amazon cloud S3 bucket using Spark. We will use PySpark to demonstrate the method. In my other article, we have…

Continue ReadingHow to Export SQL Server Table to S3 using Spark?
Comments Off on How to Export SQL Server Table to S3 using Spark?

Connect to SQL Server From Spark – PySpark

Due to its in memory distributed and fast computation, Apache Spark is one of the emerging Bigdata technology. Apache Spark in memory distributed computation allows you to analyze petabytes of data without any performance issue. In this article, we will check one of methods to connect SQL Server database from Spark program. Preferably, we will use PySpark to read SQL Server table. Connection method is similar to that have already discussed for Oracle, Netezza, Snowflake, Teradata, etc. Steps to Connect SQL Server From Spark To access SQL Server from Apache…

Continue ReadingConnect to SQL Server From Spark – PySpark
Comments Off on Connect to SQL Server From Spark – PySpark

Spark SQL Count Distinct Window Function

Windows are commonly used analytical functions in a Spark SQL query. The COUNT is one of such a windows functions that will allow you to count over certain window. Many relational databases such as Oracle support COUNT window function with distinct keyword. However, Spark SQL does not support count distinct window function. Spark SQL Count Distinct Window Function Spark SQL supports count window functions. However, the COUNT window function with distinct keyword is not supported as of now. Following is the example of an Oracle COUNT window function with distinct…

Continue ReadingSpark SQL Count Distinct Window Function
Comments Off on Spark SQL Count Distinct Window Function

How to Use Spark SQL REPLACE on DataFrame?

Similar to the DataFrame COALESCE function, REPLACE function is one of the important functions that you will use to manipulate string data. Replace function is one of the widely used function in SQL. You can use the replace function to replace values. In this article, we will check how to use Spark SQL replace function on an Apache Spark DataFrame with an example. Spark SQL REPLACE Spark SQL REPLACE on DataFrame In a SQL, replace function removes all occurrences of a specified substring, and optionally replaces them with another string.…

Continue ReadingHow to Use Spark SQL REPLACE on DataFrame?
Comments Off on How to Use Spark SQL REPLACE on DataFrame?

How to Search String in Spark DataFrame? – Scala and PySpark

Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods. How to Search String in Spark DataFrame? Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame. Following are the…

Continue ReadingHow to Search String in Spark DataFrame? – Scala and PySpark
Comments Off on How to Search String in Spark DataFrame? – Scala and PySpark

How to Find Tables Size in Spark SQL? – Scala Example

Be it relational database, Hive, or Spark SQL, Finding the table size is one of the common requirements. Relational databases such as Snowflake, Teradata, etc support system tables. You can use those system tables to identify the size of tables. But, there are no system tables in Spark SQL. You can make use of the Spark catalog API to find the tables size in the Spark SQL database. Find Tables Size in Spark SQL Starting version 2.0, Spark supports catalog API. It has many useful methods such as listtables, listdatabases,…

Continue ReadingHow to Find Tables Size in Spark SQL? – Scala Example
Comments Off on How to Find Tables Size in Spark SQL? – Scala Example

Spark SQL Array Functions – Syntax and Examples

Similar to relational databases such as Snowflake, Teradata, Spark SQL support many useful array functions. You can use these array manipulation functions to manipulate the array types. In this article, we will check how to work with Spark SQL Array Functions its Syntax and Examples. Spark SQL Array Functions Following is the list of Spark SQL array functions with brief descriptions: Spark SQL Array FunctionDescriptionarray(expr, ...) Returns an array with the given elements.array_contains(array, value)Returns true if the array contains the value.array_distinct(array)This function removes duplicate values from the arrayarray_except(array1, array2)Returns an array…

Continue ReadingSpark SQL Array Functions – Syntax and Examples
Comments Off on Spark SQL Array Functions – Syntax and Examples

How to Add Column with Default Value to Pyspark DataFrame?

Since the inception, Spark has made a lot of improvement and added many useful DataFrame API's. If you are from SQL background, you might have noticed that adding default value to a column when you add new column is a common practice. This is just to make sure the new column does not hold junk or NULL values. In this article, we will check how to add a column with a default or constant value to a Pyspark DataFrame. Add a Column with Default Value to Pyspark DataFrame Adding a…

Continue ReadingHow to Add Column with Default Value to Pyspark DataFrame?
Comments Off on How to Add Column with Default Value to Pyspark DataFrame?

Spark SQL Correlated Subquery and Usage Restrictions

The Correlated subquery in a Spark SQL is a query within a query that refer the columns from the parent or outer query table. These kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Spark SQL supports the regular and correlated subqueries. You can use the subqueries to improve the performance of the Spark SQL queries such as limiting the number of records returned by the subquery. Spark SQL Correlated Subquery Spark SQL supports many types of subqueries. However, it…

Continue ReadingSpark SQL Correlated Subquery and Usage Restrictions
Comments Off on Spark SQL Correlated Subquery and Usage Restrictions