pyspark Archives - DWgeek.com

How to Export SQL Server Table to S3 using Spark?

Apache Spark is one of the emerging Bigdata technology. Due to its in memory distributed and fast computation, you can use it to perform heavy jobs such as analyzing petabytes of data or export millions or billions of records from any relational database to cloud storage such as Amazon S3, Azure Blob or Google cloud storage. In this article, we will check how to export SQL Server table to the Amazon cloud S3 bucket using Spark. We will use PySpark to demonstrate the method. In my other article, we have…

Comments Off

January 19, 2023

General

Connect to SQL Server From Spark – PySpark

Due to its in memory distributed and fast computation, Apache Spark is one of the emerging Bigdata technology. Apache Spark in memory distributed computation allows you to analyze petabytes of data without any performance issue. In this article, we will check one of methods to connect SQL Server database from Spark program. Preferably, we will use PySpark to read SQL Server table. Connection method is similar to that have already discussed for Oracle, Netezza, Snowflake, Teradata, etc. Steps to Connect SQL Server From Spark To access SQL Server from Apache…

Comments Off

January 19, 2023

Apache Spark

How to Search String in Spark DataFrame? – Scala and PySpark

Being a data engineer, you may work with many different kinds of datasets. You will always get a requirement to filter out or search for a specific string within a data or DataFrame. For example, identify the junk string within a dataset. In this article, we will check how to search a string in Spark DataFrame using different methods. How to Search String in Spark DataFrame? Apache Spark supports many different built in API methods that you can use to search a specific strings in a DataFrame. Following are the…

Comments Off

June 28, 2021

Snowflake

How to Connect to Snowflake from Databricks?

Many organizations use the hybrid model to process the data. They use databricks to perform operations such as Machine Learning tasks and copy end results to Snowflake for reporting or further analysis. In this article, we will check how to connect to Snowflake from databricks to build hybrid architecture. Connect to Snowflake from Databricks The Snowflake is one of the relational databases that provide connector for Spark. You can use the Snowflake Spark connector to connect to Snowflake server and copy data from databricks to Snowflake. Test Data We will…

Comments Off

June 16, 2021

Apache Spark

How to Add Column with Default Value to Pyspark DataFrame?

Since the inception, Spark has made a lot of improvement and added many useful DataFrame API's. If you are from SQL background, you might have noticed that adding default value to a column when you add new column is a common practice. This is just to make sure the new column does not hold junk or NULL values. In this article, we will check how to add a column with a default or constant value to a Pyspark DataFrame. Add a Column with Default Value to Pyspark DataFrame Adding a…

Comments Off

June 11, 2021

Apache Spark

PySpark – Search Table in Spark Database

In a real world scenario, you will be dealing with petabytes of data and thousands of tables in a hundred of databases within Spark or Hive catalog. It is practically, time consuming to identify the particular table in a database, hence it is always good idea to develop reusable code that you can use to search table(s) in a given database. The usability includes safely dropping table(s) and identify table structure. In this article, we will learn how to search table in a database using PySpark. Search Table in Database…

Comments Off

June 8, 2021

Apache Spark

Spark SQL to_date() Function – Pyspark and Scala

Spark SQL supports many date and time conversion functions. One of such a function is to_date() function. Spark SQL to_date() function is used to convert string containing date to a date format. The function is useful when you are trying to transform captured string data into particular data type such as date type. In this article, we will check how to use the Spark to_date function on DataFrame as well as in plain SQL queries. Spark SQL to_date() Function You can use Spark to_date() function to convert and format string…

Comments Off

June 7, 2021

Apache Spark

How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala

You can create Spark DataFrame with duplicate records. There are no methods that prevent you from adding duplicate records to Spark DataFrame. There are chances that some application such as ETL process may create dataframe with duplicate records. Spark SQL supports several methods to de-duplicate the table. In this article, we will check how to identify and remove duplicate records from Spark SQL DataFrame. Remove Duplicate Records from Spark DataFrame There are many methods that you can use to identify and remove the duplicate records from the Spark SQL DataFrame.…

Comments Off

June 2, 2021

Apache Spark

Spark SQL Recursive DataFrame – Pyspark and Scala

Identifying top level hierarchy of one column from another column is one of the import feature that many relational databases such as Teradata, Oracle, Snowflake, etc support. The relational databases use recursive query to identify the hierarchies of data, such as an organizational structure, employee-manager, bill-of-materials, and document hierarchy. Relational databases such as Teradata, Snowflake supports recursive queries in the form of recursive WITH clause or recursive views. But, Spark SQL does not support recursive CTE or recursive views. In this article, we will check Spark SQL recursive DataFrame using…

Comments Off

May 31, 2021

General

How to Load Spark DataFrame to Oracle Table – Example

In my previous post Steps to Connect Oracle Database from Spark, I have explained how to connect to Oracle and query tables from the database. But in some cases, you may get requirement to load Spark dataFrame to Oracle table. We can also use JDBC to write data from a Spark dataframe to database tables. In the subsequent sections, we will explore method to write Spark dataframe to Oracle Table. Load Spark DataFrame to Oracle Table As mentioned in the previous section, we can use JDBC driver to write dataframe…

Comments Off

July 9, 2020