Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Continue ReadingSpark SQL Bucketing on DataFrame – Examples
Comments Off on Spark SQL Bucketing on DataFrame – Examples

Replace Pyspark DataFrame Column Value – Methods

A DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows similar to that of relational database tables. There are many situations you may get unwanted values such as invalid values in the data frame. In this article, we will check how to replace such a value in pyspark DataFrame column. We will also check methods to replace values in Spark DataFrames. Replace Pyspark DataFrame Column Value As mentioned, we often get a requirement to cleanse the data by replacing unwanted values from the DataFrame…

Continue ReadingReplace Pyspark DataFrame Column Value – Methods
Comments Off on Replace Pyspark DataFrame Column Value – Methods

Create Row for each array Element using PySpark Explode

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element. Create a Row for each array Element using PySpark Explode Before jumping into the examples, first, let us understand what is explode function in PySpark. Pyspark Explode Function The Pyspark explode function returns…

Continue ReadingCreate Row for each array Element using PySpark Explode
Comments Off on Create Row for each array Element using PySpark Explode

Spark SQL Create Temporary Tables, Syntax and Examples

The temporary tables are tables that are available within the current session. Tables are automatically dropped at the end of the current session. In this article, we will check how to create Spark SQL temporary tables, its syntax and some examples. Spark SQL Create Temporary Tables Temporary tables or temp tables in Spark are available within the current spark session. Spark temp tables are useful, for example, when you want to join the dataFrame column with other tables. Spark DataFrame Methods or Function to Create Temp Tables Depends on the…

Continue ReadingSpark SQL Create Temporary Tables, Syntax and Examples
Comments Off on Spark SQL Create Temporary Tables, Syntax and Examples

Import CSV file to Pyspark DataFrame – Example

Many organization uses a flat file format such as CSV or TSV to offload their tables. Managing flat file is easy and can be transported by any electronic medium. In this article we will check how to import CSV file to Pyspark DataFrame with some examples. Import CSV file to Pyspark DataFrame There are many methods that you can use to import CSV file into pyspark or Spark DataFrame. But, the following methods are easy to use. Read Local CSV using com.databricks.spark.csv FormatRun Spark SQL Query to Create Spark DataFrame…

Continue ReadingImport CSV file to Pyspark DataFrame – Example
Comments Off on Import CSV file to Pyspark DataFrame – Example

Spark SQL Date and Timestamp Functions and Examples

Spark SQL provides many built-in functions. The functions such as date and time functions are useful when you are working with DataFrame which stores date and time type values. The built-in functions also support type conversion functions that you can use to format the date or time type. In this article, we will check what are Spark SQL date and timestamp functions with some examples. Spark SQL Date and Timestamp Functions Spark SQL supports almost all date and time functions that are supported in Apache Hive. You can use these…

Continue ReadingSpark SQL Date and Timestamp Functions and Examples
Comments Off on Spark SQL Date and Timestamp Functions and Examples

Rename PySpark DataFrame Column – Methods and Examples

A DataFrame in Spark is a dataset organized into named columns. Spark data frame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. When you work with Datarames, you may get a requirement to rename the column. In this article, we will check how to rename a PySpark DataFrame column, Methods to rename DF column and some examples. Rename PySpark DataFrame Column As mentioned earlier, we often need to rename one column or multiple columns on PySpark (or Spark) DataFrame. Note…

Continue ReadingRename PySpark DataFrame Column – Methods and Examples
Comments Off on Rename PySpark DataFrame Column – Methods and Examples

SQL Merge Operation Using Pyspark – UPSERT Example

In the relational databases such as Snowflake, Netezza, Oracle, etc, Merge statement is used to manipulate the data stored in the table. In this article, we will check how to SQL Merge operation simulation using Pyspark. The method is same in Scala with little modification. SQL Merge Statement The MERGE command in relational databases, allows you to update old records and insert new records simultaneously. This command is sometimes called UPSERT (UPdate and inSERT command). Following is the sample merge statement available in RDBMS. merge into merge_test using merge_test2 on…

Continue ReadingSQL Merge Operation Using Pyspark – UPSERT Example
1 Comment