Apache Spark Archives - Page 3 of 7

Spark SQL Bucketing on DataFrame – Examples

We have already discussed the Hive bucketing concept in my other post. The concept is also same in Spark SQL. Bucketing concept is dividing partition into a number of equal clusters (also called clustering) or buckets. The concept is very much similar to clustering in relational databases such as Netezza, Snowflake, etc. In this article, we will check Spark SQL bucketing on DataFrame instead of tables. We will use Pyspark to demonstrate the bucketing examples. The concept is same in Scala as well. Spark SQL Bucketing on DataFrame Bucketing is an optimization technique in both…

Comments Off

May 29, 2020

Apache Spark

Create Row for each array Element using PySpark Explode

Best about Spark is that you can easily work with semi-structured data such as JSON. The json can contains arrays or map elements. You may get requirement to create a row for each array or map elements. In this article, we will check how to use Pyspark explode function to create a row for each array element. Create a Row for each array Element using PySpark Explode Before jumping into the examples, first, let us understand what is explode function in PySpark. Pyspark Explode Function The Pyspark explode function returns…

Comments Off

May 20, 2020

Apache Spark

Apache Spark SQL Bucketing Support – Explanation

Spark SQL supports clustering column values using bucketing concept. Bucketing and partition is similar to that of Hive concept, but with syntax change. In this article, we will check Apache Spark SQL Bucketing support in different versions of Spark. In this article, we will concentrate only on the Spark SQL DDL changes. On applying bucketing on DataFrame, go through the article. Apache Spark SQL Bucketing Support Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. The bucketing concept is one of the optimization technique that use bucketing to…

Comments Off

May 18, 2020

Apache Spark

Spark SQL COALESCE on DataFrame – Examples

You will know the importance of coalesce function if you are from SQL or Data Warehouse background. Coalesce function is one of the widely used function in SQL. You can use the coalesce function to return non-null values. In this article, we will check how to use Spark SQL coalesce on an Apache Spark DataFrame with an example. Spark SQL COALESCE on DataFrame The coalesce is a non-aggregate regular function in Spark SQL. The coalesce gives the first non-null value among the given columns or null if all columns are null. Coalesce requires at…

Comments Off

May 7, 2020

Apache Spark

Spark SQL Create Temporary Tables, Syntax and Examples

The temporary tables are tables that are available within the current session. Tables are automatically dropped at the end of the current session. In this article, we will check how to create Spark SQL temporary tables, its syntax and some examples. Spark SQL Create Temporary Tables Temporary tables or temp tables in Spark are available within the current spark session. Spark temp tables are useful, for example, when you want to join the dataFrame column with other tables. Spark DataFrame Methods or Function to Create Temp Tables Depends on the…

Comments Off

March 7, 2020

Apache Spark

Spark SQL CASE WHEN on DataFrame – Examples

In general, the CASE expression or command is a conditional expression, similar to if-then-else statements found in other languages. Spark SQL supports almost all features that are available in Apace Hive. One of such a features is CASE statement. In this article, how to use CASE WHEN and OTHERWISE statement on a Spark SQL DataFrame. Spark SQL CASE WHEN on DataFrame The CASE WHEN and OTHERWISE function or statement tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression. Spark…

Comments Off

February 4, 2020

Apache Spark

Import CSV file to Pyspark DataFrame – Example

Many organization uses a flat file format such as CSV or TSV to offload their tables. Managing flat file is easy and can be transported by any electronic medium. In this article we will check how to import CSV file to Pyspark DataFrame with some examples. Import CSV file to Pyspark DataFrame There are many methods that you can use to import CSV file into pyspark or Spark DataFrame. But, the following methods are easy to use. Read Local CSV using com.databricks.spark.csv FormatRun Spark SQL Query to Create Spark DataFrame…

Comments Off

February 3, 2020

Apache Spark

Spark SQL Date and Timestamp Functions and Examples

Spark SQL provides many built-in functions. The functions such as date and time functions are useful when you are working with DataFrame which stores date and time type values. The built-in functions also support type conversion functions that you can use to format the date or time type. In this article, we will check what are Spark SQL date and timestamp functions with some examples. Spark SQL Date and Timestamp Functions Spark SQL supports almost all date and time functions that are supported in Apache Hive. You can use these…

Comments Off

January 31, 2020

Apache Spark

Rename PySpark DataFrame Column – Methods and Examples

A DataFrame in Spark is a dataset organized into named columns. Spark data frame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. When you work with Datarames, you may get a requirement to rename the column. In this article, we will check how to rename a PySpark DataFrame column, Methods to rename DF column and some examples. Rename PySpark DataFrame Column As mentioned earlier, we often need to rename one column or multiple columns on PySpark (or Spark) DataFrame. Note…

Comments Off

January 31, 2020

Apache Spark

SQL Merge Operation Using Pyspark – UPSERT Example

In the relational databases such as Snowflake, Netezza, Oracle, etc, Merge statement is used to manipulate the data stored in the table. In this article, we will check how to SQL Merge operation simulation using Pyspark. The method is same in Scala with little modification. SQL Merge Statement The MERGE command in relational databases, allows you to update old records and insert new records simultaneously. This command is sometimes called UPSERT (UPdate and inSERT command). Following is the sample merge statement available in RDBMS. merge into merge_test using merge_test2 on…

1 Comment

January 27, 2020