Spark RDD Cache and Persist to Improve Performance

Apache Spark itself is a fast, distributed processing engine. As per the official documentation, Spark is 100x faster compared to traditional Map-Reduce processing. Another motivation of using Spark is the ease of use. You work with Apache Spark using any of your favorite programming language such as Scala, Java, Python, R, etc. In this article, we will check how to improve performance of iterative applications using Spark RDD cache and persist methods. Spark RDD Cache and Persist Spark RDD Caching or persistence are optimization techniques for iterative and interactive Spark applications. Caching and…

Continue ReadingSpark RDD Cache and Persist to Improve Performance
Comments Off on Spark RDD Cache and Persist to Improve Performance

Spark SQL INSERT INTO Table VALUES issue and Alternatives

Spark SQL is gaining popularity because of is fast distributed framework. The Spark SQL is fast enough compared to Apache Hive. You can create tables in the Spark warehouse as explained in the Spark SQL introduction or connect to Hive metastore and work on the Hive tables. Not all the Hive syntax are supported in Spark SQL, one such syntax is Spark SQL INSERT INTO Table VALUES which is not supported. You cannot use INSERT INTO table VALUES option in spark. We will discuss the alternate approach with some examples.…

Continue ReadingSpark SQL INSERT INTO Table VALUES issue and Alternatives
Comments Off on Spark SQL INSERT INTO Table VALUES issue and Alternatives

Python Pyspark Iterator-How to create and Use?

An iterator is an object in Python representing a stream of data. You can create an iterator object by applying the iter() built-in function to an iterable dataset. In python, you can create your own iterator from list, tuple. For example, the list is an iterator and you can run a for loop over a list. In this article, we will check Python Pyspark iterator, how to create and use it. Python Pyspark Iterator As you know, Spark is a fast distributed processing engine. It uses RDD to distribute the…

Continue ReadingPython Pyspark Iterator-How to create and Use?
Comments Off on Python Pyspark Iterator-How to create and Use?

Register Python Function into Pyspark – Example

Similar to UDFs in the hive, you can add custom UDFs in pyspark spark context. We have discussed "Register Hive UDF jar into pyspark" in my other post. We have discussed, how to add udf present in jar to spark executor later we register them to Spark SQL using create function command. In this article, we will check how to register Python function into Pyspark with an example. Register Python Function into Pyspark Python is one of the widely used programming languages. Most of the organizations using pyspark to perform…

Continue ReadingRegister Python Function into Pyspark – Example
Comments Off on Register Python Function into Pyspark – Example

Register Hive UDF jar into pyspark – Steps and Examples

Apache Spark is one of the widely used processing engine because of its fast and in-memory computation. Most of the organizations use both Hive and Spark. Hive as a data source and Spark as a processing engine. You can use any of your favorite programming language to interact with Hadoop. You can write custom UDFs in Java, Python or Scala. To use those UDFs, you have to register into the Hive so that you can use them like normal built-in functions. In this article, we check check couple of methods…

Continue ReadingRegister Hive UDF jar into pyspark – Steps and Examples
Comments Off on Register Hive UDF jar into pyspark – Steps and Examples

How to Update Spark DataFrame Column Values using Pyspark?

A dataFrame in Spark is a distributed collection of data, which is organized into named columns. You can compare Spark dataFrame with Pandas dataFrame, but the only difference is Spark dataFrames are immutable, i.e. You cannot change data from already created dataFrame. In this article, we will check how to update spark dataFrame column values using pyspark. The same concept will be applied to Scala as well. How to Update Spark DataFrame Column Values using Pyspark? The Spark dataFrame is one of the widely used features in Apache Spark. All…

Continue ReadingHow to Update Spark DataFrame Column Values using Pyspark?
Comments Off on How to Update Spark DataFrame Column Values using Pyspark?

What is SQL Cursor Alternative in Spark SQL?

SQL Cursor is a database object to retrieve data from a result set one row at a time. You can also consider cursor as a temporary workspace created in database system memory when a SQL query is executed. SQL Cursor always returns one row at a time, you can perform your calculation on returned values. Cursors are usually written using SQL procedural language such as Oracle PL/SQL, Netezza NZPL/SQL. Sample SQL Cursor Example Below is the sample Oracle PL/SQL procedure with cursor defined: CREATE OR replace PROCEDURE Sample_proc IS str1…

Continue ReadingWhat is SQL Cursor Alternative in Spark SQL?
Comments Off on What is SQL Cursor Alternative in Spark SQL?

Spark Dataset Join Operators using Pyspark – Examples

Joining two different tables results in different dataset. You can join two different datasets to perform specific task, such as getting common rows. Relational databases like Netezza, Teradata supports different join types. Just like RDBMS, Apache Hive also supports different join types. In this article, we will check Spark Dataset Join Operators using Pyspark and some examples to demonstrate different join types. Before going into Spark SQL dataframe join types, let us check what is join in SQL? “A query that accesses multiple rows of the same or different table…

Continue ReadingSpark Dataset Join Operators using Pyspark – Examples
Comments Off on Spark Dataset Join Operators using Pyspark – Examples

Spark SQL Cumulative Average Function and Examples

Spark SQL supports Analytics or window functions. You can use Spark SQL to calculate certain results based on the range of values. Result might be dependent of previous or next row values, in that case you can use cumulative sum or average functions. Databases like Netezza, Teradata, Oracle, even latest version of Apache Hive supports analytic or window functions. In this article, we will check Spark SQL cumulative Average function and how to use it with an example. Spark SQL Cumulative Average Function There are two methods to calculate cumulative…

Continue ReadingSpark SQL Cumulative Average Function and Examples
Comments Off on Spark SQL Cumulative Average Function and Examples

Spark SQL Cumulative Sum Function and Examples

Spark SQL supports Analytics or window function. You can use Spark SQL to calculate certain results based on the range of values. Most of the databases like Netezza, Teradata, Oracle, even latest version of Apache Hive supports analytic or window functions. In this article, we will check Spark SQL cumulative sum function and how to use it with an example. Spark SQL Cumulative Sum Function Before going deep into calculating cumulative sum, first, let is check what is running total or cumulative sum? “A running total or cumulative sum refers…

Continue ReadingSpark SQL Cumulative Sum Function and Examples
Comments Off on Spark SQL Cumulative Sum Function and Examples