Register Hive UDF jar into pyspark – Steps and Examples

Apache Spark is one of the widely used processing engine because of its fast and in-memory computation. Most of the organizations use both Hive and Spark. Hive as a data source and Spark as a processing engine. You can use any of your favorite programming language to interact with Hadoop. You can write custom UDFs in Java, Python or Scala. To use those UDFs, you have to register into the Hive so that you can use them like normal built-in functions. In this article, we check check couple of methods…

Continue ReadingRegister Hive UDF jar into pyspark – Steps and Examples
Comments Off on Register Hive UDF jar into pyspark – Steps and Examples

How to Update Spark DataFrame Column Values using Pyspark?

A dataFrame in Spark is a distributed collection of data, which is organized into named columns. You can compare Spark dataFrame with Pandas dataFrame, but the only difference is Spark dataFrames are immutable, i.e. You cannot change data from already created dataFrame. In this article, we will check how to update spark dataFrame column values using pyspark. The same concept will be applied to Scala as well. How to Update Spark DataFrame Column Values using Pyspark? The Spark dataFrame is one of the widely used features in Apache Spark. All…

Continue ReadingHow to Update Spark DataFrame Column Values using Pyspark?
Comments Off on How to Update Spark DataFrame Column Values using Pyspark?

What is SQL Cursor Alternative in Spark SQL?

SQL Cursor is a database object to retrieve data from a result set one row at a time. You can also consider cursor as a temporary workspace created in database system memory when a SQL query is executed. SQL Cursor always returns one row at a time, you can perform your calculation on returned values. Cursors are usually written using SQL procedural language such as Oracle PL/SQL, Netezza NZPL/SQL. Sample SQL Cursor Example Below is the sample Oracle PL/SQL procedure with cursor defined: CREATE OR replace PROCEDURE Sample_proc IS str1…

Continue ReadingWhat is SQL Cursor Alternative in Spark SQL?
Comments Off on What is SQL Cursor Alternative in Spark SQL?

Spark Dataset Join Operators using Pyspark – Examples

Joining two different tables results in different dataset. You can join two different datasets to perform specific task, such as getting common rows. Relational databases like Netezza, Teradata supports different join types. Just like RDBMS, Apache Hive also supports different join types. In this article, we will check Spark Dataset Join Operators using Pyspark and some examples to demonstrate different join types. Before going into Spark SQL dataframe join types, let us check what is join in SQL? “A query that accesses multiple rows of the same or different table…

Continue ReadingSpark Dataset Join Operators using Pyspark – Examples
Comments Off on Spark Dataset Join Operators using Pyspark – Examples

Spark SQL Cumulative Average Function and Examples

Spark SQL supports Analytics or window functions. You can use Spark SQL to calculate certain results based on the range of values. Result might be dependent of previous or next row values, in that case you can use cumulative sum or average functions. Databases like Netezza, Teradata, Oracle, even latest version of Apache Hive supports analytic or window functions. In this article, we will check Spark SQL cumulative Average function and how to use it with an example. Spark SQL Cumulative Average Function There are two methods to calculate cumulative…

Continue ReadingSpark SQL Cumulative Average Function and Examples
Comments Off on Spark SQL Cumulative Average Function and Examples

Spark SQL Cumulative Sum Function and Examples

Spark SQL supports Analytics or window function. You can use Spark SQL to calculate certain results based on the range of values. Most of the databases like Netezza, Teradata, Oracle, even latest version of Apache Hive supports analytic or window functions. In this article, we will check Spark SQL cumulative sum function and how to use it with an example. Spark SQL Cumulative Sum Function Before going deep into calculating cumulative sum, first, let is check what is running total or cumulative sum? “A running total or cumulative sum refers…

Continue ReadingSpark SQL Cumulative Sum Function and Examples
Comments Off on Spark SQL Cumulative Sum Function and Examples

Spark SQL Analytic Functions and Examples

Spark SQL analytic functions sometimes called as Spark SQL windows function compute an aggregate value that is based on groups of rows. These functions optionally partition among rows based on partition column in the windows spec. Like other analytic functions such as Hive Analytics functions, Netezza analytics functions and Teradata Analytics functions, Spark SQL analytic functions works on groups of rows. These functions optionally ignore NULL values in the data. Spark SQL Analytic Functions There are two types of Spark SQL windows functions: Ranking functions and Analytic functions Related Articles:…

Continue ReadingSpark SQL Analytic Functions and Examples
Comments Off on Spark SQL Analytic Functions and Examples

Running SQL using Spark-SQL Command line Interface-CLI

In my other post, we have seen how to connect to Spark SQL using beeline jdbc connection. You can execute SQL queries in many ways, such as programmatically,  use spark or pyspark shell, beeline jdbc client. Many does not know that spark supports spark-sql command line interface. You can use this to run hive metastore service in local mode. Related Article: Methods to access Hive Tables from Apache Spark Spark SQL Cumulative Sum Function and Examples What is Spark-SQL command line Interface (CLI)? The Spark SQL command line interface or…

Continue ReadingRunning SQL using Spark-SQL Command line Interface-CLI
Comments Off on Running SQL using Spark-SQL Command line Interface-CLI

Steps to Connect Teradata Database from Spark – Examples

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. You can connect Spark to all major databases in market such as Netezza, Oracle, etc. In this article, we will check one of methods to connect Teradata database from Spark program. You can connect using either Scala or Python Pyspark. For all examples in this article, we will use Scala to read Teradata tables. You can even execute…

Continue ReadingSteps to Connect Teradata Database from Spark – Examples
Comments Off on Steps to Connect Teradata Database from Spark – Examples

Steps to Connect Oracle Database from Spark – Examples

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. In this article, we will check one of methods to connect Oracle database from Spark program. Preferably, we will use Scala to read Oracle tables. You can even execute queries and create Spark dataFrame. Steps to Connect Oracle Database from Spark Oracle database is one of the widely used databases in world. Almost all companies use Oracle as…

Continue ReadingSteps to Connect Oracle Database from Spark – Examples
Comments Off on Steps to Connect Oracle Database from Spark – Examples