Apache Spark Archives - Page 4 of 7

Spark DataFrame Column Type Conversion using CAST

In my other post, we have discussed how to check if Spark DataFrame column is of Integer Type. Some application expects column to be of a specific type. For example, Machine learning models accepts only integer type. In this article, we will check how to perform Spark DataFrame column type conversion using the Spark dataFrame CAST method. Spark DataFrame Column Type Conversion You can use the Spark CAST method to convert data frame column data type to required format. Test Data Frame Following is the test data frame (df) that…

Comments Off

November 16, 2019

Apache Spark

Spark DataFrame Integer Type Check and Example

Apache Spark is one of the easiest framework to deal with different data sources. You can combine heterogeneous data source with the help of dataFrames. Some application, for example, Machine Learning model requires only integer values. You should check the data type of the dataFrame before feeding it to ML models, or you should type cast it to an integer type. In this article, how to perform Spark dataFrame integer type check and how to convert it using CAST function in Spark. Spark DataFrame Integer Type Check Requirement As mentioned…

Comments Off

November 16, 2019

Apache Spark

How to Create Spark SQL User Defined Functions? Example

A user defined function (UDF) is a function written to perform specific tasks when built-in function is not available for the same. In a Hadoop environment, you can write user defined function using Java, Python, R, etc. In this article, we will check how to create Spark SQL user defined functions with an python user defined functionexample. Spark SQL User-defined Functions When you migrate your relational database warehouse to Hive and use Spark as an execution engine, you may miss some of the built-in function support. Some user defined functions…

Comments Off

November 16, 2019

Apache Spark

Spark SQL isnumeric Function Alternative and Example

Most of the organizations are moving their data warehouse to the Hive and using Spark as an execution engine. Spark as an execution engine will boost the performance. In SQL, there are many options that you can use to deal with non-numeric values, for example, you can create user defined functions to filter out unwanted data. In this article, we will check Spark SQL isnumeric function alternative and examples. Spark SQL isnumeric Function Spark SQL, or Apache Hive does not provide support for is numeric function. You have to write…

Comments Off

November 16, 2019

Apache Spark

Spark SQL DataFrame Self Join and Example

You can use Spark Dataset join operators to join multiple dataframes in Spark. Two or more dataFrames are joined to perform specific tasks such as getting common data from both dataFrames. In this article, we will check how to perform Spark SQL DataFrame self join using Pyspark. Spark SQL DataFrame Self Join using Pyspark Spark DataFrame supports various join types as mentioned in Spark Dataset join operators. A self join in a DataFrame is a join in which dataFrame is joined to itself. The self join is used to identify…

Comments Off

November 14, 2019

Apache Spark

How to Save Spark DataFrame as Hive Table – Example

Apache Spark is one of the highly contributed frameworks. Many e-commerce, data analytics and travel companies are using Spark to analyze the huge amount of data as soon as possible. Because of in memory computations, Apache Spark can provide results 10 to 100X faster compared to Hive. In this article, we will check How to Save Spark DataFrame as Hive Table? and some examples. How to Save Spark DataFrame as Hive Table? Because of its in-memory computation, Spark is used to process the complex computation. In case if you have…

Comments Off

September 26, 2019

Apache Spark

How to Export Spark-SQL Results to CSV?

Data plays important role in today's decision making process. Be it online bookstore, e-commerce website or online food delivery applications use user data to provide better customer service. These are many organizations that share data to decision making systems. These companies provide data in the form of flat files or direct access to the source system. Many companies use Spark as an execution engine. In this article, we will check how to export Spark-SQL results to CSV flat file. The created flat files or CSV files then be transported using…

Comments Off

August 16, 2019

Apache Spark

Spark Modes of Operation and Deployment

Apache Spark Mode of operations or Deployment refers how Spark will run. Spark can run either in Local Mode or Cluster Mode. Local mode is used to test your application and cluster mode for production deployment. In this article, we will check the Spark Mode of operation and deployment. Spark Mode of Operation Apache Spark by default runs in Local Mode. Usually, local modes are used for developing applications and unit testing. Spark can be configured to run in Cluster Mode using YARN Cluster Manager. Currently, Spark supports Three Cluster…

Comments Off

July 16, 2019

Apache Spark

Pass Functions to pyspark – Run Python Functions on Spark Cluster

Functions in any programming language are used to handle particular task and improve the readability of the overall code. By definition, a function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing. In this article, we will check how to pass functions to pyspark driver program to execute on cluster. Pass Functions to pyspark Spark API require you to pass functions to driver program so that it will be…

Comments Off

July 11, 2019

Apache Spark

Pyspark Storagelevel and Explanation

The basic building block of an Apache Spark is RDD. The main abstraction Apache Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. In this article, we will check how to store the RDD using Pyspark Storagelevel. We will also check various storage levels with some examples. Pyspark Storagelevel Explanation Pyspark storagelevels are flags for controlling the storage of an resilient distributed dataset (RDD). Each StorageLevel helps Spark to decide whether to Use…

Comments Off

July 8, 2019