How to Connect Netezza Server from Spark? – Example

I was working on one of the Spark project where we had a requirement to connect Netezza server from Spark. Integrating Netezza and Apache Spark enable analytic capabilities using the Apache Spark for data resides in Netezza database. There are various ways you can use to connect Netezza server from Spark program. You can also connect to the Netezza server from Pyspark. So, Why to Connect Netezza Server from Spark? This is an interesting question. To answer this, let us check how Apache Spark works. Apache Spark works on data…

Continue ReadingHow to Connect Netezza Server from Spark? – Example
Comments Off on How to Connect Netezza Server from Spark? – Example

Basic Spark Transformations and Actions using pyspark

Apache Spark provides two kinds of operations: Transformations and Actions. We will check the commonly used basic Spark Transformations and Actions using pyspark. Create RDD from Local File You can use textFile spark context method to create RDD from local or HDFS file systems rdd = sc.textFile("file:////home/impadmin/test.txt") Related Articles: Apache Spark Architecture, Design and Overview Create RDD from HDFS File rdd = sc.textFile("hdfs:/localhost:8020/home/impadmin/test.txt") Basic Spark Transformations Transformations are Spark operation which will transform one RDD into another. Transformations will always create new RDD from original one. Below are some basic…

Continue ReadingBasic Spark Transformations and Actions using pyspark
Comments Off on Basic Spark Transformations and Actions using pyspark

Apache Spark SQL Introduction and Features

In an Apache Spark, Spark SQL is a module to work with structured and semi structured data. Any data that has schema is considered as structured data, for example, JSON, Hive tables, parquet file formats, etc. Whereas, semi structured data is something with no separation between the schema and the data. In this article, we will check Apache Spark SQL introduction and its features. Apache Spark SQL Introduction As mentioned earlier, Spark SQL is a module to work with structured and semi structured data. Spark SQL works well with huge…

Continue ReadingApache Spark SQL Introduction and Features
Comments Off on Apache Spark SQL Introduction and Features

Spark SQL Performance Tuning – Improve Spark SQL Performance

You can improve the performance of Spark SQL by making simple changes to the system parameters. It requires Spark knowledge and the type of file system that are used to tune your Spark SQL performance. In this article, we will check the Spark SQL performance tuning to improve Spark SQL performance. Related Article: Apache Spark SQL Introduction and Features Apache Spark Architecture, Design and Overview Data Storage Consideration for Spark Performance Before going into Spark SQL performance tuning, let us check some of data storage considerations for spark performance. Optimize…

Continue ReadingSpark SQL Performance Tuning – Improve Spark SQL Performance
Comments Off on Spark SQL Performance Tuning – Improve Spark SQL Performance

Spark SQL EXPLAIN Operator and Examples

Spark SQL uses Catalyst optimizer to create optimal execution plan. Execution plan will change based on scan, join operations, join order, type of joins, sub-queries and aggregate operations. In this article, we will check Spark SQL EXPLAIN Operator and some working examples. Spark SQL EXPLAIN Operator Spark SQL EXPLAIN operator provide detailed plan information about sql statement without actually running it. You can use the Spark SQL  EXPLAIN operator to display the actual execution plan that Spark execution engine will generates and uses while executing any query. You can use this execution plan…

Continue ReadingSpark SQL EXPLAIN Operator and Examples
Comments Off on Spark SQL EXPLAIN Operator and Examples

Create Pyspark sparkContext within python Program

In my other article, we have seen how to connect to Spark using JDBC driver and Jaydebeapi module. Hadoop cluster like Cloudera Hadoop distribution (CDH) does not provide JDBC driver. You either have to create your own JDBC driver by using Spark thrift server or create Pyspark sparkContext within python Program to enter into Apache Spark world. Spark Context or Hive Contex SparkContext or HiveContex is entry gate to interact with Spark engine. When you execute any Spark application, driver program initiates context for you. for example, when you start…

Continue ReadingCreate Pyspark sparkContext within python Program
Comments Off on Create Pyspark sparkContext within python Program

Execute Pyspark Script from Python and Examples

As Apache Spark is gaining popularity, most of organizations are trying to integrate existing big data ecosystem to Spark so that they can utilize speed and distributed computation power of Apache Spark. In my earlier post, I have discussed various Methods to Access Hive Tables from Apache Spark to access Spark from from Python. In this post we will discuss how to execute pyspark script from Python with an working examples. Python Pyspark Python is widely used programming language and easy to learn. Well, you can access Apache Spark within python…

Continue ReadingExecute Pyspark Script from Python and Examples
Comments Off on Execute Pyspark Script from Python and Examples

Methods to Access Hive Tables from Apache Spark

Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark. Why Apache Spark? As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark…

Continue ReadingMethods to Access Hive Tables from Apache Spark
Comments Off on Methods to Access Hive Tables from Apache Spark

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark supports both local and remote metastore. You can connect to remote HiveServer2 using Apache Spark JDBC drivers. Hive JDBC driver for Spark2 is available in the jars folder located in the spark installation directory. In this post, we will check steps to connect HiveServer2 using Apache Spark JDBC Driver and Python. Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python There are various methods that you can use to connect to HiveServer2. Using Spark JDBC driver is one of easy method. Methods to Access Hive Tables…

Continue ReadingSteps to Connect HiveServer2 using Apache Spark JDBC Driver and Python
Comments Off on Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark Architecture, Design and Overview

Apache Spark is a fast, open source and general-purpose cluster computing system with an in-memory data processing engine. Apache Spark is written in Scala and it provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark architecture is designed in such a way that you can use it for ETL (Spark SQL), analytics, machine learning (MLlib), graph processing or building streaming application (spark streaming). Spark is often called cluster computing engine or simply execution engine. Apache Spark is one of most…

Continue ReadingApache Spark Architecture, Design and Overview
Comments Off on Apache Spark Architecture, Design and Overview