Execute Pyspark Script from Python and Examples

As Apache Spark is gaining popularity, most of organizations are trying to integrate existing big data ecosystem to Spark so that they can utilize speed and distributed computation power of Apache Spark. In my earlier post, I have discussed various Methods to Access Hive Tables from Apache Spark to access Spark from from Python. In this post we will discuss how to execute pyspark script from Python with an working examples. Python Pyspark Python is widely used programming language and easy to learn. Well, you can access Apache Spark within python…

Continue ReadingExecute Pyspark Script from Python and Examples
Comments Off on Execute Pyspark Script from Python and Examples

Methods to Access Hive Tables from Apache Spark

Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark. Why Apache Spark? As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark…

Continue ReadingMethods to Access Hive Tables from Apache Spark
Comments Off on Methods to Access Hive Tables from Apache Spark

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark supports both local and remote metastore. You can connect to remote HiveServer2 using Apache Spark JDBC drivers. Hive JDBC driver for Spark2 is available in the jars folder located in the spark installation directory. In this post, we will check steps to connect HiveServer2 using Apache Spark JDBC Driver and Python. Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python There are various methods that you can use to connect to HiveServer2. Using Spark JDBC driver is one of easy method. Methods to Access Hive Tables…

Continue ReadingSteps to Connect HiveServer2 using Apache Spark JDBC Driver and Python
Comments Off on Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark Architecture, Design and Overview

Apache Spark is a fast, open source and general-purpose cluster computing system with an in-memory data processing engine. Apache Spark is written in Scala and it provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark architecture is designed in such a way that you can use it for ETL (Spark SQL), analytics, machine learning (MLlib), graph processing or building streaming application (spark streaming). Spark is often called cluster computing engine or simply execution engine. Apache Spark is one of most…

Continue ReadingApache Spark Architecture, Design and Overview
Comments Off on Apache Spark Architecture, Design and Overview