Execute Pyspark Script from Python and Examples

As Apache Spark is gaining popularity, most of organizations are trying to integrate existing big data ecosystem to Spark so that they can utilize speed and distributed computation power of Apache Spark. In my earlier post, I have discussed various Methods to Access Hive Tables from Apache Spark to access Spark from from Python. In this post we will discuss how to execute pyspark script from Python with an working examples. Python Pyspark Python is widely used programming language and easy to learn. Well, you can access Apache Spark within python…

Continue ReadingExecute Pyspark Script from Python and Examples
Comments Off on Execute Pyspark Script from Python and Examples

Methods to Access Hive Tables from Python

Apache Hive is database framework on the top of Hadoop distributed file system (HDFS) to query structured and semi-structured data. Just like your regular RDBMS, you access hdfs files in the form of tables. You can create tables, views etc in Apache Hive. You can analyze structured data using HiveQL language which is similar to Structural Query Language (SQL). In this article, we will check different methods to access Hive tables from python program. Methods we are going to discuss here will help you to connect Hive tables and get…

Continue ReadingMethods to Access Hive Tables from Python
Comments Off on Methods to Access Hive Tables from Python

Methods to Access Hive Tables from Apache Spark

Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark. Why Apache Spark? As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark…

Continue ReadingMethods to Access Hive Tables from Apache Spark
Comments Off on Methods to Access Hive Tables from Apache Spark

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark supports both local and remote metastore. You can connect to remote HiveServer2 using Apache Spark JDBC drivers. Hive JDBC driver for Spark2 is available in the jars folder located in the spark installation directory. In this post, we will check steps to connect HiveServer2 using Apache Spark JDBC Driver and Python. Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python There are various methods that you can use to connect to HiveServer2. Using Spark JDBC driver is one of easy method. Methods to Access Hive Tables…

Continue ReadingSteps to Connect HiveServer2 using Apache Spark JDBC Driver and Python
Comments Off on Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Set and Use Environment Variable inside Python Script

It is somewhat difficult when it comes to setting and using bash environment variables in python script file. The same step is very easy and straight forward using shell script. In this post, we will check one of the method to set and use environment variable inside python scrip file. Note that, steps mentioned in this post helps only if you are setting and using that variable inside same process i.e. in same python script. There is no way you can modify bash script from python and use that variable…

Continue ReadingSet and Use Environment Variable inside Python Script
Comments Off on Set and Use Environment Variable inside Python Script

Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

HiveServer2 has a JDBC driver and It supports both embedded and remote access to HiveServer2. Usually, remote HiveServer2 is recommended for production environment as it does not require direct metastore or HDFS access to be given to Hive users. In this article, we will check steps to Connect HiveServer2 from Python using Hive JDBC Drivers. Steps to Connect HiveServer2 from Python using Hive JDBC Drivers Hive JDBC driver is one of the widely used method to connect to HiveServer2. You can use the Hive JDBC with Python Jaydebeapi open source module.…

Continue ReadingSteps to Connect HiveServer2 from Python using Hive JDBC Drivers
Comments Off on Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

Execute Hive Beeline JDBC String Command from Python

To perform any analysis, you need to have data in place. To collect data, you may have to connect your application to different data source. In this article, we will discuss on one of such approach to execute Hive Beeline JDBC string command from Python application. This is one of the simple and easy approach to connect to Kerberos HiveServer2 using Beeline shell. I was working on one of the machine learning project to predict query execution time on Hadoop Hive cluster. We were gathering various features from the HiveQL…

Continue ReadingExecute Hive Beeline JDBC String Command from Python
Comments Off on Execute Hive Beeline JDBC String Command from Python

Apache Spark Architecture, Design and Overview

Apache Spark is a fast, open source and general-purpose cluster computing system with an in-memory data processing engine. Apache Spark is written in Scala and it provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark architecture is designed in such a way that you can use it for ETL (Spark SQL), analytics, machine learning (MLlib), graph processing or building streaming application (spark streaming). Spark is often called cluster computing engine or simply execution engine. Apache Spark is one of most…

Continue ReadingApache Spark Architecture, Design and Overview
Comments Off on Apache Spark Architecture, Design and Overview

Netezza Stored Procedure ARRAY Variables and Examples

The ARRAY data type is a composite data value that consists of zero or more elements of a specified specific data type. Netezza nzplsql allows you to define ARRAY types along with other scalar variables. In this article, we will check Netezza stored procedure ARRAY variables, declaration and examples. Netezza Stored Procedure ARRAY Variables You can define and use VARRAY in Netezza stored procedures. You can insert values to array variables, increase array size in case if length is exceeded and remove elements from array. ARRAY variables are allowed anywhere…

Continue ReadingNetezza Stored Procedure ARRAY Variables and Examples
Comments Off on Netezza Stored Procedure ARRAY Variables and Examples

Apache Hive Grouping Function, Alternative and Examples

Most of the relational databases supports Grouping function to segregate super-aggregated rows.  Apache Hive Support for SQL grouping function was added in Hive 2.3.0. But who are using lower version of Hive will have difficult time in porting SQL queries that are written using grouping functions. In this article, we will check Apache Hive Grouping function alternative and examples. Grouping Function In general, the grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part…

Continue ReadingApache Hive Grouping Function, Alternative and Examples
Comments Off on Apache Hive Grouping Function, Alternative and Examples