Methods to Access Hive Tables from Apache Spark

Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark. Why Apache Spark? As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark…

Continue ReadingMethods to Access Hive Tables from Apache Spark
Comments Off on Methods to Access Hive Tables from Apache Spark

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Apache Spark supports both local and remote metastore. You can connect to remote HiveServer2 using Apache Spark JDBC drivers. Hive JDBC driver for Spark2 is available in the jars folder located in the spark installation directory. In this post, we will check steps to connect HiveServer2 using Apache Spark JDBC Driver and Python. Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python There are various methods that you can use to connect to HiveServer2. Using Spark JDBC driver is one of easy method. Methods to Access Hive Tables…

Continue ReadingSteps to Connect HiveServer2 using Apache Spark JDBC Driver and Python
Comments Off on Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Set and Use Environment Variable inside Python Script

It is somewhat difficult when it comes to setting and using bash environment variables in python script file. The same step is very easy and straight forward using shell script. In this post, we will check one of the method to set and use environment variable inside python scrip file. Note that, steps mentioned in this post helps only if you are setting and using that variable inside same process i.e. in same python script. There is no way you can modify bash script from python and use that variable…

Continue ReadingSet and Use Environment Variable inside Python Script
Comments Off on Set and Use Environment Variable inside Python Script

Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

HiveServer2 has a JDBC driver and It supports both embedded and remote access to HiveServer2. Usually, remote HiveServer2 is recommended for production environment as it does not require direct metastore or HDFS access to be given to Hive users. In this article, we will check steps to Connect HiveServer2 from Python using Hive JDBC Drivers. Steps to Connect HiveServer2 from Python using Hive JDBC Drivers Hive JDBC driver is one of the widely used method to connect to HiveServer2. You can use the Hive JDBC with Python Jaydebeapi open source module.…

Continue ReadingSteps to Connect HiveServer2 from Python using Hive JDBC Drivers
Comments Off on Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

Execute Hive Beeline JDBC String Command from Python

To perform any analysis, you need to have data in place. To collect data, you may have to connect your application to different data source. In this article, we will discuss on one of such approach to execute Hive Beeline JDBC string command from Python application. This is one of the simple and easy approach to connect to Kerberos HiveServer2 using Beeline shell. I was working on one of the machine learning project to predict query execution time on Hadoop Hive cluster. We were gathering various features from the HiveQL…

Continue ReadingExecute Hive Beeline JDBC String Command from Python
Comments Off on Execute Hive Beeline JDBC String Command from Python

Apache Hive Grouping Function, Alternative and Examples

Most of the relational databases supports Grouping function to segregate super-aggregated rows.  Apache Hive Support for SQL grouping function was added in Hive 2.3.0. But who are using lower version of Hive will have difficult time in porting SQL queries that are written using grouping functions. In this article, we will check Apache Hive Grouping function alternative and examples. Grouping Function In general, the grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part…

Continue ReadingApache Hive Grouping Function, Alternative and Examples
Comments Off on Apache Hive Grouping Function, Alternative and Examples

Step by Step Guide Connecting HiveServer2 using Python Pyhive

Data plays important role in every decision-making process. You may have to connect to various remote servers to get required data for your application. This article explains how to connect Hive running on remote host (HiveSever2) using commonly used Python package, Pyhive. In this article, we will check step by step guide Connecting HiveServer2 using Python Pyhive. There are lot of other Python packages available to connect to remote Hive, but Pyhive package is one of the easy and well-maintained and supported package. There is a option to connect to…

Continue ReadingStep by Step Guide Connecting HiveServer2 using Python Pyhive
2 Comments

Apache Hive Table Update using ACID Transactions and Examples

Apache Hive and Cloudera Impala supports SQL on Hadoop and provides better way to manage data on Hadoop ecosystem. There are many frameworks to support SQL on Hadoop are available, but Hive and Cloudera are widely used and popular frameworks. Until recently, Apache Hive did not support Update tables. Version 0.14 onwards, Hive supports ACID transactions. You must define the table as transaction to use ACID transactions such as UPDATE and DELETE. In this article, we will check Apache Hive table update using ACID Transactions and Examples. Apache Hive Table…

Continue ReadingApache Hive Table Update using ACID Transactions and Examples
Comments Off on Apache Hive Table Update using ACID Transactions and Examples

Apache Hive Derived Column Support and Alternative

Derived columns are columns that are derived from the previously derived or computed columns in same table. Derived columns or computed columns are virtual columns that are not physically stored in the table. Their values are re-calculated every time they are referenced in a query. Many relational databases such as Netezza supports derived columns but Apache Hive does not support derived columns. In this article, we will check Apache Hive Derived Column Support and Alternative method that you can use to derive columns. What are derived columns? Before going in…

Continue ReadingApache Hive Derived Column Support and Alternative
Comments Off on Apache Hive Derived Column Support and Alternative

How to update Hive Table without Setting Table Properties?

Apache Hive and Cloudera Impala provides better way to manage data on Hadoop ecosystem. There are many frameworks to support SQL on Hadoop are available, but Hive and Cloudera are widely used and popular frameworks. Until recently, Apache Hive did not support Update tables. You must set up TBLPROPERTIES to use transaction on the Hive table. These are relatively new features and should be used with caution. In this article, we will discuss how to update Hive table without setting table properties. You should not think Apache Hive as a…

Continue ReadingHow to update Hive Table without Setting Table Properties?
Comments Off on How to update Hive Table without Setting Table Properties?