Details about bigdata

Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

HiveServer2 has a JDBC driver and It supports both embedded and remote access to HiveServer2. Usually, remote HiveServer2 is recommended for production environment as it does not require direct metastore or HDFS access to be given to Hive users. In this article, we will check steps to Connect HiveServer2 from Python using Hive JDBC Drivers. Steps to Connect HiveServer2 from Python using Hive JDBC Drivers Hive JDBC driver is one of the widely used method to connect to HiveServer2. You can use the Hive JDBC with Python Jaydebeapi open source module.…

Continue ReadingSteps to Connect HiveServer2 from Python using Hive JDBC Drivers
Comments Off on Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

Execute Hive Beeline JDBC String Command from Python

To perform any analysis, you need to have data in place. To collect data, you may have to connect your application to different data source. In this article, we will discuss on one of such approach to execute Hive Beeline JDBC string command from Python application. This is one of the simple and easy approach to connect to Kerberos HiveServer2 using Beeline shell. I was working on one of the machine learning project to predict query execution time on Hadoop Hive cluster. We were gathering various features from the HiveQL…

Continue ReadingExecute Hive Beeline JDBC String Command from Python
Comments Off on Execute Hive Beeline JDBC String Command from Python

Apache Hive Grouping Function, Alternative and Examples

Most of the relational databases supports Grouping function to segregate super-aggregated rows.  Apache Hive Support for SQL grouping function was added in Hive 2.3.0. But who are using lower version of Hive will have difficult time in porting SQL queries that are written using grouping functions. In this article, we will check Apache Hive Grouping function alternative and examples. Grouping Function In general, the grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part…

Continue ReadingApache Hive Grouping Function, Alternative and Examples
Comments Off on Apache Hive Grouping Function, Alternative and Examples

Step by Step Guide Connecting HiveServer2 using Python Pyhive

Data plays important role in every decision-making process. You may have to connect to various remote servers to get required data for your application. This article explains how to connect Hive running on remote host (HiveSever2) using commonly used Python package, Pyhive. In this article, we will check step by step guide Connecting HiveServer2 using Python Pyhive. There are lot of other Python packages available to connect to remote Hive, but Pyhive package is one of the easy and well-maintained and supported package. There is a option to connect to…

Continue ReadingStep by Step Guide Connecting HiveServer2 using Python Pyhive
2 Comments

Apache Hive Table Update using ACID Transactions and Examples

Apache Hive and Cloudera Impala supports SQL on Hadoop and provides better way to manage data on Hadoop ecosystem. There are many frameworks to support SQL on Hadoop are available, but Hive and Cloudera are widely used and popular frameworks. Until recently, Apache Hive did not support Update tables. Version 0.14 onwards, Hive supports ACID transactions. You must define the table as transaction to use ACID transactions such as UPDATE and DELETE. In this article, we will check Apache Hive table update using ACID Transactions and Examples. Apache Hive Table…

Continue ReadingApache Hive Table Update using ACID Transactions and Examples
Comments Off on Apache Hive Table Update using ACID Transactions and Examples

Apache Hive Derived Column Support and Alternative

Derived columns are columns that are derived from the previously derived or computed columns in same table. Derived columns or computed columns are virtual columns that are not physically stored in the table. Their values are re-calculated every time they are referenced in a query. Many relational databases such as Netezza supports derived columns but Apache Hive does not support derived columns. In this article, we will check Apache Hive Derived Column Support and Alternative method that you can use to derive columns. What are derived columns? Before going in…

Continue ReadingApache Hive Derived Column Support and Alternative
Comments Off on Apache Hive Derived Column Support and Alternative

How to update Hive Table without Setting Table Properties?

Apache Hive and Cloudera Impala provides better way to manage data on Hadoop ecosystem. There are many frameworks to support SQL on Hadoop are available, but Hive and Cloudera are widely used and popular frameworks. Until recently, Apache Hive did not support Update tables. You must set up TBLPROPERTIES to use transaction on the Hive table. These are relatively new features and should be used with caution. In this article, we will discuss how to update Hive table without setting table properties. You should not think Apache Hive as a…

Continue ReadingHow to update Hive Table without Setting Table Properties?
Comments Off on How to update Hive Table without Setting Table Properties?

Automatically Delete HBase row – Time to Live (TTL) Settings

One of the HBase features is that it can delete the rows in the table automatically. This feature reduces lot of time that is required to maintain rows if you are handling sensitive data.  In this article, we will check automatically delete HBase row using time to live (TTL) setting. HBase Time to Live (TTL) Option -  Automatically Delete HBase Row You can set ColumnFamilies a TTL length in seconds, and HBase will automatically delete rows or automatically expires the row once the expiration time is reached. This setting applies…

Continue ReadingAutomatically Delete HBase row – Time to Live (TTL) Settings
Comments Off on Automatically Delete HBase row – Time to Live (TTL) Settings

HBase Auto Sharding Concept and Explanation

HBase is the Hadoop storage manager on the top of Hadoop HDFS that provides low-latency random reads and writes, and it can handle petabytes of data without any issue. One of the interesting capabilities in HBase is auto sharding, which simply means that tables are dynamically distributed by the system to different region servers when they become too large. In other word, Splitting and serving regions can be thought of as auto sharding, as offered by other systems. Regions and Region Servers In Hbase, the scalability and load balancing is…

Continue ReadingHBase Auto Sharding Concept and Explanation
Comments Off on HBase Auto Sharding Concept and Explanation

Apache HBase Column Versions and Explanations

Cells in HBase is a combination of the row, column family, and version contains a value and a timestamp, which represents the column family version. In this article, we will check Apache HBase column versions and explanations with some examples. Apache HBase Column Versions As mentioned in beginning of this post, A {row, column, version} tuple exactly specifies a cell in HBase. In the Apache HBase you can have many cells where row and columns are same but differs only in version values. A version is a timestamp values is…

Continue ReadingApache HBase Column Versions and Explanations
Comments Off on Apache HBase Column Versions and Explanations