SQL SET Operator MINUS Alternative in Hive and Examples

The set operators in SQL are used to combine similar data set of two or more SELECT statements. Here similar data set literally means the number of columns and its data type should match, otherwise you must explicitly type cast the data types of the values in SELECT statements. Hive does support UNION and UNION ALL set operator, INTERSECT and MINUS are not supported as of now. In this article, we will check SQL set operator MINUS alternative in Hive with an example. SQL SET Operator MINUS Alternative in Hive…

Continue ReadingSQL SET Operator MINUS Alternative in Hive and Examples
Comments Off on SQL SET Operator MINUS Alternative in Hive and Examples

What are SQL Features Missing in Hive?

Apache Hive syntax looks similar to SQL-92 standards but does not fully compatible to SQL-92. Storage and querying underlying table’s closes resembles traditional databases available in industry. HiveQL provides some of the extensions that are not present in traditional databases. There are some features gap between traditional SQL and Apache Hive. In this article, we will check some basic and import SQL features missing in Hive. SQL features Missing in Hive Below are some of important yet basic SQL features missing in Hive: Online Transaction Processing (OLTP)Correlated Sub-queriesMaterialized ViewsTruncate TableIndexes…

Continue ReadingWhat are SQL Features Missing in Hive?
Comments Off on What are SQL Features Missing in Hive?

Hive DELETE FROM Table Alternative– Easy Steps

By definition, Data Warehouse is mechanism to store historical data in an easy accessible manner. Data may be updated to keep tables with up-to date records. This performance critical operation holds good when you plan to migrate your data warehouse to bigdata world. In this article, we will check one of the method to remove outdated records from Hive table i.e. Hive DELETE FROM table Alternative.   Hive DELETE FROM Table Alternative Apache Hive is not designed for online transaction processing and does not offer real-time queries and row level…

Continue ReadingHive DELETE FROM Table Alternative– Easy Steps
Comments Off on Hive DELETE FROM Table Alternative– Easy Steps

Running SQL using Spark-SQL Command line Interface-CLI

In my other post, we have seen how to connect to Spark SQL using beeline jdbc connection. You can execute SQL queries in many ways, such as programmatically,  use spark or pyspark shell, beeline jdbc client. Many does not know that spark supports spark-sql command line interface. You can use this to run hive metastore service in local mode. Related Article: Methods to access Hive Tables from Apache Spark Spark SQL Cumulative Sum Function and Examples What is Spark-SQL command line Interface (CLI)? The Spark SQL command line interface or…

Continue ReadingRunning SQL using Spark-SQL Command line Interface-CLI
Comments Off on Running SQL using Spark-SQL Command line Interface-CLI

How to Import Netezza Tables using Sqoop?

With growing data, organizations are moving computation part to Hadoop ecosystem. Apache Sqoop is an open source tool to import data from relational databases to Hadoop and vice versa. Apache Sqoop is one of the easiest tool to import relational database such as Netezza into Hadoop ecosystem. The Sqoop command allows you to import all tables, single table, execute query and store result in Hadoop HDFS. In this article, we will check how to import Netezza tables using Sqoop with some practical examples. Sqoop uses a connector based architecture which…

Continue ReadingHow to Import Netezza Tables using Sqoop?
Comments Off on How to Import Netezza Tables using Sqoop?

Apache Hive User-defined Functions

Apache Hive is a data warehouse framework on top of Hadoop ecosystem. The Apache Hive architecture is different compared to other Hadoop tools that are available. Being an open source project, Apache Hive has added a lot of functionalities since its inception. But it still lacks some basic functionalities that are available in traditional data warehouse systems such as Netezza, Teradata, Oracle, etc. In this post, we will check Apache Hive user-defined functions and how to use them to perform a specific task. Apache Hive User-defined Functions When you start…

Continue ReadingApache Hive User-defined Functions
Comments Off on Apache Hive User-defined Functions

Best Practices to Optimize Hive Query Performance

As we have seen in my other post Steps to Optimize SQL Query Performance, we can improve the performance of back-end SQL by adding simple improvement while writing SQL queries. Apache Hive architecture behaves differently with data and type of HQL query you write. In this post, we will check best practices to optimize Hive query performance with some examples. In data warehouse environment, we write lot of queries and pay very little attention to the optimization part. Tuning performance of Hive query is one of important step and require…

Continue ReadingBest Practices to Optimize Hive Query Performance
2 Comments

Hive ANALYZE TABLE Command – Table Statistics

Hive uses cost based optimizer. Statistics serve as the input to the cost functions of the Hive optimizer so that it can compare different plans and choose best among them. Hive uses the statistics such as number of rows in tables or table partition to generate an optimal query plan. Other than optimizer, hive uses mentioned statistics in many other ways. In this post, we will check Apache Hive table statistics - Hive ANALYZE TABLE command and some examples. Uses of Hive Table or Partition Statistics There are many ways…

Continue ReadingHive ANALYZE TABLE Command – Table Statistics
Comments Off on Hive ANALYZE TABLE Command – Table Statistics

Execute Pyspark Script from Python and Examples

As Apache Spark is gaining popularity, most of organizations are trying to integrate existing big data ecosystem to Spark so that they can utilize speed and distributed computation power of Apache Spark. In my earlier post, I have discussed various Methods to Access Hive Tables from Apache Spark to access Spark from from Python. In this post we will discuss how to execute pyspark script from Python with an working examples. Python Pyspark Python is widely used programming language and easy to learn. Well, you can access Apache Spark within python…

Continue ReadingExecute Pyspark Script from Python and Examples
Comments Off on Execute Pyspark Script from Python and Examples

Methods to Access Hive Tables from Python

Apache Hive is database framework on the top of Hadoop distributed file system (HDFS) to query structured and semi-structured data. Just like your regular RDBMS, you access hdfs files in the form of tables. You can create tables, views etc in Apache Hive. You can analyze structured data using HiveQL language which is similar to Structural Query Language (SQL). In this article, we will check different methods to access Hive tables from python program. Methods we are going to discuss here will help you to connect Hive tables and get…

Continue ReadingMethods to Access Hive Tables from Python
Comments Off on Methods to Access Hive Tables from Python