Best Practices to Optimize Hive Query Performance

As we have seen in my other post Steps to Optimize SQL Query Performance, we can improve the performance of back-end SQL by adding simple improvement while writing SQL queries. Apache Hive architecture behaves differently with data and type of HQL query you write. In this post, we will check best practices to optimize Hive query performance with some examples. In data warehouse environment, we write lot of queries and pay very little attention to the optimization part. Tuning performance of Hive query is one of important step and require…

Continue ReadingBest Practices to Optimize Hive Query Performance
2 Comments

Netezza Query History details using nz_query_history Table

Sometimes you may need to verify the queries that are running for a long time on production servers. There are several ways that you can perform this task. For instance, you can use Netezza administrative tool to verify long running queries. In this post, we will check how to get Netezza query history details using nz_query_history table. Netezza query history configuration steps are simple. You can follow below steps to use Netezza query history views to collect Netezza queries historical data in separate history table in optional history database. Why…

Continue ReadingNetezza Query History details using nz_query_history Table
Comments Off on Netezza Query History details using nz_query_history Table

Spark SQL Performance Tuning – Improve Spark SQL Performance

You can improve the performance of Spark SQL by making simple changes to the system parameters. It requires Spark knowledge and the type of file system that are used to tune your Spark SQL performance. In this article, we will check the Spark SQL performance tuning to improve Spark SQL performance. Related Article: Apache Spark SQL Introduction and Features Apache Spark Architecture, Design and Overview Data Storage Consideration for Spark Performance Before going into Spark SQL performance tuning, let us check some of data storage considerations for spark performance. Optimize…

Continue ReadingSpark SQL Performance Tuning – Improve Spark SQL Performance
Comments Off on Spark SQL Performance Tuning – Improve Spark SQL Performance

Hive ANALYZE TABLE Command – Table Statistics

Hive uses cost based optimizer. Statistics serve as the input to the cost functions of the Hive optimizer so that it can compare different plans and choose best among them. Hive uses the statistics such as number of rows in tables or table partition to generate an optimal query plan. Other than optimizer, hive uses mentioned statistics in many other ways. In this post, we will check Apache Hive table statistics - Hive ANALYZE TABLE command and some examples. Uses of Hive Table or Partition Statistics There are many ways…

Continue ReadingHive ANALYZE TABLE Command – Table Statistics
Comments Off on Hive ANALYZE TABLE Command – Table Statistics

Steps to Optimize SQL Query Performance – Best Practices

We pay lots of attention to improve the performance of the web application, but ignore back-end SQL performance tuning. Even experts like application architects and the developer does not have an idea on how databases process the SQL queries internally. This could be because of lack of SQL and database knowledge. In this post, we will check best practices to optimize SQL query performance. How to Select SQL Query for Optimization? Identifying the query to optimize is a crucial step. Even today's most advance SQL engines require optimization. Simple SQL query tweak may increase the…

Continue ReadingSteps to Optimize SQL Query Performance – Best Practices
Comments Off on Steps to Optimize SQL Query Performance – Best Practices

Steps to Generate and Load TPC-DS Data into Netezza Server

The TPC-DS benchmark model decision support system of a retail product supplier. This includes various queries and data maintenance. Database schema, data population, queries, data maintenance model and implementation rules have been designed to be broadly representative of new or modern decision support systems. In this post, we will discuss steps to generate and load TPC-DS data into Netezza or puredata systems server. What is TPC-DS? The Transaction Processing Performance Council (TPC) is a benchmark model decision support system. Several major firms are member of TPC. You can get more…

Continue ReadingSteps to Generate and Load TPC-DS Data into Netezza Server
Comments Off on Steps to Generate and Load TPC-DS Data into Netezza Server

Different Methods to Display Netezza Table Statistics

IBM Netezza or Puredata Systems uses the cost based optimizer to determine best methods determine redistribution, scan, join, join orders. Optimizer uses statistics to generate optimal execution plan. You need collect statistics on tables or databases regularly. In my other post, Netezza Generate Statistics: A Guide and Best Practices, we have discussed some of best practices to generate statistics on Netezza tables. In this post, we will discuss different methods to display Netezza table statistics with some examples. What Are Table Statistics? Tables statistics are nothing but information about each column…

Continue ReadingDifferent Methods to Display Netezza Table Statistics
Comments Off on Different Methods to Display Netezza Table Statistics

Spark SQL EXPLAIN Operator and Examples

Spark SQL uses Catalyst optimizer to create optimal execution plan. Execution plan will change based on scan, join operations, join order, type of joins, sub-queries and aggregate operations. In this article, we will check Spark SQL EXPLAIN Operator and some working examples. Spark SQL EXPLAIN Operator Spark SQL EXPLAIN operator provide detailed plan information about sql statement without actually running it. You can use the Spark SQL  EXPLAIN operator to display the actual execution plan that Spark execution engine will generates and uses while executing any query. You can use this execution plan…

Continue ReadingSpark SQL EXPLAIN Operator and Examples
Comments Off on Spark SQL EXPLAIN Operator and Examples

Create Pyspark sparkContext within python Program

In my other article, we have seen how to connect to Spark using JDBC driver and Jaydebeapi module. Hadoop cluster like Cloudera Hadoop distribution (CDH) does not provide JDBC driver. You either have to create your own JDBC driver by using Spark thrift server or create Pyspark sparkContext within python Program to enter into Apache Spark world. Spark Context or Hive Contex SparkContext or HiveContex is entry gate to interact with Spark engine. When you execute any Spark application, driver program initiates context for you. for example, when you start…

Continue ReadingCreate Pyspark sparkContext within python Program
Comments Off on Create Pyspark sparkContext within python Program

How to Connect Netezza using JDBC Driver and working Examples

Netezza is one of the widely used MPP database. You connect to it by using various methods and programming languages. Netezza supports ODBC, OLEDB and JDBC drivers for connections. Connection to Netezza using JDBC driver is easy and one of the widely used method. In this article, we will check how to connect Netezza using JDBC driver and some working examples. Netezza JDBC Driver Netezza provides JDBC driver, you can use that driver from any programming language that supports JDBC connections such as Java, Python etc. You can download JDBC…

Continue ReadingHow to Connect Netezza using JDBC Driver and working Examples
Comments Off on How to Connect Netezza using JDBC Driver and working Examples