Details about bigdata

Cloudera Impala Regular Expression Functions and Examples

The Cloudera Impala regular expression functions identify precise patterns of characters in the given string and are useful for extracting string from the data and validation of the existing data, for example, validate date, range checks, checks for characters, and extract specific characters from the data. In this article, we will be checking some commonly used Cloudera Impala regular expression functions with an examples. Types of Cloudera Impala Regular Expression Functions As of now, Cloudera Impala supports only three regular expression functions: regexp_extract regexp_like regexp_replace Impala regexp_extract Function The Impala…

Continue ReadingCloudera Impala Regular Expression Functions and Examples
Comments Off on Cloudera Impala Regular Expression Functions and Examples

Hadoop Hive Regular Expression Functions and Examples

The Hadoop Hive regular expression functions identify precise patterns of characters in the given string and are useful for extracting string from the data and validation of the existing data, for example, validate date, range checks, checks for characters, and extract specific characters from the data. In this article, we will be checking some commonly used Hadoop Hive regular expressions with an examples. Types of Hadoop Hive regular expression functions As of now, Hive supports only two regular expression functions: REGEXP_REPLACE REGEXP_EXTRACT Hive REGEXP_REPLACE Function Searches a string for a…

Continue ReadingHadoop Hive Regular Expression Functions and Examples
Comments Off on Hadoop Hive Regular Expression Functions and Examples

Apache Hive Table Design Best Practices and Considerations

As you plan your database or data warehouse migration to Hadoop ecosystem, there are key table design decisions that will heavily influence overall Hive query performance. In this article, we will check Apache Hive table design best practices.  Apache Hive Table Design Best Practices Table design play very important roles in Hive query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process Hive queries. Read: Apache Hive…

Continue ReadingApache Hive Table Design Best Practices and Considerations
Comments Off on Apache Hive Table Design Best Practices and Considerations

Apache Hive EXPLAIN Command and Example

Latest version of Hive uses Cost Based Optimizer (CBO) to increase the Hive query performance. Hive uses a cost-based optimizer to determine the best method for scan and join operations, join order, and aggregate operations. You can use the Apache Hive EXPLAIN command to display the actual execution plan that Hive query engine generates and uses while executing any query in the Hadoop ecosystem. Read: Hive ANALYZE TABLE Command Hive Performance Tuning Best Practices Apache Hive Cost Based Optimizer Latest version of Apache Hive uses the cost based optimizer to…

Continue ReadingApache Hive EXPLAIN Command and Example
Comments Off on Apache Hive EXPLAIN Command and Example

HiveServer2 Beeline Command Line Shell Options and Examples

HiveServer2 supports a command shell Beeline that works with HiveServer2. It's a JDBC client that is based on the SQLLine CLI. The Beeline shell works in both embedded mode as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive Command line) whereas remote mode is for connecting to a separate HiveServer2 process over Thrift. In this article, we will check commonly used HiveServer2 Beeline command line shell options with an examples. You can run all Hive command line and Interactive options from Beeline…

Continue ReadingHiveServer2 Beeline Command Line Shell Options and Examples
Comments Off on HiveServer2 Beeline Command Line Shell Options and Examples

Apache Hive Performance Tuning Best Practices – Steps

When it comes to building data warehouse-on-Hadoop ecosystem, there are handful open source frameworks available. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. Hive is developed by Facebook and Impala by Cloudera. In this article, we will explain Apache Hive Performance Tuning Best Practices and steps to be followed to achieve high performance. Apache Hive Performance Tuning Best Practices You can adapt number of steps to tune the performance in Hive including better schema design, right file format, using proper execution engines etc.…

Continue ReadingApache Hive Performance Tuning Best Practices – Steps
Comments Off on Apache Hive Performance Tuning Best Practices – Steps

Commonly used Apache Hive Interactive Shell Command Options and Examples

You can use the Hive Interactive shell command options to add JAR or resource files, set variables, display list of resource files and delete them when not required. Hive interactive shell provides various option. You can even execute shell or linux commands from Hive interactive shell without actually leaving Hive shell. For ad hoc queries and data exploration, you can submit SQL statements in an interactive session. You can add the UDF JAR files to the Hive using Apache Hive interactive shell command options. Read: Steps to Connect to Hive…

Continue ReadingCommonly used Apache Hive Interactive Shell Command Options and Examples
Comments Off on Commonly used Apache Hive Interactive Shell Command Options and Examples

Commonly used Apache Hive Command Line Options and Examples

You can use the Hive shell interactive tool (hive) to set up databases and tables, insert data, and issue queries. If you have worked on Netezza or Oracle, this tool is similar to nzsql or SQLPlus. For ad hoc queries and data exploration, you can submit SQL statements in an interactive session. You can write the queries in the script file and execute those using Hive shell Command Line Options. Read: Steps to Connect to Hive Using Beeline CLI HiveServer2 Beeline Command Line Shell Options and Examples Commonly used Hive…

Continue ReadingCommonly used Apache Hive Command Line Options and Examples
Comments Off on Commonly used Apache Hive Command Line Options and Examples

Apache HBase Writing Data Best Practices

For writing data into HBase, you use methods of the HtableInterface class. You can also use the Java API directly, or use the HBase Shell Commands. When you issue an HBase Shell Put command, the coordinates of the data are the row, the column, and the timestamp. The timestamp is unique per version of the cell, and it can be generated automatically or specified programmatically by your application, and must be a long integer. In this article, we will check Apache HBase writing data best practices to tune the performance…

Continue ReadingApache HBase Writing Data Best Practices
Comments Off on Apache HBase Writing Data Best Practices

Hive CREATE INDEX to Optimize and Improve Query Performance

The main goal of creating INDEX on Hive table is to improve the data retrieval speed and optimize query performance. For example, let us say you are executing Hive query with filter condition WHERE col1 = 100, without index hive will load entire table or partition to process records and with index on col1 would load part of HDFS file to process records. But be informed that Index on hive table is not recommended. The create index will help if you are migrating your existing data warehouse to Hive and…

Continue ReadingHive CREATE INDEX to Optimize and Improve Query Performance
Comments Off on Hive CREATE INDEX to Optimize and Improve Query Performance