Spark SQL isnumeric Function Alternative and Example

Most of the organizations are moving their data warehouse to the Hive and using Spark as an execution engine. Spark as an execution engine will boost the performance. In SQL, there are many options that you can use to deal with non-numeric values, for example, you can create user defined functions to filter out unwanted data. In this article, we will check Spark SQL isnumeric function alternative and examples. Spark SQL isnumeric Function Spark SQL, or Apache Hive does not provide support for is numeric function. You have to write…

Continue ReadingSpark SQL isnumeric Function Alternative and Example
Comments Off on Spark SQL isnumeric Function Alternative and Example

Set and Use Environment Variable inside Python Script

It is somewhat difficult when it comes to setting and using bash environment variables in python script file. The same step is very easy and straight forward using shell script. In this post, we will check one of the method to set and use environment variable inside python scrip file. Note that, steps mentioned in this post helps only if you are setting and using that variable inside same process i.e. in same python script. There is no way you can modify bash script from python and use that variable…

Continue ReadingSet and Use Environment Variable inside Python Script
Comments Off on Set and Use Environment Variable inside Python Script

Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

HiveServer2 has a JDBC driver and It supports both embedded and remote access to HiveServer2. Usually, remote HiveServer2 is recommended for production environment as it does not require direct metastore or HDFS access to be given to Hive users. In this article, we will check steps to Connect HiveServer2 from Python using Hive JDBC Drivers. Steps to Connect HiveServer2 from Python using Hive JDBC Drivers Hive JDBC driver is one of the widely used method to connect to HiveServer2. You can use the Hive JDBC with Python Jaydebeapi open source module.…

Continue ReadingSteps to Connect HiveServer2 from Python using Hive JDBC Drivers
Comments Off on Steps to Connect HiveServer2 from Python using Hive JDBC Drivers

Apache Spark Architecture, Design and Overview

Apache Spark is a fast, open source and general-purpose cluster computing system with an in-memory data processing engine. Apache Spark is written in Scala and it provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark architecture is designed in such a way that you can use it for ETL (Spark SQL), analytics, machine learning (MLlib), graph processing or building streaming application (spark streaming). Spark is often called cluster computing engine or simply execution engine. Apache Spark is one of most…

Continue ReadingApache Spark Architecture, Design and Overview
Comments Off on Apache Spark Architecture, Design and Overview

Apache Hive Load Quoted Values CSV File and Examples

If you are reading this post, then you probably are considering using BigData or started BigData ecosystem for your huge data processing. When you say huge data, that means you may get all different kind of structured, unstructured and semi-structured data. Hive is just like your regular data warehouse appliances and you may receive files with single or double quoted values. In this article, we will see Apache Hive load quoted values CSV files and see some examples for the same. Apache Hive Load Quoted Values CSV File Let us…

Continue ReadingApache Hive Load Quoted Values CSV File and Examples
2 Comments

Commonly used Cloudera Impala Date Functions and Examples

This article is about short descriptions and examples of the commonly used Cloudera Impala date functions that you can use to manipulate date columns in Impala SQL. In the real word scenarios many application manipulate the date and time data types. Impala SQL supports most of the date and time functions that relational databases supports. Date types are highly formatted and very complicated. Each date value contains the century, year, month, day, hour, minute, and second. We shall see how to use the Impala date functions with an examples. Cloudera…

Continue ReadingCommonly used Cloudera Impala Date Functions and Examples
3 Comments

Different Hive Join Types and Examples

Join is a clause that is used for combining specific fields from two or more tables based on the common columns. The joins in the hive are similar to the SQL joins. Joins are used to combine rows from multiple tables. In this article, we will learn about different Hive join types with examples. Read: Hadoop Hive Bucket Concept and Bucketing Examples Hive Create Table Command and Examples Hive Create View Syntax and Examples Below are the tables that we will be using to demonstrate different Join types in Hive:…

Continue ReadingDifferent Hive Join Types and Examples
Comments Off on Different Hive Join Types and Examples

Hadoop Streaming Map Reduce using Python

In this article, we will check how to work with Hadoop Streaming Map Reduce using Python. Hadoop Streaming First let us check about Hadoop streaming! Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. If you are using any language that support standard input and output, that can be used to write the Hadoop Map-Reduce job for examples, Python, C# etc. Read: Hadoop HDFS Schema Design for…

Continue ReadingHadoop Streaming Map Reduce using Python
1 Comment

Migrating Netezza to Impala SQL Best Practices

Now a days everybody wants to migrate to Hadoop environment for their analytics that includes real-time or near real-time. In this post i will explain some best practices in Migrating Netezza to Impala SQL. Impala uses the standard SQL but still you might need to modify the source SQL when bringing specific application to Hadoop Impala due to variations in data types, built-in function and obviously Hadoop specific syntax. Even if the SQL is working correctly in Impala, you might consider rewriting it to improve performance. Read: Netezza Hadoop Connector…

Continue ReadingMigrating Netezza to Impala SQL Best Practices
2 Comments

Hadoop Single Node Cluster Setup on Ubuntu

In this tutorial, I will explain you setting up Hadoop single node cluster setup on Ubuntu 14.04. Single node cluster will sit on the top of Hadoop Distributed File System (HDFS). Hadoop single node cluster setup on Ubuntu 14.04 Hadoop is a Java framework for running application on the large cluster made up of commodity hardware's. Hadoop framework allows us to run MapReduce programs on file system stored in highly fault-tolerant Hadoop distributed file systems. Related Readings:  How to Learn Apache Hadoop   Also: 7 Best Books to Learn Bigdata Hadoop The main…

Continue ReadingHadoop Single Node Cluster Setup on Ubuntu
Comments Off on Hadoop Single Node Cluster Setup on Ubuntu