An Introduction to Cloudera Hadoop Impala Architecture

Post author:Vithal S
Post last modified:February 28, 2018
Post category:BigData
Reading time:3 mins read

Cloudera Hadoop impala architecture is very different compared to other database engine on HDFS like Hive. The Impala server is a distributed, massively parallel processing (MPP) database engine.

Cloudera hadoop impala architecture

The architecture is similar to the other distributed databases like Netezza, Greenplum etc. Hadoop impala consists of different daemon processes that run on specific hosts within your CDH cluster.

Read:

Cloudera Hadoop Impala Architecture Overview

The Hadoop impala is consists of three components: The Impala Daemon, Impala Statestore and Impala Catalog Services:

The Impala Daemon

Impala Daemon is the important and core component of the Hadoop Impala. This daemon runs on every node in the CDH cluster. This is identified by the impalad process. It reads and writes the data files. It also accepts the queries transmitted from impala-shell command, ODBC, JDBC or Hue.

You can connect and submit the query to the Impala Daemon running on any Datanode and that instance of Daemon serves as coordinator. The Daemon which accepts the queries acts as a coordinator, that parallizes the queries and distributes the workload across the Hadoop cluster. It also collects the results back from all nodes.

Impala Daemon will always be communicating to statestore to confirm which node is healthy and accepts the new work. Each Daemon will also receive the broadcasted message whenever any Impala node in cluster create, alter, drops any object or any statement like insert, load data is processed.

The Impala Statestore

This component checks health of all Impala Daemons on all the datanodes in the Hadoop cluster. It is physically represented by the process statestored. Only one such process is required on one host in the Hadoop cluster.

If an Impala Daemon goes down, statestore informs all the Impala Daemons so that they can avoid the failed node while distributing future queries.

The impala Catalog Service

This component of the Hadoop Impala tells metadata changes from Impala SQL statements to all the Datanodes in Hadoop cluster. It is physically rhaepresented by Daemon process catalogd. . Only one such process is required on one host in the Hadoop cluster. Usually, statestored and catalogd process will be running on same host as catalog services are passed through statestored.

The catalog service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata changes are performed by statements issued through Impala.

Tags: Hadoop, Hadoop Architecture, Hadoop Impala, Impala SQL