Methods to Access Impala Tables from Python

  • Post author:
  • Post last modified:June 4, 2019
  • Post category:BigData
  • Reading time:4 mins read

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored on bigdata Hadoop. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface as Apache Hive. In this article, we will check different methods to access Impala tables from python program or script. The methods we are going to discuss here will help you to connect Impala tables and get required data for your analysis.

Methods to Access Impala Tables from Python

Following are commonly used methods to connect to Impala from python program:

  • Execute impala-shell command from Python.
  • Connect to Hive using Impyla.
  • Connect Impala using JDBC Driver

Now, let us check these methods in details;

Execute impala-shell command from Python

Cloudera Impala has its own shell i.e. Impala-shell. You can execute any command from the edge node by providing impalad id while executing impala shell. You can execute query on any impala demon.

Below is the example of Python script:

import commands
import re

query = "select id from my_table"

impalad = str('192.168.154.128')
port = str('21000')
user = str('cloudera')
database = str('default')
result_string = 'impala-shell -i "'+ impalad+':'+port +'" -u "'+user+'" -d "'+database+'" -B --delimited -q "'+query+'"'
print result_string

status, output = commands.getstatusoutput(result_string)
print output
if status == 0:
        print output
else:
        print "Error encountered while executing HiveQL queries."

Connect to Hive using Impyla

Impyla is a Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines. You can use this to connect to Impala using python script or program.

For more details, follow official Impyla document

Connect Impala using JDBC Driver

Cloudera provided JDBC driver and It supports both embedded and remote access to HiveServer2/Impala. Use Python Jaydebeapi package to connect to Impala from Python program.

Note that, there are two version of Jaydebeapi available: Jaydebeapi for Python 2 and Jaydebeapi3 for Python3.

Impala connection is same as using Hiveserver2 jdbc driver. Follow steps given in below post to use Hive JDBC driver with Python program:

Hope this helps 🙂