Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored on bigdata Hadoop. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface as Apache Hive. In this article, we will check different methods to access Impala tables from python program or script. The methods we are going to discuss here will help you to connect Impala tables and get required data for your analysis.
Methods to Access Impala Tables from Python
Following are commonly used methods to connect to Impala from python program:
- Execute impala-shell command from Python.
- Connect to Hive using Impyla.
- Connect Impala using JDBC Driver
Now, let us check these methods in details;
Execute impala-shell command from Python
Cloudera Impala has its own shell i.e. Impala-shell. You can execute any command from the edge node by providing impalad id while executing impala shell. You can execute query on any impala demon.
Below is the example of Python script:
import commands
import re
query = "select id from my_table"
impalad = str('192.168.154.128')
port = str('21000')
user = str('cloudera')
database = str('default')
result_string = 'impala-shell -i "'+ impalad+':'+port +'" -u "'+user+'" -d "'+database+'" -B --delimited -q "'+query+'"'
print result_string
status, output = commands.getstatusoutput(result_string)
print output
if status == 0:
print output
else:
print "Error encountered while executing HiveQL queries."
Connect to Hive using Impyla
Impyla is a Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines. You can use this to connect to Impala using python script or program.
For more details, follow official Impyla document
Connect Impala using JDBC Driver
Cloudera provided JDBC driver and It supports both embedded and remote access to HiveServer2/Impala. Use Python Jaydebeapi package to connect to Impala from Python program.
Note that, there are two version of Jaydebeapi available: Jaydebeapi for Python 2 and Jaydebeapi3 for Python3.
Impala connection is same as using Hiveserver2 jdbc driver. Follow steps given in below post to use Hive JDBC driver with Python program:
Hope this helps 🙂