Execute Hive Beeline JDBC String Command from Python

  • Post author:
  • Post last modified:September 21, 2018
  • Post category:BigData
  • Reading time:3 mins read

To perform any analysis, you need to have data in place. To collect data, you may have to connect your application to different data source. In this article, we will discuss on one of such approach to execute Hive Beeline JDBC string command from Python application. This is one of the simple and easy approach to connect to Kerberos HiveServer2 using Beeline shell.

I was working on one of the machine learning project to predict query execution time on Hadoop Hive cluster. We were gathering various features from the HiveQL query plan. To get plan, we had application that connect to Hive using beeline JDBC connection string and fetch required explain plan of that query and store that string into Python variable. We again used that string in pre-processing step.

You can also use the Hive JDBC drivers to connect HiveServer2 from Python using Jaydebeapi.

Before going more into details, let us learn what Hive Beeline is?

What is Hive beeline?

Hive Beeline is a JDBC client based on on the SQLLine CL. This client is used to connect HiveServer2. The Beeline shell works in both embedded mode as well as remote mode. In the embedded mode, it runs an embedded Hive whereas remote mode is for connecting to a separate HiveServer2 process. You can even connect to HiveServer2 which has Kerberos authentication mechanism.

You can read more about beeline command on my other post: Beeline Hive Command Options and Examples.

Execute Hive Beeline JDBC String Command from Python

Now, we know what Hive Beeline is, now let us try to connect that Beeline to Kerberized Hive cluster without any external packages such as Pyhs2, impyla or Pyhive.
Below is the example of the script that you are looking for:

import commands
import re

query = 'explain select * from date_dim limit 1'

host=str('192.168.0.100')
port=str('10000')
authMechanism=str('KERBEROS')
database=str('store')
principal=str('hive/system1.example.co.in@EXAMPLE.CO.IN')

result_string= 'beeline -u "jdbc:hive2://"'+host +'":"'+port+'"/"'+database+'";principal="'+principal+'"" ' \
'--fastConnect=true --showHeader=false --verbose=false --showWarnings=false --silent=true -n user1 -p ' \
'passwd -e "' + query + ';"'

status, output = commands.getstatusoutput(result_string)

if status == 0:
    return output
else:
    print "Error encountered while executing HiveQL queries."

This script uses Python commands module to execute command and get results into Python variables.

You can also used well maintained packages like pyhive to connect to Hive. Read more in Step by Step Guide Connecting HiveServer2 using Python Pyhive

Hope this helps and let me know if you used different method.