Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

  • Post author:
  • Post last modified:November 16, 2018
  • Post category:Apache Spark
  • Reading time:4 mins read

Apache Spark supports both local and remote metastore. You can connect to remote HiveServer2 using Apache Spark JDBC drivers. Hive JDBC driver for Spark2 is available in the jars folder located in the spark installation directory. In this post, we will check steps to connect HiveServer2 using Apache Spark JDBC Driver and Python.

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

There are various methods that you can use to connect to HiveServer2. Using Spark JDBC driver is one of easy method.

Hive JDBC driver is one of the most widely used driver to connect to HiveServer2. Hive -Spark2 JDBC driver use thrift server, you should start thrift server before attempting to connect to remove HiveServer2. Apache Spark comes with Hive JDBC driver for Spark2. JDBC driver jars comes with standard installation. You can use the Hive Spark2 JDBC jar files along with Python Jaydebeapi open source module to connect to HiveServer2 remote server from your Python. There are other options such as Pyspark that you can use to connect to HiveServer2.

Note that, example in this post uses Jaydebeapi for python 2. If you are using Python3, you should install Jaydebeapi3.

Install Jaydebeapi

The JayDeBeApi module allows you to connect to any databases that supports JDBC driver. It provides a Python DB-API v2.0 to that database.

You can install it using pip:

pip install Jaydebeapi

For Python3 use below pip command:

pip install Jaydebeapi3
Set CLASSPATH to Driver Location

Hive Spark2 JDBC driver is dependent on many other Hadoop jars. You can either download them or simply set Hadoop-client and Spark2-client path to CLASSPATH shell environmental variable.

export CLASSPATH=$CLASSPATH:$(hadoop classpath):/usr/hdp/current/hadoop-client/*:/usr/hdp/current/spark2-client/jars/*:/usr/hdp/current/hadoop-client/client/*

Execute above command from your Linux edge node where kerberos ticket has been created. If you are trying to execute form windows then you might want to set user specific environmental variables.

Note that, As jaydebeapi module is dependent on many other Hadoop specific jar files, it will not work if you don’t have all required jar files.

You can read on how to set CLASSPATH variable in my another post Set and Use Environment Variable inside Python Script

Related reading:

Create Kerberos Ticket

Before connecting to Hive server, you must create Kerberos ticket. You can use knit command along with keytab file to create ticket. Get you local admins help if you are unable to fine keytab file and create keberos ticket.

Connect HiveServer 2 using Hive JDBC Driver for Apache Spark2

Now you are all set to connect to Hivesever2. Use klist command to check if Kerberos ticket is available.

Below is the code that you can use to connect HiveServer2 from Python using Hive JDBC Drivers:

import jaydebeapi

database='testtdb'
driver='org.apache.hive.jdbc.HiveDriver'
server='192.168.200.100'
principal='hive/example.domain.com@DOMAIN.COM.'
port=10000

# JDBC connection string
url = ("jdbc:hive2://" + server + ":" + str(port)+ "/" + database + ";principal=" + principal + ";")

#Connect to HiveServer2 
conn=jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver", url)
cursor = conn.cursor()

# Execute SQL query
sql="select * from item limit 10"
cursor.execute(sql)
results = cursor.fetchall()
print results

Hope this helps, let me know how it goes 🙂