Methods to Access Hive Tables from Apache Spark

Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark.

Why Apache Spark?

As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark is written in Scala and provides rich API to connect Java, Python, R ect.

Methods to Access Hive Tables from Apache Spark

Apache Hive can perform only batch processing, you need to have much faster framework like Apache Spark to perform near real-time processing. Spark can process data in batch as well as real-time. You can use Spark framework to perform much processing on huge amount of data.

There is a big demand for a powerful engine like Apache Spark as it can process the data in real-time as well as in batch mode. Spark framework uses in memory-processing and can responds to your queries in sub-seconds.

Apache Spark is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use.

Running SQL using Spark-SQL Command line Interface-CLI

Methods to Access Hive Tables from Apache Spark

There are various methods that you can follow to connect to Hive metastore or access Hive tables from Apache Spark processing framework.

Below are some of commonly used methods to access hive tables from apache spark:

Access Hive Tables using Apache Spark Beeline
Accessing Hive Tables using Apache Spark JDBC Driver
Execute Pyspark Script from Python and Examples

Let us check these methods in details;

Access Hive Tables using Apache Spark Beeline

There is beeline application that comes with Apache Spark installation. You can find same in Spark installation directory within /bin.

$ ls -ltr /usr/hdp/current/spark2-client/bin/beeline
-rwxr-xr-x. 1 root root 1119 Aug 26 2016 /usr/hdp/current/spark2-client/bin/beeline
$

You just have to execute the beeline and connect using JDBC driver URL.

$ /usr/hdp/current/spark2-client/bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;
Connecting to jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;
Enter username for jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;: impadmin
Enter password for jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;: *******
18/11/16 15:08:09 INFO Utils: Supplied authorities: 192.168.100.103:10000
18/11/16 15:08:09 INFO Utils: Resolved authority: 192.168.100.103:10000
Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://192.168.100.103:10000/default>

You can read more about beeline command options in my other post:

Beeline Hive Command Options and Examples

Now you can query the regular Hive databases and tables.

Access Hive Tables using Apache Spark JDBC Driver

Another method is to use Spark provided hive-jdbc driver. This method uses thrift server to connect to remote hiveserver2. If you have requirement to connect to Apache Hive tables from Apache Spark program, then Spark provided jdbc driver can save your day.

You can read my other post about using Spark2 JDBC driver to connect to remote Hive server2:

Steps to Connect HiveServer2 using Apache Spark JDBC Driver and Python

Hope this helps. Let me know if you are using any other method 🙂