Now a days, with growing data size, Apache Spark is gaining importance. It is open-source general purpose and lightning fast distributed computing framework. Apache Spark is 100 times faster compared to Hadoop technologies. Considering its speed, you can use Apache Spark to access Hive metastore and process required data. In this post, we will check methods to access Hive tables from Apache Spark.
Why Apache Spark?
As mentioned earlier, Apache Spark is 100 times faster compared to Hadoop and more than 10 times faster than accessing data from disks. Spark is written in Scala and provides rich API to connect Java, Python, R ect.
Apache Hive can perform only batch processing, you need to have much faster framework like Apache Spark to perform near real-time processing. Spark can process data in batch as well as real-time. You can use Spark framework to perform much processing on huge amount of data.
There is a big demand for a powerful engine like Apache Spark as it can process the data in real-time as well as in batch mode. Spark framework uses in memory-processing and can responds to your queries in sub-seconds.
Apache Spark is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use.
Related Article:
Methods to Access Hive Tables from Apache Spark
There are various methods that you can follow to connect to Hive metastore or access Hive tables from Apache Spark processing framework.
Below are some of commonly used methods to access hive tables from apache spark:
- Access Hive Tables using Apache Spark Beeline
- Accessing Hive Tables using Apache Spark JDBC Driver
- Execute Pyspark Script from Python and Examples
Let us check these methods in details;
Access Hive Tables using Apache Spark Beeline
There is beeline application that comes with Apache Spark installation. You can find same in Spark installation directory within /bin.
$ ls -ltr /usr/hdp/current/spark2-client/bin/beeline -rwxr-xr-x. 1 root root 1119 Aug 26 2016 /usr/hdp/current/spark2-client/bin/beeline $
You just have to execute the beeline and connect using JDBC driver URL.
$ /usr/hdp/current/spark2-client/bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos; Connecting to jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos; Enter username for jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;: impadmin Enter password for jdbc:hive2://192.168.100.103:10000/default; principal=hive/server1.domain.co.in@DOMAIN.CO.IN;auth=Kerberos;: ******* 18/11/16 15:08:09 INFO Utils: Supplied authorities: 192.168.100.103:10000 18/11/16 15:08:09 INFO Utils: Resolved authority: 192.168.100.103:10000 Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://192.168.100.103:10000/default>
You can read more about beeline command options in my other post:
Now you can query the regular Hive databases and tables.
Related Articles:
- Steps to Connect to Hive Using Beeline CLI
- HiveServer2 Beeline Command Line Shell Options and Examples
Access Hive Tables using Apache Spark JDBC Driver
Another method is to use Spark provided hive-jdbc driver. This method uses thrift server to connect to remote hiveserver2. If you have requirement to connect to Apache Hive tables from Apache Spark program, then Spark provided jdbc driver can save your day.
You can read my other post about using Spark2 JDBC driver to connect to remote Hive server2:
Hope this helps. Let me know if you are using any other method 🙂