Create Pyspark sparkContext within python Program

In my other article, we have seen how to connect to Spark using JDBC driver and Jaydebeapi module. Hadoop cluster like Cloudera Hadoop distribution (CDH) does not provide JDBC driver. You either have to create your own JDBC driver by using Spark thrift server or create Pyspark sparkContext within python Program to enter into Apache Spark world.

Spark Context or Hive Contex

SparkContext or HiveContex is entry gate to interact with Spark engine. When you execute any Spark application, driver program initiates context for you. for example, when you start pyspark shell, driver program create sparkContext as ‘sc’ or HiveContex as SQLContex. You can directly use that ‘sc’ in your applications. But if you are trying to interact with Spark from Python program, you have to manually create SparkContex or HiveContext to execute your queries.

Creating sparkContext in Python using pyspark is very much similar to creating sparkContext in Scala. You need additional python modules to if you are trying to create sparkContext in your Python script or program. One such module is findspark module.

findspark Python Module

findspark module is one of the easy and best module you can find in Python world. This module provides findspark.init() to make pyspark importable as a regular library in your python application.

findspark will automatically identifies the common Spark installation directory if SPARK_HOME variable is set otherwise you have to provide installation directory manually:

findspark.init('/usr/hdp/current/spark2-client')

Alternatively, findspark identifies the Hadoop configuration files such as hive-site.xml, core-site.xml, yarn-site.xml etc from SPARK_CLASSPATH path variable. With help of findspark you can easily import pyspark within your python program.

Now, let us see complete code to create SparkContext:

Create Pyspark sparkContext within python Program

Here is the example to create pyspark sparkContext and HiveContext within Python program or script:

findspark.init('/opt/cloudera/parcels/CDH/lib/spark')
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext,HiveContext

conf = SparkConf().setAppName("Test").set("spark.driver.memory", "1g")
sc = SparkContext(conf = conf)
sc.addFile("/home/spplication/spark/conf/spark-defaults.conf")
sc.setLogLevel("ERROR")

sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)
qry = "select 1"
hiveContext.sql("use test")
results = hiveContext.sql(qry)

output = results._jdf.showString(100000, True)
print output

Here are some of useful tips for creating sparkContext and HiveCntex:

Use set function to set parameter values to you sparkconf
Add configuration file to SparkContex using sc.addFile
Suppress INFO and WARN messages using setLogLevel parameter
Use results._jdf.showString(100000, True) function to return query execution output. Note, replace results with your spark SQL dataFrame. df.show() function will only print results to console, so you can use given method to capture output of the query.