Execute Pyspark Script from Python and Examples

As Apache Spark is gaining popularity, most of organizations are trying to integrate existing big data ecosystem to Spark so that they can utilize speed and distributed computation power of Apache Spark. In my earlier post, I have discussed various Methods to Access Hive Tables from Apache Spark to access Spark from from Python. In this post we will discuss how to execute pyspark script from Python with an working examples.

Python Pyspark

Python is widely used programming language and easy to learn. Well, you can access Apache Spark within python with pyspark shell. As you already know that Apache Spark as a fast and general distributed engine for big data processing. Apache Spark has built-in modules for streaming, SQL, machine learning and graph processing.

There are various ways to access Spark from within python programming using JDBC, Spark beeline etc. Pyspark provides easy methods to create RDDs, dataFrames etc. Pyspark isn’t as fast as Scala but it serves the purpose.

Execute Pyspark Script from Python

If you are familiar with python Pandas DataFrames, Numpy etc, then pyspark isn’t that much tough to learn. You must understand the basics of Spark RDDs, DataFrames etc. Refer official Apache Spark documentation for more information,

Now, coming back to our main topic. Here is the question.

Can you execute pyspark scripts from Python?

Yes, you can use the spark-submit to execute pyspark application or script. The spark-submit script in Spark’s installation bin directory is used to launch applications on a cluster.

Applications with spark-submit

Create pyspark application and bundle that within script preferably with .py extension. Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the CLASSPATH with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

Here is the syntax of spark-submit:

./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

Some of the commonly used options are:

–class: The entry point for your application
–master: The master URL for the cluster
–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
–conf: Arbitrary Spark configuration property in key=value format.

Execute Pyspark Script from Python Examples

Here is an example to execute pyspark script from Python:

pyspark-example.py

from pyspark import SparkContext
from pyspark.sql import HiveContext

sc = SparkContext()
SQLContext = HiveContext(sc)
SQLContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
txt = SQLContext.sql( "SELECT 1")
txt.show(2000000, False)

Submit this script using spark-submit:

$ spark-submit pyspark_hive2_conn.py

After lot of INFO and WARN, you will get output.

+—+
|_c0|
+—+
|1 |
+—+

Hope this helps. Let me know if you have better idea. 🙂