Pass Functions to pyspark – Run Python Functions on Spark Cluster

  • Post author:
  • Post last modified:July 11, 2019
  • Post category:Apache Spark
  • Reading time:4 mins read

Functions in any programming language are used to handle particular task and improve the readability of the overall code. By definition, a function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing. In this article, we will check how to pass functions to pyspark driver program to execute on cluster.

Pass Functions to pyspark

Spark API require you to pass functions to driver program so that it will be executed on the distributed cluster. There are three ways to pass functions to Spark.

  • Lambda expressions
  • Local defs inside the function calling into Spark
  • Top-level functions in a module

Now let us check above methods with some examples.

Lambda Expressions in pyspark

Lambda Expressions in pyspark are simple functions that can be written as an expression.

For example, let us say yo are trying to replace all the None values in each row in rdd_source with empty strings, in this case you can use a list comprehension something like below.

rdd_output = rdd_source.map(lambda row: [r if r is not None else "" for r in row])

Local defs inside the function calling into Spark

For longer code, you can create local functions and call it with Spark RDD or any actions.

if __name__ == "__main__":
    def myFunc(s):
        words = s.split(" ")
        return len(words)

    sc = SparkContext(...)
    sc.textFile("file.txt").map(myFunc)

Top-level functions in a module

It is possible to refer top level function in a module. Function can be used to carry out particular task.

For example, if we create a new MyClass and call doStuff on it, the map inside there references the func method of that MyClass instance, so the whole object needs to be sent to the cluster

class MyClass(object):
    def func(self, s):
        return s
    def doStuff(self, rdd):
        return rdd.map(self.func)

If your function is performing simple tasks such as word split, convert to upper case or lower case, etc. then it is better to create lambda expressions as it make code simpler and works faster compared to standalone functions.

Related Articles,

Hope this helps 🙂