Register Hive UDF jar into pyspark – Steps and Examples

  • Post author:
  • Post last modified:July 3, 2019
  • Post category:Apache Spark
  • Reading time:4 mins read

Apache Spark is one of the widely used processing engine because of its fast and in-memory computation. Most of the organizations use both Hive and Spark. Hive as a data source and Spark as a processing engine. You can use any of your favorite programming language to interact with Hadoop. You can write custom UDFs in Java, Python or Scala. To use those UDFs, you have to register into the Hive so that you can use them like normal built-in functions. In this article, we check check couple of methods on how to register Hive UDF jar into pyspark.

What are UDFs in Hive?

In Hive, the users can define their own functions to meet certain client requirements or to perform certain asks. These are known as UDFs in Hive. User Defined Functions written in Java, Python or Scala depends upon the programming language you know.

Usually, in Java, UDF jar is created. You can use that jar to register UDF in either Hive or Spark.

Register Hive UDF jar into pyspark

As mentioned earlier, you must register the created UDFs in order to use it like normal built-in functions.

There are many methods that you can use to register the UDF jar into pyspark. In this article, we will check registering UDFs using spark-submit command.

  • Add jar to spark-submit during execution

Add jar to Spark-Submit During Execution

This is one of the preferred methods to use jar file in pyspark or spark. Just use the –jars parameter. Spark will share those jars with the executors during run-time and expose Java class. You can use that java class to register the user defined function in spark.

Below is the command that you can use to add jar files to spark execution.

$ spark-submit --jars "/home//app/jars/custom_hive_udf.jar" spark_test.py

Inside your pyspark spark program use Spark sql to execute CREATE FUNCTION command by referring java udf class.

For example,

sqlContext.sql("CREATE TEMPORARY FUNCTION nullif AS 'com.example.udf.UDFnullif")

Related Article,

Hope this helps 🙂