How to Create Spark SQL User Defined Functions? Example

  • Post author:
  • Post last modified:July 30, 2020
  • Post category:Apache Spark
  • Reading time:5 mins read

A user defined function (UDF) is a function written to perform specific tasks when built-in function is not available for the same. In a Hadoop environment, you can write user defined function using Java, Python, R, etc. In this article, we will check how to create Spark SQL user defined functions with an python user defined functionexample.

How to Create Spark SQL User Defined Functions? Example

Spark SQL User-defined Functions

When you migrate your relational database warehouse to Hive and use Spark as an execution engine, you may miss some of the built-in function support. Some user defined functions are used to meet certain functionalities. For example, you may notice T-SQL isnumeric function is not available in Hive or Spark SQL. You can write your own UDF to check if a string value is numeric using Java or Python.

The best part about Spark is it is flexible, it also provides options to register Hive UDF jar.

Steps to Create User Defined Functions in Spark

Follow below steps to create user defined function in Spark. We will use pyspark to demonstrate Spark UDF functions.

As an example, we will create function to check if string value is numeric.

Create Python UDF on Pyspark Terminal

The first step is to create python user defined function on pyspark terminal that you want to register in Spark.

For example, consider below user defined function.

def numeric_check(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

The function will try to convert given string value to float. If the string is converted successfully, then True will be returned, and any value error False will be returned.

Import Spark Data Type

The second steps is to import the Spark data type. The type should be same as the one above created function returns. Function will return null if there is a type mismatch.

For example, in our example, function returns Boolean value.

from pyspark.sql.types import BooleanType
Register numeric_check Function into Spark

The final step is to register the python function into spark. Use pyspark udf.register api to register user defined function.

sqlContext.udf.register("is_numeric", numeric_check, BooleanType())
Test Spark SQL User Defined Function

Now, write Spark SQL command to test Spark SQL udf.

sqlContext.sql("select is_numeric('23.332')").show()
+----+
| _c0|
+----+
|true|
+----+

sqlContext.sql("select is_numeric('23h12.332')").show()
+-----+
|  _c0|
+-----+
|false|
+-----+

Related Articles

Hope this helps 🙂