Hadoop provides an API so that you can write user-defined functions or UDFs using any of your favorite programming language. In this article, we will check how to how to create a custom function for Hive using Python? that is nothing but creating Hive UDF using Python.
What is Hive?
Hive is a data warehouse ecosystem built on top of Hadoop HDFS to perform batch and ad-hoc query execution on large datasets. Apache Hive can handle petabyte of data. The Hive is designed for OLAP. It is not suited for OLTP applications.
What are UDFs in Hive?
In Hive, the users can define their own functions or methods to meet certain requirements. These are known as UDFs in Hive. User Defined Functions written in Java, Python or Scala depends upon the programming language you know. In this article, we will check Hive UDF using Python.
Usually, in Java, UDF jar is created. You can use that jar to register UDF in either Hive or Spark.
Hive UDF using Python
As Hadoop, Hive support many programming API’s, you can create user defined functions in any of the known programming language and plug it into your Hive query using the Hive provided built-in TRANSFORM clause.
Hive UDF using Python Example
You can follow below steps to create Hive UDF using Python.
Step 1: Create Python Custom UDF Script
Below Python program accepts the string from standard input and perform INITCAP task. i.e. Capitalize first letter of each word. This python script will be our UDF that we will be calling from Hive.
import sys
for line in sys.stdin:
line = line.strip('\n\r')
fname , lname = line.split('\t')
firstname = fname.title()
lastname = lname.title()
print '\t'.join([firstname, lastname])
Step 2: Add Python File into Hive
You can use the add FILE command available with Hive command line interface (CLI), we can add the Python script into the Hive’s classpath. Once this is done, we need not alias the script as it can be directly referred to by its name.
0: jdbc:hive2://> add FILE /home/cloudera/python_scripts/initCap.py
Note that, you have to copy the Python script to machine where your hive Server2 is setup. This will be applicable only when you are working on edge node. Our code has been tested on cloudera VM.
Step 3: Use the Hive TRANSFORM…AS Construct to use the UDF
Hive Shell command TRANSFORM lets you add your own mappers and/or reducers to process the data. This clause will help you to use any external UDF in Hive.
Below is the example of using Python UDF using TRANSFORM command.
SELECT Transform (col1)
using 'python /home/cloudera/python_scripts/initCap.py' AS (col_names)
FROM (
SELECT 'vithal\tsamp' AS col1
UNION ALL
SELECT 'Ram\tkumar' AS col1
UNION ALL
SELECT 'sam\tKumar' AS col1) AS a;
Related Articles,
- Register Python Function into Pyspark – Example
- Register Hive UDF jar into pyspark – Steps and Examples
- Pass Functions to pyspark – Run Python Functions on Cluster
- Hadoop Hive isnumeric Alternative and Examples
Hope this helps 🙂