Create Spark SQL isdate Function – Date Validation

  • Post author:
  • Post last modified:July 30, 2020
  • Post category:Apache Spark
  • Reading time:5 mins read

Many databases such as SQL Server supports isdate function. Spark SQL supports many data frame methods. We have already seen Spark SQL date functions in my other post, “Spark SQL Date and Timestamp Functions“. You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date. In this article, we will check how to create Spark SQL isdate user defined function with an example.

Create Spark SQL isdate Function - Date Validation

Create Spark SQL isdate Function

The best part about Spark is, it supports wide range of programming language such as Java, Scala, Python, R, etc. You can use any of the supported programming language to write UDF and register it on Spark.

In pyspark , the Python function register process will take a slightly different approach compared to adding jar file and register through spark SQL.

You can use following to create and register an isdate UDF into Spark. We will be using pyspark to demonstrate the UDF registration process.

Create isdate Python Function

Firstly, you need to create an isdate python function that will validate date function.

Following is the isdate function.

import datetime
def isdate(date_text):
	try:
		datetime.datetime.strptime(date_text, '%Y-%m-%d')
		return True
	except ValueError:
		return False

The function will return true if date is valid, false otherwise.

Register isdate Function into Spark Context

Next step is to register a python function created in the previous step into spark context so that it is visible to spark SQL during runtime.

You can make use of sqlContext.udf.register method available with spark SQL context to register.

sqlContext.udf.register('udf_isdate', isdate)

To explain above syntax, we are registering ‘isdate’ function as a ‘udf_isdate’ in spark context. The udf_isdate should be used in subsequent spark SQL statements.

Use isdate function in your Spark SQL query

Finally, the Python function is visible with spark context, you can directly use with Spark SQL statements.

For examples,

sqlContext.sql('select udf_isdate(\'2020-01-001\')').show()
+-----------------------+
|udf_isdate(2020-01-001)|
+-----------------------+
|                  false|
+-----------------------+

DateTime Validation in Spark SQL

You can use following Python function to validate datetime in Pyspark.

import datetime
def isvaliddatetime(date_text):
	try:
		datetime.datetime.strptime(date_text, '%Y-%m-%d %H:%M:%S')
		return True
	except ValueError:
		return False

For example,

sqlContext.sql('select udf_isvaliddatetime( \'2020-01-01 01:01:01\ ')' ).show()
+----------------------------------------+
|udf_isvaliddatetime(2020-01-01 01:01:01)|
+----------------------------------------+
|                                    true|
+----------------------------------------+

Related Article,

Hope this helps 🙂