Many databases such as SQL Server supports isdate function. Spark SQL supports many data frame methods. We have already seen Spark SQL date functions in my other post, “Spark SQL Date and Timestamp Functions“. You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date. In this article, we will check how to create Spark SQL isdate user defined function with an example.
Create Spark SQL isdate Function
The best part about Spark is, it supports wide range of programming language such as Java, Scala, Python, R, etc. You can use any of the supported programming language to write UDF and register it on Spark.
In pyspark , the Python function register process will take a slightly different approach compared to adding jar file and register through spark SQL.
You can use following to create and register an isdate UDF into Spark. We will be using pyspark to demonstrate the UDF registration process.
- Create isdate Python Function
- Register isdate Function into Spark Context
- Use isdate function in your Spark SQL query
Create isdate Python Function
Firstly, you need to create an isdate python function that will validate date function.
Following is the isdate function.
import datetime
def isdate(date_text):
try:
datetime.datetime.strptime(date_text, '%Y-%m-%d')
return True
except ValueError:
return False
The function will return true
if date is valid, false
otherwise.
Register isdate Function into Spark Context
Next step is to register a python function created in the previous step into spark context so that it is visible to spark SQL during runtime.
You can make use of sqlContext.udf.register method available with spark SQL context to register.
sqlContext.udf.register('udf_isdate', isdate)
To explain above syntax, we are registering ‘isdate’ function as a ‘udf_isdate’ in spark context. The udf_isdate should be used in subsequent spark SQL statements.
Use isdate function in your Spark SQL query
Finally, the Python function is visible with spark context, you can directly use with Spark SQL statements.
For examples,
sqlContext.sql('select udf_isdate(\'2020-01-001\')').show()
+-----------------------+
|udf_isdate(2020-01-001)|
+-----------------------+
| false|
+-----------------------+
DateTime Validation in Spark SQL
You can use following Python function to validate datetime in Pyspark.
import datetime
def isvaliddatetime(date_text):
try:
datetime.datetime.strptime(date_text, '%Y-%m-%d %H:%M:%S')
return True
except ValueError:
return False
For example,
sqlContext.sql('select udf_isvaliddatetime( \'2020-01-01 01:01:01\ ')' ).show()
+----------------------------------------+
|udf_isvaliddatetime(2020-01-01 01:01:01)|
+----------------------------------------+
| true|
+----------------------------------------+
Related Article,
- Register Python Function into Pyspark – Example
- Register Hive UDF jar into pyspark – Steps and Examples
- Hive UDF using Python-Use Python Script into Hive-Example
- Spark SQL isnumeric Function Alternative and Example
Hope this helps 🙂