Apache Spark is one of the easiest framework to deal with different data sources. You can combine heterogeneous data source with the help of dataFrames. Some application, for example, Machine Learning model requires only integer values. You should check the data type of the dataFrame before feeding it to ML models, or you should type cast it to an integer type. In this article, how to perform Spark dataFrame integer type check and how to convert it using CAST function in Spark.
Spark DataFrame Integer Type Check Requirement
As mentioned earlier, if you are building ML model using the Spark ML library, it expects only integer data type. You should apply the cast function to change the dataFrame column type if it is of different type.
Test Data
Following is the test DF that we are going to use in the subsequent examples.
testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])
Check Spark DataFrame Schema
Before applying any cast methods on dataFrame column, first you should check the schema of the dataFrame. You can use DataFrame.schema command to verify the dataFrame columns and its type.
For example, Consider below example to display dataFrame schema.
>>> testDF.schema
StructType(List(StructField(id,LongType,true),StructField(d_id,StringType,true)))
Note that, d_id type is stringType.
Spark DataFrame Integer Type Check
In this section, we will check how to check if data frame column type is integer. You can use df.dtype command to check the type of the column.
For example, consider below example
>>> dict(testDF.dtypes)['id']
'bigint'
>>> dict(testDF.dtypes)['d_id']
'string'
You can use the above method in your if condition to check if column type is integer.
>>> if dict(testDF.dtypes)['id'] == 'bigint':
... print "DF column is of integer type"
... else:
... print "DF column is of different type"
...
DF column is of integer type
You can convert to dataFrame column type to a different type using the Spark CAST function. Read more about type conversion in my other post – Spark DataFrame Column Type Conversion using CAST
Related Articles,
- Spark Dataset Join Operators using Pyspark – Examples
- How to Update Spark DataFrame Column Values using Pyspark?
- How to Export Spark-SQL Results to CSV?
Hope this helps 🙂