Spark DataFrame Integer Type Check and Example

  • Post author:
  • Post last modified:November 16, 2019
  • Post category:Apache Spark
  • Reading time:4 mins read

Apache Spark is one of the easiest framework to deal with different data sources. You can combine heterogeneous data source with the help of dataFrames. Some application, for example, Machine Learning model requires only integer values. You should check the data type of the dataFrame before feeding it to ML models, or you should type cast it to an integer type. In this article, how to perform Spark dataFrame integer type check and how to convert it using CAST function in Spark.

Spark DataFrame Integer Type Check Requirement

As mentioned earlier, if you are building ML model using the Spark ML library, it expects only integer data type. You should apply the cast function to change the dataFrame column type if it is of different type.

Test Data

Following is the test DF that we are going to use in the subsequent examples.

testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])

Check Spark DataFrame Schema

Before applying any cast methods on dataFrame column, first you should check the schema of the dataFrame. You can use DataFrame.schema command to verify the dataFrame columns and its type.

For example, Consider below example to display dataFrame schema.

>>> testDF.schema
StructType(List(StructField(id,LongType,true),StructField(d_id,StringType,true)))

Note that, d_id type is stringType.

Spark DataFrame Integer Type Check

In this section, we will check how to check if data frame column type is integer. You can use df.dtype command to check the type of the column.

For example, consider below example

>>> dict(testDF.dtypes)['id']
'bigint'

>>> dict(testDF.dtypes)['d_id']
'string'

You can use the above method in your if condition to check if column type is integer.

>>> if dict(testDF.dtypes)['id'] == 'bigint':
...     print "DF column is of integer type"
... else:
...     print "DF column is of different type"
...
DF column is of integer type

You can convert to dataFrame column type to a different type using the Spark CAST function. Read more about type conversion in my other post – Spark DataFrame Column Type Conversion using CAST

Related Articles,

Hope this helps 🙂