Spark DataFrame Column Type Conversion using CAST

Post author:Vithal S
Post last modified:January 31, 2020
Post category:Apache Spark
Reading time:5 mins read

In my other post, we have discussed how to check if Spark DataFrame column is of Integer Type. Some application expects column to be of a specific type. For example, Machine learning models accepts only integer type. In this article, we will check how to perform Spark DataFrame column type conversion using the Spark dataFrame CAST method.

Spark DataFrame Column Type Conversion

You can use the Spark CAST method to convert data frame column data type to required format.

Test Data Frame

Following is the test data frame (df) that we are going to use in the subsequent examples.

testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])

Test Data Frame Schema

Use DF.schema to check the schema or structure of the test DF.

>>> testDF.schema
StructType(List(StructField(id,LongType,true), StructField(d_id,StringType,true)))

Note that, column d_id is of StringType. As it contains data of type integer , we will convert it to integer type using Spark data frame CAST method.

Spark DataFrame CAST Method

The CAST function convert the column into type dataType. This is one of the handy method that you can use with data frame.

Syntax

Following is the CAST method syntax

dataFrame["columnName"].cast(DataType())

Where, dataFrame is DF that you are manupulating. columnName name of the data frame column and DataType could be anything from the data Type list.

Data Frame Column Type Conversion using CAST

In this section, we will use the CAST function to convert the data type of the data frame column to the desired type.

For example, consider below example to convert d_id column to integer type. d_id column holds data which is of type integer. We will apply the CAST method to convert it to integerType.

>>> testDF = testDF.withColumn("d_idTmp", testDF["d_id"].cast(IntegerType())).drop(" d_id").withColumnRenamed("d_idTmp", "d_id")

Now, very the schema again to confirm

>>> testDF.schema
StructType(List(StructField(id,LongType,true), StructField(d_id,IntegerType,true)))

As you can see, pyspark data frame column type is converted from string to integer type.

Hope this helps 🙂

Tags: Apache Spark