In my other post, we have discussed how to check if Spark DataFrame column is of Integer Type. Some application expects column to be of a specific type. For example, Machine learning models accepts only integer type. In this article, we will check how to perform Spark DataFrame column type conversion using the Spark dataFrame CAST method.
Spark DataFrame Column Type Conversion
You can use the Spark CAST method to convert data frame column data type to required format.
Test Data Frame
Following is the test data frame (df) that we are going to use in the subsequent examples.
testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])
Test Data Frame Schema
Use DF.schema to check the schema or structure of the test DF.
>>> testDF.schema
StructType(List(StructField(id,LongType,true), StructField(d_id,StringType,true)))
Note that, column d_id is of StringType. As it contains data of type integer , we will convert it to integer type using Spark data frame CAST method.
Spark DataFrame CAST Method
The CAST function convert the column into type dataType. This is one of the handy method that you can use with data frame.
Syntax
Following is the CAST method syntax
dataFrame["columnName"].cast(DataType())
Where, dataFrame is DF that you are manupulating. columnName name of the data frame column and DataType could be anything from the data Type list.
Data Frame Column Type Conversion using CAST
In this section, we will use the CAST function to convert the data type of the data frame column to the desired type.
For example, consider below example to convert d_id column to integer type. d_id column holds data which is of type integer. We will apply the CAST method to convert it to integerType.
>>> testDF = testDF.withColumn("d_idTmp", testDF["d_id"].cast(IntegerType())).drop(" d_id").withColumnRenamed("d_idTmp", "d_id")
Now, very the schema again to confirm
>>> testDF.schema
StructType(List(StructField(id,LongType,true), StructField(d_id,IntegerType,true)))
As you can see, pyspark data frame column type is converted from string to integer type.
Related Articles
- Spark DataFrame Integer Type Check and Example
- How to Update Spark DataFrame Column Values using Pyspark?
- Rename PySpark DataFrame Column – Methods and Examples
- Spark SQL Date and Timestamp Functions and Examples
Hope this helps 🙂