A DataFrame in Spark is a dataset organized into named columns. Spark data frame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. When you work with Datarames, you may get a requirement to rename the column. In this article, we will check how to rename a PySpark DataFrame column, Methods to rename DF column and some examples.
Rename PySpark DataFrame Column
As mentioned earlier, we often need to rename one column or multiple columns on PySpark (or Spark) DataFrame.
Note that, we are only renaming the column name. We are not replacing or converting DataFrame column data type.
Following are some methods that you can use to rename dataFrame columns in Pyspark.
- Use
withColumnRenamed
Function toDF
Function to Rename All Columns in DataFrame- Use DataFrame Column
Alias
method
Now let use check these methods with an examples.
Test Data
Following is the test DataFrame that we will be using in subsequent methods and examples.
testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])
+---+----+
| id|d_id|
+---+----+
| 1| 111|
| 2| 111|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 111|
| 7| 333|
| 8| 444|
+---+----+
Rename DataFrame Column using withColumnRenamed
This is one of the easiest methods that you can use to rename dataFrame column.
Following example uses the Spark withColumnRenamed function to rename DataFrame column name.
>>> testDF2 = testDF.withColumnRenamed("d_id", "dept_id")
>>> testDF2.show();
+---+-------+
| id|dept_id|
+---+-------+
| 1| 111|
| 2| 111|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 111|
| 7| 333|
| 8| 444|
+---+-------+
AS you can see, d_id is renamed to dept_id.
Spark toDF
Function to Rename All Columns in DataFrame
The toDF()
converts strongly typed collection of data to generic DataFrame with columns renamed. You can use this method to create new DataFrame with different column names.
For example, consider below example.
>>> testDF3 = testDF.toDF("id", "dept_id")
>>> testDF3.show()
+---+-------+
| id|dept_id|
+---+-------+
| 1| 111|
| 2| 111|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 111|
| 7| 333|
| 8| 444|
+---+-------+
With using toDF()
for renaming columns in DataFrame must be careful. This method works much slower than others.
Rename DataFrame Column using Alias Method
This is one of the easiest methods and often used in many pyspark code. an Alias
is used to rename the DataFrame column while displaying its content.
For Example,
Consider following Spark SQL example that uses an alias to rename DataFrame column names.
>>> from pyspark.sql.functions import col
>>> testDF4 = testDF.select(col("id").alias("id"), col("d_id").alias("dept_id"))
>>> testDF4.show()
+---+-------+
| id|dept_id|
+---+-------+
| 1| 111|
| 2| 111|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 111|
| 7| 333|
| 8| 444|
+---+-------+
Related Articles
- Spark DataFrame Column Type Conversion using CAST
- How to Update Spark DataFrame Column Values using Pyspark?
- Spark SQL Date and Timestamp Functions and Examples
- Import CSV file to Pyspark DataFrame – Example
- Spark SQL CASE WHEN on DataFrame – Examples
- Spark SQL Create Temporary Tables, Syntax and Examples
- Apache Spark SQL COALESCE on DataFrame – Examples
- Spark SQL Recursive DataFrame – Pyspark and Scala
Hope this helps 🙂