Rename PySpark DataFrame Column – Methods and Examples

  • Post author:
  • Post last modified:May 31, 2021
  • Post category:Apache Spark
  • Reading time:6 mins read

A DataFrame in Spark is a dataset organized into named columns. Spark data frame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. When you work with Datarames, you may get a requirement to rename the column. In this article, we will check how to rename a PySpark DataFrame column, Methods to rename DF column and some examples.

Rename PySpark DataFrame Column - Methods and Examples

Rename PySpark DataFrame Column

As mentioned earlier, we often need to rename one column or multiple columns on PySpark (or Spark) DataFrame.

Note that, we are only renaming the column name. We are not replacing or converting DataFrame column data type.

Following are some methods that you can use to rename dataFrame columns in Pyspark.

  • Use withColumnRenamed Function
  • toDF Function to Rename All Columns in DataFrame
  • Use DataFrame Column Alias method

Now let use check these methods with an examples.

Test Data

Following is the test DataFrame that we will be using in subsequent methods and examples.

testDF = sqlContext.createDataFrame([(1,"111"), (2,"111"), (3,"222"), (4,"222"), (5,"222"), (6,"111"), (7,"333"), (8,"444")], ["id", "d_id"])

+---+----+
| id|d_id|
+---+----+
|  1| 111|
|  2| 111|
|  3| 222|
|  4| 222|
|  5| 222|
|  6| 111|
|  7| 333|
|  8| 444|
+---+----+

Rename DataFrame Column using withColumnRenamed

This is one of the easiest methods that you can use to rename dataFrame column.

Following example uses the Spark withColumnRenamed function to rename DataFrame column name.

>>> testDF2 = testDF.withColumnRenamed("d_id", "dept_id")
>>> testDF2.show();
+---+-------+
| id|dept_id|
+---+-------+
|  1|    111|
|  2|    111|
|  3|    222|
|  4|    222|
|  5|    222|
|  6|    111|
|  7|    333|
|  8|    444|
+---+-------+

AS you can see, d_id is renamed to dept_id.

Spark toDF Function to Rename All Columns in DataFrame

The toDF() converts strongly typed collection of data to generic DataFrame with columns renamed. You can use this method to create new DataFrame with different column names.

For example, consider below example.

>>> testDF3 = testDF.toDF("id", "dept_id")
>>> testDF3.show()
+---+-------+
| id|dept_id|
+---+-------+
|  1|    111|
|  2|    111|
|  3|    222|
|  4|    222|
|  5|    222|
|  6|    111|
|  7|    333|
|  8|    444|
+---+-------+

With using toDF() for renaming columns in DataFrame must be careful. This method works much slower than others.

Rename DataFrame Column using Alias Method

This is one of the easiest methods and often used in many pyspark code. an Alias is used to rename the DataFrame column while displaying its content.

For Example,

Consider following Spark SQL example that uses an alias to rename DataFrame column names.

>>> from pyspark.sql.functions import col
>>> testDF4 = testDF.select(col("id").alias("id"), col("d_id").alias("dept_id"))
>>> testDF4.show()
+---+-------+
| id|dept_id|
+---+-------+
|  1|    111|
|  2|    111|
|  3|    222|
|  4|    222|
|  5|    222|
|  6|    111|
|  7|    333|
|  8|    444|
+---+-------+

Related Articles

Hope this helps 🙂