How to Add Column with Default Value to Pyspark DataFrame?

  • Post author:
  • Post last modified:June 16, 2022
  • Post category:Apache Spark
  • Reading time:7 mins read

Since the inception, Spark has made a lot of improvement and added many useful DataFrame API’s. If you are from SQL background, you might have noticed that adding default value to a column when you add new column is a common practice. This is just to make sure the new column does not hold junk or NULL values. In this article, we will check how to add a column with a default or constant value to a Pyspark DataFrame.

How to Add Column with Default Value to Pyspark DataFrame?

Add a Column with Default Value to Pyspark DataFrame

Adding a column with default or constant value to a existing Pyspark DataFrame is one of the common requirement when you work with dataset which has many different columns. During data processing you may need to add new columns to an already existing dataframe. You can also replace or update the column value if you already have column with junk records.

Following are the different methods that you can use to add a new column with constant value to Pyspark DataFrame.

Now, let us check these methods briefly with an example.

Test DataFrame

We will be using following sample DataFrame across all our subsequent examples,

sampleDF = spark.createDataFrame(
[[1,'Scott'], 
[2,'Tiger'], 
[3,'Jane'], 
[4,'Jenny'], 
[5,'Judy'],
[3,'Jane'],
[2,'Tiger']], ('id', 'name'))

+---+-----+
| id| name|
+---+-----+
|  1|Scott|
|  2|Tiger|
|  3| Jane|
|  4|Jenny|
|  5| Judy|
|  3| Jane|
|  2|Tiger|
+---+-----+

Add DataFrame Constant or Default Column using lit Function

You can use the PySpark SQL function lit to add columns with constant value to a Pysaprk dataframe.

Consider following example to add a column with constant value.

# Add new default column using lit function
from datetime import date
from pyspark.sql.functions import lit

sampleDF = sampleDF\
	.withColumn('newid', lit(0))\
	.withColumn('joinDate', lit(date.today()))

And following output shows two new columns with default values.

>>> sampleDF.show()
+---+-----+-----+----------+
| id| name|newid|  joinDate|
+---+-----+-----+----------+
|  1|Scott|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  4|Jenny|    0|2021-06-11|
|  5| Judy|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
+---+-----+-----+----------+

UDF to Add Default Value to a Spark DataFrame

Secondly, you can a create PySpark UDF(A.K.A user defined function) to add default value to a DataFrame.

Consider following example to add a column with constant value using PySpark UDF.

# Add new default column using pyspark udf
from datetime import date
from pyspark.sql.functions import udf

@udf("int")
def const_int_col():
    return 0
	
@udf("date")
def const_date_col():
    return date.today()	

sampleDF1 = sampleDF\
	.withColumn('newid', const_int_col())\
	.withColumn('joinDate', const_date_col())

And following output shows two new columns with default values.

>>> sampleDF1.show()
+---+-----+-----+----------+
| id| name|newid|  joinDate|
+---+-----+-----+----------+
|  1|Scott|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  4|Jenny|    0|2021-06-11|
|  5| Judy|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
+---+-----+-----+----------+

Spark SQL to Add Default Value to a Spark DataFrame

And the last method is to use a Spark SQL query to add constant column value to a dataframe. This is just an alternate approach and not recommended.

# Add new constant column using Spark SQL query
sampleDF.createOrReplaceTempView("sampleDF")
sampleDF1 = spark.sql("select id, name,'0' as newid, current_date as joinDate from sampleDF")

And following output shows two new columns with default values.

>>> sampleDF1.show()
+---+-----+-----+----------+
| id| name|newid|  joinDate|
+---+-----+-----+----------+
|  1|Scott|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  4|Jenny|    0|2021-06-11|
|  5| Judy|    0|2021-06-11|
|  3| Jane|    0|2021-06-11|
|  2|Tiger|    0|2021-06-11|
+---+-----+-----+----------+

Let me know which of the above method helped you.

Related Articles,

Hope this helps 🙂