Since the inception, Spark has made a lot of improvement and added many useful DataFrame API’s. If you are from SQL background, you might have noticed that adding default value to a column when you add new column is a common practice. This is just to make sure the new column does not hold junk or NULL values. In this article, we will check how to add a column with a default or constant value to a Pyspark DataFrame.
Add a Column with Default Value to Pyspark DataFrame
Adding a column with default or constant value to a existing Pyspark DataFrame is one of the common requirement when you work with dataset which has many different columns. During data processing you may need to add new columns to an already existing dataframe. You can also replace or update the column value if you already have column with junk records.
Following are the different methods that you can use to add a new column with constant value to Pyspark DataFrame.
- Add DataFrame Constant or Default Column using lit Function
- UDF to Add Default Value to a Spark DataFrame
- Spark SQL to Add Default Value to a Spark DataFrame
Now, let us check these methods briefly with an example.
Test DataFrame
We will be using following sample DataFrame across all our subsequent examples,
sampleDF = spark.createDataFrame(
[[1,'Scott'],
[2,'Tiger'],
[3,'Jane'],
[4,'Jenny'],
[5,'Judy'],
[3,'Jane'],
[2,'Tiger']], ('id', 'name'))
+---+-----+
| id| name|
+---+-----+
| 1|Scott|
| 2|Tiger|
| 3| Jane|
| 4|Jenny|
| 5| Judy|
| 3| Jane|
| 2|Tiger|
+---+-----+
Add DataFrame Constant or Default Column using lit Function
You can use the PySpark SQL function lit to add columns with constant value to a Pysaprk dataframe.
Consider following example to add a column with constant value.
# Add new default column using lit function
from datetime import date
from pyspark.sql.functions import lit
sampleDF = sampleDF\
.withColumn('newid', lit(0))\
.withColumn('joinDate', lit(date.today()))
And following output shows two new columns with default values.
>>> sampleDF.show()
+---+-----+-----+----------+
| id| name|newid| joinDate|
+---+-----+-----+----------+
| 1|Scott| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 4|Jenny| 0|2021-06-11|
| 5| Judy| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
+---+-----+-----+----------+
UDF to Add Default Value to a Spark DataFrame
Secondly, you can a create PySpark UDF(A.K.A user defined function) to add default value to a DataFrame.
Consider following example to add a column with constant value using PySpark UDF.
# Add new default column using pyspark udf
from datetime import date
from pyspark.sql.functions import udf
@udf("int")
def const_int_col():
return 0
@udf("date")
def const_date_col():
return date.today()
sampleDF1 = sampleDF\
.withColumn('newid', const_int_col())\
.withColumn('joinDate', const_date_col())
And following output shows two new columns with default values.
>>> sampleDF1.show()
+---+-----+-----+----------+
| id| name|newid| joinDate|
+---+-----+-----+----------+
| 1|Scott| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 4|Jenny| 0|2021-06-11|
| 5| Judy| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
+---+-----+-----+----------+
Spark SQL to Add Default Value to a Spark DataFrame
And the last method is to use a Spark SQL query to add constant column value to a dataframe. This is just an alternate approach and not recommended.
# Add new constant column using Spark SQL query
sampleDF.createOrReplaceTempView("sampleDF")
sampleDF1 = spark.sql("select id, name,'0' as newid, current_date as joinDate from sampleDF")
And following output shows two new columns with default values.
>>> sampleDF1.show()
+---+-----+-----+----------+
| id| name|newid| joinDate|
+---+-----+-----+----------+
| 1|Scott| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 4|Jenny| 0|2021-06-11|
| 5| Judy| 0|2021-06-11|
| 3| Jane| 0|2021-06-11|
| 2|Tiger| 0|2021-06-11|
+---+-----+-----+----------+
Let me know which of the above method helped you.
Related Articles,
- Rename PySpark DataFrame Column – Methods and Examples
- Spark SQL Recursive DataFrame – Pyspark and Scala
- How to Search String in Spark DataFrame? – Scala and PySpark
- How to Use Spark SQL REPLACE on DataFrame?
Hope this helps 🙂