How to Use withColumn()
Function in PySpark | Add & Update Columns
The withColumn() function in PySpark allows you to add new columns or update existing ones within a DataFrame. This guide provides simple, step-by-step examples to help you understand how to use it effectively.
📌 Syntax
DataFrame.withColumn(colName, col)
📦 Import Libraries
from pyspark.sql.functions import col, lit, expr
📋 Create Sample DataFrame
data = [
(1, "Alice", 5000, "IT"),
(2, "Bob", 6000, "HR"),
(3, "Charlie", 7000, "Finance")
]
df = spark.createDataFrame(data, ["id", "name", "salary", "department"])
df.show()
✅ Expected Output
+---+-------+------+----------+
| id| name|salary|department|
+---+-------+------+----------+
| 1| Alice| 5000| IT|
| 2| Bob| 6000| HR|
| 3|Charlie| 7000| Finance|
+---+-------+------+----------+
1️⃣ Add New Column
df_new = df.withColumn("bonus", lit(1000))
df_new.show()
✅ Expected Output
+---+-------+------+----------+-----+
| id| name|salary|department|bonus|
+---+-------+------+----------+-----+
| 1| Alice| 5000| IT| 1000|
| 2| Bob| 6000| HR| 1000|
| 3|Charlie| 7000| Finance| 1000|
+---+-------+------+----------+-----+
2️⃣ Update Existing Column
df_updated = df_new.withColumn("salary", col("salary") * 1.10)
df_updated.show()
✅ Expected Output
+---+-------+-------+----------+-----+
| id| name| salary|department|bonus|
+---+-------+-------+----------+-----+
| 1| Alice| 5500.0| IT| 1000|
| 2| Bob| 6600.0| HR| 1000|
| 3|Charlie| 7700.0| Finance| 1000|
+---+-------+-------+----------+-----+
3️⃣ Use Expressions with expr()
df_expr = df_updated.withColumn("salary_with_bonus", expr("salary + bonus"))
df_expr.show()
✅ Expected Output
+---+-------+-------+----------+-----+------------------+
| id| name| salary|department|bonus|salary_with_bonus |
+---+-------+-------+----------+-----+------------------+
| 1| Alice| 5500.0| IT| 1000| 6500.0|
| 2| Bob| 6600.0| HR| 1000| 7600.0|
| 3|Charlie| 7700.0| Finance| 1000| 8700.0|
+---+-------+-------+----------+-----+------------------+
4️⃣ Change Column Data Type
df_type_changed = df_expr.withColumn("salary", col("salary").cast("Integer"))
df_type_changed.show()
✅ Expected Output
+---+-------+------+----------+-----+------------------+
| id| name|salary|department|bonus|salary_with_bonus |
+---+-------+------+----------+-----+------------------+
| 1| Alice| 5500| IT| 1000| 6500.0|
| 2| Bob| 6600| HR| 1000| 7600.0|
| 3|Charlie| 7700| Finance| 1000| 8700.0|
+---+-------+------+----------+-----+------------------+
5️⃣ Rename a Column
df_renamed = df_type_changed.withColumn("emp_name", col("name")).drop("name")
df_renamed.show()
✅ Expected Output
+---+------+----------+-----+------------------+---------+
| id|salary|department|bonus|salary_with_bonus |emp_name |
+---+------+----------+-----+------------------+---------+
| 1| 5500| IT| 1000| 6500.0| Alice |
| 2| 6600| HR| 1000| 7600.0| Bob |
| 3| 7700| Finance| 1000| 8700.0| Charlie |
+---+------+----------+-----+------------------+---------+