How to Use withColumn() Function in PySpark to Add & Update Columns | PySpark Tutorial

How to Use withColumn() Function in PySpark | Add & Update Columns

How to Use withColumn() Function in PySpark | Add & Update Columns

The withColumn() function in PySpark allows you to add new columns or update existing ones within a DataFrame. This guide provides simple, step-by-step examples to help you understand how to use it effectively.

📌 Syntax

DataFrame.withColumn(colName, col)

📦 Import Libraries

from pyspark.sql.functions import col, lit, expr

📋 Create Sample DataFrame

data = [
    (1, "Alice", 5000, "IT"),
    (2, "Bob", 6000, "HR"),
    (3, "Charlie", 7000, "Finance")
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department"])
df.show()

✅ Expected Output

+---+-------+------+----------+
| id|   name|salary|department|
+---+-------+------+----------+
|  1|  Alice|  5000|        IT|
|  2|    Bob|  6000|        HR|
|  3|Charlie|  7000|   Finance|
+---+-------+------+----------+

1️⃣ Add New Column

df_new = df.withColumn("bonus", lit(1000))
df_new.show()

✅ Expected Output

+---+-------+------+----------+-----+
| id|   name|salary|department|bonus|
+---+-------+------+----------+-----+
|  1|  Alice|  5000|        IT| 1000|
|  2|    Bob|  6000|        HR| 1000|
|  3|Charlie|  7000|   Finance| 1000|
+---+-------+------+----------+-----+

2️⃣ Update Existing Column

df_updated = df_new.withColumn("salary", col("salary") * 1.10)
df_updated.show()

✅ Expected Output

+---+-------+-------+----------+-----+
| id|   name| salary|department|bonus|
+---+-------+-------+----------+-----+
|  1|  Alice| 5500.0|        IT| 1000|
|  2|    Bob| 6600.0|        HR| 1000|
|  3|Charlie| 7700.0|   Finance| 1000|
+---+-------+-------+----------+-----+

3️⃣ Use Expressions with expr()

df_expr = df_updated.withColumn("salary_with_bonus", expr("salary + bonus"))
df_expr.show()

✅ Expected Output

+---+-------+-------+----------+-----+------------------+
| id|   name| salary|department|bonus|salary_with_bonus |
+---+-------+-------+----------+-----+------------------+
|  1|  Alice| 5500.0|        IT| 1000|            6500.0|
|  2|    Bob| 6600.0|        HR| 1000|            7600.0|
|  3|Charlie| 7700.0|   Finance| 1000|            8700.0|
+---+-------+-------+----------+-----+------------------+

4️⃣ Change Column Data Type

df_type_changed = df_expr.withColumn("salary", col("salary").cast("Integer"))
df_type_changed.show()

✅ Expected Output

+---+-------+------+----------+-----+------------------+
| id|   name|salary|department|bonus|salary_with_bonus |
+---+-------+------+----------+-----+------------------+
|  1|  Alice|  5500|        IT| 1000|            6500.0|
|  2|    Bob|  6600|        HR| 1000|            7600.0|
|  3|Charlie|  7700|   Finance| 1000|            8700.0|
+---+-------+------+----------+-----+------------------+

5️⃣ Rename a Column

df_renamed = df_type_changed.withColumn("emp_name", col("name")).drop("name")
df_renamed.show()

✅ Expected Output

+---+------+----------+-----+------------------+---------+
| id|salary|department|bonus|salary_with_bonus |emp_name |
+---+------+----------+-----+------------------+---------+
|  1|  5500|        IT| 1000|            6500.0|   Alice |
|  2|  6600|        HR| 1000|            7600.0|     Bob |
|  3|  7700|   Finance| 1000|            8700.0| Charlie |
+---+------+----------+-----+------------------+---------+

🎥 Watch the Tutorial

Click here to watch on YouTube

Author: Aamir Shahzad

For more PySpark tutorials, subscribe to TechBrothersIT