How to Use drop() to Remove Columns from DataFrame | PySpark Tutorial

How to Use drop() Function in PySpark | Step-by-Step Guide

How to Use drop() Function in PySpark

The drop() function in PySpark is used to remove one or multiple columns from a DataFrame. It returns a new DataFrame without modifying the original one, making it useful for data cleaning and transformation tasks.

Sample Data

data = [
    (1, "Alice", 5000, "IT", 25),
    (2, "Bob", 6000, "HR", 30),
    (3, "Charlie", 7000, "Finance", 35),
    (4, "David", 8000, "IT", 40),
    (5, "Eve", 9000, "HR", 45)
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

df.show()

Expected Output

+---+-------+------+----------+---+
| id|   name|salary|department|age|
+---+-------+------+----------+---+
|  1|  Alice|  5000|        IT| 25|
|  2|    Bob|  6000|        HR| 30|
|  3|Charlie|  7000|   Finance| 35|
|  4|  David|  8000|        IT| 40|
|  5|    Eve|  9000|        HR| 45|
+---+-------+------+----------+---+

Example 1: Dropping a Single Column

# Dropping the 'age' column
df_dropped = df.drop("age")
df_dropped.show()

Expected Output

+---+-------+------+----------+
| id|   name|salary|department|
+---+-------+------+----------+
|  1|  Alice|  5000|        IT|
|  2|    Bob|  6000|        HR|
|  3|Charlie|  7000|   Finance|
|  4|  David|  8000|        IT|
|  5|    Eve|  9000|        HR|
+---+-------+------+----------+

Example 2: Dropping Multiple Columns

# Dropping 'id' and 'department' columns
df_multiple_dropped = df.drop("id", "department")
df_multiple_dropped.show()

Expected Output

+-------+------+---+
|   name|salary|age|
+-------+------+---+
|  Alice|  5000| 25|
|    Bob|  6000| 30|
|Charlie|  7000| 35|
|  David|  8000| 40|
|    Eve|  9000| 45|
+-------+------+---+

Example 3: Trying to Drop a Non-Existent Column

# Dropping a column that does not exist ('gender')
df_non_existent = df.drop("gender")
df_non_existent.show()

Expected Output

+---+-------+------+----------+---+
| id|   name|salary|department|age|
+---+-------+------+----------+---+
|  1|  Alice|  5000|        IT| 25|
|  2|    Bob|  6000|        HR| 30|
|  3|Charlie|  7000|   Finance| 35|
|  4|  David|  8000|        IT| 40|
|  5|    Eve|  9000|        HR| 45|
+---+-------+------+----------+---+

Since the column 'gender' does not exist, the DataFrame remains unchanged.

Watch the Video Tutorial