How to Use drop() Function in PySpark
The drop()
function in PySpark is used to remove one or multiple columns from a DataFrame. It returns a new DataFrame without modifying the original one, making it useful for data cleaning and transformation tasks.
Sample Data
data = [
(1, "Alice", 5000, "IT", 25),
(2, "Bob", 6000, "HR", 30),
(3, "Charlie", 7000, "Finance", 35),
(4, "David", 8000, "IT", 40),
(5, "Eve", 9000, "HR", 45)
]
df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])
df.show()
Expected Output
+---+-------+------+----------+---+
| id| name|salary|department|age|
+---+-------+------+----------+---+
| 1| Alice| 5000| IT| 25|
| 2| Bob| 6000| HR| 30|
| 3|Charlie| 7000| Finance| 35|
| 4| David| 8000| IT| 40|
| 5| Eve| 9000| HR| 45|
+---+-------+------+----------+---+
Example 1: Dropping a Single Column
# Dropping the 'age' column
df_dropped = df.drop("age")
df_dropped.show()
Expected Output
+---+-------+------+----------+
| id| name|salary|department|
+---+-------+------+----------+
| 1| Alice| 5000| IT|
| 2| Bob| 6000| HR|
| 3|Charlie| 7000| Finance|
| 4| David| 8000| IT|
| 5| Eve| 9000| HR|
+---+-------+------+----------+
Example 2: Dropping Multiple Columns
# Dropping 'id' and 'department' columns
df_multiple_dropped = df.drop("id", "department")
df_multiple_dropped.show()
Expected Output
+-------+------+---+
| name|salary|age|
+-------+------+---+
| Alice| 5000| 25|
| Bob| 6000| 30|
|Charlie| 7000| 35|
| David| 8000| 40|
| Eve| 9000| 45|
+-------+------+---+
Example 3: Trying to Drop a Non-Existent Column
# Dropping a column that does not exist ('gender')
df_non_existent = df.drop("gender")
df_non_existent.show()
Expected Output
+---+-------+------+----------+---+
| id| name|salary|department|age|
+---+-------+------+----------+---+
| 1| Alice| 5000| IT| 25|
| 2| Bob| 6000| HR| 30|
| 3|Charlie| 7000| Finance| 35|
| 4| David| 8000| IT| 40|
| 5| Eve| 9000| HR| 45|
+---+-------+------+----------+---+
Since the column 'gender'
does not exist, the DataFrame remains unchanged.