How to Use dropDuplicates() Function in PySpark
The dropDuplicates()
function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame with only unique rows, keeping the first occurrence of each duplicate.
Sample Data
data = [
(1, "Aamir Shahzad", "IT", 6000),
(2, "Ali", "HR", 7000),
(3, "Raza", "Finance", 8000),
(4, "Aamir Shahzad", "IT", 6000), # Duplicate Entry
(5, "Alice", "IT", 5000),
(6, "Bob", "HR", 7000),
(7, "Ali", "HR", 7000), # Duplicate Entry
(8, "Charlie", "Finance", 9000),
(9, "David", "IT", 6000),
(10, "Eve", "HR", 7500)
]
df = spark.createDataFrame(data, ["id", "name", "department", "salary"])
Show the Full DataFrame
df.show()
Expected Output
+---+-------------+----------+------+
| id| name|department|salary|
+---+-------------+----------+------+
| 1|Aamir Shahzad| IT| 6000|
| 2| Ali| HR| 7000|
| 3| Raza| Finance| 8000|
| 4|Aamir Shahzad| IT| 6000|
| 5| Alice| IT| 5000|
| 6| Bob| HR| 7000|
| 7| Ali| HR| 7000|
| 8| Charlie| Finance| 9000|
| 9| David| IT| 6000|
| 10| Eve| HR| 7500|
+---+-------------+----------+------+
Example 1: Removing All Duplicate Rows
# Removing duplicate rows (entire row must match)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()
Expected Output
+---+-------------+----------+------+
| id| name|department|salary|
+---+-------------+----------+------+
| 1|Aamir Shahzad| IT| 6000|
| 2| Ali| HR| 7000|
| 3| Raza| Finance| 8000|
| 5| Alice| IT| 5000|
| 6| Bob| HR| 7000|
| 8| Charlie| Finance| 9000|
| 9| David| IT| 6000|
| 10| Eve| HR| 7500|
+---+-------------+----------+------+
Example 2: Removing Duplicates Based on Specific Columns
# Removing duplicates based on 'name' and 'department'
df_no_duplicates_specific = df.dropDuplicates(["name", "department"])
df_no_duplicates_specific.show()
Expected Output
+---+-------------+----------+------+
| id| name|department|salary|
+---+-------------+----------+------+
| 1|Aamir Shahzad| IT| 6000|
| 2| Ali| HR| 7000|
| 3| Raza| Finance| 8000|
| 5| Alice| IT| 5000|
| 6| Bob| HR| 7000|
| 8| Charlie| Finance| 9000|
| 9| David| IT| 6000|
| 10| Eve| HR| 7500|
+---+-------------+----------+------+
Summary
dropDuplicates()
removes duplicate rows from a DataFrame.- When columns are specified, it keeps only unique rows based on those columns.
- If no columns are specified, it compares all columns in the DataFrame.