How to use dropDuplicates Function in PySpark | PySpark Tutorial

How to Use dropDuplicates() Function in PySpark | Step-by-Step Guide

How to Use dropDuplicates() Function in PySpark

The dropDuplicates() function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame with only unique rows, keeping the first occurrence of each duplicate.

Sample Data

data = [
    (1, "Aamir Shahzad", "IT", 6000),
    (2, "Ali", "HR", 7000),
    (3, "Raza", "Finance", 8000),
    (4, "Aamir Shahzad", "IT", 6000),  # Duplicate Entry
    (5, "Alice", "IT", 5000),
    (6, "Bob", "HR", 7000),
    (7, "Ali", "HR", 7000),            # Duplicate Entry
    (8, "Charlie", "Finance", 9000),
    (9, "David", "IT", 6000),
    (10, "Eve", "HR", 7500)
]

df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

Show the Full DataFrame

df.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  4|Aamir Shahzad|        IT|  6000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  7|          Ali|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Example 1: Removing All Duplicate Rows

# Removing duplicate rows (entire row must match)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Example 2: Removing Duplicates Based on Specific Columns

# Removing duplicates based on 'name' and 'department'
df_no_duplicates_specific = df.dropDuplicates(["name", "department"])
df_no_duplicates_specific.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Summary

  • dropDuplicates() removes duplicate rows from a DataFrame.
  • When columns are specified, it keeps only unique rows based on those columns.
  • If no columns are specified, it compares all columns in the DataFrame.

Watch the Video Tutorial