Use distinct() to Remove Duplicates from DataFrames | Get Unique Rows with distinct() in PySpark

How to Use distinct() Function in PySpark | Step-by-Step Guide

How to Use distinct() Function in PySpark

The distinct() function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only unique rows, making it a valuable tool for data cleaning and analysis in big data workflows.

Example Dataset

id  name    department
1   Alice   IT
2   Bob     HR
9   Aamir   Finance
4   Alice   IT
5   Eve     HR
6   Frank   Finance
7   Bob     HR
8   Grace   IT
9   Aamir   Finance

Create DataFrame

data = [
    (1, "Alice", "IT"),
    (2, "Bob", "HR"),
    (9, "Aamir", "Finance"),
    (4, "Alice", "IT"),
    (5, "Eve", "HR"),
    (6, "Frank", "Finance"),
    (7, "Bob", "HR"),
    (8, "Grace", "IT"),
    (9, "Aamir", "Finance")
]

df = spark.createDataFrame(data, ["id", "name", "department"])

df.show()

Removing Duplicate Rows using distinct()

df_distinct = df.distinct()

df_distinct.show()

Getting Unique Values from a Single Column

df.select("department").distinct().show()

Summary

The distinct() function in PySpark is very useful when you need to remove duplicate rows from your DataFrame or get unique values from a specific column. It is commonly used during data preprocessing and cleaning tasks in data engineering projects.

Watch the Video Tutorial