How to Use distinct() Function in PySpark
The distinct()
function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only unique rows, making it a valuable tool for data cleaning and analysis in big data workflows.
Example Dataset
id name department
1 Alice IT
2 Bob HR
9 Aamir Finance
4 Alice IT
5 Eve HR
6 Frank Finance
7 Bob HR
8 Grace IT
9 Aamir Finance
Create DataFrame
data = [
(1, "Alice", "IT"),
(2, "Bob", "HR"),
(9, "Aamir", "Finance"),
(4, "Alice", "IT"),
(5, "Eve", "HR"),
(6, "Frank", "Finance"),
(7, "Bob", "HR"),
(8, "Grace", "IT"),
(9, "Aamir", "Finance")
]
df = spark.createDataFrame(data, ["id", "name", "department"])
df.show()
Removing Duplicate Rows using distinct()
df_distinct = df.distinct()
df_distinct.show()
Getting Unique Values from a Single Column
df.select("department").distinct().show()
Summary
The distinct()
function in PySpark is very useful when you need to remove duplicate rows from your DataFrame or get unique values from a specific column. It is commonly used during data preprocessing and cleaning tasks in data engineering projects.