How to Use dropna() Function in PySpark | Remove Null Values Easily | PySpark Tutorial

How to Use dropna() Function in PySpark | Step-by-Step Guide

How to Use dropna() Function in PySpark

The dropna() function in PySpark is used to remove rows that contain NULL (missing) values from a DataFrame. This is crucial in data preprocessing and cleansing, ensuring your data is clean and ready for analysis.

Basic Syntax

DataFrame.dropna(how='any', thresh=None, subset=None)
  • how: 'any' (default) drops a row if any nulls are present. 'all' drops a row only if all columns are null.
  • thresh: Minimum number of non-null values required to retain the row.
  • subset: List of column names to consider for null checking.

Sample Data

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize SparkSession
spark = SparkSession.builder.appName("DropNAExample").getOrCreate()

# Sample data
data = [
    Row(name='John', age=30, salary=None),
    Row(name='Alice', age=None, salary=5000),
    Row(name='Bob', age=40, salary=6000),
    Row(name=None, age=None, salary=None)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show original DataFrame
df.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
| John|  30|  null|
|Alice|null|  5000|
|  Bob|  40|  6000|
| null|null|  null|
+-----+----+------+

Example 1: Remove Rows with Any NULL Values

# Remove rows with any NULL values
df_clean_any = df.dropna()
df_clean_any.show()

Expected Output

+----+---+------+
|name|age|salary|
+----+---+------+
| Bob| 40|  6000|
+----+---+------+

Example 2: Remove Rows Where All Columns Are NULL

# Remove rows where all columns are NULL
df_clean_all = df.dropna(how='all')
df_clean_all.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
| John|  30|  null|
|Alice|null|  5000|
|  Bob|  40|  6000|
+-----+----+------+

Example 3: Remove Rows with NULLs in a Specific Subset of Columns

# Remove rows with NULLs in 'name' or 'salary'
df_clean_subset = df.dropna(subset=['name', 'salary'])
df_clean_subset.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
|Alice|null|  5000|
|  Bob|  40|  6000|
+-----+----+------+

Conclusion

The dropna() function in PySpark is a powerful tool for handling missing data efficiently. It allows you to clean your datasets by removing incomplete rows based on flexible criteria, ensuring the quality of your data pipelines.

Watch the Video Tutorial