PySpark Tutorial: fillna() Function to Replace Null or Missing Values | #PySparkTutorial #PySpark

How to Use fillna() Function in PySpark | Step-by-Step Guide

How to Use fillna() Function in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

In this tutorial, we will learn how to handle missing or null values in PySpark DataFrames using the fillna() function. Handling missing data is a critical part of data cleaning in data engineering workflows.

Why Use fillna() in PySpark?

  • Replace NULL values in DataFrame columns with specific values.
  • Apply different replacement values to different columns.
  • Clean your dataset before analysis or feeding it into machine learning models.

Step 1: Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkFillnaFunction") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    ("Amir Shahzad", "Engineering", 5000),
    ("Ali", None, 4000),
    ("Raza", "Marketing", None),
    (None, "Sales", 4500),
    ("Ali", None, None)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|       null|  4000|
|         Raza|  Marketing|  null|
|         null|      Sales|  4500|
|          Ali|       null|  null|
+-------------+-----------+------+

Step 3: Fill All NULL Values

Fill all NULL values with 'Unknown' for string columns and 0 for numeric columns.

df_fill_all = df.fillna("Unknown").fillna(0)

df_fill_all.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|    Unknown|  4000|
|         Raza|  Marketing|     0|
|      Unknown|      Sales|  4500|
|          Ali|    Unknown|     0|
+-------------+-----------+------+

Step 4: Fill NULLs with Column-Specific Values

df_fill_columns = df.fillna({
    "Department": "NA",
    "Salary": 10000
})

df_fill_columns.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|         NA|  4000|
|         Raza|  Marketing| 10000|
|         null|      Sales|  4500|
|          Ali|         NA| 10000|
+-------------+-----------+------+

Step 5: Fill NULLs in a Specific Column Only

df_fill_name = df.fillna("No Name", subset=["Name"])

df_fill_name.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|       null|  4000|
|         Raza|  Marketing|  null|
|      No Name|      Sales|  4500|
|          Ali|       null|  null|
+-------------+-----------+------+

Conclusion

Handling null and missing values is an essential part of data processing in PySpark. The fillna() function provides a simple and flexible way to replace these values, ensuring your data is clean and ready for further analysis or modeling.

Watch the Video Tutorial