How to Use fillna() Function in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
In this tutorial, we will learn how to handle missing or null values in PySpark DataFrames using the fillna()
function. Handling missing data is a critical part of data cleaning in data engineering workflows.
Why Use fillna() in PySpark?
- Replace NULL values in DataFrame columns with specific values.
- Apply different replacement values to different columns.
- Clean your dataset before analysis or feeding it into machine learning models.
Step 1: Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkFillnaFunction") \
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
("Amir Shahzad", "Engineering", 5000),
("Ali", None, 4000),
("Raza", "Marketing", None),
(None, "Sales", 4500),
("Ali", None, None)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| null| Sales| 4500|
| Ali| null| null|
+-------------+-----------+------+
Step 3: Fill All NULL Values
Fill all NULL values with 'Unknown' for string columns and 0 for numeric columns.
df_fill_all = df.fillna("Unknown").fillna(0)
df_fill_all.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| Unknown| 4000|
| Raza| Marketing| 0|
| Unknown| Sales| 4500|
| Ali| Unknown| 0|
+-------------+-----------+------+
Step 4: Fill NULLs with Column-Specific Values
df_fill_columns = df.fillna({
"Department": "NA",
"Salary": 10000
})
df_fill_columns.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| NA| 4000|
| Raza| Marketing| 10000|
| null| Sales| 4500|
| Ali| NA| 10000|
+-------------+-----------+------+
Step 5: Fill NULLs in a Specific Column Only
df_fill_name = df.fillna("No Name", subset=["Name"])
df_fill_name.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| No Name| Sales| 4500|
| Ali| null| null|
+-------------+-----------+------+
Conclusion
Handling null and missing values is an essential part of data processing in PySpark. The fillna()
function provides a simple and flexible way to replace these values, ensuring your data is clean and ready for further analysis or modeling.