How to Use na() and isEmpty() Functions in PySpark
Author: Aamir Shahzad
Published on: March 2025
Introduction
In this blog post, you’ll learn how to use na()
and isEmpty()
functions in PySpark to handle missing data and validate whether a DataFrame is empty. These functions are crucial for data preprocessing and validation in big data pipelines.
What is na() in PySpark?
The na() function returns an object of DataFrameNaFunctions
, which is used to handle null values in a DataFrame. Common methods include:
fill()
- Replace null values with a specified value.drop()
- Remove rows containing null values.replace()
- Replace specific values.
Example: Using na() Function
from pyspark.sql import SparkSession
# Initialize Spark Session
spark = SparkSession.builder.appName("PySpark_na_and_isEmpty").getOrCreate()
# Sample data with nulls
data = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", None, 4000),
("Raza", "Marketing", None),
("Bob", "Sales", 4200),
("Lisa", None, None)
]
columns = ["Name", "Department", "Salary"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show original DataFrame
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| Bob| Sales| 4200|
| Lisa| null| null|
+-------------+-----------+------+
Fill null values in Department and Salary columns
df_filled = df.na.fill({
"Department": "Not Assigned",
"Salary": 0
})
df_filled.show()
Expected Output
+-------------+-------------+------+
| Name| Department|Salary|
+-------------+-------------+------+
|Aamir Shahzad| Engineering| 5000|
| Ali| Not Assigned| 4000|
| Raza| Marketing| 0|
| Bob| Sales| 4200|
| Lisa| Not Assigned| 0|
+-------------+-------------+------+
Drop rows with any null values
df_dropped = df.na.drop()
df_dropped.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Bob| Sales| 4200|
+-------------+-----------+------+
Replace a specific value
df_replaced = df.na.replace("Sales", "Business Development")
df_replaced.show()
Expected Output
+-------------+----------------------+------+
| Name| Department |Salary|
+-------------+----------------------+------+
|Aamir Shahzad| Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| Bob|Business Development | 4200|
| Lisa| null| null|
+-------------+----------------------+------+
What is isEmpty() in PySpark?
The isEmpty() function checks whether a DataFrame is empty (has no rows). This is helpful to validate results of filters, joins, or transformations.
Example: Using isEmpty() Function
# Filter rows with Salary greater than 10000
df_filtered = df.filter(df.Salary > 10000)
# Check if DataFrame is empty
if df_filtered.isEmpty():
print("The DataFrame is empty!")
else:
df_filtered.show()
Expected Output
The DataFrame is empty!
Explanation: There are no rows in the DataFrame where Salary > 10000, so isEmpty()
returns True.
Watch the Video Tutorial
For a complete walkthrough of the na() and isEmpty() functions in PySpark, check out the video tutorial below: