PySpark Tutorial: na Functions and isEmpty Explained with Examples

How to use na() and isEmpty() Function in PySpark | PySpark Tutorial

How to Use na() and isEmpty() Functions in PySpark

Author: Aamir Shahzad

Published on: March 2025

Introduction

In this blog post, you’ll learn how to use na() and isEmpty() functions in PySpark to handle missing data and validate whether a DataFrame is empty. These functions are crucial for data preprocessing and validation in big data pipelines.

What is na() in PySpark?

The na() function returns an object of DataFrameNaFunctions, which is used to handle null values in a DataFrame. Common methods include:

  • fill() - Replace null values with a specified value.
  • drop() - Remove rows containing null values.
  • replace() - Replace specific values.

Example: Using na() Function

from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("PySpark_na_and_isEmpty").getOrCreate()

# Sample data with nulls
data = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Ali", None, 4000),
    ("Raza", "Marketing", None),
    ("Bob", "Sales", 4200),
    ("Lisa", None, None)
]

columns = ["Name", "Department", "Salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show original DataFrame
df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|       null|  4000|
|         Raza|  Marketing|  null|
|          Bob|      Sales|  4200|
|         Lisa|       null|  null|
+-------------+-----------+------+

Fill null values in Department and Salary columns

df_filled = df.na.fill({
    "Department": "Not Assigned",
    "Salary": 0
})

df_filled.show()

Expected Output

+-------------+-------------+------+
|         Name|   Department|Salary|
+-------------+-------------+------+
|Aamir Shahzad|   Engineering|  5000|
|          Ali| Not Assigned|  4000|
|         Raza|    Marketing|     0|
|          Bob|        Sales|  4200|
|         Lisa| Not Assigned|     0|
+-------------+-------------+------+

Drop rows with any null values

df_dropped = df.na.drop()
df_dropped.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Bob|      Sales|  4200|
+-------------+-----------+------+

Replace a specific value

df_replaced = df.na.replace("Sales", "Business Development")
df_replaced.show()

Expected Output

+-------------+----------------------+------+
|         Name|           Department |Salary|
+-------------+----------------------+------+
|Aamir Shahzad|           Engineering|  5000|
|          Ali|                  null|  4000|
|         Raza|             Marketing|  null|
|          Bob|Business Development  |  4200|
|         Lisa|                  null|  null|
+-------------+----------------------+------+

What is isEmpty() in PySpark?

The isEmpty() function checks whether a DataFrame is empty (has no rows). This is helpful to validate results of filters, joins, or transformations.

Example: Using isEmpty() Function

# Filter rows with Salary greater than 10000
df_filtered = df.filter(df.Salary > 10000)

# Check if DataFrame is empty
if df_filtered.isEmpty():
    print("The DataFrame is empty!")
else:
    df_filtered.show()

Expected Output

The DataFrame is empty!

Explanation: There are no rows in the DataFrame where Salary > 10000, so isEmpty() returns True.

Watch the Video Tutorial

For a complete walkthrough of the na() and isEmpty() functions in PySpark, check out the video tutorial below:

© 2025 Aamir Shahzad. All rights reserved.