PySpark Tutorial: filter() vs where() | Filter DataFrames

How to Use filter() or where() Function in PySpark | Step-by-Step Guide

How to Use filter() or where() Function in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

In this tutorial, you will learn how to use the filter() and where() functions in PySpark to filter rows in a DataFrame. These functions are essential for data manipulation and play a critical role in transforming datasets for analysis or machine learning tasks.

Why Use filter() and where() in PySpark?

  • Both functions are used to filter rows based on a condition.
  • They return a new DataFrame that satisfies the given condition.
  • filter() and where() are equivalent and work the same way in PySpark.

Step 1: Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkFilterFunction") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    ("Amir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Bob", "Sales", 4200),
    ("Lisa", "Engineering", 6000)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 3: filter() Example 1 - Filter Rows Where Department is 'Sales'

df.filter(df.Department == "Sales").show()

Expected Output

+----+----------+------+
|Name|Department|Salary|
+----+----------+------+
| Ali|     Sales|  4000|
| Bob|     Sales|  4200|
+----+----------+------+

Step 4: filter() Example 2 - Filter Rows Where Salary is Greater Than 4000

df.filter(df.Salary > 4000).show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 5: filter() Example 3 - Filter Rows with Multiple Conditions (AND)

df.filter((df.Department == "Engineering") & (df.Salary > 5000)).show()

Expected Output

+----+-----------+------+
|Name| Department|Salary|
+----+-----------+------+
|Lisa|Engineering|  6000|
+----+-----------+------+

Step 6: filter() Example 4 - Filter Rows with Multiple Conditions (OR)

df.filter((df.Department == "Sales") | (df.Salary > 5000)).show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|          Ali|      Sales|  4000|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 7: Bonus - filter() with isin() Function

df.filter(df.Name.isin("Amir Shahzad", "Raza")).show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|         Raza|  Marketing|  3500|
+-------------+-----------+------+

Step 8: Using where() Instead of filter()

You can use where() in the same way as filter().

df.where(df.Department == "Sales").show()

Expected Output

+----+----------+------+
|Name|Department|Salary|
+----+----------+------+
| Ali|     Sales|  4000|
| Bob|     Sales|  4200|
+----+----------+------+

Conclusion

In PySpark, both filter() and where() are used to filter DataFrame rows based on given conditions. They are functionally identical, and you can use whichever makes your code more readable.

Watch the Video Tutorial

For a complete walkthrough of using filter() and where() functions in PySpark, check out this video:

© 2025 Aamir Shahzad | PySpark Tutorials