How to Use filter() or where() Function in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
In this tutorial, you will learn how to use the filter()
and where()
functions in PySpark to filter rows in a DataFrame. These functions are essential for data manipulation and play a critical role in transforming datasets for analysis or machine learning tasks.
Why Use filter() and where() in PySpark?
- Both functions are used to filter rows based on a condition.
- They return a new DataFrame that satisfies the given condition.
filter()
andwhere()
are equivalent and work the same way in PySpark.
Step 1: Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkFilterFunction") \
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
("Amir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Bob", "Sales", 4200),
("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 3: filter() Example 1 - Filter Rows Where Department is 'Sales'
df.filter(df.Department == "Sales").show()
Expected Output
+----+----------+------+
|Name|Department|Salary|
+----+----------+------+
| Ali| Sales| 4000|
| Bob| Sales| 4200|
+----+----------+------+
Step 4: filter() Example 2 - Filter Rows Where Salary is Greater Than 4000
df.filter(df.Salary > 4000).show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 5: filter() Example 3 - Filter Rows with Multiple Conditions (AND)
df.filter((df.Department == "Engineering") & (df.Salary > 5000)).show()
Expected Output
+----+-----------+------+
|Name| Department|Salary|
+----+-----------+------+
|Lisa|Engineering| 6000|
+----+-----------+------+
Step 6: filter() Example 4 - Filter Rows with Multiple Conditions (OR)
df.filter((df.Department == "Sales") | (df.Salary > 5000)).show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
| Ali| Sales| 4000|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 7: Bonus - filter() with isin() Function
df.filter(df.Name.isin("Amir Shahzad", "Raza")).show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Raza| Marketing| 3500|
+-------------+-----------+------+
Step 8: Using where() Instead of filter()
You can use where()
in the same way as filter()
.
df.where(df.Department == "Sales").show()
Expected Output
+----+----------+------+
|Name|Department|Salary|
+----+----------+------+
| Ali| Sales| 4000|
| Bob| Sales| 4200|
+----+----------+------+
Conclusion
In PySpark, both filter()
and where()
are used to filter DataFrame rows based on given conditions. They are functionally identical, and you can use whichever makes your code more readable.
Watch the Video Tutorial
For a complete walkthrough of using filter() and where() functions in PySpark, check out this video: