How to Use Intersect() and IntersectAll() in PySpark | Compare DataFrames Easily | PySpark Tutorial

How to Use Intersect() and IntersectAll() Functions in PySpark | Compare DataFrames Easily

How to Use Intersect() and IntersectAll() Functions in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

In this tutorial, you will learn how to use the intersect() and intersectAll() functions in PySpark to find common rows between two DataFrames. The intersect() function returns distinct common rows, while intersectAll() returns all common rows, including duplicates.

Why Use Intersect() and IntersectAll()?

  • intersect(): Returns distinct rows that are present in both DataFrames.
  • intersectAll(): Returns all rows, including duplicates, that are present in both DataFrames.
  • Helpful for comparing datasets and finding overlaps.

Step 1: Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkIntersectFunction") \
    .getOrCreate()

Step 2: Create Sample DataFrames

# DataFrame 1
data1 = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Bob", "Sales", 4200),
    ("Lisa", "Engineering", 6000),
    ("Aamir Shahzad", "Engineering", 5000)  # duplicate row
]

# DataFrame 2
data2 = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Lisa", "Engineering", 6000)
]

columns = ["Name", "Department", "Salary"]

df1 = spark.createDataFrame(data1, schema=columns)
df2 = spark.createDataFrame(data2, schema=columns)

df1.show()

Expected Output (df1)

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
|Aamir Shahzad|Engineering|  5000|
+-------------+-----------+------+
df2.show()

Expected Output (df2)

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 3: Using intersect()

# Returns only distinct common rows
intersect_df = df1.intersect(df2)

intersect_df.show()

Expected Output (intersect_df)

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Note: Even though there is a duplicate row in df1 for "Aamir Shahzad", intersect() returns distinct matches only.

Step 4: Using intersectAll()

# Returns all common rows, including duplicates
intersect_all_df = df1.intersectAll(df2)

intersect_all_df.show()

Expected Output (intersectAll_df)

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|Aamir Shahzad|Engineering|  5000|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Explanation: - df1 has two records for "Aamir Shahzad". - df2 has one record for "Aamir Shahzad". - intersectAll() keeps duplicates from df1 that match the rows in df2.

Conclusion

The intersect() and intersectAll() functions in PySpark are useful for comparing two DataFrames and finding common rows. Use intersect() when you only need distinct matches, and intersectAll() when you want to include duplicates for more detailed comparisons.

Watch the Video Tutorial

For a complete walkthrough on using intersect() and intersectAll() functions in PySpark, watch the video below:

© 2025 Aamir Shahzad | PySpark Tutorials