Welcome To TechBrothersIT

PySpark Tutorial: How to Use subtract() to Compare and Filter DataFrames

In this tutorial, you'll learn how to use the subtract() function in PySpark to find differences between two DataFrames. A simple way to compare and filter rows in big data!

What is subtract() in PySpark?

The subtract() function returns rows that exist in the first DataFrame but not in the second. It works like the EXCEPT operator in SQL.

Both DataFrames must have the same schema.
Commonly used to compare datasets or filter out rows.

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark Subtract Example") \\
    .getOrCreate()

Step 2: Create Sample DataFrames

# First DataFrame
data1 = [
    (29, "Aamir Shahzad"),
    (35, "Ali Raza"),
    (40, "Bob"),
    (25, "Lisa")
]

columns = ["age", "name"]

df1 = spark.createDataFrame(data1, columns)

print("DataFrame 1:")
df1.show()

# Second DataFrame
data2 = [
    (40, "Bob"),
    (25, "Lisa")
]

df2 = spark.createDataFrame(data2, columns)

print("DataFrame 2:")
df2.show()

DataFrame 1: +---+--------------+ |age| name| +---+--------------+ | 29| Aamir Shahzad| | 35| Ali Raza| | 40| Bob| | 25| Lisa| +---+--------------+ DataFrame 2: +---+----+ |age|name| +---+----+ | 40| Bob| | 25|Lisa| +---+----+

Step 3: Using subtract() Function in PySpark

# Subtract df2 from df1
result_df = df1.subtract(df2)

print("Result after subtracting df2 from df1:")
result_df.show()

Expected Output: +---+--------------+ |age| name| +---+--------------+ | 29| Aamir Shahzad| | 35| Ali Raza| +---+--------------+

Why Use subtract()?

Helps identify differences between two DataFrames.
Useful for change data capture (CDC) or data validation.
Removes all rows from df1 that also exist in df2 (based on complete row match).

📺 Watch the Full Tutorial Video

For a detailed walkthrough, watch the video below:

▶️ Watch on YouTube