How to Use exceptAll() Function in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
The exceptAll()
function in PySpark returns the rows from one DataFrame that are not present in another DataFrame. Unlike except()
, it keeps duplicate rows. This makes exceptAll()
a powerful tool when you want to subtract one DataFrame from another while preserving duplicates.
Why Use exceptAll()?
- Removes records present in the second DataFrame from the first one.
- Keeps duplicates from the first DataFrame that don't have matches in the second.
- Useful when you need an exact subtraction while considering duplicates.
Step 1: Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkExceptAllFunction") \
.getOrCreate()
Step 2: Create Sample DataFrames
# Sample data for DataFrame 1
data1 = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Bob", "Sales", 4200),
("Lisa", "Engineering", 6000),
("Aamir Shahzad", "Engineering", 5000)
]
# Sample data for DataFrame 2
data2 = [
("Aamir Shahzad", "Engineering", 5000),
("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
# Create DataFrames
df1 = spark.createDataFrame(data1, schema=columns)
df2 = spark.createDataFrame(data2, schema=columns)
df1.show()
Expected Output (df1)
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
|Aamir Shahzad|Engineering| 5000|
+-------------+-----------+------+
df2.show()
Expected Output (df2)
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 3: Apply exceptAll() Function
# Subtract df2 from df1 using exceptAll()
result_df = df1.exceptAll(df2)
result_df.show()
Expected Output (result_df)
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
+-------------+-----------+------+
Explanation:
- df1 had two identical rows for Aamir Shahzad, Engineering, 5000
.
- df2 had one row for Aamir Shahzad, Engineering, 5000
, so only one of them was removed.
- The row for Lisa was fully removed since it was present once in both dataframes.
Conclusion
The exceptAll()
function in PySpark is useful for subtracting one DataFrame from another while keeping duplicates intact. It's essential when you need to perform data comparison operations that consider multiple occurrences of records. Make sure the schemas of both DataFrames match when using this function.
Watch the Video Tutorial
For a complete walkthrough on using the exceptAll() function in PySpark, check out this video tutorial: