PySpark crossJoin() Function Tutorial
Cartesian Product of DataFrames in PySpark
In this tutorial, you will learn how to use the crossJoin()
function in PySpark to generate a Cartesian product between two DataFrames. This operation combines every row of the first DataFrame with every row of the second one.
Step 1: Import and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark crossJoin Example") \
.getOrCreate()
Step 2: Create Sample Data
# DataFrame 1: People
data_people = [
("Aamir Shahzad", "Pakistan"),
("Ali Raza", "USA"),
("Bob", "UK"),
("Lisa", "Canada")
]
df_people = spark.createDataFrame(data_people, ["Name", "Country"])
# DataFrame 2: Hobbies
data_hobbies = [
("Reading",),
("Traveling",),
("Cricket",)
]
df_hobbies = spark.createDataFrame(data_hobbies, ["Hobby"])
People DataFrame Output:
Name | Country
------------------------
Aamir Shahzad | Pakistan
Ali Raza | USA
Bob | UK
Lisa | Canada
------------------------
Aamir Shahzad | Pakistan
Ali Raza | USA
Bob | UK
Lisa | Canada
Hobbies DataFrame Output:
Hobby
------
Reading
Traveling
Cricket
------
Reading
Traveling
Cricket
Step 3: Perform crossJoin()
# Perform Cartesian join
cross_join_result = df_people.crossJoin(df_hobbies)
# Show result
cross_join_result.show(truncate=False)
Output:
Name | Country | Hobby
------------------------------------
Aamir Shahzad | Pakistan | Reading
Aamir Shahzad | Pakistan | Traveling
Aamir Shahzad | Pakistan | Cricket
Ali Raza | USA | Reading
Ali Raza | USA | Traveling
Ali Raza | USA | Cricket
Bob | UK | Reading
Bob | UK | Traveling
Bob | UK | Cricket
Lisa | Canada | Reading
Lisa | Canada | Traveling
Lisa | Canada | Cricket
------------------------------------
Aamir Shahzad | Pakistan | Reading
Aamir Shahzad | Pakistan | Traveling
Aamir Shahzad | Pakistan | Cricket
Ali Raza | USA | Reading
Ali Raza | USA | Traveling
Ali Raza | USA | Cricket
Bob | UK | Reading
Bob | UK | Traveling
Bob | UK | Cricket
Lisa | Canada | Reading
Lisa | Canada | Traveling
Lisa | Canada | Cricket