PySpark crossJoin() Function | Cartesian Product of DataFrames

PySpark crossJoin() Function Tutorial

Cartesian Product of DataFrames in PySpark

In this tutorial, you will learn how to use the crossJoin() function in PySpark to generate a Cartesian product between two DataFrames. This operation combines every row of the first DataFrame with every row of the second one.

Step 1: Import and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark crossJoin Example") \
    .getOrCreate()

Step 2: Create Sample Data

# DataFrame 1: People
data_people = [
    ("Aamir Shahzad", "Pakistan"),
    ("Ali Raza", "USA"),
    ("Bob", "UK"),
    ("Lisa", "Canada")
]
df_people = spark.createDataFrame(data_people, ["Name", "Country"])

# DataFrame 2: Hobbies
data_hobbies = [
    ("Reading",),
    ("Traveling",),
    ("Cricket",)
]
df_hobbies = spark.createDataFrame(data_hobbies, ["Hobby"])

People DataFrame Output:

Hobbies DataFrame Output:

Hobby
------
Reading
Traveling
Cricket

Step 3: Perform crossJoin()

# Perform Cartesian join
cross_join_result = df_people.crossJoin(df_hobbies)

# Show result
cross_join_result.show(truncate=False)

Output:

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Welcome To TechBrothersIT

Label

How to Use crossJoin() Function for Cartesian Product | PySpark Tutorial #pysparktutorial