PySpark Tutorial: unionByName() Function for Joining DataFrames
This tutorial demonstrates how to use the unionByName()
function in PySpark to combine two DataFrames by matching column names.
What is unionByName() in PySpark?
The unionByName()
function combines two DataFrames by aligning columns with the same name, regardless of their order.
- Column names must match
- Useful when schemas are the same but column orders differ
Step 1: Create a Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark unionByName Example").getOrCreate()
Step 2: Create Sample DataFrames
# DataFrame 1
data1 = [("Aamir Shahzad", "Pakistan", 25),
("Ali Raza", "USA", 30)]
df1 = spark.createDataFrame(data1, ["Name", "Country", "Age"])
# DataFrame 2 (Different column order)
data2 = [("Bob", 45, "UK"),
("Lisa", 35, "Canada")]
df2 = spark.createDataFrame(data2, ["Name", "Age", "Country"])
DataFrame 1:
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
+--------------+---------+---+
DataFrame 2:
+-----+---+--------+
|Name |Age|Country |
+-----+---+--------+
|Bob | 45|UK |
|Lisa | 35|Canada |
+-----+---+--------+
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
+--------------+---------+---+
DataFrame 2:
+-----+---+--------+
|Name |Age|Country |
+-----+---+--------+
|Bob | 45|UK |
|Lisa | 35|Canada |
+-----+---+--------+
Step 3: Use unionByName() to Combine DataFrames
union_df = df1.unionByName(df2)
Step 4: Show Result
print("Union Result:")
union_df.show()
Union Result:
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
| Bob | UK | 45 |
| Lisa | Canada | 35 |
+--------------+---------+---+
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
| Bob | UK | 45 |
| Lisa | Canada | 35 |
+--------------+---------+---+
Why Use unionByName()?
- Safer alternative to
union()
when column orders might differ - Prevents data from being mismatched due to incorrect column alignment
- Great for combining data from different sources with consistent column names