Difference Between union() and unionAll() in PySpark
In this PySpark tutorial, you'll learn how to use union()
and unionAll()
functions to combine two DataFrames. These functions are essential for merging datasets in big data processing using Apache Spark.
What is union()?
The union()
function merges two DataFrames with the same schema and returns a new DataFrame. It includes all rows, and you can apply distinct()
to remove duplicates.
What is unionAll()?
The unionAll()
function works similarly to union()
in modern versions of PySpark. It includes all rows, even duplicates, without the need for distinct()
.
Why Use union() and unionAll()?
- To combine data from different sources
- To merge datasets with the same schema
- For data aggregation and analysis
Example: union() and unionAll() in PySpark
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("PySparkUnionExample").getOrCreate()
# Create DataFrame 1
data1 = [("Aamir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000)]
columns = ["Name", "Department", "Salary"]
df1 = spark.createDataFrame(data1, schema=columns)
# Create DataFrame 2
data2 = [("Raza", "Marketing", 3500),
("Ali", "Sales", 4000)] # Duplicate row for demonstration
df2 = spark.createDataFrame(data2, schema=columns)
DataFrame 1
df1.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
+-------------+-----------+------+
DataFrame 2
df2.show()
Expected Output
+-----+-----------+------+
| Name| Department|Salary|
+-----+-----------+------+
| Raza| Marketing| 3500|
| Ali| Sales| 4000|
+-----+-----------+------+
union() Example (Apply distinct() for Unique Rows)
df_union = df1.union(df2).distinct()
df_union.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
+-------------+-----------+------+
unionAll() Example (Includes Duplicates)
df_unionAll = df1.unionAll(df2)
df_unionAll.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Ali| Sales| 4000|
+-------------+-----------+------+
Row Count Comparison
print("Row count after union (with distinct):", df_union.count())
print("Row count after unionAll (with duplicates):", df_unionAll.count())
Expected Output
Row count after union (with distinct): 3
Row count after unionAll (with duplicates): 4
Key Points to Remember
- Both DataFrames must have the same schema for
union()
andunionAll()
. union()
returns a DataFrame with all rows; applydistinct()
if needed to remove duplicates.unionAll()
(legacy) includes all duplicates; in modern PySpark,union()
behaves likeunionAll()
.