Difference Between union() and unionAll() in PySpark | union() vs unionAll() | PySpark Tutorial

Difference Between union() and unionAll() in PySpark | Step-by-Step Guide

Difference Between union() and unionAll() in PySpark

In this PySpark tutorial, you'll learn how to use union() and unionAll() functions to combine two DataFrames. These functions are essential for merging datasets in big data processing using Apache Spark.

What is union()?

The union() function merges two DataFrames with the same schema and returns a new DataFrame. It includes all rows, and you can apply distinct() to remove duplicates.

What is unionAll()?

The unionAll() function works similarly to union() in modern versions of PySpark. It includes all rows, even duplicates, without the need for distinct().

Why Use union() and unionAll()?

  • To combine data from different sources
  • To merge datasets with the same schema
  • For data aggregation and analysis

Example: union() and unionAll() in PySpark

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("PySparkUnionExample").getOrCreate()

# Create DataFrame 1
data1 = [("Aamir Shahzad", "Engineering", 5000),
         ("Ali", "Sales", 4000)]

columns = ["Name", "Department", "Salary"]

df1 = spark.createDataFrame(data1, schema=columns)

# Create DataFrame 2
data2 = [("Raza", "Marketing", 3500),
         ("Ali", "Sales", 4000)]  # Duplicate row for demonstration

df2 = spark.createDataFrame(data2, schema=columns)

DataFrame 1

df1.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
+-------------+-----------+------+

DataFrame 2

df2.show()

Expected Output

+-----+-----------+------+
| Name| Department|Salary|
+-----+-----------+------+
| Raza|  Marketing|  3500|
|  Ali|      Sales|  4000|
+-----+-----------+------+

union() Example (Apply distinct() for Unique Rows)

df_union = df1.union(df2).distinct()
df_union.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
+-------------+-----------+------+

unionAll() Example (Includes Duplicates)

df_unionAll = df1.unionAll(df2)
df_unionAll.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Ali|      Sales|  4000|
+-------------+-----------+------+

Row Count Comparison

print("Row count after union (with distinct):", df_union.count())
print("Row count after unionAll (with duplicates):", df_unionAll.count())

Expected Output

Row count after union (with distinct): 3
Row count after unionAll (with duplicates): 4

Key Points to Remember

  • Both DataFrames must have the same schema for union() and unionAll().
  • union() returns a DataFrame with all rows; apply distinct() if needed to remove duplicates.
  • unionAll() (legacy) includes all duplicates; in modern PySpark, union() behaves like unionAll().

Watch the Full Tutorial on YouTube