Welcome To TechBrothersIT

PySpark coalesce() Function Tutorial: Optimize Partitioning for Faster Spark Jobs

This tutorial will help you understand how to use the coalesce() function in PySpark to reduce the number of partitions in your DataFrame and improve performance.

1. What is coalesce() in PySpark?

coalesce() reduces the number of partitions in a DataFrame.
It is preferred over repartition() when reducing partitions because it avoids full data shuffle.
Ideal for optimizing small files and preparing data for output operations.

2. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark coalesce() Example") \\
    .getOrCreate()

3. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()

+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

4. Check Number of Partitions Before coalesce()

print("Partitions before coalesce:", df.rdd.getNumPartitions())

Partitions before coalesce: 4

5. Apply coalesce() to Reduce Partitions

df_coalesced = df.coalesce(1)

6. Check Number of Partitions After coalesce()

print("Partitions after coalesce:", df_coalesced.rdd.getNumPartitions())

Partitions after coalesce: 1

7. Show Transformed Data

df_coalesced.show()