PySpark coalesce() Function Tutorial: Optimize Partitioning for Faster Spark Jobs
This tutorial will help you understand how to use the coalesce()
function in PySpark to reduce the number of partitions in your DataFrame and improve performance.
1. What is coalesce() in PySpark?
coalesce()
reduces the number of partitions in a DataFrame.- It is preferred over
repartition()
when reducing partitions because it avoids full data shuffle. - Ideal for optimizing small files and preparing data for output operations.
2. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("PySpark coalesce() Example") \\
.getOrCreate()
3. Create Sample DataFrame
data = [
(1, "Aamir Shahzad", 35),
(2, "Ali Raza", 30),
(3, "Bob", 25),
(4, "Lisa", 28)
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
4. Check Number of Partitions Before coalesce()
print("Partitions before coalesce:", df.rdd.getNumPartitions())
Partitions before coalesce: 4
5. Apply coalesce() to Reduce Partitions
df_coalesced = df.coalesce(1)
6. Check Number of Partitions After coalesce()
print("Partitions after coalesce:", df_coalesced.rdd.getNumPartitions())
Partitions after coalesce: 1
7. Show Transformed Data
df_coalesced.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+