PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance
The repartition()
function in PySpark allows you to increase or decrease the number of partitions of a DataFrame. This helps in improving parallelism, managing data skew, and optimizing performance in distributed Spark jobs.
1. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark repartition() Example") \
.getOrCreate()
2. Create Sample DataFrame
data = [
(1, "Aamir Shahzad", 35),
(2, "Ali Raza", 30),
(3, "Bob", 25),
(4, "Lisa", 28)
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
3. Check Number of Partitions Before repartition()
print("Partitions before repartition():", df.rdd.getNumPartitions())
Partitions before repartition(): 1
4. Apply repartition()
# Example 1: Increase partitions to 4
df_repartitioned_4 = df.repartition(4)
print("Partitions after repartition(4):", df_repartitioned_4.rdd.getNumPartitions())
Partitions after repartition(4): 4
# Example 2: Decrease partitions to 2
df_repartitioned_2 = df.repartition(2)
print("Partitions after repartition(2):", df_repartitioned_2.rdd.getNumPartitions())
Partitions after repartition(2): 2
5. repartition() by Column (Optional)
# repartition by "age"
df_by_age = df.repartition(2, "age")
print("Partitions after repartition by age:", df_by_age.rdd.getNumPartitions())
Partitions after repartition by age: 2
6. Show Data after Repartition
# Show content to verify repartition didn't alter data
df_repartitioned_2.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
✅ When to Use repartition()
- When you need balanced partitions for parallel processing
- To increase partitions for better parallelism
- To decrease partitions for optimized small output files
📊 Summary
repartition()
can increase or decrease partitions- Helps with load balancing and performance tuning
- Good for parallel writes and skewed data correction