PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

The repartition() function in PySpark allows you to increase or decrease the number of partitions of a DataFrame. This helps in improving parallelism, managing data skew, and optimizing performance in distributed Spark jobs.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark repartition() Example") \
    .getOrCreate()

2. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

3. Check Number of Partitions Before repartition()

print("Partitions before repartition():", df.rdd.getNumPartitions())
Partitions before repartition(): 1

4. Apply repartition()

# Example 1: Increase partitions to 4
df_repartitioned_4 = df.repartition(4)
print("Partitions after repartition(4):", df_repartitioned_4.rdd.getNumPartitions())
Partitions after repartition(4): 4
# Example 2: Decrease partitions to 2
df_repartitioned_2 = df.repartition(2)
print("Partitions after repartition(2):", df_repartitioned_2.rdd.getNumPartitions())
Partitions after repartition(2): 2

5. repartition() by Column (Optional)

# repartition by "age"
df_by_age = df.repartition(2, "age")
print("Partitions after repartition by age:", df_by_age.rdd.getNumPartitions())
Partitions after repartition by age: 2

6. Show Data after Repartition

# Show content to verify repartition didn't alter data
df_repartitioned_2.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

✅ When to Use repartition()

  • When you need balanced partitions for parallel processing
  • To increase partitions for better parallelism
  • To decrease partitions for optimized small output files

📊 Summary

  • repartition() can increase or decrease partitions
  • Helps with load balancing and performance tuning
  • Good for parallel writes and skewed data correction

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube