Welcome To TechBrothersIT

PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

The repartition() function in PySpark allows you to increase or decrease the number of partitions of a DataFrame. This helps in improving parallelism, managing data skew, and optimizing performance in distributed Spark jobs.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark repartition() Example") \
    .getOrCreate()

2. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

3. Check Number of Partitions Before repartition()

print("Partitions before repartition():", df.rdd.getNumPartitions())

Partitions before repartition(): 1

4. Apply repartition()

# Example 1: Increase partitions to 4
df_repartitioned_4 = df.repartition(4)
print("Partitions after repartition(4):", df_repartitioned_4.rdd.getNumPartitions())

Partitions after repartition(4): 4

# Example 2: Decrease partitions to 2
df_repartitioned_2 = df.repartition(2)
print("Partitions after repartition(2):", df_repartitioned_2.rdd.getNumPartitions())

Partitions after repartition(2): 2

5. repartition() by Column (Optional)

# repartition by "age"
df_by_age = df.repartition(2, "age")
print("Partitions after repartition by age:", df_by_age.rdd.getNumPartitions())

Partitions after repartition by age: 2

6. Show Data after Repartition

# Show content to verify repartition didn't alter data
df_repartitioned_2.show()

+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

✅ When to Use repartition()

When you need balanced partitions for parallel processing
To increase partitions for better parallelism
To decrease partitions for optimized small output files

📊 Summary

repartition() can increase or decrease partitions
Helps with load balancing and performance tuning
Good for parallel writes and skewed data correction

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Label

PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

PySpark repartition() Function Tutorial: Optimize Data Partitioning for Better Performance

1. Create Spark Session

2. Create Sample DataFrame

3. Check Number of Partitions Before repartition()

4. Apply repartition()

5. repartition() by Column (Optional)

6. Show Data after Repartition

✅ When to Use repartition()

📊 Summary

📺 Watch the Full Tutorial Video