What Are RDD Partitions in PySpark?
Understanding how RDD partitions work is crucial for optimizing your PySpark applications. This guide walks you through the concept of partitioning, partition count, and how to tune performance using `repartition()` and `coalesce()` functions.
🔍 What is an RDD Partition?
Partitions are logical chunks of data in a distributed system. Spark uses partitions to divide the data for parallel processing across worker nodes. More partitions mean more parallelism and better scalability.
📌 Step 1: Create RDD with Default Partitions
data = ["Aamir", "Ali", "Raza", "Bob", "Lisa"]
rdd = spark.sparkContext.parallelize(data)
print("Original Partitions:", rdd.getNumPartitions())
📌 Step 2: Repartition RDD to 5
rdd_repart = rdd.repartition(5)
print("After Repartition to 5:", rdd_repart.getNumPartitions())
📌 Step 3: Coalesce Back to 2
rdd_coalesce = rdd_repart.coalesce(2)
print("After Coalesce to 2:", rdd_coalesce.getNumPartitions())
📌 Step 4: Show How Data is Distributed
print("Data in Each Partition (Final RDD):")
print(rdd_coalesce.glom().collect())