Mastering PySpark PairRDD Transformations
groupByKey, reduceByKey, sortByKey Explained with Real Data
In this tutorial, we explore three essential PairRDD transformations in PySpark: groupByKey()
, reduceByKey()
, and sortByKey()
. These functions enable you to group, reduce, and sort data using key-value pairs — powerful operations for distributed data processing.
🔹 Step 1: Sample Data
data = [
("Alice", 300),
("Bob", 150),
("Alice", 200),
("Raza", 450),
("Bob", 100),
("Raza", 50)
]
rdd = spark.sparkContext.parallelize(data)
🔹 Step 2: groupByKey()
grouped_rdd = rdd.groupByKey()
for k, v in grouped_rdd.collect():
print(k, list(v))
🔹 Step 3: reduceByKey()
reduced_rdd = rdd.reduceByKey(lambda a, b: a + b)
print(reduced_rdd.collect())
🔹 Step 4: sortByKey()
sorted_rdd = reduced_rdd.sortByKey()
print(sorted_rdd.collect())