PySpark PairRDD Transformations : groupByKey, reduceByKey, sortByKey Explained with Real Data | PySpark Tutorial

Mastering PySpark PairRDD Transformations | groupByKey, reduceByKey, sortByKey

Mastering PySpark PairRDD Transformations

groupByKey, reduceByKey, sortByKey Explained with Real Data

In this tutorial, we explore three essential PairRDD transformations in PySpark: groupByKey(), reduceByKey(), and sortByKey(). These functions enable you to group, reduce, and sort data using key-value pairs — powerful operations for distributed data processing.

🔹 Step 1: Sample Data

data = [
    ("Alice", 300),
    ("Bob", 150),
    ("Alice", 200),
    ("Raza", 450),
    ("Bob", 100),
    ("Raza", 50)
]
rdd = spark.sparkContext.parallelize(data)

🔹 Step 2: groupByKey()

grouped_rdd = rdd.groupByKey()
for k, v in grouped_rdd.collect():
    print(k, list(v))

🔹 Step 3: reduceByKey()

reduced_rdd = rdd.reduceByKey(lambda a, b: a + b)
print(reduced_rdd.collect())

🔹 Step 4: sortByKey()

sorted_rdd = reduced_rdd.sortByKey()
print(sorted_rdd.collect())

📺 Watch the Full Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.