Optimize Spark Shuffles: groupByKey vs reduceByKey
Learn how Spark handles shuffles behind the scenes and why reduceByKey is more efficient than groupByKey.
🧠What You'll Learn
- What is a Spark shuffle?
- Why reduceByKey is more performant than groupByKey
- Real code examples with data output
- Tips to improve Spark job efficiency
⚙️ Code Example: groupByKey
data = [("Aamir", 100), ("Ali", 200), ("Aamir", 300), ("Raza", 150), ("Ali", 50)]
rdd = spark.sparkContext.parallelize(data)
grouped = rdd.groupByKey()
print("GroupByKey Result:")
for k, v in grouped.collect():
print(k, list(v))
⚙️ Code Example: reduceByKey
reduced = rdd.reduceByKey(lambda a, b: a + b)
print("reduceByKey Result:")
print(reduced.collect())
📊 Output Comparison
- groupByKey: Results in more data shuffled across partitions.
- reduceByKey: Performs local aggregation before the shuffle, minimizing data transfer.