Optimize Spark Shuffles in PySpark: groupByKey vs reduceByKey Explained with Real Examples | PySpark Tutorial

Optimize Spark Shuffles: groupByKey vs reduceByKey | PySpark Internals Explained

Optimize Spark Shuffles: groupByKey vs reduceByKey

Learn how Spark handles shuffles behind the scenes and why reduceByKey is more efficient than groupByKey.

🧠 What You'll Learn

  • What is a Spark shuffle?
  • Why reduceByKey is more performant than groupByKey
  • Real code examples with data output
  • Tips to improve Spark job efficiency

⚙️ Code Example: groupByKey

data = [("Aamir", 100), ("Ali", 200), ("Aamir", 300), ("Raza", 150), ("Ali", 50)]
rdd = spark.sparkContext.parallelize(data)
grouped = rdd.groupByKey()
print("GroupByKey Result:")
for k, v in grouped.collect():
    print(k, list(v))

⚙️ Code Example: reduceByKey

reduced = rdd.reduceByKey(lambda a, b: a + b)
print("reduceByKey Result:")
print(reduced.collect())

📊 Output Comparison

  • groupByKey: Results in more data shuffled across partitions.
  • reduceByKey: Performs local aggregation before the shuffle, minimizing data transfer.

📺 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.