Boost PySpark Performance with Broadcast Variables & Accumulators
🚀 Introduction
In distributed computing with PySpark, sharing large lookup datasets and tracking global counters efficiently can greatly improve performance. This tutorial demonstrates how to use broadcast variables and accumulators to reduce shuffle, avoid redundant copies, and enhance parallel computations.
📘 What Are Broadcast Variables?
Broadcast variables allow you to cache a large read-only dataset on all worker nodes, preventing it from being re-sent with every task.
# Broadcast example
lookup_set = set(["apple", "banana", "orange", "grape", "kiwi"])
broadcast_var = sc.broadcast(lookup_set)
# Use inside transformation
matches = rdd.filter(lambda word: word in broadcast_var.value)
print(matches.collect())
🧮 What Are Accumulators?
Accumulators are write-only variables used for counters or metrics in parallel computations.
# Accumulator example
clicks = sc.accumulator(0)
def count_home_clicks(page):
if page == "home":
clicks.add(1)
return page
click_rdd = rdd.map(count_home_clicks)
print("Accumulator Value for 'home' clicks:", clicks.value)