RDD Persistence in PySpark
Understand how RDD caching and persistence work in PySpark, especially the differences between MEMORY_ONLY
and MEMORY_AND_DISK
.
📌 Why Use Persistence?
When working with large datasets and performing iterative operations (like in loops or ML algorithms), recalculating the RDD can be time-consuming. Caching or persisting saves the RDD in memory or disk for reuse.
🧠Storage Levels
- MEMORY_ONLY: Stores in RAM (fails if not enough memory).
- MEMORY_AND_DISK: Stores in RAM, spills to disk if needed.
⚙️ Example Code
from pyspark import SparkContext, StorageLevel
sc = SparkContext("local", "RDD Persistence Demo")
data = ["Ali", "Aamir", "Fatima", "Aamir", "Ali"]
rdd = sc.parallelize(data)
# Count each word
mapped_rdd = rdd.map(lambda name: (name, 1))
print(mapped_rdd.collect())
# Persist the RDD in memory and disk
mapped_rdd.persist(StorageLevel.MEMORY_AND_DISK)
# Count by key (1)
print("Count by Key (1):")
print(mapped_rdd.reduceByKey(lambda a, b: a + b).collect())
# Group by key and display values
print("Group by Key:")
print(mapped_rdd.groupByKey().mapValues(list).collect())
# Unpersist when done
mapped_rdd.unpersist()