RDD Persistence in PySpark Explained | MEMORY_ONLY vs MEMORY_AND_DISK with Examples #pyspark

RDD Persistence in PySpark | MEMORY_ONLY vs MEMORY_AND_DISK Explained

RDD Persistence in PySpark

Understand how RDD caching and persistence work in PySpark, especially the differences between MEMORY_ONLY and MEMORY_AND_DISK.

📌 Why Use Persistence?

When working with large datasets and performing iterative operations (like in loops or ML algorithms), recalculating the RDD can be time-consuming. Caching or persisting saves the RDD in memory or disk for reuse.

🧠 Storage Levels

  • MEMORY_ONLY: Stores in RAM (fails if not enough memory).
  • MEMORY_AND_DISK: Stores in RAM, spills to disk if needed.

⚙️ Example Code

from pyspark import SparkContext, StorageLevel

sc = SparkContext("local", "RDD Persistence Demo")
data = ["Ali", "Aamir", "Fatima", "Aamir", "Ali"]
rdd = sc.parallelize(data)

# Count each word
mapped_rdd = rdd.map(lambda name: (name, 1))
print(mapped_rdd.collect())

# Persist the RDD in memory and disk
mapped_rdd.persist(StorageLevel.MEMORY_AND_DISK)

# Count by key (1)
print("Count by Key (1):")
print(mapped_rdd.reduceByKey(lambda a, b: a + b).collect())

# Group by key and display values
print("Group by Key:")
print(mapped_rdd.groupByKey().mapValues(list).collect())

# Unpersist when done
mapped_rdd.unpersist()

🎥 Watch Video Tutorial

Some of the contents