Welcome To TechBrothersIT

PySpark Tutorial: How to Use cache() to Improve Spark Performance

In this tutorial, you'll learn how to use the cache() function in PySpark to optimize performance by storing intermediate results in memory.

📌 What is cache() in PySpark?

cache() is an optimization technique that stores a DataFrame (RDD) in memory after an action is triggered. It avoids recomputing the DataFrame for future actions, improving performance.

🔥 Why Use cache()? (Benefits)

Speeds up jobs that reuse the same DataFrame
Saves recomputation time in iterative algorithms
Useful for exploratory data analysis (EDA)
Optimizes performance in joins and repeated filters

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark cache() Example") \\
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

Step 3: Cache the DataFrame

# Apply cache
df.cache()

Step 4: Trigger Action to Materialize Cache

# Action like count() triggers caching
df.count()

Output: 4

Step 5: Perform Actions on Cached Data

df.show()
df.filter(df.age > 28).show()

+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

Step 6: Check if DataFrame is Cached

print("Is DataFrame cached?", df.is_cached)

Is DataFrame cached? True

Step 7: Remove Cache (Unpersist)

df.unpersist()
print("Is DataFrame cached after unpersist?", df.is_cached)

Is DataFrame cached after unpersist? False

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube