How to Use Cache in PySpark to Improve Spark Performance | PySpark Tutorial #pysparktutorial

PySpark Tutorial: How to Use cache() to Improve Spark Performance

PySpark Tutorial: How to Use cache() to Improve Spark Performance

In this tutorial, you'll learn how to use the cache() function in PySpark to optimize performance by storing intermediate results in memory.

📌 What is cache() in PySpark?

cache() is an optimization technique that stores a DataFrame (RDD) in memory after an action is triggered. It avoids recomputing the DataFrame for future actions, improving performance.

🔥 Why Use cache()? (Benefits)

  • Speeds up jobs that reuse the same DataFrame
  • Saves recomputation time in iterative algorithms
  • Useful for exploratory data analysis (EDA)
  • Optimizes performance in joins and repeated filters

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark cache() Example") \\
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

Step 3: Cache the DataFrame

# Apply cache
df.cache()

Step 4: Trigger Action to Materialize Cache

# Action like count() triggers caching
df.count()
Output: 4

Step 5: Perform Actions on Cached Data

df.show()
df.filter(df.age > 28).show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

Step 6: Check if DataFrame is Cached

print("Is DataFrame cached?", df.is_cached)
Is DataFrame cached? True

Step 7: Remove Cache (Unpersist)

df.unpersist()
print("Is DataFrame cached after unpersist?", df.is_cached)
Is DataFrame cached after unpersist? False

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.