PySpark Tutorial: How to Use cache() to Improve Spark Performance
In this tutorial, you'll learn how to use the cache()
function in PySpark to optimize performance by storing intermediate results in memory.
📌 What is cache() in PySpark?
cache()
is an optimization technique that stores a DataFrame (RDD) in memory after an action is triggered. It avoids recomputing the DataFrame for future actions, improving performance.
🔥 Why Use cache()? (Benefits)
- Speeds up jobs that reuse the same DataFrame
- Saves recomputation time in iterative algorithms
- Useful for exploratory data analysis (EDA)
- Optimizes performance in joins and repeated filters
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("PySpark cache() Example") \\
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
(1, "Aamir Shahzad", 35),
(2, "Ali Raza", 30),
(3, "Bob", 25),
(4, "Lisa", 28)
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
Step 3: Cache the DataFrame
# Apply cache
df.cache()
Step 4: Trigger Action to Materialize Cache
# Action like count() triggers caching
df.count()
Output: 4
Step 5: Perform Actions on Cached Data
df.show()
df.filter(df.age > 28).show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+
Step 6: Check if DataFrame is Cached
print("Is DataFrame cached?", df.is_cached)
Is DataFrame cached? True
Step 7: Remove Cache (Unpersist)
df.unpersist()
print("Is DataFrame cached after unpersist?", df.is_cached)
Is DataFrame cached after unpersist? False