Understanding RDD Actions in PySpark
Learn the difference between collect()
, count()
, and reduce()
in PySpark through examples and output.
๐ Introduction
In PySpark, RDD actions are used to trigger the execution of transformations and return results. Unlike transformations (which are lazy), actions cause Spark to actually process the data.
๐งช Sample RDD
data = [10, 20, 30, 40, 50, 60]
rdd = spark.sparkContext.parallelize(data)
๐น collect()
Returns all elements to the driver:
rdd.collect()
# Output: [10, 20, 30, 40, 50, 60]
๐น count()
Returns the number of elements in the RDD:
rdd.count()
# Output: 6
๐น reduce()
Aggregates the RDD using a function:
rdd.reduce(lambda a, b: a + b)
# Output: 210
๐ก Summary
- collect() → Brings all data to the driver (use with caution on large datasets)
- count() → Returns number of elements
- reduce() → Returns aggregated result