RDD Actions in PySpark: collect(), count(), reduce() Explained with Real Examples | PySpark Tutorial

Understanding RDD Actions in PySpark | collect(), count(), reduce() Explained

Understanding RDD Actions in PySpark

Learn the difference between collect(), count(), and reduce() in PySpark through examples and output.

๐Ÿ“˜ Introduction

In PySpark, RDD actions are used to trigger the execution of transformations and return results. Unlike transformations (which are lazy), actions cause Spark to actually process the data.

๐Ÿงช Sample RDD

data = [10, 20, 30, 40, 50, 60]
rdd = spark.sparkContext.parallelize(data)

๐Ÿ”น collect()

Returns all elements to the driver:

rdd.collect()
# Output: [10, 20, 30, 40, 50, 60]

๐Ÿ”น count()

Returns the number of elements in the RDD:

rdd.count()
# Output: 6

๐Ÿ”น reduce()

Aggregates the RDD using a function:

rdd.reduce(lambda a, b: a + b)
# Output: 210

๐Ÿ’ก Summary

  • collect() → Brings all data to the driver (use with caution on large datasets)
  • count() → Returns number of elements
  • reduce() → Returns aggregated result

๐Ÿ“บ Watch the Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.