What is RDD in PySpark? | A Beginner’s Guide to Apache Spark’s Core Data Structure | PySpark Tutorial

What is RDD in PySpark? | Learn Spark’s Core Data Structure

๐Ÿ”ฅ What is RDD in PySpark?

RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects. This tutorial helps you understand what RDDs are, how they work, and when to use them.

๐Ÿ“˜ Definition of RDD

RDD is a low-level abstraction in PySpark. It provides:

  • Fault-tolerance: Can recover from node failures
  • Immutability: Once created, it cannot be changed
  • Partitioning: Internally split across nodes

๐Ÿงช Create RDD Example

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rdd_intro").getOrCreate()
sc = spark.sparkContext

data = [("Alice", 28), ("Bob", 35), ("Charlie", 40), ("Diana", 23)]
rdd = sc.parallelize(data, 2)
print("Partition Count:", rdd.getNumPartitions())

Output:

Partition Count: 2

๐Ÿ”„ Basic Transformations

# map() transformation to format data
mapped = rdd.map(lambda x: f"{x[0]} is {x[1]} years old")
for item in mapped.collect():
    print(item)

Output:

Alice is 28 years old
Bob is 35 years old
Charlie is 40 years old
Diana is 23 years old

⚖️ RDD vs DataFrame

  • RDD: More control, lower-level, suitable for complex operations
  • DataFrame: Optimized, higher-level API, supports SQL queries

๐Ÿ“Œ When to Use RDD

  • When you need fine-grained control over transformations
  • When you’re working with unstructured or complex data
  • When DataFrame/Dataset APIs do not support your logic

๐ŸŽฅ Watch Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.