๐ฅ What is RDD in PySpark?
RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects. This tutorial helps you understand what RDDs are, how they work, and when to use them.
๐ Definition of RDD
RDD is a low-level abstraction in PySpark. It provides:
- Fault-tolerance: Can recover from node failures
- Immutability: Once created, it cannot be changed
- Partitioning: Internally split across nodes
๐งช Create RDD Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rdd_intro").getOrCreate()
sc = spark.sparkContext
data = [("Alice", 28), ("Bob", 35), ("Charlie", 40), ("Diana", 23)]
rdd = sc.parallelize(data, 2)
print("Partition Count:", rdd.getNumPartitions())
Output:
Partition Count: 2
๐ Basic Transformations
# map() transformation to format data
mapped = rdd.map(lambda x: f"{x[0]} is {x[1]} years old")
for item in mapped.collect():
print(item)
Output:
Alice is 28 years old
Bob is 35 years old
Charlie is 40 years old
Diana is 23 years old
⚖️ RDD vs DataFrame
- RDD: More control, lower-level, suitable for complex operations
- DataFrame: Optimized, higher-level API, supports SQL queries
๐ When to Use RDD
- When you need fine-grained control over transformations
- When you’re working with unstructured or complex data
- When DataFrame/Dataset APIs do not support your logic