PySpark RDD Transformations Explained
This post covers how to use the most essential RDD transformations in PySpark. We’ll walk through map, flatMap, filter, distinct, and apply them in a Word Count example — all fundamental for big data processing.
1. Create a Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rdd_transformations_demo").getOrCreate()
2. Load Lines into RDD
lines = spark.sparkContext.parallelize([
"Aamir loves Spark",
"Ali and Raza love PySpark",
"Spark is fast",
"Ali loves big data"
])
3. Tokenize Text into Words
words_rdd = lines.flatMap(lambda line: line.split(" "))
words_rdd.collect()
4. Convert Words to (word, 1)
pair_rdd = words_rdd.map(lambda word: (word, 1))
pair_rdd.collect()
5. Filter Out Short Words
filtered_rdd = words_rdd.filter(lambda word: word != "is")
filtered_rdd.collect()
6. Distinct Words
distinct_rdd = words_rdd.distinct()
distinct_rdd.collect()
7. Count Word Occurrences
word_count_rdd = pair_rdd.reduceByKey(lambda a, b: a + b)
word_count_rdd.collect()
✅ Output Example:
[('Ali', 2), ('Spark', 2), ('Aamir', 1), ('loves', 3), ('PySpark', 1), ('Raza', 1), ('big', 1), ('data', 1)]