PySpark RDD Transformations | map, flatMap, filter, distinct, Word Count | PySpark Tutorial

PySpark RDD Transformations Explained | map, flatMap, filter, distinct, Word Count

PySpark RDD Transformations Explained

This post covers how to use the most essential RDD transformations in PySpark. We’ll walk through map, flatMap, filter, distinct, and apply them in a Word Count example — all fundamental for big data processing.

1. Create a Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rdd_transformations_demo").getOrCreate()

2. Load Lines into RDD

lines = spark.sparkContext.parallelize([
    "Aamir loves Spark",
    "Ali and Raza love PySpark",
    "Spark is fast",
    "Ali loves big data"
])

3. Tokenize Text into Words

words_rdd = lines.flatMap(lambda line: line.split(" "))
words_rdd.collect()

4. Convert Words to (word, 1)

pair_rdd = words_rdd.map(lambda word: (word, 1))
pair_rdd.collect()

5. Filter Out Short Words

filtered_rdd = words_rdd.filter(lambda word: word != "is")
filtered_rdd.collect()

6. Distinct Words

distinct_rdd = words_rdd.distinct()
distinct_rdd.collect()

7. Count Word Occurrences

word_count_rdd = pair_rdd.reduceByKey(lambda a, b: a + b)
word_count_rdd.collect()

✅ Output Example:

[('Ali', 2), ('Spark', 2), ('Aamir', 1), ('loves', 3), ('PySpark', 1), ('Raza', 1), ('big', 1), ('data', 1)]

📺 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.