PySpark Tutorial : PySpark transform() Function | How to Apply Custom Transformations to DataFrames

PySpark Tutorial: How to Use transform() for Custom DataFrame Transformations

PySpark Tutorial: How to Use transform() for Custom DataFrame Transformations

In this tutorial, you will learn how to use the transform() function in PySpark to apply custom reusable transformations on DataFrames. This is a great way to simplify complex logic and make your code cleaner!

What is transform() in PySpark?

The transform() function allows you to apply a custom function to a DataFrame. It’s a cleaner way to chain multiple operations, especially when applying reusable logic.

  • Helps make your code more reusable.
  • Applies custom transformations on DataFrames.
  • Keeps your DataFrame pipelines clean and modular.

Step 1: Create Spark Session

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper

spark = SparkSession.builder \\
    .appName("PySpark transform() Example") \\
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    (1, "Aamir Shahzad", 5000),
    (2, "Ali Raza", 6000),
    (3, "Bob", 5500),
    (4, "Lisa", 7000)
]

columns = ["ID", "Name", "Salary"]

df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
Original DataFrame: +---+--------------+------+ | ID| Name|Salary| +---+--------------+------+ | 1| Aamir Shahzad| 5000| | 2| Ali Raza| 6000| | 3| Bob| 5500| | 4| Lisa| 7000| +---+--------------+------+

Step 3: Define a Transformation Function

def add_bonus(input_df):
    # 1. Uppercase the Name column
    # 2. Add a new column "Bonus" which is 10% of Salary
    return input_df.withColumn("Name_Upper", upper(col("Name"))) \\
                   .withColumn("Bonus", col("Salary") * 0.10)

Step 4: Apply transform() Function

df_transformed = df.transform(add_bonus)

print("Transformed DataFrame:")
df_transformed.show()
Transformed DataFrame: +---+--------------+------+--------------+------+ | ID| Name|Salary| Name_Upper| Bonus| +---+--------------+------+--------------+------+ | 1| Aamir Shahzad| 5000| AAMIR SHAHZAD| 500.0| | 2| Ali Raza| 6000| ALI RAZA| 600.0| | 3| Bob| 5500| BOB| 550.0| | 4| Lisa| 7000| LISA| 700.0| +---+--------------+------+--------------+------+

Why Use transform()?

  • Cleaner, modular code with reusable logic.
  • Perfect for applying consistent transformations across multiple DataFrames.
  • Improves readability and testability in complex pipelines.

📺 Watch the Full Tutorial Video

For a complete walkthrough, watch the video below:

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.