PySpark Tutorial: How to Use transform() for Custom DataFrame Transformations
In this tutorial, you will learn how to use the transform()
function in PySpark to apply custom reusable transformations on DataFrames. This is a great way to simplify complex logic and make your code cleaner!
What is transform() in PySpark?
The transform()
function allows you to apply a custom function to a DataFrame. It’s a cleaner way to chain multiple operations, especially when applying reusable logic.
- Helps make your code more reusable.
- Applies custom transformations on DataFrames.
- Keeps your DataFrame pipelines clean and modular.
Step 1: Create Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark = SparkSession.builder \\
.appName("PySpark transform() Example") \\
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
(1, "Aamir Shahzad", 5000),
(2, "Ali Raza", 6000),
(3, "Bob", 5500),
(4, "Lisa", 7000)
]
columns = ["ID", "Name", "Salary"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
Step 3: Define a Transformation Function
def add_bonus(input_df):
# 1. Uppercase the Name column
# 2. Add a new column "Bonus" which is 10% of Salary
return input_df.withColumn("Name_Upper", upper(col("Name"))) \\
.withColumn("Bonus", col("Salary") * 0.10)
Step 4: Apply transform() Function
df_transformed = df.transform(add_bonus)
print("Transformed DataFrame:")
df_transformed.show()
Why Use transform()?
- Cleaner, modular code with reusable logic.
- Perfect for applying consistent transformations across multiple DataFrames.
- Improves readability and testability in complex pipelines.
📺 Watch the Full Tutorial Video
For a complete walkthrough, watch the video below: