Boost PySpark Performance with pandas_udf | Beginner-Friendly Tutorial with Real Examples | PySpark Tutorial

Speed Up PySpark with pandas_udf | Beginner-Friendly Tutorial

🚀 Speed Up PySpark with pandas_udf() – Easy Tutorial

Want faster performance in your PySpark jobs? This tutorial covers how to use pandas_udf() to process data in batches using Pandas under the hood—providing serious speed boosts over regular UDFs.

📘 Sample DataFrame

data = [("apple",), ("banana",), ("kiwi",)]
df = spark.createDataFrame(data, ["fruit"])
df.show()

Output:

+--------+
|  fruit |
+--------+
|  apple |
| banana |
|   kiwi |
+--------+

⚡ Step 1: Define a pandas_udf to Get Length of Fruit Name

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
import pandas as pd

@pandas_udf(IntegerType())
def fruit_length(series: pd.Series) -> pd.Series:
    return series.str.len()

df = df.withColumn("length", fruit_length(df["fruit"]))
df.select("fruit", "length").show()

Output:

+--------+------+
|  fruit |length|
+--------+------+
|  apple |     5|
| banana |     6|
|   kiwi |     4|
+--------+------+

🎯 Step 2: Classify Fruit Based on Length

from pyspark.sql.types import StringType

@pandas_udf(StringType())
def classify_fruit(series: pd.Series) -> pd.Series:
    return series.apply(lambda name: "long name" if len(name) > 5 else "short name")

df = df.withColumn("length_category", classify_fruit(df["fruit"]))
df.select("fruit", "length_category").show()

Output:

+--------+----------------+
|  fruit |length_category|
+--------+----------------+
|  apple |     short name |
| banana|     long name   |
|   kiwi |     short name |
+--------+----------------+

🎥 Watch Full Tutorial on YouTube

Some of the contents in this website were created with assistance from ChatGPT and Gemini.