Create Custom Column Logic in PySpark Using udf() | Easy Guide with Real Examples | PySpark Tutorial

Create Custom Column Logic with udf() in PySpark

Create Custom Column Logic with udf() in PySpark

Sometimes built-in PySpark functions just aren't enough. That’s when you use udf() to apply your own Python logic directly to Spark DataFrames.

📘 Sample Data

data = [("apple", 2), ("banana", 3), ("kiwi", 1)]
df = spark.createDataFrame(data, ["fruit", "quantity"])
df.show()

Output:

+-------+--------+
| fruit |quantity|
+-------+--------+
| apple |   2    |
| banana|   3    |
| kiwi  |   1    |
+-------+--------+

🧠 Step 1: Define a Python Function to Tag Price

def tag_price(qty: int) -> str:
    return "expensive" if qty >= 3 else "cheap"

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

tag_price_udf = udf(tag_price, StringType())

⚡ Step 2: Apply UDF to Create a New Column

df = df.withColumn("price_tag", tag_price_udf(df["quantity"]))
df.select("fruit", "quantity", "price_tag").show()

Output:

+--------+--------+-----------+
| fruit  |quantity| price_tag |
+--------+--------+-----------+
| apple  |   2    |   cheap   |
| banana |   3    | expensive |
| kiwi   |   1    |   cheap   |
+--------+--------+-----------+

🔤 Step 3: Another UDF to Transform Fruit Name

def shout_name(name: str) -> str:
    return name.upper() + "!"

shout_udf = udf(shout_name, StringType())
df = df.withColumn("fruit_shout", shout_udf(df["fruit"]))
df.select("fruit", "fruit_shout").show()

Output:

+--------+-------------+
| fruit  |fruit_shout  |
+--------+-------------+
| apple  |  APPLE!     |
| banana |  BANANA!    |
| kiwi   |  KIWI!      |
+--------+-------------+

🎥 Watch the Full Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.