Create Custom Column Logic with udf()
in PySpark
Sometimes built-in PySpark functions just aren't enough. That’s when you use udf()
to apply your own Python logic directly to Spark DataFrames.
📘 Sample Data
data = [("apple", 2), ("banana", 3), ("kiwi", 1)]
df = spark.createDataFrame(data, ["fruit", "quantity"])
df.show()
Output:
+-------+--------+
| fruit |quantity|
+-------+--------+
| apple | 2 |
| banana| 3 |
| kiwi | 1 |
+-------+--------+
🧠Step 1: Define a Python Function to Tag Price
def tag_price(qty: int) -> str:
return "expensive" if qty >= 3 else "cheap"
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
tag_price_udf = udf(tag_price, StringType())
⚡ Step 2: Apply UDF to Create a New Column
df = df.withColumn("price_tag", tag_price_udf(df["quantity"]))
df.select("fruit", "quantity", "price_tag").show()
Output:
+--------+--------+-----------+
| fruit |quantity| price_tag |
+--------+--------+-----------+
| apple | 2 | cheap |
| banana | 3 | expensive |
| kiwi | 1 | cheap |
+--------+--------+-----------+
🔤 Step 3: Another UDF to Transform Fruit Name
def shout_name(name: str) -> str:
return name.upper() + "!"
shout_udf = udf(shout_name, StringType())
df = df.withColumn("fruit_shout", shout_udf(df["fruit"]))
df.select("fruit", "fruit_shout").show()
Output:
+--------+-------------+
| fruit |fruit_shout |
+--------+-------------+
| apple | APPLE! |
| banana | BANANA! |
| kiwi | KIWI! |
+--------+-------------+