How to Create a UDTF (User Defined Table Function) in PySpark
Need to return multiple rows per input in Spark? Learn how to define and use UDTFs (User Defined Table Functions) in PySpark using the @udtf
decorator with real-world logic.
📘 Step 1: Define a Class to Split Words
from pyspark.sql.functions import udtf
class SplitWords:
def eval(self, text: str):
for word in text.split(" "):
yield (word,)
split_words_udtf = udtf(SplitWords, returnType="word: string")
Use Case: This takes a sentence and emits each word as a row.
⚡ Step 2: Use UDTF Directly in Select
from pyspark.sql.functions import lit
split_words_udtf(lit("pyspark is powerful")).show()
Output:
+---------+
| word |
+---------+
| pyspark |
| is |
| powerful|
+---------+
📊 Step 3: Another UDTF Returning Two Columns
@udtf(returnType="num: int, plus_one: int")
class PlusOne:
def eval(self, x: int):
yield x, x + 1
PlusOne(lit(5)).show()
Output:
+-----+---------+
| num | plus_one|
+-----+---------+
| 5 | 6 |
+-----+---------+