PySpark Tutorial: How to Build User-Defined Table Functions (UDTFs) in PySpark | Split Rows

Create UDTF in PySpark | Generate Multiple Rows per Input

How to Create a UDTF (User Defined Table Function) in PySpark

Need to return multiple rows per input in Spark? Learn how to define and use UDTFs (User Defined Table Functions) in PySpark using the @udtf decorator with real-world logic.

📘 Step 1: Define a Class to Split Words

from pyspark.sql.functions import udtf

class SplitWords:
    def eval(self, text: str):
        for word in text.split(" "):
            yield (word,)

split_words_udtf = udtf(SplitWords, returnType="word: string")

Use Case: This takes a sentence and emits each word as a row.

⚡ Step 2: Use UDTF Directly in Select

from pyspark.sql.functions import lit

split_words_udtf(lit("pyspark is powerful")).show()

Output:

+---------+
|  word   |
+---------+
| pyspark |
| is      |
| powerful|
+---------+

📊 Step 3: Another UDTF Returning Two Columns

@udtf(returnType="num: int, plus_one: int")
class PlusOne:
    def eval(self, x: int):
        yield x, x + 1

PlusOne(lit(5)).show()

Output:

+-----+---------+
| num | plus_one|
+-----+---------+
|  5  |    6    |
+-----+---------+

🎥 Full Tutorial Video

Some of the contents in this website were created with assistance from ChatGPT and Gemini.