How to Perform Unpivot
in PySpark | Convert Columns to Rows
Author: Aamir Shahzad
Published: March 2025
📘 Introduction
In PySpark, unpivoting means converting columns into rows — the opposite of a pivot operation. While PySpark doesn't have a built-in unpivot()
function, you can easily achieve this using selectExpr()
and stack()
.
🧾 Sample Dataset
We’ll work with a DataFrame that contains sales data across three years:
Name Sales_2023 Sales_2024 Sales_2025
Aamir Shahzad 800 900 1000
Ali 500 600 1001
Raza 450 550 102
Bob 700 750 103
Lisa 620 None 109
🔧 Create DataFrame in PySpark
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("UnpivotExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 800, 900, 1000),
("Ali", 500, 600, 1001),
("Raza", 450, 550, 102),
("Bob", 700, 750, 103),
("Lisa", 620, None, 109)
]
columns = ["Name", "Sales_2023", "Sales_2024", "Sales_2025"]
df = spark.createDataFrame(data, columns)
df.show()
✅ Expected Output
+-------------+-----------+-----------+-----------+
| Name|Sales_2023|Sales_2024|Sales_2025|
+-------------+-----------+-----------+-----------+
|Aamir Shahzad| 800| 900| 1000|
| Ali| 500| 600| 1001|
| Raza| 450| 550| 102|
| Bob| 700| 750| 103|
| Lisa| 620| null| 109|
+-------------+-----------+-----------+-----------+
🔁 Unpivot Using stack()
and selectExpr()
Convert sales columns into rows using stack()
for each year:
unpivotDF = df.selectExpr(
"Name",
"stack(3, '2023', Sales_2023, '2024', Sales_2024, '2025', Sales_2025) as (Year, Sales)"
)
unpivotDF.show()
✅ Expected Output
+-------------+----+-----+
| Name|Year|Sales|
+-------------+----+-----+
|Aamir Shahzad|2023| 800|
|Aamir Shahzad|2024| 900|
|Aamir Shahzad|2025| 1000|
| Ali|2023| 500|
| Ali|2024| 600|
| Ali|2025| 1001|
| Raza|2023| 450|
| Raza|2024| 550|
| Raza|2025| 102|
| Bob|2023| 700|
| Bob|2024| 750|
| Bob|2025| 103|
| Lisa|2023| 620|
| Lisa|2024| null|
| Lisa|2025| 109|
+-------------+----+-----+
📌 Explanation
stack(3, ...)
tells PySpark to generate 3 rows per input row.- Each pair of values (e.g.,
'2023', Sales_2023
) is used to populate new rows. - The result is a normalized table format that's easier to process and analyze.