How to Perform Unpivot
in PySpark | Convert Columns to Rows
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
In PySpark, unpivoting means converting columns into rows — the opposite of a pivot operation. While PySpark doesn't have a built-in unpivot()
function, you can easily achieve this using selectExpr()
and stack()
.
๐งพ Sample Dataset
We’ll work with a DataFrame that contains sales data across three years:
Name Sales_2023 Sales_2024 Sales_2025
Aamir Shahzad 800 900 1000
Ali 500 600 1001
Raza 450 550 102
Bob 700 750 103
Lisa 620 None 109
๐ง Create DataFrame in PySpark
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("UnpivotExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 800, 900, 1000),
("Ali", 500, 600, 1001),
("Raza", 450, 550, 102),
("Bob", 700, 750, 103),
("Lisa", 620, None, 109)
]
columns = ["Name", "Sales_2023", "Sales_2024", "Sales_2025"]
df = spark.createDataFrame(data, columns)
df.show()
✅ Expected Output
+-------------+-----------+-----------+-----------+
| Name|Sales_2023|Sales_2024|Sales_2025|
+-------------+-----------+-----------+-----------+
|Aamir Shahzad| 800| 900| 1000|
| Ali| 500| 600| 1001|
| Raza| 450| 550| 102|
| Bob| 700| 750| 103|
| Lisa| 620| null| 109|
+-------------+-----------+-----------+-----------+
๐ Unpivot Using stack()
and selectExpr()
Convert sales columns into rows using stack()
for each year:
unpivotDF = df.selectExpr(
"Name",
"stack(3, '2023', Sales_2023, '2024', Sales_2024, '2025', Sales_2025) as (Year, Sales)"
)
unpivotDF.show()
✅ Expected Output
+-------------+----+-----+
| Name|Year|Sales|
+-------------+----+-----+
|Aamir Shahzad|2023| 800|
|Aamir Shahzad|2024| 900|
|Aamir Shahzad|2025| 1000|
| Ali|2023| 500|
| Ali|2024| 600|
| Ali|2025| 1001|
| Raza|2023| 450|
| Raza|2024| 550|
| Raza|2025| 102|
| Bob|2023| 700|
| Bob|2024| 750|
| Bob|2025| 103|
| Lisa|2023| 620|
| Lisa|2024| null|
| Lisa|2025| 109|
+-------------+----+-----+
๐ Explanation
stack(3, ...)
tells PySpark to generate 3 rows per input row.- Each pair of values (e.g.,
'2023', Sales_2023
) is used to populate new rows. - The result is a normalized table format that's easier to process and analyze.