How to Use pivot()
Function in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The pivot()
function in PySpark lets you rotate data in a DataFrame. It's used to transform rows into columns—helpful when summarizing data for reporting and analytics.
๐งพ Sample Dataset
Name Year Product Revenue
Aamir Shahzad 2023 Product A 500
Aamir Shahzad 2023 Product B 300
Ali 2023 Product A 400
Raza 2023 Product B 200
Bob 2024 Product A 700
Lisa 2024 Product B 600
Ali 2024 Product A 300
Raza 2024 Product B 500
Aamir Shahzad 2024 Product A 800
Lisa 2023 Product A 650
Lisa 2025 Product A 650
Lisa 2026 Product C 1100
๐ง Create DataFrame in PySpark
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("PivotFunctionExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 2023, "Product A", 500),
("Aamir Shahzad", 2023, "Product B", 300),
("Ali", 2023, "Product A", 400),
("Raza", 2023, "Product B", 200),
("Bob", 2024, "Product A", 700),
("Lisa", 2024, "Product B", 600),
("Ali", 2024, "Product A", 300),
("Raza", 2024, "Product B", 500),
("Aamir Shahzad", 2024, "Product A", 800),
("Lisa", 2023, "Product A", 650),
("Lisa", 2025, "Product A", 650),
("Lisa", 2026, "Product C", 1100)
]
columns = ["Name", "Year", "Product", "Revenue"]
df = spark.createDataFrame(data, columns)
# Show original DataFrame
df.show()
๐ Pivot Example 1: Pivot on Product
Rotate the Product
column into individual columns and summarize Revenue
using sum()
.
from pyspark.sql.functions import sum
pivot_df = df.groupBy("Name", "Year").pivot("Product").agg(sum("Revenue"))
pivot_df.show()
✅ Expected Output
+-------------+----+---------+---------+---------+
| Name|Year|Product A|Product B|Product C|
+-------------+----+---------+---------+---------+
| Bob|2024| 700| null| null|
|Aamir Shahzad|2023| 500| 300| null|
|Aamir Shahzad|2024| 800| null| null|
| Ali|2023| 400| null| null|
| Ali|2024| 300| null| null|
| Raza|2023| null| 200| null|
| Raza|2024| null| 500| null|
| Lisa|2023| 650| null| null|
| Lisa|2024| null| 600| null|
| Lisa|2025| 650| null| null|
| Lisa|2026| null| null| 1100|
+-------------+----+---------+---------+---------+
๐ Pivot Example 2: Pivot on Year
Rotate the Year
column to display annual revenue for each person across all years.
pivot_df_year = df.groupBy("Name").pivot("Year").agg(sum("Revenue"))
pivot_df_year.show()
✅ Expected Output
+-------------+----+----+----+----+
| Name|2023|2024|2025|2026|
+-------------+----+----+----+----+
| Bob|null| 700|null|null|
|Aamir Shahzad| 800| 800|null|null|
| Ali| 400| 300|null|null|
| Raza| 200| 500|null|null|
| Lisa| 650| 600| 650|1100|
+-------------+----+----+----+----+
๐ Explanation
pivot("Product")
creates a column for each product type.pivot("Year")
creates a column for each year.agg(sum("Revenue"))
aggregates revenue values during the pivot.- Missing values appear as
null
.