How to Use Pivot Function in PySpark | Transform and Summarize Data Easily | PySpark Tutorial

How to Use Pivot Function in PySpark | Step-by-Step Guide

How to Use pivot() Function in PySpark | Step-by-Step Guide

Author: Aamir Shahzad

Published: March 2025

๐Ÿ“˜ Introduction

The pivot() function in PySpark lets you rotate data in a DataFrame. It's used to transform rows into columns—helpful when summarizing data for reporting and analytics.

๐Ÿงพ Sample Dataset

Name            Year    Product     Revenue
Aamir Shahzad   2023    Product A   500
Aamir Shahzad   2023    Product B   300
Ali             2023    Product A   400
Raza            2023    Product B   200
Bob             2024    Product A   700
Lisa            2024    Product B   600
Ali             2024    Product A   300
Raza            2024    Product B   500
Aamir Shahzad   2024    Product A   800
Lisa            2023    Product A   650
Lisa            2025    Product A   650
Lisa            2026    Product C   1100

๐Ÿ”ง Create DataFrame in PySpark

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("PivotFunctionExample").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", 2023, "Product A", 500),
    ("Aamir Shahzad", 2023, "Product B", 300),
    ("Ali", 2023, "Product A", 400),
    ("Raza", 2023, "Product B", 200),
    ("Bob", 2024, "Product A", 700),
    ("Lisa", 2024, "Product B", 600),
    ("Ali", 2024, "Product A", 300),
    ("Raza", 2024, "Product B", 500),
    ("Aamir Shahzad", 2024, "Product A", 800),
    ("Lisa", 2023, "Product A", 650),
    ("Lisa", 2025, "Product A", 650),
    ("Lisa", 2026, "Product C", 1100)
]

columns = ["Name", "Year", "Product", "Revenue"]
df = spark.createDataFrame(data, columns)

# Show original DataFrame
df.show()

๐Ÿ“Š Pivot Example 1: Pivot on Product

Rotate the Product column into individual columns and summarize Revenue using sum().

from pyspark.sql.functions import sum

pivot_df = df.groupBy("Name", "Year").pivot("Product").agg(sum("Revenue"))
pivot_df.show()

✅ Expected Output

+-------------+----+---------+---------+---------+
|         Name|Year|Product A|Product B|Product C|
+-------------+----+---------+---------+---------+
|          Bob|2024|      700|     null|     null|
|Aamir Shahzad|2023|      500|      300|     null|
|Aamir Shahzad|2024|      800|     null|     null|
|          Ali|2023|      400|     null|     null|
|          Ali|2024|      300|     null|     null|
|         Raza|2023|     null|      200|     null|
|         Raza|2024|     null|      500|     null|
|         Lisa|2023|      650|     null|     null|
|         Lisa|2024|     null|      600|     null|
|         Lisa|2025|      650|     null|     null|
|         Lisa|2026|     null|     null|     1100|
+-------------+----+---------+---------+---------+

๐Ÿ“Š Pivot Example 2: Pivot on Year

Rotate the Year column to display annual revenue for each person across all years.

pivot_df_year = df.groupBy("Name").pivot("Year").agg(sum("Revenue"))
pivot_df_year.show()

✅ Expected Output

+-------------+----+----+----+----+
|         Name|2023|2024|2025|2026|
+-------------+----+----+----+----+
|          Bob|null| 700|null|null|
|Aamir Shahzad| 800| 800|null|null|
|          Ali| 400| 300|null|null|
|         Raza| 200| 500|null|null|
|         Lisa| 650| 600| 650|1100|
+-------------+----+----+----+----+

๐Ÿ“Œ Explanation

  • pivot("Product") creates a column for each product type.
  • pivot("Year") creates a column for each year.
  • agg(sum("Revenue")) aggregates revenue values during the pivot.
  • Missing values appear as null.

๐ŸŽฅ Video Tutorial

Watch on YouTube

© 2025 Aamir Shahzad. All rights reserved.

Visit TechBrothersIT for more tutorials.