PySpark Tutorial: How to Use Cube for GroupBy and Aggregations

How to Use cube() Function in PySpark | Multi-dimensional Grouping and Aggregations

How to Use cube() Function in PySpark

Author: Aamir Shahzad

Published on: March 2025

Introduction

In this tutorial, you will learn how to use the cube() function in PySpark. The cube function is useful when performing multi-dimensional aggregations, similar to OLAP cube operations, providing powerful analytics capabilities.

What is cube() in PySpark?

The cube() function computes aggregates for all combinations of a group of columns, including subtotals and grand totals. It is ideal when you need multi-level aggregations in a single query, simplifying complex data analysis tasks.

Sample Dataset

Here is the sample dataset that we'll be using for this tutorial:

Name           Department    Salary
Aamir Shahzad   IT            5000
Ali Raza        HR            4000
Bob             Finance       4500
Lisa            HR            4000

Create DataFrame in PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

# Create Spark session
spark = SparkSession.builder.appName("CubeFunctionExample").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", "IT", 5000),
    ("Ali Raza", "HR", 4000),
    ("Bob", "Finance", 4500),
    ("Lisa", "HR", 4000)
]

# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

Using cube() Function in PySpark

cube_df = df.cube("Department", "Name").agg(sum("Salary").alias("Total_Salary"))

# Show results
cube_df.orderBy("Department", "Name").show()

Expected Output

+-----------+--------------+------------+
|Department | Name         |Total_Salary|
+-----------+--------------+------------+
|Finance    | Bob          | 4500       |
|Finance    | null         | 4500       |
|HR         | Ali Raza     | 4000       |
|HR         | Lisa         | 4000       |
|HR         | null         | 8000       |
|IT         | Aamir Shahzad| 5000       |
|IT         | null         | 5000       |
|null       | null         | 17500      |
+-----------+--------------+------------+

Explanation

This cube function generates aggregations at all levels. For example:

  • Total salary by each Name within each Department
  • Total salary for each Department (with Name as null)
  • Grand total of all salaries (both Department and Name as null)

Watch the Video Tutorial

For a complete walkthrough of the cube() function in PySpark, check out this video tutorial:

© 2025 Aamir Shahzad. All rights reserved.