How to Use cube() Function in PySpark | Multi-dimensional Grouping and Aggregations

How to Use cube() Function in PySpark

Author: Aamir Shahzad

Published on: March 2025

Introduction

In this tutorial, you will learn how to use the cube() function in PySpark. The cube function is useful when performing multi-dimensional aggregations, similar to OLAP cube operations, providing powerful analytics capabilities.

What is cube() in PySpark?

The cube() function computes aggregates for all combinations of a group of columns, including subtotals and grand totals. It is ideal when you need multi-level aggregations in a single query, simplifying complex data analysis tasks.

Sample Dataset

Here is the sample dataset that we'll be using for this tutorial:

Name           Department    Salary
Aamir Shahzad   IT            5000
Ali Raza        HR            4000
Bob             Finance       4500
Lisa            HR            4000

Create DataFrame in PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

# Create Spark session
spark = SparkSession.builder.appName("CubeFunctionExample").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", "IT", 5000),
    ("Ali Raza", "HR", 4000),
    ("Bob", "Finance", 4500),
    ("Lisa", "HR", 4000)
]

# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

Using cube() Function in PySpark

cube_df = df.cube("Department", "Name").agg(sum("Salary").alias("Total_Salary"))

# Show results
cube_df.orderBy("Department", "Name").show()

Expected Output

+-----------+--------------+------------+
|Department | Name         |Total_Salary|
+-----------+--------------+------------+
|Finance    | Bob          | 4500       |
|Finance    | null         | 4500       |
|HR         | Ali Raza     | 4000       |
|HR         | Lisa         | 4000       |
|HR         | null         | 8000       |
|IT         | Aamir Shahzad| 5000       |
|IT         | null         | 5000       |
|null       | null         | 17500      |
+-----------+--------------+------------+

Explanation

This cube function generates aggregations at all levels. For example:

Total salary by each Name within each Department
Total salary for each Department (with Name as null)
Grand total of all salaries (both Department and Name as null)

Watch the Video Tutorial

For a complete walkthrough of the cube() function in PySpark, check out this video tutorial:

Welcome To TechBrothersIT

Label

PySpark Tutorial: How to Use Cube for GroupBy and Aggregations