How to Use cube() Function in PySpark
Author: Aamir Shahzad
Published on: March 2025
Introduction
In this tutorial, you will learn how to use the cube() function in PySpark. The cube function is useful when performing multi-dimensional aggregations, similar to OLAP cube operations, providing powerful analytics capabilities.
What is cube() in PySpark?
The cube()
function computes aggregates for all combinations of a group of columns, including subtotals and grand totals. It is ideal when you need multi-level aggregations in a single query, simplifying complex data analysis tasks.
Sample Dataset
Here is the sample dataset that we'll be using for this tutorial:
Name Department Salary
Aamir Shahzad IT 5000
Ali Raza HR 4000
Bob Finance 4500
Lisa HR 4000
Create DataFrame in PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum
# Create Spark session
spark = SparkSession.builder.appName("CubeFunctionExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", "IT", 5000),
("Ali Raza", "HR", 4000),
("Bob", "Finance", 4500),
("Lisa", "HR", 4000)
]
# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
# Show DataFrame
df.show()
Using cube() Function in PySpark
cube_df = df.cube("Department", "Name").agg(sum("Salary").alias("Total_Salary"))
# Show results
cube_df.orderBy("Department", "Name").show()
Expected Output
+-----------+--------------+------------+
|Department | Name |Total_Salary|
+-----------+--------------+------------+
|Finance | Bob | 4500 |
|Finance | null | 4500 |
|HR | Ali Raza | 4000 |
|HR | Lisa | 4000 |
|HR | null | 8000 |
|IT | Aamir Shahzad| 5000 |
|IT | null | 5000 |
|null | null | 17500 |
+-----------+--------------+------------+
Explanation
This cube function generates aggregations at all levels. For example:
- Total salary by each Name within each Department
- Total salary for each Department (with Name as null)
- Grand total of all salaries (both Department and Name as null)
Watch the Video Tutorial
For a complete walkthrough of the cube() function in PySpark, check out this video tutorial: