How to Use agg()
Function in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
📘 Introduction
The agg()
function in PySpark is used to apply multiple aggregate functions at once on grouped data. It is part of the DataFrame API and works in conjunction with the groupBy()
method.
📌 What is agg() in PySpark?
The agg()
method is ideal when you want to compute multiple statistics like sum()
, avg()
, min()
, max()
, and count()
in a single transformation. It makes code cleaner and more efficient when working with grouped datasets.
🧾 Sample Dataset
Name Department Salary
Aamir Shahzad IT 5000
Ali Raza HR 4000
Bob Finance 4500
Lisa HR 4000
🔧 Create DataFrame in PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, min, max, count
# Create Spark session
spark = SparkSession.builder.appName("AggFunctionExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", "IT", 5000),
("Ali Raza", "HR", 4000),
("Bob", "Finance", 4500),
("Lisa", "HR", 4000)
]
# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
📊 Apply Multiple Aggregations Using agg()
# Group by Department and apply multiple aggregations
agg_df = df.groupBy("Department").agg(
sum("Salary").alias("Total_Salary"),
avg("Salary").alias("Average_Salary"),
min("Salary").alias("Min_Salary"),
max("Salary").alias("Max_Salary"),
count("Name").alias("Employee_Count")
)
# Show results
agg_df.show()
✅ Expected Output
+-----------+------------+--------------+----------+----------+--------------+
|Department |Total_Salary|Average_Salary|Min_Salary|Max_Salary|Employee_Count|
+-----------+------------+--------------+----------+----------+--------------+
|Finance | 4500| 4500.0| 4500| 4500| 1|
|HR | 8000| 4000.0| 4000| 4000| 2|
|IT | 5000| 5000.0| 5000| 5000| 1|
+-----------+------------+--------------+----------+----------+--------------+
📌 Explanation
- sum("Salary"): Total salary for each department
- avg("Salary"): Average salary for each department
- min("Salary"): Minimum salary in each department
- max("Salary"): Maximum salary in each department
- count("Name"): Number of employees in each department