How to Aggregate Data Using agg() Function in PySpark | PySpark Tutorial

How to Use agg() Function in PySpark | Step-by-Step Guide

How to Use agg() Function in PySpark | Step-by-Step Guide

Author: Aamir Shahzad

Published: March 2025

📘 Introduction

The agg() function in PySpark is used to apply multiple aggregate functions at once on grouped data. It is part of the DataFrame API and works in conjunction with the groupBy() method.

📌 What is agg() in PySpark?

The agg() method is ideal when you want to compute multiple statistics like sum(), avg(), min(), max(), and count() in a single transformation. It makes code cleaner and more efficient when working with grouped datasets.

🧾 Sample Dataset

Name           Department    Salary
Aamir Shahzad   IT            5000
Ali Raza        HR            4000
Bob             Finance       4500
Lisa            HR            4000

🔧 Create DataFrame in PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, min, max, count

# Create Spark session
spark = SparkSession.builder.appName("AggFunctionExample").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", "IT", 5000),
    ("Ali Raza", "HR", 4000),
    ("Bob", "Finance", 4500),
    ("Lisa", "HR", 4000)
]

# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

📊 Apply Multiple Aggregations Using agg()

# Group by Department and apply multiple aggregations
agg_df = df.groupBy("Department").agg(
    sum("Salary").alias("Total_Salary"),
    avg("Salary").alias("Average_Salary"),
    min("Salary").alias("Min_Salary"),
    max("Salary").alias("Max_Salary"),
    count("Name").alias("Employee_Count")
)

# Show results
agg_df.show()

✅ Expected Output

+-----------+------------+--------------+----------+----------+--------------+
|Department |Total_Salary|Average_Salary|Min_Salary|Max_Salary|Employee_Count|
+-----------+------------+--------------+----------+----------+--------------+
|Finance    |        4500|        4500.0|      4500|      4500|             1|
|HR         |        8000|        4000.0|      4000|      4000|             2|
|IT         |        5000|        5000.0|      5000|      5000|             1|
+-----------+------------+--------------+----------+----------+--------------+

📌 Explanation

  • sum("Salary"): Total salary for each department
  • avg("Salary"): Average salary for each department
  • min("Salary"): Minimum salary in each department
  • max("Salary"): Maximum salary in each department
  • count("Name"): Number of employees in each department

🎥 Video Tutorial

Watch on YouTube

© 2025 Aamir Shahzad. All rights reserved.

Visit TechBrothersIT for more tutorials.