How to Use groupBy() in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
The groupBy()
function in PySpark is used to group rows based on one or more columns and perform aggregate functions like count, sum, avg, min, max, etc. It works similarly to SQL GROUP BY.
Step 1: Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkGroupByFunction") \
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Bob", "Sales", 4200),
("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 3: groupBy() Example 1 - Count Employees in Each Department
df.groupBy("Department").count().show()
Expected Output
+-----------+-----+
| Department|count|
+-----------+-----+
|Engineering| 2|
| Marketing| 1|
| Sales| 2|
+-----------+-----+
Step 4: groupBy() Example 2 - Total Salary by Department
from pyspark.sql.functions import sum
df.groupBy("Department") \
.agg(sum("Salary").alias("Total_Salary")) \
.show()
Expected Output
+-----------+------------+
| Department|Total_Salary|
+-----------+------------+
|Engineering| 11000|
| Marketing| 3500|
| Sales| 8200|
+-----------+------------+
Step 5: groupBy() Example 3 - Average, Min, and Max Salary by Department
from pyspark.sql.functions import avg, min, max
df.groupBy("Department") \
.agg(
avg("Salary").alias("Average_Salary"),
min("Salary").alias("Min_Salary"),
max("Salary").alias("Max_Salary")
).show()
Expected Output
+-----------+--------------+----------+----------+
| Department|Average_Salary|Min_Salary|Max_Salary|
+-----------+--------------+----------+----------+
|Engineering| 5500.0| 5000| 6000|
| Marketing| 3500.0| 3500| 3500|
| Sales| 4100.0| 4000| 4200|
+-----------+--------------+----------+----------+
Step 6: groupBy() Example 4 - Group By Name and Department
df.groupBy("Name", "Department") \
.sum("Salary") \
.show()
Expected Output
+-------------+-----------+-----------+
| Name| Department|sum(Salary)|
+-------------+-----------+-----------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+-----------+
Conclusion
Using groupBy()
in PySpark allows you to aggregate and summarize data effectively. You can combine it with various aggregate functions to perform complex data analysis directly on your Spark DataFrames.
Watch the Video Tutorial
For a complete walkthrough of groupBy() in PySpark, check out the video tutorial below: