PySpark Tutorial: groupBy() Function | Group & Summarize DataFrames

How to Use groupBy() in PySpark | Aggregate & Summarize DataFrames

How to Use groupBy() in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

The groupBy() function in PySpark is used to group rows based on one or more columns and perform aggregate functions like count, sum, avg, min, max, etc. It works similarly to SQL GROUP BY.

Step 1: Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkGroupByFunction") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Bob", "Sales", 4200),
    ("Lisa", "Engineering", 6000)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 3: groupBy() Example 1 - Count Employees in Each Department

df.groupBy("Department").count().show()

Expected Output

+-----------+-----+
| Department|count|
+-----------+-----+
|Engineering|    2|
|  Marketing|    1|
|      Sales|    2|
+-----------+-----+

Step 4: groupBy() Example 2 - Total Salary by Department

from pyspark.sql.functions import sum

df.groupBy("Department") \
  .agg(sum("Salary").alias("Total_Salary")) \
  .show()

Expected Output

+-----------+------------+
| Department|Total_Salary|
+-----------+------------+
|Engineering|        11000|
|  Marketing|         3500|
|      Sales|         8200|
+-----------+------------+

Step 5: groupBy() Example 3 - Average, Min, and Max Salary by Department

from pyspark.sql.functions import avg, min, max

df.groupBy("Department") \
  .agg(
      avg("Salary").alias("Average_Salary"),
      min("Salary").alias("Min_Salary"),
      max("Salary").alias("Max_Salary")
  ).show()

Expected Output

+-----------+--------------+----------+----------+
| Department|Average_Salary|Min_Salary|Max_Salary|
+-----------+--------------+----------+----------+
|Engineering|        5500.0|      5000|      6000|
|  Marketing|        3500.0|      3500|      3500|
|      Sales|        4100.0|      4000|      4200|
+-----------+--------------+----------+----------+

Step 6: groupBy() Example 4 - Group By Name and Department

df.groupBy("Name", "Department") \
  .sum("Salary") \
  .show()

Expected Output

+-------------+-----------+-----------+
|         Name| Department|sum(Salary)|
+-------------+-----------+-----------+
|Aamir Shahzad|Engineering|       5000|
|          Ali|      Sales|       4000|
|         Raza|  Marketing|       3500|
|          Bob|      Sales|       4200|
|         Lisa|Engineering|       6000|
+-------------+-----------+-----------+

Conclusion

Using groupBy() in PySpark allows you to aggregate and summarize data effectively. You can combine it with various aggregate functions to perform complex data analysis directly on your Spark DataFrames.

Watch the Video Tutorial

For a complete walkthrough of groupBy() in PySpark, check out the video tutorial below:

© 2025 Aamir Shahzad | PySpark Tutorials