PySpark Aggregations: count(), count_distinct(), first(), last() Explained with Examples #pyspark | PySpark Tutorial

Basic Aggregations in PySpark | count(), count_distinct(), first(), last()

Basic Aggregations in PySpark

Learn how to use count(), count_distinct(), first(), and last() functions with real-world PySpark DataFrames. A beginner-friendly walkthrough.

📌 Step 1: Sample Dataset

data = [
  ("Aamir", "New York", 31),
  ("Sara", "San Francisco", 25),
  ("John", "Los Angeles", 35),
  ("Lina", "Chicago", 28),
  ("Aamir", "Lahore", 30),
  ("John", "Los Angeles", 35)
]

schema = StructType([
  StructField("name", StringType(), True),
  StructField("city", StringType(), True),
  StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

🖥️ Output:

+-----+-------------+---+
| name|         city|age|
+-----+-------------+---+
|Aamir|     New York| 31|
| Sara|San Francisco| 25|
| John|  Los Angeles| 35|
| Lina|      Chicago| 28|
|Aamir|       Lahore| 30|
| John|  Los Angeles| 35|
+-----+-------------+---+

📊 Step 2: Aggregation Functions

1️⃣ Count()

df.groupBy("name").agg(count("name").alias("name_count")).show()

2️⃣ count_distinct()

df.groupBy("name").agg(count_distinct("age").alias("distinct_age_count")).show()

3️⃣ any_value()

df.groupBy("name").agg(any_value("city").alias("any_city")).show()

4️⃣ first()

df.groupBy("name").agg(first("city").alias("first_city")).show()

5️⃣ last()

df.groupBy("name").agg(last("city").alias("last_city")).show()

📺 Watch the Full Tutorial

📢 Some of the contents in this website were created with assistance from ChatGPT and Gemini.