Basic Aggregations in PySpark
Learn how to use count()
, count_distinct()
, first()
, and last()
functions with real-world PySpark DataFrames. A beginner-friendly walkthrough.
📌 Step 1: Sample Dataset
data = [
("Aamir", "New York", 31),
("Sara", "San Francisco", 25),
("John", "Los Angeles", 35),
("Lina", "Chicago", 28),
("Aamir", "Lahore", 30),
("John", "Los Angeles", 35)
]
schema = StructType([
StructField("name", StringType(), True),
StructField("city", StringType(), True),
StructField("age", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()
🖥️ Output:
+-----+-------------+---+
| name| city|age|
+-----+-------------+---+
|Aamir| New York| 31|
| Sara|San Francisco| 25|
| John| Los Angeles| 35|
| Lina| Chicago| 28|
|Aamir| Lahore| 30|
| John| Los Angeles| 35|
+-----+-------------+---+
📊 Step 2: Aggregation Functions
1️⃣ Count()
df.groupBy("name").agg(count("name").alias("name_count")).show()
2️⃣ count_distinct()
df.groupBy("name").agg(count_distinct("age").alias("distinct_age_count")).show()
3️⃣ any_value()
df.groupBy("name").agg(any_value("city").alias("any_city")).show()
4️⃣ first()
df.groupBy("name").agg(first("city").alias("first_city")).show()
5️⃣ last()
df.groupBy("name").agg(last("city").alias("last_city")).show()