🚀 Mastering Window Functions in PySpark
This blog post will walk you through essential PySpark window functions like rank(), lag(), lead(), dense_rank(), and ntile() using practical examples. These functions are crucial for complex data analytics, ordering, and partitioning scenarios.
📘 Sample Dataset
+-------+------------+-------+------------+
|Employee|Department |Sales |Sale_Date |
+--------+-----------+--------+-----------+
|Alice |Sales |5000 |2024-01-01 |
|Ben |Sales |5500 |2024-01-01 |
|Cara |Marketing |4800 |2024-01-01 |
|Dan |HR |5300 |2024-01-01 |
|Alice |Sales |6200 |2024-01-02 |
|Ben |Sales |5500 |2024-01-02 |
|Cara |Marketing |5100 |2024-01-02 |
|Dan |HR |6000 |2024-01-02 |
|Alice |Sales |7000 |2024-01-03 |
|Ben |Sales |6400 |2024-01-03 |
+--------+-----------+--------+-----------+
⚙️ Code Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, lag, lead, ntile
from pyspark.sql.window import Window
# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsDemo").getOrCreate()
# Sample DataFrame
data = [
("Alice", "Sales", 5000, "2024-01-01"),
("Ben", "Sales", 5500, "2024-01-01"),
("Cara", "Marketing", 4800, "2024-01-01"),
("Dan", "HR", 5300, "2024-01-01"),
("Alice", "Sales", 6200, "2024-01-02"),
("Ben", "Sales", 5500, "2024-01-02"),
("Cara", "Marketing", 5100, "2024-01-02"),
("Dan", "HR", 6000, "2024-01-02"),
("Alice", "Sales", 7000, "2024-01-03"),
("Ben", "Sales", 6400, "2024-01-03"),
]
schema = ["employee", "department", "sales", "sale_date"]
df = spark.createDataFrame(data, schema)
df.show()
📊 Applying Window Functions
window_spec = Window.partitionBy("employee").orderBy("sale_date")
df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
df.withColumn("percent_rank", percent_rank().over(window_spec)).show()
df.withColumn("lag", lag("sales", 1).over(window_spec)).show()
df.withColumn("lead", lead("sales", 1).over(window_spec)).show()
df.withColumn("ntile", ntile(2).over(window_spec)).show()