PySpark Window Functions Explained | Rank, Dense_Rank, Lead, Lag, NTILE | Real-World Demo | PySpark Tutorial

Mastering Window Functions in PySpark | Rank, Lag, Lead, Dense_Rank, Ntile

🚀 Mastering Window Functions in PySpark

This blog post will walk you through essential PySpark window functions like rank(), lag(), lead(), dense_rank(), and ntile() using practical examples. These functions are crucial for complex data analytics, ordering, and partitioning scenarios.

📘 Sample Dataset

+-------+------------+-------+------------+
|Employee|Department |Sales  |Sale_Date   |
+--------+-----------+--------+-----------+
|Alice   |Sales      |5000    |2024-01-01 |
|Ben     |Sales      |5500    |2024-01-01 |
|Cara    |Marketing  |4800    |2024-01-01 |
|Dan     |HR         |5300    |2024-01-01 |
|Alice   |Sales      |6200    |2024-01-02 |
|Ben     |Sales      |5500    |2024-01-02 |
|Cara    |Marketing  |5100    |2024-01-02 |
|Dan     |HR         |6000    |2024-01-02 |
|Alice   |Sales      |7000    |2024-01-03 |
|Ben     |Sales      |6400    |2024-01-03 |
+--------+-----------+--------+-----------+

⚙️ Code Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, lag, lead, ntile
from pyspark.sql.window import Window

# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsDemo").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Sales", 5000, "2024-01-01"),
    ("Ben", "Sales", 5500, "2024-01-01"),
    ("Cara", "Marketing", 4800, "2024-01-01"),
    ("Dan", "HR", 5300, "2024-01-01"),
    ("Alice", "Sales", 6200, "2024-01-02"),
    ("Ben", "Sales", 5500, "2024-01-02"),
    ("Cara", "Marketing", 5100, "2024-01-02"),
    ("Dan", "HR", 6000, "2024-01-02"),
    ("Alice", "Sales", 7000, "2024-01-03"),
    ("Ben", "Sales", 6400, "2024-01-03"),
]
schema = ["employee", "department", "sales", "sale_date"]
df = spark.createDataFrame(data, schema)
df.show()

📊 Applying Window Functions

window_spec = Window.partitionBy("employee").orderBy("sale_date")

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
df.withColumn("percent_rank", percent_rank().over(window_spec)).show()
df.withColumn("lag", lag("sales", 1).over(window_spec)).show()
df.withColumn("lead", lead("sales", 1).over(window_spec)).show()
df.withColumn("ntile", ntile(2).over(window_spec)).show()

🎬 Watch Full Tutorial

📝 Some of the contents in this website were created with assistance from ChatGPT and Gemini.