How to Use approxQuantile() in PySpark | Quick Guide to Percentiles & Median #pysparktutorial

How to Use approxQuantile() in PySpark | Quick Guide to Percentiles & Median

How to Use approxQuantile() in PySpark

Quick Guide to Percentiles & Median

The approxQuantile() function in PySpark helps you estimate percentiles and median values quickly and efficiently. This is especially useful for large datasets when full scans are costly.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("approxQuantile Example") \
    .getOrCreate()

2. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28),
    (5, "John", 40),
    (6, "Sara", 50)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+ | id| name|age| +---+--------------+---+ | 1| Aamir Shahzad| 35| | 2| Ali Raza| 30| | 3| Bob| 25| | 4| Lisa| 28| | 5| John| 40| | 6| Sara| 50| +---+--------------+---+

3. Use approxQuantile()

Example 1: Median (50th percentile)

median_age = df.approxQuantile("age", [0.5], 0.01)
print("Median Age:", median_age)
Median Age: [30.0]

Example 2: 25th, 50th, and 75th Percentiles

quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.01)
print("25th, 50th, and 75th Percentiles:", quantiles)
25th, 50th, and 75th Percentiles: [28.0, 30.0, 40.0]

Example 3: Min, Median, Max

min_median_max = df.approxQuantile("age", [0.0, 0.5, 1.0], 0.01)
print("Min, Median, and Max Age:", min_median_max)
Min, Median, and Max Age: [25.0, 30.0, 50.0]

4. Control Accuracy with relativeError

# Lower relativeError = more accurate but slower
# Higher relativeError = less accurate but faster

# Example: Set relativeError to 0.1 (faster but less accurate)
quantiles_fast = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.1)
print("Quantiles with higher relative error:", quantiles_fast)
Quantiles with higher relative error: [28.0, 30.0, 40.0]

📺 Watch Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.