How to Use approxQuantile() in PySpark
Quick Guide to Percentiles & Median
The approxQuantile()
function in PySpark helps you estimate percentiles and median values quickly and efficiently. This is especially useful for large datasets when full scans are costly.
1. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("approxQuantile Example") \
.getOrCreate()
2. Create Sample DataFrame
data = [
(1, "Aamir Shahzad", 35),
(2, "Ali Raza", 30),
(3, "Bob", 25),
(4, "Lisa", 28),
(5, "John", 40),
(6, "Sara", 50)
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
| 5| John| 40|
| 6| Sara| 50|
+---+--------------+---+
3. Use approxQuantile()
Example 1: Median (50th percentile)
median_age = df.approxQuantile("age", [0.5], 0.01)
print("Median Age:", median_age)
Median Age: [30.0]
Example 2: 25th, 50th, and 75th Percentiles
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.01)
print("25th, 50th, and 75th Percentiles:", quantiles)
25th, 50th, and 75th Percentiles: [28.0, 30.0, 40.0]
Example 3: Min, Median, Max
min_median_max = df.approxQuantile("age", [0.0, 0.5, 1.0], 0.01)
print("Min, Median, and Max Age:", min_median_max)
Min, Median, and Max Age: [25.0, 30.0, 50.0]
4. Control Accuracy with relativeError
# Lower relativeError = more accurate but slower
# Higher relativeError = less accurate but faster
# Example: Set relativeError to 0.1 (faster but less accurate)
quantiles_fast = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.1)
print("Quantiles with higher relative error:", quantiles_fast)
Quantiles with higher relative error: [28.0, 30.0, 40.0]