PySpark Tutorial: How to Use describe() for DataFrame Statistics | PySpark Tutorial for Data Engineers

How to Use describe() Function in PySpark | Step-by-Step Guide

How to Use describe() Function in PySpark

The describe() function in PySpark provides summary statistics for numerical columns in a DataFrame. It returns important metrics like count, mean, standard deviation, min, and max for each selected column.

Sample Data

data = [
    (1, "Alice", 5000, 25),
    (2, "Bob", 6000, 30),
    (3, "Charlie", 7000, 35),
    (4, "David", 8000, 40),
    (5, "Eve", 9000, 45),
    (6, "Frank", 10000, 50),
    (7, "Grace", 11000, 55),
    (8, "Hannah", 12000, 60),
    (9, "Ian", 13000, 65),
    (10, "Jack", 14000, 70)
]

Create DataFrame

df = spark.createDataFrame(data, ["id", "name", "salary", "age"])

Show the Full DataFrame

df.show()

Example 1: Basic Usage of describe()

print("Summary statistics for numerical columns:")
df.describe().show()

Example 2: describe() for Specific Columns

print("Summary statistics for 'salary' and 'age':")
df.describe("salary", "age").show()

Sample Output

+-------+------------------+------------------+
|summary|            salary|               age|
+-------+------------------+------------------+
|  count|                10|                10|
|   mean|            9500.0|              47.5|
| stddev|3027.6503540974913|15.138251770487457|
|    min|              5000|                25|
|    max|             14000|                70|
+-------+------------------+------------------+

Watch the Video Tutorial