PySpark Tutorial : DataFrame.summary() for Statistical Summary in One Command

PySpark Tutorial: DataFrame.summary() for Statistical Summary in One Command

PySpark Tutorial: DataFrame.summary() for Statistical Summary in One Command

In this tutorial, you will learn how to use summary() in PySpark to quickly generate statistical summaries of your data. Perfect for data exploration and quick analysis!

What is summary() in PySpark?

The summary() function provides descriptive statistics for numeric and string columns in a PySpark DataFrame.

By default, it shows:

  • count: Number of records
  • mean: Average for numeric columns
  • stddev: Standard deviation
  • min: Minimum value
  • max: Maximum value

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark Summary Example") \\
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    (29, "Aamir Shahzad", 5000),
    (35, "Ali Raza", 6000),
    (40, "Bob", 5500),
    (25, "Lisa", 5200)
]

columns = ["age", "name", "salary"]

df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+------+
|age| name|salary|
+---+--------------+------+
| 29| Aamir Shahzad| 5000|
| 35| Ali Raza| 6000|
| 40| Bob| 5500|
| 25| Lisa| 5200|
+---+--------------+------+

Step 3: Use summary() Function

Example 1: Default Summary Statistics

df.summary().show()
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| mean |32.25| null |5425.00 |
|stddev |7.50 | null |567.64 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+

Example 2: Custom Summary Statistics

df.summary("count", "min", "max").show()
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+

Best Practices

  • summary() is quick and easy for descriptive statistics.
  • Use it for numeric columns and get counts/min/max for string columns.
  • Combine summary() with describe() for a complete statistical overview.

📺 Watch the Full Tutorial Video

For a detailed walkthrough, watch the video below:

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2025 PySpark Tutorials. All rights reserved.