PySpark Tutorial: DataFrame.summary() for Statistical Summary in One Command
In this tutorial, you will learn how to use summary()
in PySpark to quickly generate statistical summaries of your data. Perfect for data exploration and quick analysis!
What is summary() in PySpark?
The summary()
function provides descriptive statistics for numeric and string columns in a PySpark DataFrame.
By default, it shows:
- count: Number of records
- mean: Average for numeric columns
- stddev: Standard deviation
- min: Minimum value
- max: Maximum value
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("PySpark Summary Example") \\
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
(29, "Aamir Shahzad", 5000),
(35, "Ali Raza", 6000),
(40, "Bob", 5500),
(25, "Lisa", 5200)
]
columns = ["age", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+------+
|age| name|salary|
+---+--------------+------+
| 29| Aamir Shahzad| 5000|
| 35| Ali Raza| 6000|
| 40| Bob| 5500|
| 25| Lisa| 5200|
+---+--------------+------+
|age| name|salary|
+---+--------------+------+
| 29| Aamir Shahzad| 5000|
| 35| Ali Raza| 6000|
| 40| Bob| 5500|
| 25| Lisa| 5200|
+---+--------------+------+
Step 3: Use summary() Function
Example 1: Default Summary Statistics
df.summary().show()
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| mean |32.25| null |5425.00 |
|stddev |7.50 | null |567.64 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| mean |32.25| null |5425.00 |
|stddev |7.50 | null |567.64 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+
Example 2: Custom Summary Statistics
df.summary("count", "min", "max").show()
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+
|summary| age | name | salary |
+-------+-----+--------------+--------+
| count | 4 | 4 | 4 |
| min | 25 |Aamir Shahzad | 5000 |
| max | 40 | Lisa | 6000 |
+-------+-----+--------------+--------+
Best Practices
summary()
is quick and easy for descriptive statistics.- Use it for numeric columns and get counts/min/max for string columns.
- Combine
summary()
withdescribe()
for a complete statistical overview.
📺 Watch the Full Tutorial Video
For a detailed walkthrough, watch the video below: